Skip to content

Commit 8aade54

Browse files
authored
Merge pull request #70 from Jeerhz/adle/rankings
Feat?: Compute Elo Rankings
2 parents 0778134 + 9ac9e5b commit 8aade54

File tree

16 files changed

+1041
-2142
lines changed

16 files changed

+1041
-2142
lines changed

.env.example

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,12 @@
11
MISTRAL_API_KEY=""
22
OPENAI_API_KEY=""
3+
ANTHROPIC_API_KEY=""
4+
GOOGLE_GEMINI_API_KEY=""
35
GROK_API_KEY=""
46
CEREBRAS_API_KEY=""
7+
TOGETHER_API_KEY=""
8+
ANYSCALE_API_KEY=""
9+
FIREWORKS_API_KEY=""
510
DISABLE_LLM="False"
611

712
# AWS credentials

.gitignore

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -11,4 +11,5 @@ results/
1111
.DS_Store
1212
env
1313

14-
adle_notebook.ipynb
14+
adle_notebook.ipynb
15+
elo_rankings.ipynb

README.md

Lines changed: 106 additions & 36 deletions
Original file line numberDiff line numberDiff line change
@@ -35,31 +35,40 @@ As opposed to RL models, which blindly take actions based on the reward function
3535

3636
# Results
3737

38-
Our experimentations (342 fights so far) led to the following leaderboard.
38+
Our experimentations (546 fights so far) led to the following leaderboard.
3939
Each LLM has an ELO score based on its results
4040

4141
## Ranking
4242

4343
### ELO ranking
4444

45-
| Model | Rating |
46-
| ------------------------------ | ------: |
47-
| 🥇openai:gpt-3.5-turbo-0125 | 1776.11 |
48-
| 🥈mistral:mistral-small-latest | 1586.16 |
49-
| 🥉openai:gpt-4-1106-preview | 1584.78 |
50-
| openai:gpt-4 | 1517.2 |
51-
| openai:gpt-4-turbo-preview | 1509.28 |
52-
| openai:gpt-4-0125-preview | 1438.92 |
53-
| mistral:mistral-medium-latest | 1356.19 |
54-
| mistral:mistral-large-latest | 1231.36 |
45+
| Rank | Model | Rating |
46+
| ---: | :----------------------------------------------------------------- | ------: |
47+
| 1 | 🥇openai:gpt-4o:text | 1912.5 |
48+
| 2 | 🥈**openai:gpt-4o-mini:vision** | 1835.27 |
49+
| 3 | 🥉openai:gpt-4o-mini:text | 1670.89 |
50+
| 4 | **openai:gpt-4o:vision** | 1656.93 |
51+
| 5 | **mistral:pixtral-large-latest:vision** | 1654.61 |
52+
| 6 | **mistral:pixtral-12b-2409:vision** | 1590.77 |
53+
| 7 | mistral:pixtral-12b-2409:text | 1569.03 |
54+
| 8 | together:meta-llama/Llama-3.2-90B-Vision-Instruct-Turbo:text | 1441.45 |
55+
| 9 | **anthropic:claude-3-haiku-20240307:vision** | 1364.87 |
56+
| 10 | mistral:pixtral-large-latest:text | 1356.32 |
57+
| 11 | anthropic:claude-3-haiku-20240307:text | 1333.6 |
58+
| 12 | **anthropic:claude-3-sonnet-20240229:vision** | 1314.61 |
59+
| 13 | **together:meta-llama/Llama-3.2-90B-Vision-Instruct-Turbo:vision** | 1269.84 |
60+
| 14 | anthropic:claude-3-sonnet-20240229:text | 1029.31 |
5561

5662
### Win rate matrix
5763

58-
![Win rate matrix](notebooks/win_rate_matrix.png)
64+
![Win rate matrix](notebooks/result_matrix.png)
5965

6066
# Explanation
6167

62-
Each player is controlled by an LLM.
68+
Each player can be controlled by a multimodal model or an text generating model.
69+
70+
### TextRobot
71+
6372
We send to the LLM a text description of the screen. The LLM decide on the next moves its character will make. The next moves depends on its previous moves, the moves of its opponents, its power and health bars.
6473

6574
- Agent based
@@ -68,6 +77,10 @@ We send to the LLM a text description of the screen. The LLM decide on the next
6877

6978
![fight3 drawio](https://github.com/OpenGenerativeAI/llm-colosseum/assets/78322686/3a212601-f54c-490d-aeb9-6f7c2401ebe6)
7079

80+
### VisionRobot
81+
82+
We send to the LLM a screenshot of the current state of the game precising which character he is controlling. His decision is only based on this visual information.
83+
7184
# Installation
7285

7386
- Follow instructions in https://docs.diambra.ai/#installation
@@ -142,43 +155,52 @@ By default, it runs mistral against mistral. To use other models, you need to ch
142155
from eval.game import Game, Player1, Player2
143156

144157
def main():
158+
# Environment Settings
159+
145160
game = Game(
146161
render=True,
147162
save_game=True,
148163
player_1=Player1(
149164
nickname="Baby",
150-
model="ollama:mistral", # change this
165+
model="ollama:mistral",
166+
robot_type="text", # vision or text
167+
temperature=0.7,
151168
),
152169
player_2=Player2(
153170
nickname="Daddy",
154-
model="ollama:mistral", # change this
171+
model="ollama:mistral",
172+
robot_type="text",
173+
temperature=0.7,
155174
),
156175
)
176+
157177
game.run()
158178
return 0
179+
180+
181+
if __name__ == "__main__":
182+
main()
159183
```
160184

161185
The convention we use is `model_provider:model_name`. If you want to use another local model than Mistral, you can do `ollama:some_other_model`
162186

163187
## How to make my own LLM model play? Can I improve the prompts?
164188

165-
The LLM is called in `Robot.call_llm()` method of the `agent/robot.py` file.
189+
The LLM is called in `<Text||Vision>Robot.call_llm()` method of the `agent/robot.py` file.
190+
191+
#### TextRobot method:
166192

167193
```python
168194
def call_llm(
169195
self,
170-
temperature: float = 0.7,
171196
max_tokens: int = 50,
172197
top_p: float = 1.0,
173-
) -> str:
198+
) -> Generator[ChatResponse, None, None]:
174199
"""
175200
Make an API call to the language model.
176201
177202
Edit this method to change the behavior of the robot!
178203
"""
179-
# self.model is a slug like mistral:mistral-small-latest or ollama:mistral
180-
provider_name, model_name = get_provider_and_model(self.model)
181-
client = get_sync_client(provider_name) # OpenAI client
182204

183205
# Generate the prompts
184206
move_list = "- " + "\n - ".join([move for move in META_INSTRUCTIONS])
@@ -197,28 +219,76 @@ Example if the opponent is far:
197219
- Fireball
198220
- Move closer"""
199221

200-
# Call the LLM
201-
completion = client.chat.completions.create(
202-
model=model_name,
203-
messages=[
204-
{"role": "system", "content": system_prompt},
205-
{"role": "user", "content": "Your next moves are:"},
206-
],
207-
temperature=temperature,
208-
max_tokens=max_tokens,
209-
top_p=top_p,
222+
start_time = time.time()
223+
224+
client = get_client(self.model, temperature=self.temperature)
225+
226+
messages = [
227+
ChatMessage(role="system", content=system_prompt),
228+
ChatMessage(role="user", content="Your next moves are:"),
229+
]
230+
resp = client.stream_chat(messages)
231+
232+
logger.debug(f"LLM call to {self.model}: {system_prompt}")
233+
logger.debug(f"LLM call to {self.model}: {time.time() - start_time}s")
234+
235+
return resp
236+
```
237+
238+
#### VisionRobot method:
239+
240+
```python
241+
def call_llm(
242+
self,
243+
max_tokens: int = 50,
244+
top_p: float = 1.0,
245+
) -> Generator[CompletionResponse, None, None]:
246+
"""
247+
Make an API call to the language model.
248+
249+
Edit this method to change the behavior of the robot!
250+
"""
251+
252+
# Generate the prompts
253+
move_list = "- " + "\n - ".join([move for move in META_INSTRUCTIONS])
254+
system_prompt = f"""You are the best and most aggressive Street Fighter III 3rd strike player in the world.
255+
Your character is {self.character}. Your goal is to beat the other opponent. You respond with a bullet point list of moves.
256+
257+
The current state of the game is given in the following image.
258+
259+
The moves you can use are:
260+
{move_list}
261+
----
262+
Reply with a bullet point list of 3 moves. The format should be: `- <name of the move>` separated by a new line.
263+
Example if the opponent is close:
264+
- Move closer
265+
- Medium Punch
266+
267+
Example if the opponent is far:
268+
- Fireball
269+
- Move closer"""
270+
271+
start_time = time.time()
272+
273+
client = get_client_multimodal(
274+
self.model, temperature=self.temperature
275+
) # MultiModalLLM
276+
277+
resp = client.stream_complete(
278+
prompt=system_prompt, image_documents=[self.last_image_to_image_node()]
210279
)
211280

212-
# Return the string to be parsed with regex
213-
llm_response = completion.choices[0].message.content.strip()
214-
return llm_response
281+
logger.debug(f"LLM call to {self.model}: {system_prompt}")
282+
logger.debug(f"LLM call to {self.model}: {time.time() - start_time}s")
283+
284+
return resp
215285
```
216286

217-
To use another model or other prompts, make a call to another client in this function, change the system prompt, or make any fancy stuff.
287+
You can personnalise your prompt in these functions.
218288

219289
### Submit your model
220290

221-
Create a new class herited from `Robot` that has the changes you want to make and open a PR.
291+
Create a new class herited from Robot that has the changes you want to make and open a PR.
222292

223293
We'll do our best to add it to the ranking!
224294

agent/llm.py

Lines changed: 85 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -1,8 +1,9 @@
11
from llama_index.core.llms.function_calling import FunctionCallingLLM
22
from llama_index.core.multi_modal_llms.base import MultiModalLLM
3+
import os
34

45

5-
def get_client(model_str: str) -> FunctionCallingLLM:
6+
def get_client(model_str: str, temperature: float = 0.7) -> FunctionCallingLLM:
67
split_result = model_str.split(":")
78
if len(split_result) == 1:
89
# Assume default provider to be openai
@@ -19,37 +20,71 @@ def get_client(model_str: str) -> FunctionCallingLLM:
1920
if provider == "openai":
2021
from llama_index.llms.openai import OpenAI
2122

22-
return OpenAI(model=model_name)
23+
return OpenAI(model=model_name, temperature=temperature)
2324
elif provider == "anthropic":
2425
from llama_index.llms.anthropic import Anthropic
2526

26-
return Anthropic(model=model_name)
27+
return Anthropic(model=model_name, temperature=temperature)
2728
elif provider == "mistral":
2829
from llama_index.llms.mistralai import MistralAI
2930

3031
return MistralAI(model=model_name)
3132
elif provider == "groq":
3233
from llama_index.llms.groq import Groq
3334

34-
return Groq(model=model_name)
35+
return Groq(model=model_name, temperature=temperature)
3536

3637
elif provider == "ollama":
3738
from llama_index.llms.ollama import Ollama
3839

39-
return Ollama(model=model_name)
40+
return Ollama(model=model_name, temperature=temperature)
4041
elif provider == "bedrock":
4142
from llama_index.llms.bedrock import Bedrock
4243

4344
return Bedrock(model=model_name)
4445
elif provider == "cerebras":
4546
from llama_index.llms.cerebras import Cerebras
4647

47-
return Cerebras(model=model_name)
48+
return Cerebras(model=model_name, temperature=temperature)
49+
elif provider == "gemini":
50+
from llama_index.llms.gemini import Gemini
51+
52+
return Gemini(model=model_name, temperature=temperature)
53+
54+
elif provider == "anyscale":
55+
from llama_index.llms.openai import OpenAI
56+
57+
return OpenAI(
58+
model=model_name,
59+
temperature=temperature,
60+
api_key=os.environ.get("ANYSCALE_API_KEY"),
61+
api_base="https://api.endpoints.anyscale.com/v1/",
62+
)
63+
64+
elif provider == "fireworks":
65+
from llama_index.llms.openai import OpenAI
66+
67+
return OpenAI(
68+
model=model_name,
69+
temperature=temperature,
70+
api_key=os.environ.get("FIREWORKS_API_KEY"),
71+
api_base="https://api.fireworks.ai/inference/v1/",
72+
)
73+
74+
elif provider == "together":
75+
from llama_index.llms.openai import OpenAI
76+
77+
return OpenAI(
78+
model=model_name,
79+
temperature=temperature,
80+
api_key=os.environ.get("TOGETHER_API_KEY"),
81+
api_base="https://api.together.xyz/v1/",
82+
)
4883

4984
raise ValueError(f"Provider {provider} not found in models")
5085

5186

52-
def get_client_multimodal(model_str: str) -> MultiModalLLM:
87+
def get_client_multimodal(model_str: str, temperature: float = 0.7) -> MultiModalLLM:
5388
split_result = model_str.split(":")
5489
if len(split_result) == 1:
5590
# Assume default provider to be openai
@@ -66,16 +101,56 @@ def get_client_multimodal(model_str: str) -> MultiModalLLM:
66101
if provider == "openai":
67102
from llama_index.multi_modal_llms.openai import OpenAIMultiModal
68103

69-
return OpenAIMultiModal(model=model_name)
104+
return OpenAIMultiModal(model=model_name, temperature=temperature)
70105

71106
if provider == "ollama":
72107
from llama_index.multi_modal_llms.ollama import OllamaMultiModal
73108

74-
return OllamaMultiModal(model=model_name)
109+
return OllamaMultiModal(model=model_name, temperature=temperature)
75110

76111
elif provider == "mistral":
77112
from llama_index.multi_modal_llms.mistralai import MistralAIMultiModal
78113

79-
return MistralAIMultiModal(model=model_name)
114+
return MistralAIMultiModal(model=model_name, temperature=temperature)
115+
116+
elif provider == "gemini":
117+
from llama_index.multi_modal_llms.gemini import GeminiMultiModal
118+
119+
return GeminiMultiModal(model=model_name, temperature=temperature)
120+
121+
elif provider == "anthropic":
122+
from llama_index.multi_modal_llms.anthropic import AnthropicMultiModal
123+
124+
return AnthropicMultiModal(model=model_name, temperature=temperature)
125+
126+
elif provider == "anyscale":
127+
from llama_index.multi_modal_llms.openai import OpenAIMultiModal
128+
129+
return OpenAIMultiModal(
130+
model=model_name,
131+
temperature=temperature,
132+
api_key=os.environ.get("ANYSCALE_API_KEY"),
133+
api_base="https://api.endpoints.anyscale.com/v1/",
134+
)
135+
136+
elif provider == "fireworks":
137+
from llama_index.multi_modal_llms.openai import OpenAIMultiModal
138+
139+
return OpenAIMultiModal(
140+
model=model_name,
141+
temperature=temperature,
142+
api_key=os.environ.get("FIREWORKS_API_KEY"),
143+
api_base="https://api.fireworks.ai/inference/v1/",
144+
)
145+
146+
elif provider == "together":
147+
from llama_index.multi_modal_llms.openai import OpenAIMultiModal
148+
149+
return OpenAIMultiModal(
150+
model=model_name,
151+
temperature=temperature,
152+
api_key=os.environ.get("TOGETHER_API_KEY"),
153+
api_base="https://api.together.xyz/v1/",
154+
)
80155

81156
raise ValueError(f"Provider {provider} not found in multimodal models")

0 commit comments

Comments
 (0)