Skip to content

Commit c2bd5f7

Browse files
authored
Merge pull request #202 from THUDM/agentbench_fc
publish agentbench_fc
2 parents 41e6807 + 5bf4eb2 commit c2bd5f7

246 files changed

Lines changed: 436295 additions & 66054 deletions

File tree

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

README.md

Lines changed: 70 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -10,6 +10,74 @@
1010
👋 Join our <a href="https://join.slack.com/t/agentbenchcol-huw1944/shared_invite/zt-20ixabcuv-31cFLBAkqGQxQkJqrWVEVg" target="_blank">Slack</a> for <i>Q & A</i> or <i><b>collaboration</b> on next version of AgentBench</i>!
1111
</p>
1212

13+
## 🔥[2025.10.10] Introducing **AgentBench FC (Function Calling)** based on [AgentRL](https://github.com/THUDM/AgentRL)
14+
15+
The current repository contains the function-calling version of AgentBench, integrated with [AgentRL](https://github.com/THUDM/AgentRL), an end-to-end multitask and mutliturn LLM Agent RL framework.
16+
If you wish to use the older version, you can revert to [v0.1](https://github.com/THUDM/AgentBench/tree/v0.1) and [v0.2](https://github.com/THUDM/AgentBench/tree/v0.2).
17+
18+
Comparing to the original AgentBench, this version uses a function-calling style prompt,
19+
and adds fully-containerized deployment support for the following tasks:
20+
21+
- `alfworld` (AF)
22+
- `dbbench` (DB)
23+
- `knowledgegraph` (KG)
24+
- `os_interaction` (OS)
25+
- `webshop` (WS)
26+
27+
### Quick Start
28+
29+
We support a quick one-command setup for all the above tasks using Docker Compose.
30+
31+
Before starting, please download or build the following Docker images required by the tasks:
32+
33+
```shell
34+
# dbbench
35+
docker pull mysql:8
36+
37+
# os_interaction
38+
docker build -t local-os/default -f ./data/os_interaction/res/dockerfiles/default data/os_interaction/res/dockerfiles
39+
docker build -t local-os/packages -f ./data/os_interaction/res/dockerfiles/packages data/os_interaction/res/dockerfiles
40+
docker build -t local-os/ubuntu -f ./data/os_interaction/res/dockerfiles/ubuntu data/os_interaction/res/dockerfiles
41+
```
42+
43+
To run the KG freebase server, you will also need a copy of the data found [here](https://github.com/dki-lab/Freebase-Setup).
44+
Download, extract and place the data at `./virtuoso_db/virtuoso.db` (or modify `extra/docker-compose.yml` and set the mount point to your data location).
45+
46+
Then, you can bring up the stack with:
47+
48+
```shell
49+
docker compose -f extra/docker-compose.yml up
50+
```
51+
52+
This command will download or build the necessary Docker images and start the following services in Docker:
53+
54+
- AgentRL Controller
55+
- `alfworld` task worker (x1, increase as needed)
56+
- `dbbench` task worker (x1, increase as needed)
57+
- `knowledgegraph` task worker (x1, increase as needed)
58+
- `os_interaction` task worker (x1, increase as needed)
59+
- `webshop` task worker (x1, increase as needed)
60+
- freebase server (for `knowledgegraph` task)
61+
- Redis server (for container allocation)
62+
63+
If your machine already has Redis (version 7+) running, you can omit the Redis service from the `docker-compose.yml`.
64+
65+
> [!WARNING]
66+
> Please note that the `webshop` environment requires ~16GB of RAM to start,
67+
> and the current implementation of `alfworld` leaks memory and disk space until the task worker is restarted.
68+
> Make sure your machine has sufficient resources before running.
69+
70+
### Benchmarking Results
71+
72+
We report the results of various models on the test set of AgentBench FC.
73+
74+
![img.png](assets/fc_leaderboard.png)
75+
76+
Please see our [Leaderboard](https://docs.google.com/spreadsheets/d/e/2PACX-1vRR3Wl7wsCgHpwUw1_eUXW_fptAPLL3FkhnW_rua0O1Ji_GIVrpTjY5LaKAhwO-WeARjnY_KNw0SYNJ/pubhtml) for full results.
77+
Please contact [agentbench_fc&#64;googlegroups.com](mailto:agentbench_fc@googlegroups.com) if you have any questions or would like to contribute your results.
78+
79+
---
80+
1381
## 🔥[2024.08.13] Introducing [VisualAgentBench](https://github.com/THUDM/VisualAgentBench)
1482

1583
VisualAgentBench is designed for evaluating and training visual foundation agents based on large multimodel models (LMMs). We introduce 5 distinct environments spanning
@@ -20,16 +88,9 @@ VisualAgentBench is designed for evaluating and training visual foundation agent
2088

2189
to systematically benchmark 17 LMMs (proprietary & open LMMs). We also provide the trajectory dataset for behavior cloning training on open LMMs for you to develop your own visual foundation agents!
2290

23-
## 📌Introducing AgentBench v0.2🎉
24-
25-
You are now browsing AgentBench v0.2. If you wish to use the older version, you can revert to [v0.1](https://github.com/THUDM/AgentBench/tree/v0.1).
26-
27-
Based on [v0.1](https://github.com/THUDM/AgentBench/tree/v0.1), we:
91+
---
2892

29-
- Updated the framework architecture for easier use and extension
30-
- Adjusted some task settings
31-
- Added test results for more models
32-
- Released the full data for the Dev and Test sets
93+
The following is the introduction to the original AgentBench (v0.2).
3394

3495
# AgentBench: Evaluating LLMs as Agents
3596

assets/fc_leaderboard.png

300 KB
Loading

configs/tasks/alfworld.yaml

Lines changed: 26 additions & 14 deletions
Original file line numberDiff line numberDiff line change
@@ -1,22 +1,34 @@
11
default:
22
module: src.server.tasks.alfworld.ALFWorld
3-
docker:
4-
image: longinyu/agentbench-alfworld
5-
command: umask 0; [ -f /root/.setup.sh ] && bash /root/.setup.sh;
63
parameters:
74
name: alfworld-std
8-
data_path: "/AgentBench/data/alfworld"
9-
config_path: "src/server/tasks/alfworld/configs/base_config.yaml"
10-
prompts_path: "src/server/tasks/alfworld/prompts/alfworld_multiturn_plan_first.json"
11-
split: "standard"
12-
max_step: 35
13-
14-
alfworld-dev:
15-
parameters:
16-
name: alfworld-dev
17-
split: "dev"
5+
concurrency: 16
6+
data_path: "/app/data/alfworld"
7+
config_path: "/app/src/server/tasks/alfworld/configs/base_config.yaml"
8+
prompts_path: "/app/src/server/tasks/alfworld/prompts/alfworld_multiturn_plan_first.json"
9+
split: "new_std"
10+
max_step: 20
11+
tools:
12+
- type: "function"
13+
function:
14+
name: "take_action"
15+
description: "Take an action."
16+
parameters:
17+
type: "object"
18+
properties:
19+
action:
20+
type: "string"
21+
description: "The action you would like to take"
22+
required:
23+
- "action"
24+
additionalProperties: False
1825

1926
alfworld-std:
2027
parameters:
2128
name: alfworld-std
22-
split: "standard"
29+
split: "new_std"
30+
31+
alfworld-env_train:
32+
parameters:
33+
name: alfworld-env_train
34+
split: "train_valid"

configs/tasks/dbbench.yaml

Lines changed: 45 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -1,15 +1,54 @@
11
default:
2-
module: src.server.tasks.dbbench.DBBench
2+
module: src.server.tasks.dbbench.DBBenchTask
33
parameters:
4-
concurrency: 1
4+
concurrency: 32
55
max_round: 15
66

7-
dbbench-dev:
8-
parameters:
9-
name: dbbench-dev
10-
data_file: "data/dbbench/dev.jsonl"
7+
tools:
8+
- type: "function"
9+
function:
10+
name: "execute_sql"
11+
description: "Executes a given SQL statement on the database and returns the result."
12+
parameters:
13+
type: "object"
14+
properties:
15+
query:
16+
type: "string"
17+
description: "The SQL query to be executed."
18+
required:
19+
- "query"
20+
additionalProperties: False
21+
- type: "function"
22+
function:
23+
name: "commit_final_answer"
24+
description: "Commits the final answer after all operations are completed."
25+
parameters:
26+
type: "object"
27+
properties:
28+
answers:
29+
type: "array"
30+
items:
31+
type: "string"
32+
description: "The list of final answers to commit."
33+
required:
34+
- "answers"
35+
additionalProperties: False
36+
37+
env_driver: docker
38+
env_options:
39+
network_name: dbbench_default
40+
state_driver: redis
41+
state_options:
42+
connection:
43+
host: 172.17.0.1
1144

1245
dbbench-std:
1346
parameters:
1447
name: dbbench-std
1548
data_file: "data/dbbench/standard.jsonl"
49+
50+
dbbench-env_train:
51+
parameters:
52+
name: dbbench-env_train
53+
data_file: "data/dbbench/db_out_new.jsonl"
54+
db_file: "data/dbbench/db_train"

configs/tasks/kg.yaml

Lines changed: 27 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -1,15 +1,35 @@
11
default:
22
module: "src.server.tasks.knowledgegraph.KnowledgeGraph"
33
parameters:
4-
round: 15
5-
sparql_url: "http://164.107.116.56:3093/sparql"
4+
concurrency: 32
5+
max_rounds: 15
6+
one_shot: false
7+
database_file:
8+
env_driver: manual
9+
env_options:
10+
urls:
11+
kg: http://localhost:3001/sparql
612

7-
kg-dev:
8-
parameters:
9-
name: "KnowledgeGraph-dev"
10-
data_file: "data/knowledgegraph/dev.json"
13+
# alternative configuration - automatically start a SPARQL server in a docker container
14+
# fill-in the database_file parameter with the absolute path to the freebase db file on the host
15+
# and replace the above parameters with the following:
16+
#
17+
# database_file: /path/to/virtuoso_db/virtuoso.db
18+
# env_driver: docker
19+
# env_options:
20+
# network_name: knowledgegraph_default
21+
# state_driver: redis
22+
# state_options:
23+
# connection:
24+
# host: 172.17.0.1
1125

1226
kg-std:
1327
parameters:
14-
name: "KnowledgeGraph-std"
28+
name: "kg-std"
1529
data_file: "data/knowledgegraph/std.json"
30+
one_shot: true
31+
32+
kg-env_train:
33+
parameters:
34+
name: "kg-env_train"
35+
data_file: "data/knowledgegraph/kg_rl_all.json"

configs/tasks/os.yaml

Lines changed: 62 additions & 21 deletions
Original file line numberDiff line numberDiff line change
@@ -1,9 +1,50 @@
1-
os-dev:
1+
default:
22
module: "src.server.tasks.os_interaction.OSInteraction"
33
parameters:
4-
name: "os-dev"
5-
concurrency: 24
4+
concurrency: 32
65
round_limit: 8
6+
tools:
7+
- type: "function"
8+
function:
9+
name: "bash_action"
10+
description: "Execute bash code to perform an operation in the Linux environment."
11+
parameters:
12+
type: "object"
13+
properties:
14+
script:
15+
type: "string"
16+
description: "The bash script to be executed."
17+
required:
18+
- "script"
19+
additionalProperties: False
20+
21+
- type: "function"
22+
function:
23+
name: "finish_action"
24+
description: "Indicate that the task has been finished or need some additional information to be finished."
25+
parameters:
26+
type: "object"
27+
properties:
28+
thought:
29+
type: "string"
30+
description: "The thought or reason indicating the task is finished."
31+
required:
32+
- "thought"
33+
additionalProperties: False
34+
35+
- type: "function"
36+
function:
37+
name: "answer_action"
38+
description: "Provide the answer to the question."
39+
parameters:
40+
type: "object"
41+
properties:
42+
answer:
43+
type: "string"
44+
description: "The answer to the question."
45+
required:
46+
- "answer"
47+
additionalProperties: False
748

849
docker_config:
950
localhost: local-os
@@ -12,29 +53,18 @@ os-dev:
1253
scripts:
1354
directory: data/os_interaction/res/scripts
1455

15-
data_config:
16-
files:
17-
- problem_file: data/os_interaction/data/dev.json
18-
script_dir: data/os_interaction/scripts/dev/
19-
index_prefix: "dev-001-"
20-
21-
bk: [ ]
22-
ignore: [ ]
56+
env_driver: docker
57+
env_options:
58+
network_name: os_interaction_default
59+
state_driver: redis
60+
state_options:
61+
connection:
62+
host: 172.17.0.1
2363

2464
os-std:
2565
module: "src.server.tasks.os_interaction.OSInteraction"
2666
parameters:
2767
name: "os-std"
28-
concurrency: 24
29-
round_limit: 8
30-
31-
docker_config:
32-
localhost: local-os
33-
directory: data/os_interaction/res/dockerfiles
34-
35-
scripts:
36-
directory: data/os_interaction/res/scripts
37-
3868
data_config:
3969
files:
4070
- problem_file: data/os_interaction/data/1/*.json
@@ -61,3 +91,14 @@ os-std:
6191

6292
bk: [ ]
6393
ignore: [ ]
94+
95+
os-env_train:
96+
parameters:
97+
name: "os-env_train"
98+
data_config:
99+
files:
100+
- problem_file: data/os_interaction/train_0317/training.json
101+
script_dir: data/os_interaction/scripts/7/
102+
index_prefix: "train-0223-"
103+
bk: [ ]
104+
ignore: [ ]

0 commit comments

Comments
 (0)