You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: README.md
+70-9Lines changed: 70 additions & 9 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -10,6 +10,74 @@
10
10
👋 Join our <ahref="https://join.slack.com/t/agentbenchcol-huw1944/shared_invite/zt-20ixabcuv-31cFLBAkqGQxQkJqrWVEVg"target="_blank">Slack</a> for <i>Q & A</i> or <i><b>collaboration</b> on next version of AgentBench</i>!
11
11
</p>
12
12
13
+
## 🔥[2025.10.10] Introducing **AgentBench FC (Function Calling)** based on [AgentRL](https://github.com/THUDM/AgentRL)
14
+
15
+
The current repository contains the function-calling version of AgentBench, integrated with [AgentRL](https://github.com/THUDM/AgentRL), an end-to-end multitask and mutliturn LLM Agent RL framework.
16
+
If you wish to use the older version, you can revert to [v0.1](https://github.com/THUDM/AgentBench/tree/v0.1) and [v0.2](https://github.com/THUDM/AgentBench/tree/v0.2).
17
+
18
+
Comparing to the original AgentBench, this version uses a function-calling style prompt,
19
+
and adds fully-containerized deployment support for the following tasks:
20
+
21
+
-`alfworld` (AF)
22
+
-`dbbench` (DB)
23
+
-`knowledgegraph` (KG)
24
+
-`os_interaction` (OS)
25
+
-`webshop` (WS)
26
+
27
+
### Quick Start
28
+
29
+
We support a quick one-command setup for all the above tasks using Docker Compose.
30
+
31
+
Before starting, please download or build the following Docker images required by the tasks:
To run the KG freebase server, you will also need a copy of the data found [here](https://github.com/dki-lab/Freebase-Setup).
44
+
Download, extract and place the data at `./virtuoso_db/virtuoso.db` (or modify `extra/docker-compose.yml` and set the mount point to your data location).
45
+
46
+
Then, you can bring up the stack with:
47
+
48
+
```shell
49
+
docker compose -f extra/docker-compose.yml up
50
+
```
51
+
52
+
This command will download or build the necessary Docker images and start the following services in Docker:
53
+
54
+
- AgentRL Controller
55
+
-`alfworld` task worker (x1, increase as needed)
56
+
-`dbbench` task worker (x1, increase as needed)
57
+
-`knowledgegraph` task worker (x1, increase as needed)
58
+
-`os_interaction` task worker (x1, increase as needed)
59
+
-`webshop` task worker (x1, increase as needed)
60
+
- freebase server (for `knowledgegraph` task)
61
+
- Redis server (for container allocation)
62
+
63
+
If your machine already has Redis (version 7+) running, you can omit the Redis service from the `docker-compose.yml`.
64
+
65
+
> [!WARNING]
66
+
> Please note that the `webshop` environment requires ~16GB of RAM to start,
67
+
> and the current implementation of `alfworld` leaks memory and disk space until the task worker is restarted.
68
+
> Make sure your machine has sufficient resources before running.
69
+
70
+
### Benchmarking Results
71
+
72
+
We report the results of various models on the test set of AgentBench FC.
73
+
74
+

75
+
76
+
Please see our [Leaderboard](https://docs.google.com/spreadsheets/d/e/2PACX-1vRR3Wl7wsCgHpwUw1_eUXW_fptAPLL3FkhnW_rua0O1Ji_GIVrpTjY5LaKAhwO-WeARjnY_KNw0SYNJ/pubhtml) for full results.
77
+
Please contact [agentbench_fc@googlegroups.com](mailto:agentbench_fc@googlegroups.com) if you have any questions or would like to contribute your results.
VisualAgentBench is designed for evaluating and training visual foundation agents based on large multimodel models (LMMs). We introduce 5 distinct environments spanning
@@ -20,16 +88,9 @@ VisualAgentBench is designed for evaluating and training visual foundation agent
20
88
21
89
to systematically benchmark 17 LMMs (proprietary & open LMMs). We also provide the trajectory dataset for behavior cloning training on open LMMs for you to develop your own visual foundation agents!
22
90
23
-
## 📌Introducing AgentBench v0.2🎉
24
-
25
-
You are now browsing AgentBench v0.2. If you wish to use the older version, you can revert to [v0.1](https://github.com/THUDM/AgentBench/tree/v0.1).
26
-
27
-
Based on [v0.1](https://github.com/THUDM/AgentBench/tree/v0.1), we:
91
+
---
28
92
29
-
- Updated the framework architecture for easier use and extension
30
-
- Adjusted some task settings
31
-
- Added test results for more models
32
-
- Released the full data for the Dev and Test sets
93
+
The following is the introduction to the original AgentBench (v0.2).
0 commit comments