Planning, Creation, Usage: Benchmarking LLMs for Comprehensive Tool Utilization in Real-World Complex Scenarios

✨Introduction

The recent trend of using Large Language Models (LLMs) as tool agents in real-world applications underscores the necessity for comprehensive evaluations of their capabilities, particularly in complex scenarios involving planning, creating, and using tools. However, existing benchmarks typically focus on simple synthesized queries that do not reflect real-world complexity, thereby offering limited perspectives in evaluating tool utilization. To address this issue, we present UltraTool, a novel benchmark designed to improve and evaluate LLMs' ability in tool utilization within real-world scenarios. UltraTool focuses on the entire process of using tools - from planning and creating to applying them in complex tasks. It emphasizes real-world complexities, demanding accurate, multi-step planning for effective problem-solving. A key feature of UltraTool is its independent evaluation of planning with natural language, which happens before tool usage and simplifies the task solving by mapping out the intermediate steps. Thus, unlike previous work, it eliminates the restriction of pre-defined toolset during planning. Through extensive experiments on various LLMs, we offer novel insights into the evaluation of capabilities of LLMs in tool utilization, thereby contributing a fresh perspective to this rapidly evolving field.

📂Folders

data/ – Example and test tasks used for quick runs
datasets/ – Full UltraTool dataset and manifest
database/ – Consolidated JSONL file for end‑to‑end experiments
evaluation/ – Scoring scripts and GPT‑4 based evaluators
inference/ – Utilities for running models through OpenRouter or locally
predictions/ – Sample outputs from GPT‑3.5 and GPT‑4
configs/ – Model configuration table
harness/ – Thin OpenRouter client used by the runner
scripts/ – Helper shell scripts
runner.py – Minimal end‑to‑end experiment runner

🛠️Quick start

Installation

Create a Python 3.8+ environment.
Install dependencies:
```
pip install -r requirements.txt
```
Obtain an API key from OpenRouter and set it as an environment variable:
```
export OPENROUTER_API_KEY="your_openrouter_api_key_here"
```

OpenRouter Integration

This project now uses OpenRouter for accessing various LLM APIs, providing access to the latest models from OpenAI, Anthropic, Meta, Mistral, and Google. The inference utilities read the OPENROUTER_API_KEY environment variable, so be sure to set it before running any scripts:

export OPENROUTER_API_KEY="your_openrouter_api_key_here"

Supported Models

The application now supports the following models via OpenRouter:

OpenAI: gpt-3.5, gpt-4, gpt-4o
Anthropic: claude-3-opus, claude-3-sonnet, claude-3-haiku
Meta: llama-3-70b, llama-3-8b
Mistral: mixtral-8x7b
Google: gemini-pro

Running Experiments

The runner.py script executes end-to-end evaluations. It reads tasks from database/consolidated.jsonl by default. Run it as follows:

python runner.py

Use --manifest to specify a different manifest file if needed.

Inference

OpenRouter Models

# For GPT-4o (latest OpenAI model)
python inference/inference_openai.py --model gpt-4o --language en --tasks ['planning', 'tool_usage_awareness', 'tool_creation', 'tool_usage', 'tool_creation_awareness', 'tool_selection']

# For Claude 3 Opus (Anthropic's most capable model)
python inference/inference_openai.py --model claude-3-opus --language en --tasks ['planning', 'tool_usage_awareness', 'tool_creation', 'tool_usage', 'tool_creation_awareness', 'tool_selection']

# For Llama 3 70B (Meta's large model)
python inference/inference_openai.py --model llama-3-70b --language en --tasks ['planning', 'tool_usage_awareness', 'tool_creation', 'tool_usage', 'tool_creation_awareness', 'tool_selection']

# For GPT-3.5 (cost-effective option)
python inference/inference_openai.py --model gpt-3.5 --language en --tasks ['planning', 'tool_usage_awareness', 'tool_creation', 'tool_usage', 'tool_creation_awareness', 'tool_selection']

# For Mixtral 8x7B (Mistral's mixture of experts)
python inference/inference_openai.py --model mixtral-8x7b --language en --tasks ['planning', 'tool_usage_awareness', 'tool_creation', 'tool_usage', 'tool_creation_awareness', 'tool_selection']

The inference results will be saved in predictions and we have provided the inference results of GPT-3.5 and GPT-4 in predictions.

Open-source LLMs

We choose ChatGLM3 as an example to explain the whole process. To evaluate the ChatGLM3 model using the UltraTool benchmarks, follow these steps:

1. Download model and set the model path

For instance, when employing ChatGLM3, acquire the model from Hugging Face. Subsequently, navigate to the inference/inference_ultraltool.py script and assign the appropriate value to args.model_path by specifying <path_to_your_local_chatglm_model>. Ensure to replace <path_to_your_local_chatglm_model> with the precise directory path leading to your ChatGLM3 model.

2. Update the `run.sh` Script

Location: scripts/run.sh
Modification: Change the model type to chatglm.

At line 9 in run.sh, modify the model_types array:

model_types=(chatglm)

3. Execute the Evaluation

Run the run.sh script:

bash scripts/run.sh

4. Accessing the Results

After running the script, you can find the inference results of ChatGLM3 for English-dataset on all tasks in the UltraTool benchmarks. The results are located under:

predictions/English-dataset/chatglm

Evaluation

Planning

The evaluation of planning relies on GPT-4, with the evaluation pipeline structured as follows:

# For GPT-4o
python evaluation/inference_planning_eval.py --models ['gpt-4o'] --language en
python evaluation/planning.py --models ['gpt-4o'] --language en

# For Claude 3 Opus
python evaluation/inference_planning_eval.py --models ['claude-3-opus'] --language en
python evaluation/planning.py --models ['claude-3-opus'] --language en

# For multiple models comparison
python evaluation/inference_planning_eval.py --models ['gpt-4o', 'claude-3-opus', 'llama-3-70b'] --language en
python evaluation/planning.py --models ['gpt-4o', 'claude-3-opus', 'llama-3-70b'] --language en

Tool creation

Similar to planning evaluation, the evaluation of tool creation also depends on GPT-4. However, it requires an additional post-processing step. The evaluation pipeline is as follows:

# For GPT-4o
python evaluation/post_process_tool_creation.py --models ['gpt-4o'] --language en
python evaluation/inference_tool_creation_eval.py --models ['gpt-4o'] --language en
python evaluation/tool_creation.py --models ['gpt-4o'] --language en

# For Claude 3 Sonnet (balanced performance/cost)
python evaluation/post_process_tool_creation.py --models ['claude-3-sonnet'] --language en
python evaluation/inference_tool_creation_eval.py --models ['claude-3-sonnet'] --language en
python evaluation/tool_creation.py --models ['claude-3-sonnet'] --language en

# For Gemini Pro
python evaluation/post_process_tool_creation.py --models ['gemini-pro'] --language en
python evaluation/inference_tool_creation_eval.py --models ['gemini-pro'] --language en
python evaluation/tool_creation.py --models ['gemini-pro'] --language en

Tool creation awareness

# For GPT-4o
python evaluation/tool_creation_awareness.py --models ['gpt-4o'] --language en

# For Claude 3 models
python evaluation/tool_creation_awareness.py --models ['claude-3-opus'] --language en

Tool usage awareness

# For GPT-4o
python evaluation/tool_usage_awareness.py --models ['gpt-4o'] --language en

# For Llama 3 8B (efficient option)
python evaluation/tool_usage_awareness.py --models ['llama-3-8b'] --language en

# For Claude 3 Haiku (fastest Claude model)
python evaluation/tool_usage_awareness.py --models ['claude-3-haiku'] --language en

Tool selection

# For GPT-4o
python evaluation/tool_selection.py --models ['gpt-4o'] --language en

# For Mixtral 8x7B
python evaluation/tool_selection.py --models ['mixtral-8x7b'] --language en

Tool usage

# For GPT-4o
python evaluation/tool_usage.py --models ['gpt-4o'] --language en

# For multiple models comparison
python evaluation/tool_usage.py --models ['gpt-4o', 'claude-3-opus', 'llama-3-70b'] --language en

📈Benchmark Results

Generating the final report

After running evaluations, consolidate results in reports/results.csv and generate the Markdown summary:

make report

This command runs scripts/render_report.py to render reports/README.md from the CSV and the template at templates/summary.md.j2. The analysis notebook (analysis/analyse.ipynb) loads the same CSV to create the plots referenced in the report.

License

This project is released under the MIT License.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Planning, Creation, Usage: Benchmarking LLMs for Comprehensive Tool Utilization in Real-World Complex Scenarios

✨Introduction

📂Folders

🛠️Quick start

Installation

OpenRouter Integration

Supported Models

Running Experiments

Inference

OpenRouter Models

Open-source LLMs

1. Download model and set the model path

2. Update the `run.sh` Script

3. Execute the Evaluation

4. Accessing the Results

Evaluation

Planning

Tool creation

Tool creation awareness

Tool usage awareness

Tool selection

Tool usage

📈Benchmark Results

Generating the final report

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 44 Commits
analysis		analysis
configs		configs
data		data
database		database
datasets/ultra_swarm		datasets/ultra_swarm
evaluation		evaluation
examples		examples
figures		figures
harness		harness
inference		inference
predictions		predictions
reports		reports
scripts		scripts
templates		templates
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
eval_metrics.py		eval_metrics.py
plan.md		plan.md
requirements.txt		requirements.txt
runner.py		runner.py

Folders and files

Latest commit

History

Repository files navigation

Planning, Creation, Usage: Benchmarking LLMs for Comprehensive Tool Utilization in Real-World Complex Scenarios

✨Introduction

📂Folders

🛠️Quick start

Installation

OpenRouter Integration

Supported Models

Running Experiments

Inference

OpenRouter Models

Open-source LLMs

1. Download model and set the model path

2. Update the run.sh Script

3. Execute the Evaluation

4. Accessing the Results

Evaluation

Planning

Tool creation

Tool creation awareness

Tool usage awareness

Tool selection

Tool usage

📈Benchmark Results

Generating the final report

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

2. Update the `run.sh` Script

Packages