This repo is for CityBench: Evaluating the Capabilities of Large Language Model for Urban Tasks
TL;DR: We propose a simulator based global scale benchmark to evaluate the performance of large language models on various urban tasks.
In this paper, we design CityBench, an interactive simulator based evaluation platform, as the first systematic benchmark for evaluating the capabilities of LLMs for diverse tasks in urban research. First, we build CityData to integrate the diverse urban data and CitySimu to simulate fine-grained urban dynamics. Based on CityData and CitySimu, we design 8 representative urban tasks in 2 categories of perception-understanding and decision-making as the CityBench. With extensive results from 30 well-known LLMs and VLMs in 13 cities around the world, we find that advanced LLMs and VLMs can achieve competitive performance in diverse urban tasks requiring commonsense and semantic understanding abilities.
The framework of global evaluation benchmark CityBench, which consists of a simulator CitySimu and 8 representative urban tasks. We can select any city around the world to automatically build new benchmark for it.
Currently, the following cities are supported.
World | Cities | Visual Data | GeoSpatial Data | Human Activity Data | |||
---|---|---|---|---|---|---|---|
Satellite Image |
Street View (sampling) |
Roads | PoI/AoIs | OD flow (10) |
Checkins | ||
Asia | Beijing | 1764 | 7482 | 17043 | 276090 | 1905025 | 21015 |
Shanghai | 5925 | 4170 | 33321 | 57731 | 845188 | 33129 | |
Mumbai | 638 | 6025 | 6296 | 60245 | 309147 | 31521 | |
Tokyo | 1120 | 5514 | 33174 | 1146094 | 969865 | 1044809 | |
Europe | London | 1710 | 4148 | 14418 | 83892 | 1401404 | 173268 |
Paris | 238 | 6044 | 4443 | 21950 | 28362 | 85679 | |
Moscow | 1558 | 5761 | 9850 | 28289 | 979064 | 836313 | |
Americas | NewYork | 320 | 3934 | 5414 | 349348 | 71705 | 390934 |
SanFrancisco | 345 | 4473 | 4171 | 73777 | 61367 | 100249 | |
SaoPaulo | 1332 | 5184 | 28714 | 1681735 | 311830 | 808754 | |
Africa | Nairobi | 336 | 5987 | 2972 | 264101 | 135332 | 25727 |
CapeTown | 896 | 5175 | 5947 | 151711 | 525578 | 11591 | |
Oceania | Sydney | 1935 | 5087 | 21390 | 141997 | 438763 | 54170 |
- citybench # evaluation codes
- citysim # codes for running simulation
- citydata # data used for evaluation
- serving # codes for deploy LLM and MLLM
- results # records of evaluation results
- config.py # global variables in project
- evaluate.py # main evaluation function
Install Python dependencies.
conda create -n citybench python==3.10
pip install -r requirements.txt
For using LLM/VLM API, you need to set API Key as follows
export OpenAI_API_KEY = "" # For OpenAI GPT3.5, GPT4, GPT4o
export DASHSCOPE_API_KEY = "" # For QwenVL
export DeepInfra_API_KEY = "" # For LLama3, Gemma, Mistral
export SiliconFlow_API_KEY = "" # For InternLM or Qwen
Besides, we use vllm for local LLM deployment and VLMEvalKit for VLM deployment.
We provide the CityData dataset for the existing 13 cities respectively. To access the dataset, please refer to CityData-huggingface.
If you want to construct a dataset for new cities, please follow the instruction below:
Please first navigate to the CityBench
directory by using the cd command: cd CityBench
We provide a script for generating maps related to new cities in CitySimu. You need to first define the latitude and longitude range for a city's area, and then run the following command.
python -m citysim.build_map --city_name=Paris --min_lon=2.249 --max_lon=2.4239 --min_lat=48.8115 --max_lat=48.9038 --workers=20
There are some parameters that need to be explained:
min_lon
,max_lon
,min_lat
, andmax_lat
refer to the latitude and longitude range of a specific city area you have defined. Additionally, the latitude and longitude ranges for the 13 existing cities can be found inconfig.py
.sworkers
refers to the number of workers for multiprocessing.
When the corresponding map for a new city has been generated, please update the relevant parameter settings in config.py
.
The tasks Geolocalization and Outdoor Navigation both require street images, and their downloading methods are the same. Please refer to the following instructions.
First, generate points of latitude and longitude in the city you want to obtain street view images.
# For task Geolocalization
python -m citybench.street_view.Randompoints_Gen
# For task Outdoor Navigation
python -m citybench.outdoor_navigation.sample_points_gen
Then, scrape street view images. Since the method for downloading street view images is the same, here we take Outdoor Navigation as an example. If you want to download images for the Geolocalization task, you only need to change the paths for saving the images and results.
For Chinese cities such as Beijing and Shanghai, street view images need to be obtained through Baidu. For other cities, they can be obtained through Google. Please first set the Baidu/Google API Key, and then run the script below.
# Set Baidu API key
export BAIDU_KEY = ""
# Scrape images through Baidu
python -m citybench.outdoor_navigation.crawler_baidu --city_name=SanFrancisco
# Set Google API key
export GOOGLE_API_KEY = ""
# Scrape images through Google
python -m citybench.outdoor_navigation.crawler_google --city_name=SanFrancisco --multi_process_num=20 --index=0 --total_points=50
# Stitch Google images
python -m citybench.outdoor_navigation.stitch_image_patches --city_name=SanFrancisco --multi_process_num=20 --image_size=512 --out_image_size_width=512 --out_image_size_height=512
Here are some new parameters to introduce:
multi_process_num
: Refers to the number of threads for multiprocessing. Ensure that it does not exceed the number of CPU cores to avoid overloading the system.index
: Indicates the starting group index for downloading. The script will begin processing from the specified group.total_points
: Specifies the number of points to download within each group. This determines the total number of locations to process for every group.IMG_SIZE
: The dimensions (width/height) of the originally downloaded images.OUT_IMG_SIZE_W
: The width of the final output panoramic image.OUT_IMG_SIZE_H
: The height of the final output panoramic image.
It's necessary to generate URLs for all street view images required for the Outdoor Navigation task. Place a file named url_mapping.csv
in the location citydata/outdoor_navigation_tasks/NEW_StreetView_Images_CUT/
, containing two columns: image_name
and image_url
.
You can refer to the citybench/remote_sensing/download_rs_img.py
script to download satellite images. Please add the information of the city you want to download into the script first, and then run the following command.
python -m citybench.remote_sensing.download_rs_img --city_name=Beijing
We provide the scripts to generate the evaluation dataset for each task. The following command needs to be executed to generate the evaluation dataset. You can obtain the dataset as the following examples:
# For task GeoQA, Mobility Prediction, Outdoor Navigation, Traffic Signal
python -m citybench.geoqa.data_gen --city_name=Tokyo
# For task Urban Exploration
python -m citybench.urban_exploration.eval --city_name=Tokyo --model_name=LLama3-8B --mode=gen
# For task Poplulation and Objects
python -m citybench.remote_sensing.prepare_image_and_pop --city_name=Tokyo
# For task Geolocalization
python -m citybench.street_view.Build_StreetView_List --city_name=Tokyo
Once the above command is executed, you can find the generated evaluation dataset in the citydata
folder.
Supported models, tasks, cities can refer to config.py. The following is an example evaluation for a specific task, model, and city.
python -m citybench.traffic_signal.run_eval --city_name=London --model_name=LLama3-8B --data_name=mini
data_name
refers to the size of the evaluation dataset, where 'mini' generally represents 10% of the data used in 'all'. After the evaluation code finishes running, you can find the model's response records and metrics in the corresponding folder underresults
. For further metric statistics, please refer to the next step.
We have prepared a metrics.py
file for each task to calculate the models' results on urban tasks in different cities. You can statistical the model results of one task by following the below example.
python -m citybench.mobility_prediction.metrics
The summary results file will appear in the results folder of the corresponding task, e.g., results/prediction_results/mobility_benchmark_result.csv
Evaluate multiple models, tasks, and cities, and output the statistical results.
# The unified portal directly evaluates all tasks
python -m evalaute --model_name=GPT4o,MiniCPM-Llama3-V-2_5 --task_name=geoqa,mobility --city_name=NewYork,Paris,Beijing --data_name=mini
- CityData for
Outdoor Navigation
andTraffic Signal
tasks in some missing cities - Model refusal analysis
- Automated quality control process improvement (LLM-as-judge)
- CityBench-Hard subset with human annotation
- Improvement of
Mobility Prediction
task. More information can be found at AgentMove.
If you find this work helpful, please cite our paper.
@article{Feng2024CityBench,
title={CityBench: Evaluating the Capabilities of Large Language Models for Urban Tasks},
author={Jie Feng, Jun Zhang, Tianhui Liu, Xin Zhang, Tianjian Ouyang, Junbo Yan, Yuwei Du, Siqi Guo, Yong Li},
journal={ArXiv},
year={2024},
primaryClass={cs.AI},
volume={abs/2406.13945},
url={https://arxiv.org/abs/2406.13945}
}
We appreciate the following GitHub repos a lot for their valuable code and efforts.
- https://github.com/xlwang233/LLM-Mob for mobility prediction
- https://github.com/usail-hkust/LLMTSCS for traffic signal control
- https://github.com/THUDM/AgentTuning for multi-choice evaluation
If you have any questions or want to use the code, feel free to contact: Jie Feng (fengjie@tsinghua.edu.cn)