Towards Audio-visual Navigation in Noisy Environments: A Large-scale Benchmark Dataset and An Architecture Considering Multiple Sound-Sources
[AAAI 2025]
Zhanbo Shi · Lin Zhang* · Linfei Li · Ying Shen
Audio-visual navigation has received considerable attention in recent years. However, the majority of related investigations have focused on single sound-source scenarios. Studies in this field for multiple sound-source scenarios remain underexplored due to the limitations of two aspects. First, the existing audio-visual navigation dataset only has limited audio samples, making it difficult to simulate diverse multiple sound-source environments. Second, existing navigation frameworks are mainly designed for single sound-source scenarios, thus their performance is severely reduced in multiple sound-source scenarios. To fill in these two research gaps to some extent, we establish a large-scale audio dataset named BeDAViN and propose a new embodied navigation framework called ENMuS3.
If you use this project in your research, please cite the following paper:
Coming Soon
The appendix of our paper can be found here.
This project is developed with Python 3.9 on Ubuntu 22.04. If you are using miniconda or anaconda, you can create an environment with following instructions.
conda create -n enmus python=3.9 cmake=3.14.0 -y
conda activate enmus
ENMuS3 uses Habitat and SoundSpaces for robot simulation and audio synthesis.
First, install Habitat-Sim v0.2.2 and Habitat-Lab v 0.2.2 with the following commands.
git clone https://github.com/facebookresearch/habitat-sim.git
cd habitat-sim
git checkout RLRAudioPropagationUpdate
python setup.py install --headless --audio
git clone https://github.com/facebookresearch/habitat-lab.git
cd habitat-lab
git checkout v0.2.2
pip install -e .
Edit habitat/tasks/rearrange/rearrange_sim.py file and remove the 36th line where FetchRobot is imported.
Then install SoundSpaces as the instructions here.
git clone https://github.com/facebookresearch/sound-spaces.git
cd sound-spaces
pip install -e .
Now you can install ENMuS3:
git clone https://github.com/ZhanboShiAI/ENMuS.git
cd ENMuS
pip install -e .
To use ENMuS3, you should firstly generate the following data.
Scene Dataset from Matterport3D
Matterport3D (MP3D) scene reconstructions are used. The official Matterport3D download script can be accessed by following the instructions on their project webpage. After the scene dataset has been downloaded, you can use the cache_observations.py
script from SoundSpaces to generate cached observations.
Room Impulse Response Dataset from SoundSpaces
Room impulse response dataset is used for audio synthesis. You can download this dataset following the instructions here.
Audio Dataset
The Audio dataset used in this project from three parts, our manual recorded audio files, audio samples selected from AudioSet and Freesound. The processed audio wavfiles are saved under data
directory of the current github repository. The raw 24 bit audio recordings with a sampling rate of 96,000 Hz can be found here. As for the information of the original audio clips from AudioSet and Freesound, please refer to this document.
Data Folder Structure
data|
├── binaural_rirs # binaural RIRs of 2 channels
│ └── [dataset]
│ └── [scene]
│ └── [angle] # azimuth angle of agent's heading in mesh coordinates
│ └── [receiver]-[source].wav
├── datasets # stores datasets of episodes of different splits
│ └── [dataset]
│ └── [version]
│ └── [split]
│ ├── [split].json.gz
│ └── content
│ └── [scene].json.gz
├── metadata # stores metadata of environments
│ └── [dataset]
│ └── [scene]
│ ├── point.txt # coordinates of all points in mesh coordinates
│ ├── graph.pkl # points are pruned to a connectivity graph
├── pretrained_weights # saved pretrained weights for the model
│ └── [semantic_audionav]
│ └── [enmus]
│ ├── enmus_best_val.pth
├── scene_datasets # scene_datasets
│ └── [dataset]
│ └── [scene]
│ └── [scene].house (habitat/mesh_sementic.glb)
└── scene_observations # pre-rendered scene observations
│ └── [dataset]
│ └── [scene].pkl # dictionary is in the format of {(receiver, rotation): sim_obs}
├── sounds # stores all sounds
│ └── sound_event_splits
│ ├── noise # stores sounds for background noise simulation
│ ├── sound_event # stores sounds for sound event simulation
| ├── test # splits
| ├── train
| ├── val
| └── [sound_id].wav
Below we show some example commands for training and evaluating ENMuS3 on Matterport3D in multi-source scenarios.
- Training
python sen_baselines/enmus/run.py --exp-config sen_baselines/enmus/config/multi_source/enmus.yaml --decoder-type MSMT --model-dir data/models/mp3d/enmus_multi_source
- Validation (evaluate every 10 checkpoints and generate a validation curve)
python sen_baselines/enmus/run.py --exp-config sen_baselines/enmus/config/multi_source/enmus_eval.yaml --run-type eval --decoder-type MSMT --model-dir data/models/mp3d/enmus_multi_source --eval-interval 10
- Test the best validation checkpoint based on validation curve
python sen_baselines/enmus/run.py --exp-config sen_baselines/enmus/config/multi_source/enmus_test.yaml --run-type eval --decoder-type MSMT --model-dir data/models/mp3d/enmus_multi_source EVAL_CKPT_PATH_DIR data/models/mp3d/enmus_multi_source/data/ckpt.XXX.pth
The codebase of this project and our manually collected audio recordings are CC-BY-4.0 licensed, as found in the LICENSE file. The audio samples from AudioSet and Freesound are mainly under CC-BY-4.0 license. A detailed license info of these audio samples can be found under data/sounds/metadata/
.
The trained RL policy models and task datasets are considered data derived from the MP3D scene dataset. Matterport3D based task datasets and trained models are distributed with Matterport3D Terms of Use and under CC BY-NC-SA 3.0 US license.