dataset: Add INatSounds by isaac-chung · Pull Request #4346 · embeddings-benchmark/mteb

isaac-chung · 2026-04-03T15:43:36Z

New Dataset: INatSounds

Adds the iNaturalist Sounds dataset for audio-based species classification.

Dataset details

HuggingFace: https://huggingface.co/datasets/mteb/inat_sounds
Paper: https://arxiv.org/abs/2506.00343
Task type: AudioClassification
Modality: Audio
License: CC-BY-4.0

The dataset contains recordings of 5,500+ species across birds, insects, amphibians, mammals, and reptiles, contributed by 27,000+ citizen scientists. The test split contains 49,527 recordings from 2023 observations.

Checklist

Explain how this dataset fills a gap in MTEB
- First large-scale bioacoustics species classification benchmark in MTEB
Verify the dataset runs with the mteb package
Test with sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2
Test with intfloat/multilingual-e5-small
Confirm performance isn't trivial or random
Complete all TaskMetadata fields
Place task in mteb/tasks directory
Import in __init__.py

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

KennethEnevoldsen

would be great to run at least one model on it to make sure that everything works as intended, but otherwise it looks good

It also seems like we are missing the descriptive stats

KennethEnevoldsen · 2026-04-05T13:34:51Z

I don't believe we need this - I would rather put it in the repo with the dataset

The challenge with this set is the train blob is 100+GB and I struggle to download, unzip, then sample it on my machine.

We need some training data for classification

isaac-chung · 2026-04-05T13:43:59Z

I ran mini on Yamnet and for like 0.01 as main score - maybe species level is quite hard for it. Wonder how other models perform on this.

KennethEnevoldsen · 2026-04-05T13:46:51Z

hmm, yeah that seems very low - we talked about also implementing a domain specific model - might be better to test with one of those

isaac-chung · 2026-04-05T14:35:52Z

Yeah might look into one of these: OpenBEATs - a shikhar7ssu Collection https://share.google/nJLukTgjGY8mWEgZD

Use cross-validation on train split and extract shared bibtex constant. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

KennethEnevoldsen · 2026-04-06T10:19:07Z

+                    "count": 1
+                },
+                "Aramus guarauna": {
+                    "count": 1


We only have one label pr class? I think this is due to the downsample - I don't think that makes a lot of sense there

Likely due to down sample yeah. Full set is 49k rows, and needs 30+GB to load

We could still make it smaller, but probably good to keep it large enough to be meaningful.

How about:
a stratified subsample of the test set from 49,527 to 2 of each unique label (so about 10k) and then ensure that we have 10 labels per unique label so around 50-60k?

I believe currently it would sample the entire dataset and reembed it every time during evaluation. This could be optimized (e.g. by sampling all the IDs, then encoding then encoding only the required documents once), but it is a problem in the task not the dataset. In this case to improve runtime we could reduce the number of experiments to 1. I suspect with the current setting all 10 experiment are essentially the same anyway.

Also the current label is at the most granular species level. Maybe we want to explore something higher level?

┌───────────────┬──────────┬────────────────┐ │ Level │ Distinct │ Range │ ├───────────────┼──────────┼────────────────┤ │ supercategory │ 5 │ too few │ ├───────────────┼──────────┼────────────────┤ │ order │ 62 │ good candidate? │ ├───────────────┼──────────┼────────────────┤ │ family │ 356 │ good candidate? │ ├───────────────┼──────────┼────────────────┤ │ genus │ 2,075 │ over 2k │ ├───────────────┼──────────┼────────────────┤ │ species │ 5,569 │ over 2k │ └───────────────┴──────────┴────────────────┘

This could be optimized (e.g. by sampling all the IDs, then encoding then encoding only the required documents once)

I am actually not entirely sure on this part seems that there is a cache it place that re-uses embeddings. So I don't think there is an issue. On that point, but still believe that the due to the large number of unique labels we end up sampling the whole dataset (we can check this if you have the results file)

Also the current label is at the most granular species level. Maybe we want to explore something higher level?

Ahh yeah that could be a way out. Feels like hiearchical classification. I would probably suggest family, genus, species but let me ask a friend who is a biologist about what makes sense here.

I am probably leaning toward species still though

talked with biology friend, she argued that it makes no sense to predict up the their - species or maybe genus.

She however suggest downsampling the number of species to e.g. 2k - I agree that this is probably the best approach. We could also split it into 5 datasets one for each subcategory (I don't know what a subcategory is though)

Makes sense to down sample species. Thanks for asking!

No problem - I also heard that a large majority of the dataset is birds (I think ~4k)

Add iNat Sounds classification task

529dfa7

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Samoed added new dataset Issues related to adding a new task or dataset audio Audio extension labels Apr 4, 2026

KennethEnevoldsen reviewed Apr 5, 2026

View reviewed changes

KennethEnevoldsen changed the title ~~Add dataset: INatSounds~~ dataset: Add INatSounds Apr 5, 2026

Rename to INatSoundsMini with 2,048-sample stratified subsample

00eb771

Use cross-validation on train split and extract shared bibtex constant. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

KennethEnevoldsen reviewed Apr 6, 2026

View reviewed changes

Conversation

isaac-chung commented Apr 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

New Dataset: INatSounds

Dataset details

Checklist

Uh oh!

KennethEnevoldsen left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

isaac-chung commented Apr 5, 2026

Uh oh!

KennethEnevoldsen commented Apr 5, 2026

Uh oh!

isaac-chung commented Apr 5, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

isaac-chung commented Apr 3, 2026 •

edited

Loading

KennethEnevoldsen left a comment •

edited

Loading