Skip to content

dataset: Add INatSounds#4346

Open
isaac-chung wants to merge 2 commits intomainfrom
add-inat-sounds-task
Open

dataset: Add INatSounds#4346
isaac-chung wants to merge 2 commits intomainfrom
add-inat-sounds-task

Conversation

@isaac-chung
Copy link
Copy Markdown
Collaborator

@isaac-chung isaac-chung commented Apr 3, 2026

New Dataset: INatSounds

Adds the iNaturalist Sounds dataset for audio-based species classification.

Dataset details

The dataset contains recordings of 5,500+ species across birds, insects, amphibians, mammals, and reptiles, contributed by 27,000+ citizen scientists. The test split contains 49,527 recordings from 2023 observations.

Checklist

  • Explain how this dataset fills a gap in MTEB
    • First large-scale bioacoustics species classification benchmark in MTEB
  • Verify the dataset runs with the mteb package
  • Test with sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2
  • Test with intfloat/multilingual-e5-small
  • Confirm performance isn't trivial or random
  • Complete all TaskMetadata fields
  • Place task in mteb/tasks directory
  • Import in __init__.py

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@Samoed Samoed added new dataset Issues related to adding a new task or dataset audio Audio extension labels Apr 4, 2026
Copy link
Copy Markdown
Contributor

@KennethEnevoldsen KennethEnevoldsen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would be great to run at least one model on it to make sure that everything works as intended, but otherwise it looks good

It also seems like we are missing the descriptive stats

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't believe we need this - I would rather put it in the repo with the dataset

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The challenge with this set is the train blob is 100+GB and I struggle to download, unzip, then sample it on my machine.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need some training data for classification

Comment thread mteb/tasks/classification/zxx/inat_sounds.py
@KennethEnevoldsen KennethEnevoldsen changed the title Add dataset: INatSounds dataset: Add INatSounds Apr 5, 2026
@isaac-chung
Copy link
Copy Markdown
Collaborator Author

I ran mini on Yamnet and for like 0.01 as main score - maybe species level is quite hard for it. Wonder how other models perform on this.

@KennethEnevoldsen
Copy link
Copy Markdown
Contributor

hmm, yeah that seems very low - we talked about also implementing a domain specific model - might be better to test with one of those

@isaac-chung
Copy link
Copy Markdown
Collaborator Author

Yeah might look into one of these: OpenBEATs - a shikhar7ssu Collection https://share.google/nJLukTgjGY8mWEgZD

Use cross-validation on train split and extract shared bibtex constant.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
"count": 1
},
"Aramus guarauna": {
"count": 1
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We only have one label pr class? I think this is due to the downsample - I don't think that makes a lot of sense there

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Likely due to down sample yeah. Full set is 49k rows, and needs 30+GB to load

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could still make it smaller, but probably good to keep it large enough to be meaningful.

How about:
a stratified subsample of the test set from 49,527 to 2 of each unique label (so about 10k) and then ensure that we have 10 labels per unique label so around 50-60k?

I believe currently it would sample the entire dataset and reembed it every time during evaluation. This could be optimized (e.g. by sampling all the IDs, then encoding then encoding only the required documents once), but it is a problem in the task not the dataset. In this case to improve runtime we could reduce the number of experiments to 1. I suspect with the current setting all 10 experiment are essentially the same anyway.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also the current label is at the most granular species level. Maybe we want to explore something higher level?

  ┌───────────────┬──────────┬────────────────┐
  │     Level     │ Distinct │     Range      │
  ├───────────────┼──────────┼────────────────┤
  │ supercategory │ 5        │ too few        │
  ├───────────────┼──────────┼────────────────┤
  │ order         │ 62       │ good candidate? │
  ├───────────────┼──────────┼────────────────┤
  │ family        │ 356      │ good candidate? │
  ├───────────────┼──────────┼────────────────┤
  │ genus         │ 2,075    │ over 2k        │
  ├───────────────┼──────────┼────────────────┤
  │ species       │ 5,569    │ over 2k        │
  └───────────────┴──────────┴────────────────┘

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could be optimized (e.g. by sampling all the IDs, then encoding then encoding only the required documents once)

I am actually not entirely sure on this part seems that there is a cache it place that re-uses embeddings. So I don't think there is an issue. On that point, but still believe that the due to the large number of unique labels we end up sampling the whole dataset (we can check this if you have the results file)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also the current label is at the most granular species level. Maybe we want to explore something higher level?

Ahh yeah that could be a way out. Feels like hiearchical classification. I would probably suggest family, genus, species but let me ask a friend who is a biologist about what makes sense here.

I am probably leaning toward species still though

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

talked with biology friend, she argued that it makes no sense to predict up the their - species or maybe genus.

She however suggest downsampling the number of species to e.g. 2k - I agree that this is probably the best approach. We could also split it into 5 datasets one for each subcategory (I don't know what a subcategory is though)

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense to down sample species. Thanks for asking!

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No problem - I also heard that a large majority of the dataset is birds (I think ~4k)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

audio Audio extension new dataset Issues related to adding a new task or dataset

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants