Conversation
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
I don't believe we need this - I would rather put it in the repo with the dataset
There was a problem hiding this comment.
The challenge with this set is the train blob is 100+GB and I struggle to download, unzip, then sample it on my machine.
There was a problem hiding this comment.
We need some training data for classification
|
I ran mini on Yamnet and for like 0.01 as main score - maybe species level is quite hard for it. Wonder how other models perform on this. |
|
hmm, yeah that seems very low - we talked about also implementing a domain specific model - might be better to test with one of those |
|
Yeah might look into one of these: OpenBEATs - a shikhar7ssu Collection https://share.google/nJLukTgjGY8mWEgZD |
Use cross-validation on train split and extract shared bibtex constant. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
| "count": 1 | ||
| }, | ||
| "Aramus guarauna": { | ||
| "count": 1 |
There was a problem hiding this comment.
We only have one label pr class? I think this is due to the downsample - I don't think that makes a lot of sense there
There was a problem hiding this comment.
Likely due to down sample yeah. Full set is 49k rows, and needs 30+GB to load
There was a problem hiding this comment.
We could still make it smaller, but probably good to keep it large enough to be meaningful.
How about:
a stratified subsample of the test set from 49,527 to 2 of each unique label (so about 10k) and then ensure that we have 10 labels per unique label so around 50-60k?
I believe currently it would sample the entire dataset and reembed it every time during evaluation. This could be optimized (e.g. by sampling all the IDs, then encoding then encoding only the required documents once), but it is a problem in the task not the dataset. In this case to improve runtime we could reduce the number of experiments to 1. I suspect with the current setting all 10 experiment are essentially the same anyway.
There was a problem hiding this comment.
Also the current label is at the most granular species level. Maybe we want to explore something higher level?
┌───────────────┬──────────┬────────────────┐
│ Level │ Distinct │ Range │
├───────────────┼──────────┼────────────────┤
│ supercategory │ 5 │ too few │
├───────────────┼──────────┼────────────────┤
│ order │ 62 │ good candidate? │
├───────────────┼──────────┼────────────────┤
│ family │ 356 │ good candidate? │
├───────────────┼──────────┼────────────────┤
│ genus │ 2,075 │ over 2k │
├───────────────┼──────────┼────────────────┤
│ species │ 5,569 │ over 2k │
└───────────────┴──────────┴────────────────┘
There was a problem hiding this comment.
This could be optimized (e.g. by sampling all the IDs, then encoding then encoding only the required documents once)
I am actually not entirely sure on this part seems that there is a cache it place that re-uses embeddings. So I don't think there is an issue. On that point, but still believe that the due to the large number of unique labels we end up sampling the whole dataset (we can check this if you have the results file)
There was a problem hiding this comment.
Also the current label is at the most granular species level. Maybe we want to explore something higher level?
Ahh yeah that could be a way out. Feels like hiearchical classification. I would probably suggest family, genus, species but let me ask a friend who is a biologist about what makes sense here.
I am probably leaning toward species still though
There was a problem hiding this comment.
talked with biology friend, she argued that it makes no sense to predict up the their - species or maybe genus.
She however suggest downsampling the number of species to e.g. 2k - I agree that this is probably the best approach. We could also split it into 5 datasets one for each subcategory (I don't know what a subcategory is though)
There was a problem hiding this comment.
Makes sense to down sample species. Thanks for asking!
There was a problem hiding this comment.
No problem - I also heard that a large majority of the dataset is birds (I think ~4k)
New Dataset: INatSounds
Adds the iNaturalist Sounds dataset for audio-based species classification.
Dataset details
The dataset contains recordings of 5,500+ species across birds, insects, amphibians, mammals, and reptiles, contributed by 27,000+ citizen scientists. The test split contains 49,527 recordings from 2023 observations.
Checklist
sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2intfloat/multilingual-e5-smallTaskMetadatafieldsmteb/tasksdirectory__init__.py