fix: Add AfriMTEB and AFriE5#4124
fix: Add AfriMTEB and AFriE5#4124Kosei1227 wants to merge 38 commits intoembeddings-benchmark:mainfrom
Conversation
There was a problem hiding this comment.
Why do you change existing dataset?
There was a problem hiding this comment.
@Kosei1227 This is still unresolved. Why do you change existing dataset?
There was a problem hiding this comment.
@Samoed This is because AfriSenti under mteb library does not support Oromo. This change explictly support this language for better coverage.
There was a problem hiding this comment.
Yes, but to include them they should be existing in dataset, but they don't exist in mteb repo. If you want to include them we need to update dataset repo
There was a problem hiding this comment.
I discussed the language coverage of mteb/AfriSentiClassification in my team. For AfriSentiClassification, we use the original dataset: https://huggingface.co/datasets/shmuhammad/AfriSenti-twitter-sentiment. I’m not sure why mteb/AfriSentiClassification does not support Oromo, but since I cannot modify the existing dataset, let’s revert the changes in this file.
KennethEnevoldsen
left a comment
There was a problem hiding this comment.
Hi @Kosei1227, great to see this addition!!
There are some issues that we need to address before the merge, these are mainly caused by v1-v2 changes - but generally PR the metadata annotations looks good.
Co-authored-by: Kenneth Enevoldsen <kenevoldsen@pm.me>
…sses.py Co-authored-by: Kenneth Enevoldsen <kenevoldsen@pm.me>
|
Hi @Samoed, @KennethEnevoldsen, Thank you for the detailed reviews! I have addressed the requested changes across 9 commits. Here is a summary of the updates: 1. Model Refinement (
|
Co-authored-by: Roman Solomatin <samoed.roman@gmail.com>
|
@Samoed
Thanks for flagging this. We would prefer to keep SIB200_14Classes. Although it is related to the original SIB200 task, this 14-class variant is substantially more challenging and is a key contribution of our paper. In AfriMTEB, it plays an important role in evaluating fine-grained topic classification for multilngual settings, which is not covered by the original setup. |
|
Hi @Kosei1227 thanks for the changes and sorry about the late response, can I ask you to please answer this question regarding SIB200:
Currently, I am a bit unsure about the dataset it would be nice to know how it is different |
@KennethEnevoldsen and @Samoed , the original SIB-200 has 14 classes (section 2.3 of the paper), comparison of both are in Appendix https://aclanthology.org/2024.eacl-long.14.pdf#page=15.08 . The original SIB-200, removed infrequent classes since they were already challenging enough when it was created in 2023 but not anymore. Please, let me know if you have questions. Quote from the paper: "While the SIB-200 dataset only includes seven labels, we are also releasing another version of the dataset that is more challenging with all the 14 labels (excluding “uncategorized”). We compared the performance of English dataset using both seven and 14 labels in Appendix C." |
KennethEnevoldsen
left a comment
There was a problem hiding this comment.
Thanks for the clarification @dadelani
Based on that I would change it to SIB200Classification.v2 instead.
The next steps we need now is that the datasets are re-uploaded under the mteb organization (to ensure that we can maintain them going forward). I one of you have an huggingface ID then I can add you to the organization. You can see how to push it to the hub in the documentation, but do ask if there are issues. Once the dataset it uploaded you also have to remove the dataset_transform (as it is applied before the upload)
I would also ask that you go over the comments above and resolve those that you believe are resolved
Once those I done, then I think we are pretty much there - thanks for taking the time so far!
|
This pull request has been automatically marked as stale due to inactivity. |
|
@Kosei1227 should we get this PR finalized? I would love to have it merged into MTEB |
|
@KennethEnevoldsen Thanks for following up! I'll finalize the PR by this weekend! |
|
great to hear - looking forward to it! |
Co-authored-by: Kenneth Enevoldsen <kenevoldsen@pm.me>
Align the 14-label SIB200 task with reviewer feedback by renaming it to `SIB200Classification.v2` and marking `SIB200Classification` as superseded. Update the task registry, benchmark entries, and quality-test references so the renamed task resolves consistently. Made-with: Cursor
|
Hi @KennethEnevoldsen @Samoed, quick update on the Hugging Face rehost step: I authenticated as On my side, The failure happens at repo creation, before upload, so if repo creation is restricted but push access to existing repos is allowed, pre-creating the target repos would also unblock me. If you can confirm/add |
I don't think that you have access to mteb org. You can keep datasets under your username |
|
@Kosei1227 I have added you as a contibutor on mteb, you can now upload new datasets and edit your own datasets (but not edit existing datasets) |
Point the AfriMTEB-related tasks and SIB200 v2 task to the new datasets hosted under the mteb namespace so task loading uses the transferred repositories going forward. Made-with: Cursor
|
@KennethEnevoldsen Thanks for inviting me to contributors! Now, I transferred all new datasets under mteb. |
|
@Kosei1227 Can you look into the comments? There are some unresolved ones |
|
@Samoed All comments should be resolved now! Thanks for flagging up this point. |
Restore the original Swahili subset key and remove the unnecessary fast-loading flag so the task only keeps the Oromo support changes. Made-with: Cursor
Remove the Oromo-specific AfriSenti task changes so the file matches the earlier branch version until the dataset can be updated safely. Made-with: Cursor
….v2 and remove old SIB200-14Classes stats Made-with: Cursor
Made-with: Cursor
KennethEnevoldsen
left a comment
There was a problem hiding this comment.
I believe this PR is good to merge - samoed do you have anything to add?
Add AfriMTEB tasks and AfriE5 model
Description
This PR registers the AfriMTEB benchmark and adds several new datasets and the AfriE5 model focusing on African languages.
For more details, please see our paper: AfriMTEB and AfriE5: Benchmarking and Adapting Text Embedding Models for African Languages.
New Benchmark
MTEB(Africa, v1)(aliasAfriMTEB). It includes a comprehensive set of tasks across classification, clustering, retrieval, bitext mining, and STS.New Datasets
PairClassification)MultiLabelClassification)MultiLabelClassification)Classification)Classification)Classification)Classification)New Models
Citation
If you use this benchmark or the AfriE5 model, please cite:
Checklists
Dataset Checklist
mteb/tasksdirectoryAbsTask*interfacedataset_transformmethod implemented if necessary__init__.pyModel Checklist
mteb/modelsdirectory