Skip to content

fix: Add AfriMTEB and AFriE5#4124

Open
Kosei1227 wants to merge 38 commits intoembeddings-benchmark:mainfrom
Kosei1227:afrimteb_afrie5
Open

fix: Add AfriMTEB and AFriE5#4124
Kosei1227 wants to merge 38 commits intoembeddings-benchmark:mainfrom
Kosei1227:afrimteb_afrie5

Conversation

@Kosei1227
Copy link
Copy Markdown

Add AfriMTEB tasks and AfriE5 model

Description

This PR registers the AfriMTEB benchmark and adds several new datasets and the AfriE5 model focusing on African languages.

For more details, please see our paper: AfriMTEB and AfriE5: Benchmarking and Adapting Text Embedding Models for African Languages.

New Benchmark

  • AfriMTEB: A new benchmark subset focused on African languages, registered as MTEB(Africa, v1) (alias AfriMTEB). It includes a comprehensive set of tasks across classification, clustering, retrieval, bitext mining, and STS.

New Datasets

  • AfriXNLI (PairClassification)
  • SIB200-14Classes (MultiLabelClassification)
  • EmotionAnalysisPlus (MultiLabelClassification)
  • AfriHateClassification (Classification)
  • AfriSentiClassification (Classification)
  • KinNewsClassification (Classification)
  • InjongoIntent (Classification)

New Models

  • McGill-NLP/AfriE5-Large-instruct: An AfriE5 model adapted from XLM-R.

Citation

If you use this benchmark or the AfriE5 model, please cite:

@article{uemura2025afrimteb,
  author = {Kosei Uemura and Miaoran Zhang and David Ifeoluwa Adelani},
  journal = {arXiv preprint arXiv:2510.23896},
  title = {AfriMTEB and AfriE5: Benchmarking and Adapting Text Embedding Models for African Languages},
  url = {https://arxiv.org/abs/2510.23896},
  year = {2025},
}

Checklists

Dataset Checklist

  • Dataset added to the mteb/tasks directory
  • Task class implementation follows AbsTask* interface
  • Metadata is complete and correct
  • dataset_transform method implemented if necessary
  • Added to __init__.py

Model Checklist

  • Model added to mteb/models directory
  • Model metadata (ModelMeta) defined correctly
  • Loader function defined
  • Languages and other metadata fields populated
  • License information added

@Kosei1227 Kosei1227 changed the title Afrimteb afrie5 Add AfriMTEB and AFriE5 Feb 20, 2026
Comment thread mteb/benchmarks/benchmarks/benchmarks.py Outdated
Comment thread mteb/models/model_implementations/e5_instruct.py Outdated
Comment thread mteb/models/model_implementations/e5_instruct.py Outdated
Comment thread mteb/tasks/classification/multilingual/afri_hate_classification.py Outdated
Comment thread mteb/tasks/classification/multilingual/afri_hate_classification.py Outdated
Comment thread mteb/tasks/classification/multilingual/afri_hate_classification.py Outdated
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do you change existing dataset?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Kosei1227 This is still unresolved. Why do you change existing dataset?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Samoed This is because AfriSenti under mteb library does not support Oromo. This change explictly support this language for better coverage.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, but to include them they should be existing in dataset, but they don't exist in mteb repo. If you want to include them we need to update dataset repo

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I discussed the language coverage of mteb/AfriSentiClassification in my team. For AfriSentiClassification, we use the original dataset: https://huggingface.co/datasets/shmuhammad/AfriSenti-twitter-sentiment. I’m not sure why mteb/AfriSentiClassification does not support Oromo, but since I cannot modify the existing dataset, let’s revert the changes in this file.

@Samoed Samoed added the new benchmark Issues related to adding a new benchmark label Feb 21, 2026
Copy link
Copy Markdown
Contributor

@KennethEnevoldsen KennethEnevoldsen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @Kosei1227, great to see this addition!!

There are some issues that we need to address before the merge, these are mainly caused by v1-v2 changes - but generally PR the metadata annotations looks good.

Comment thread mteb/benchmarks/benchmarks/benchmarks.py
Comment thread mteb/benchmarks/benchmarks/benchmarks.py
Comment thread mteb/tasks/classification/multilingual/injongo_intent.py Outdated
Comment thread mteb/tasks/classification/multilingual/kin_news_classification.py Outdated
Comment thread mteb/tasks/classification/multilingual/kin_news_classification.py Outdated
Comment thread mteb/tasks/multilabel_classification/multilingual/sib200_14classes.py Outdated
Comment thread mteb/tasks/multilabel_classification/multilingual/sib200_14classes.py Outdated
@Kosei1227
Copy link
Copy Markdown
Author

Hi @Samoed, @KennethEnevoldsen,

Thank you for the detailed reviews! I have addressed the requested changes across 9 commits. Here is a summary of the updates:

1. Model Refinement (e5_instruct.py)

  • Switched to InstructSentenceTransformerModel for AfriE5-Large-instruct.
  • Moved parameters to loader_kwargs and removed the use of partial.
  • Updated the revision to the specific commit hash: 2bbf55df87c1ddd7b20c5626d6f97ca6178766b7.

2. Task Modernization (MTEB v2)

  • Refactored all new tasks (AfriHate, KinNews, Injongo, AfriXNLI, SIB200-14Classes, EmotionAnalysisPlus) to inherit directly from AbsTask* base classes, removing MultilingualTask.
  • Updated TaskMetadata to comply with v2 schema:
    • Removed trust_remote_code: True.
    • Updated category from s2s to t2c (Classification) or t2t (Pair Classification).
    • Added fast_loading = True where appropriate.
    • Standardized language codes to swh (Swahili) and gaz (Oromo) for consistency with MMTEB filters.
    • Updated dataset revision to specific commit hashes.

3. SIB200 14-Classes

  • Converted the task from multi-label to single-label classification as requested.
  • Moved the implementation to mteb/tasks/classification/multilingual/sib200_14classes.py and updated the registry.
  • Updated the dataset_transform logic to return a single integer label.

4. Benchmark Configuration

  • Registered the MTEB_AFRICA_LITE benchmark in benchmarks.py, covering the 13 representative datasets and 9 languages mentioned in the paper.
  • Updated the reference for MTEB_AFRICA to the latest arXiv URL.

5. Descriptive Statistics

  • Generated and added descriptive statistics JSON files for InjongoIntent, AfriXNLI, EmotionAnalysisPlus, and SIB200-14Classes.
  • Note: Stats for AfriHate and KinNews are currently pending as they require dataset re-uploads/access updates to the MTEB organization as suggested by @KennethEnevoldsen. My Hugging Face username is KoseiUemura, if you'd like to add me to the organization!

Please let me know if there are any further adjustments needed!

Comment thread mteb/models/model_implementations/e5_instruct.py
Comment thread mteb/tasks/classification/multilingual/injongo_intent.py Outdated
Comment thread mteb/tasks/multilabel_classification/multilingual/sib200_14classes.py Outdated
Co-authored-by: Roman Solomatin <samoed.roman@gmail.com>
@Kosei1227
Copy link
Copy Markdown
Author

@Samoed
Hi ,

Can you delete this file then?

Thanks for flagging this.

We would prefer to keep SIB200_14Classes. Although it is related to the original SIB200 task, this 14-class variant is substantially more challenging and is a key contribution of our paper. In AfriMTEB, it plays an important role in evaluating fine-grained topic classification for multilngual settings, which is not covered by the original setup.

@Kosei1227 Kosei1227 requested a review from Samoed March 21, 2026 16:37
@KennethEnevoldsen
Copy link
Copy Markdown
Contributor

Hi @Kosei1227 thanks for the changes and sorry about the late response, can I ask you to please answer this question regarding SIB200:

do you know how the new labels were obtained?

Currently, I am a bit unsure about the dataset it would be nice to know how it is different

@dadelani
Copy link
Copy Markdown

dadelani commented Mar 28, 2026

Hi @Kosei1227 thanks for the changes and sorry about the late response, can I ask you to please answer this question regarding SIB200:

do you know how the new labels were obtained?

Currently, I am a bit unsure about the dataset it would be nice to know how it is different

@KennethEnevoldsen and @Samoed , the original SIB-200 has 14 classes (section 2.3 of the paper), comparison of both are in Appendix https://aclanthology.org/2024.eacl-long.14.pdf#page=15.08 . The original SIB-200, removed infrequent classes since they were already challenging enough when it was created in 2023 but not anymore. Please, let me know if you have questions.

Quote from the paper:

"While the SIB-200 dataset only includes seven labels, we are also releasing another version of the dataset that is more challenging with all the 14 labels (excluding “uncategorized”). We compared the performance of English dataset using both seven and 14 labels in Appendix C."

Copy link
Copy Markdown
Contributor

@KennethEnevoldsen KennethEnevoldsen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the clarification @dadelani

Based on that I would change it to SIB200Classification.v2 instead.

The next steps we need now is that the datasets are re-uploaded under the mteb organization (to ensure that we can maintain them going forward). I one of you have an huggingface ID then I can add you to the organization. You can see how to push it to the hub in the documentation, but do ask if there are issues. Once the dataset it uploaded you also have to remove the dataset_transform (as it is applied before the upload)

I would also ask that you go over the comments above and resolve those that you believe are resolved

Once those I done, then I think we are pretty much there - thanks for taking the time so far!

Comment thread mteb/tasks/classification/multilingual/sib200_classification.py
@github-actions
Copy link
Copy Markdown
Contributor

This pull request has been automatically marked as stale due to inactivity.

@github-actions github-actions Bot added the stale label Apr 13, 2026
@KennethEnevoldsen
Copy link
Copy Markdown
Contributor

@Kosei1227 should we get this PR finalized? I would love to have it merged into MTEB

@Kosei1227
Copy link
Copy Markdown
Author

Kosei1227 commented Apr 13, 2026

@KennethEnevoldsen Thanks for following up! I'll finalize the PR by this weekend!
We are greatly looking forward to adding our benchmark and models to MTEB!

@github-actions github-actions Bot removed the stale label Apr 14, 2026
@KennethEnevoldsen
Copy link
Copy Markdown
Contributor

great to hear - looking forward to it!

Kosei1227 and others added 2 commits April 18, 2026 13:43
Co-authored-by: Kenneth Enevoldsen <kenevoldsen@pm.me>
Align the 14-label SIB200 task with reviewer feedback by renaming it to `SIB200Classification.v2` and marking `SIB200Classification` as superseded. Update the task registry, benchmark entries, and quality-test references so the renamed task resolves consistently.

Made-with: Cursor
@Kosei1227
Copy link
Copy Markdown
Author

Hi @KennethEnevoldsen @Samoed, quick update on the Hugging Face rehost step: I authenticated as KoseiUemura with a write-scoped token and verified the HF client works locally, but creating dataset repos under the mteb/* namespace is still rejected by the Hub API with 403 Forbidden: You don't have the rights to create a dataset under the namespace "mteb".

On my side, hf auth whoami / HfApi.whoami() does not currently show mteb among my orgs, so it looks like either the org invite is still pending/not visible yet, or my current access does not include creating new dataset repos in the org.

The failure happens at repo creation, before upload, so if repo creation is restricted but push access to existing repos is allowed, pre-creating the target repos would also unblock me.
The six intended targets are mteb/AfriHateClassification, mteb/KinNewsClassification, mteb/InjongoIntent, mteb/AfriXNLI, mteb/EmotionAnalysis, and mteb/SIB200Classification.v2.

If you can confirm/add KoseiUemura with repo-creation rights, or pre-create these repos, I can complete the rehost asap.

@Samoed
Copy link
Copy Markdown
Member

Samoed commented Apr 18, 2026

I authenticated as KoseiUemura with a write-scoped token and verified the HF client works locally

I don't think that you have access to mteb org. You can keep datasets under your username

@KennethEnevoldsen
Copy link
Copy Markdown
Contributor

@Kosei1227 I have added you as a contibutor on mteb, you can now upload new datasets and edit your own datasets (but not edit existing datasets)

@KennethEnevoldsen KennethEnevoldsen changed the title Add AfriMTEB and AFriE5 dataset: Add AfriMTEB and AFriE5 Apr 19, 2026
@KennethEnevoldsen KennethEnevoldsen changed the title dataset: Add AfriMTEB and AFriE5 fix: Add AfriMTEB and AFriE5 Apr 19, 2026
Point the AfriMTEB-related tasks and SIB200 v2 task to the new datasets hosted under the mteb namespace so task loading uses the transferred repositories going forward.

Made-with: Cursor
@Kosei1227
Copy link
Copy Markdown
Author

@KennethEnevoldsen Thanks for inviting me to contributors! Now, I transferred all new datasets under mteb.

@Samoed
Copy link
Copy Markdown
Member

Samoed commented Apr 19, 2026

@Kosei1227 Can you look into the comments? There are some unresolved ones

@Kosei1227
Copy link
Copy Markdown
Author

@Samoed All comments should be resolved now! Thanks for flagging up this point.

Restore the original Swahili subset key and remove the unnecessary fast-loading flag so the task only keeps the Oromo support changes.

Made-with: Cursor
Comment thread mteb/tasks/classification/multilingual/afri_senti_classification.py Outdated
Comment thread mteb/tasks/classification/multilingual/afri_senti_classification.py Outdated
Kosei1227 and others added 4 commits April 19, 2026 13:43
Remove the Oromo-specific AfriSenti task changes so the file matches the earlier branch version until the dataset can be updated safely.

Made-with: Cursor
….v2 and remove old SIB200-14Classes stats

Made-with: Cursor
Copy link
Copy Markdown
Contributor

@KennethEnevoldsen KennethEnevoldsen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe this PR is good to merge - samoed do you have anything to add?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

new benchmark Issues related to adding a new benchmark

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants