Skip to content

Add Transformers v5 compatibility and remove pkg_resources#271

Open
daniazie wants to merge 13 commits into
Unbabel:masterfrom
daniazie:master
Open

Add Transformers v5 compatibility and remove pkg_resources#271
daniazie wants to merge 13 commits into
Unbabel:masterfrom
daniazie:master

Conversation

@daniazie
Copy link
Copy Markdown

Summary

  1. pkg_resources has been removed in Setuptools v82.x (see ModuleNotFoundError: No module named 'pkg_resources' #267).
  2. In Transformers v5, implemented separate backends for different tokenizers and Bert, RemBert and XLMRoberta(XL) tokenizers now use the TokenizersBackend (fast tokenizers), which means they no longer have separate fast and slow tokenizers.
  3. Similarly, build_inputs_with_special_tokens has been removed from TokenizersBackend, resulting in errors when concat_sequences is called.
  4. As highlighted by Add Transformers 5.x Compatibility #262, model outputs no longer include pooler_output if not otherwise created with add_pooling_layer=True.

Solutions & Changes

  1. Replaced pkg_resources with importlib_metadata and packaging.version
try:
            import importlib_metadata
            import packaging.version as parse_version

            comet_version = importlib_metadata.distribution(
                'unbabel-comet'
            ).version
            use_softmax = (
                parse_version.parse(comet_version)
                >= parse_version.parse('2.2.4')
                and hparams.get('layer_transformation') == 'sparsemax_patch'
            )
        except:
            use_softmax = False
  1. Used importlib_metadata and packaging.version to import tokenizers.
    Example:
transformers_version = importlib_metadata.distribution('transformers').version
if packaging_version.Version(transformers_version) >= packaging_version.Version(
   'v5.0.0rc0'
):
   from transformers import BertTokenizer as BertTokenizer
else:
   from transformers import BertTokenizerFast as BertTokenizer
  1. Referencing the documentation for Transformers v4, added build_inputs_with_special_tokens as a method in each encoder class, and used hasattr to dynamically call the tokenizer's method or the encoder's.
    In 'bert.py':
def build_inputs_with_special_tokens(self, token_ids_0, token_ids_1=None):
        cls = [self.tokenizer.cls_token_id]
        sep = [self.tokenizer.sep_token_id]
        if token_ids_1 is None:
            return cls + token_ids_0 + sep
        return cls + token_ids_0 + sep + token_ids_1 + sep

In 'xlmr.py':

def build_inputs_with_special_tokens(self, token_ids_0, token_ids_1=None):
        cls = [self.tokenizer.cls_token_id]
        sep = [self.tokenizer.sep_token_id]

        if token_ids_1 is None:
            return cls + token_ids_0 + sep
        return cls + token_ids_0 + sep + sep + token_ids_1 + sep

In base.py:

for j in range(1, len(inputs)):
                if hasattr(self.tokenizer, 'build_inputs_with_special_tokens'):
                    new_sequence = (
                        self.tokenizer.build_inputs_with_special_tokens(
                            new_sequence[1:-1], concat_input_ids[j][i][1:-1]
                        )
                    )
                else:
                    new_sequence = self.build_inputs_with_special_tokens(
                        new_sequence[1:-1], concat_input_ids[j][i][1:-1]
                    )
  • Note: There might be a better way to fix this but this was the best I could think of without potentially risking performance.
  1. Used the same approach as Add Transformers 5.x Compatibility #262.

Other Changes

Updated dependencies for flexibility.

[tool.poetry.dependencies]
python = "^3.8.0"
sentencepiece = ">=0.2.0"
pandas = ">=1.4.1"
transformers = ">=4.17"
pytorch-lightning = ">=2.0.0"
jsonargparse = ">= 3.13"
torch = ">=1.6.0"
numpy = ">=1.20.0"
torchmetrics = ">=0.10.2"
sacrebleu = ">=2.0.0"
scipy = ">=1.5.4"
entmax = ">=1.1"
huggingface-hub = ">=0.19.3"
importlib-metadata = "^9.0.0"

[tool.poetry.dev-dependencies]
sphinx-markdown-tables = ">=0.0.15"
coverage = ">=5.5"
scikit-learn = ">=1.0"
protobuf = ">=4.24.4"

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant