Skip to content

Commit 281ca1b

Browse files
committed
Added codec: tokenize
1 parent 3d7f43d commit 281ca1b

3 files changed

Lines changed: 22 additions & 2 deletions

File tree

codext/common/dummy.py

Lines changed: 11 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -22,7 +22,7 @@ def code(input, errors="strict"):
2222
# important note: ^
2323
# using "{2}" here instead will break the codec
2424
# this is due to the fact the codext.__common__.generate_string_from_regex DOES NOT handle ASSERT_NOT (?!) and will
25-
# faill to generate a valid instance in lookup(...) when an encoding name is to be generated to get the CodecInfo
25+
# fail to generate a valid instance in lookup(...) when an encoding name is to be generated to get the CodecInfo
2626

2727

2828
def substitute(token, replacement):
@@ -45,3 +45,13 @@ def code(input, errors="strict"):
4545
strip_spaces = lambda i, e="strict": (i.replace(" ", ""), len(i))
4646
add("strip-spaces", strip_spaces, strip_spaces, guess=None)
4747

48+
def tokenize(n):
49+
tlen = int(n[8:].lstrip("-_"))
50+
def code(input, errors="strict"):
51+
l = len(input)
52+
if tlen > l:
53+
raise LookupError("unknown encoding: %s" % n)
54+
return " ".join(input[i:i+tlen] for i in range(0, l, tlen)), l
55+
return code
56+
add("tokenize", tokenize, tokenize, r"^(tokenize[-_]?[1-9][0-9]*)$", guess=None)
57+

docs/manipulations.md

Lines changed: 9 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -43,11 +43,12 @@ These transformation functions are simple string transformations.
4343

4444
**Codec** | **Conversions** | **Aliases** | **Comment**
4545
:---: | :---: | --- | ---
46-
`replace` | text <-> text with single-char replaced | |
46+
`replace` | text <-> text with multi-chars replaced | | parametrized with a _string_ and its _replacement_
4747
`reverse` | text <-> reversed text | |
4848
`reverse-words` | text <-> reversed words | | same as `reverse` but not on the whole text, only on the words (text split by whitespace)
4949
`strip-spaces` | text <-> all whitespaces stripped | |
5050
`substitute` | text <-> text with token substituted | |
51+
`tokenize` | text <-> text split in tokens of length N | | parametrized with _N_
5152

5253
As in the previous section, these transformations have no interest while using them in Python but well while using `codext` from the terminal (see [*CLI tool*](cli.html)).
5354

@@ -58,6 +59,13 @@ $ echo -en "test string" | codext encode reverse-words | codext encode reverse r
5859
string_test
5960
```
6061

62+
Another example:
63+
64+
```sh
65+
$ echo -en "3132333435" | codext encode tokenize-2
66+
31 32 33 34 35
67+
```
68+
6169
Or using encodings chaining:
6270

6371
```sh

tests/test_manual.py

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -100,6 +100,8 @@ def test_codec_dummy_str_manips(self):
100100
self.assertEqual(codecs.decode(STR.replace("i", "1"), "replace-1i"), STR)
101101
self.assertEqual(codecs.encode(STR, "substitute-this/that"), STR.replace("this", "that"))
102102
self.assertEqual(codecs.decode(STR.replace("this", "that"), "substitute-that/this"), STR)
103+
self.assertEqual(codecs.encode(STR, "tokenize-2"), "th is i s a te st")
104+
self.assertRaises(LookupError, codecs.encode, STR, "tokenize-200")
103105

104106
def test_codec_hash_functions(self):
105107
STR = b"This is a test string!"

0 commit comments

Comments
 (0)