Skip to content

Commit 7d45c9e

Browse files
Merge pull request #591 from Blosc/string_tutorial
String tutorial
2 parents 1406e34 + c1d28ac commit 7d45c9e

7 files changed

Lines changed: 316 additions & 9 deletions

File tree

ROADMAP-TO-4.0.md

Lines changed: 23 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -21,3 +21,26 @@ The constructor for the `Table` object should take some parameters to specify pr
2121
* `.__iter__()` for easy and fast iteration over rows.
2222
* `.where()`: an iterator for querying with conditions that are evaluated with the internal compute engine.
2323
* `.index()` for indexing a column and getting better performance in queries (desirable, but optional for 4.0).
24+
25+
In particular, it should try to mimic much of the functionality of data-querying libraries such as ``pandas`` (see [this blog](https://datapythonista.me/blog/whats-new-in-pandas-3) for much of the followin). Hence, one should be able to filter rows of the `Table` via querying on multiple columns (accessed via `.` or perhaps ``__getitem__``), with conditions to select rows implemented via `.index`, `.where` like so
26+
27+
```
28+
tbl.where((tbl.property_type == "hotel") & (tbl.country == "us"))
29+
```
30+
31+
It should also be possible to modify the filtered ``Table`` in-place, using some operation which only acts on the filtered elements (e.g ``assign``)
32+
33+
```
34+
tbl = tbl.where((tbl.property_type == "hotel") & (tbl.country == "us")).assign(max_people = tbl.max_people + tbl.max_children)
35+
```
36+
37+
Secondly, it should be possible to write bespoke transformation functions which act row-wise and then may be applied to get results from the `Table` and/or modify the ``Table`` in-place:
38+
39+
```
40+
def myudf(row):
41+
col = row.name_of_column
42+
# do things with col
43+
return result
44+
45+
ans = tbl.apply(myudf, axis=1)
46+
```
Lines changed: 192 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,192 @@
1+
{
2+
"cells": [
3+
{
4+
"cell_type": "markdown",
5+
"metadata": {},
6+
"source": [
7+
"# Working with arrays of strings in Blosc2\n",
8+
"\n",
9+
"Blosc2 provides support for arrays in which the elements are strings, either of bytes (``np.bytes_`` equivalent to ``np.dtype('S0')``) or of unicode characters (``np.str_``, equivalent to ``np.dtype('U0')``), with the typesize determined by the longest element in the array. That is"
10+
]
11+
},
12+
{
13+
"cell_type": "code",
14+
"execution_count": 1,
15+
"metadata": {},
16+
"outputs": [
17+
{
18+
"name": "stdout",
19+
"output_type": "stream",
20+
"text": [
21+
"Bytes array - dtype: |S3, typesize: 3\n",
22+
"Unicode array - dtype: <U3, typesize: 12\n"
23+
]
24+
}
25+
],
26+
"source": [
27+
"import numpy as np\n",
28+
"\n",
29+
"arr = np.array([b\"a23\", b\"89u\"])\n",
30+
"print(f\"Bytes array - dtype: {arr.dtype}, typesize: {arr.dtype.itemsize}\")\n",
31+
"arr = np.array([\"a23\", \"89u\"])\n",
32+
"print(f\"Unicode array - dtype: {arr.dtype}, typesize: {arr.dtype.itemsize}\")"
33+
]
34+
},
35+
{
36+
"cell_type": "markdown",
37+
"metadata": {},
38+
"source": [
39+
"\n",
40+
"since each unicode character encodes 4 bytes. This carries over to the ``blosc2.NDArray`` object. Indeed, such arrays, particularly those of Unicode type, are highly compressible, since almost all of the bits encoding each item will be 0 (i.e. ``\\x00``), as can be seen by viewing the second array above as an array of bytestrings"
41+
]
42+
},
43+
{
44+
"cell_type": "code",
45+
"execution_count": 2,
46+
"metadata": {},
47+
"outputs": [
48+
{
49+
"data": {
50+
"text/plain": [
51+
"array([b'a\\x00\\x00\\x002\\x00\\x00\\x003', b'8\\x00\\x00\\x009\\x00\\x00\\x00u'],\n",
52+
" dtype='|S12')"
53+
]
54+
},
55+
"execution_count": 2,
56+
"metadata": {},
57+
"output_type": "execute_result"
58+
}
59+
],
60+
"source": [
61+
"arr.view(\"S12\")"
62+
]
63+
},
64+
{
65+
"cell_type": "markdown",
66+
"metadata": {},
67+
"source": [
68+
"(The trailing ``\\x00`` bytes are suppressed for the last character). Consequently, using Blosc2 can save you a lot of space (in memory or disk) when working with arrays of strings, if one exploits the structure of the (unicode) strings correctly. Specifically, the fundamental building block of the array should be the byte, and not the element - in this way, using the shuffle filter groups the $N$ elements having $m$ characters of bytesize 4 into $4$ streams of $Nm$ bytes, so that the corresponding bytes for all characters are grouped together. For the array above, one transforms the array from (2 elements of $3 \\times 4 = 12$ bytes)\n",
69+
"```\n",
70+
"|a\\x00\\x00\\x002\\x00\\x00\\x003\\x00\\x00\\x00|8\\x00\\x00\\x009\\x00\\x00\\x00u\\x00\\x00\\x00|\n",
71+
"```\n",
72+
"to (4 streams of $2 \\times 3 = 6$ bytes)\n",
73+
"```\n",
74+
"|a2389u|\\x00\\x00\\x00\\x00\\x00\\x00|\\x00\\x00\\x00\\x00\\x00\\x00|\\x00\\x00\\x00\\x00\\x00\\x00|\n",
75+
"```\n",
76+
"For the example above, 3 of the bytes for each character are 0, and so by grouping these zeros together, it is more likely to have chunks composed entirely or almost entirely of zeros, which may then be readily compressed.\n",
77+
"If one were to break up the array by elements, one would end up with $4m$ streams of $N$ bytes, i.e. ($4 \\times 3 = 12$ streams of $2$ bytes)\n",
78+
"```\n",
79+
"|a8|\\x00\\x00|\\x00\\x00|\\x00\\x00|29|\\x00\\x00|\\x00\\x00|\\x00\\x00|3u|\\x00\\x00|\\x00\\x00|\\x00\\x00|\n",
80+
"```\n",
81+
"which is not very compressible, since the informative bytes are fragmented by small groups of 0 bytes. This heuristic has been implemented as a default for Blosc2 string compression, so you can just compress your arrays of strings and reap the benefits without having to worry about such intricacies though! Check it out below:"
82+
]
83+
},
84+
{
85+
"cell_type": "code",
86+
"execution_count": 3,
87+
"metadata": {},
88+
"outputs": [
89+
{
90+
"name": "stdout",
91+
"output_type": "stream",
92+
"text": [
93+
"cratio forcing non-string defaults: 2417x\n",
94+
"cratio allowing blosc2 to optimise: 3902x\n"
95+
]
96+
}
97+
],
98+
"source": [
99+
"import blosc2\n",
100+
"\n",
101+
"N = int(1e5)\n",
102+
"nparr = np.repeat(np.array([\"josé\", \"pepe\", \"francisco\"]), N)\n",
103+
"cparams = blosc2.cparams_dflts\n",
104+
"arr1 = blosc2.asarray(nparr, cparams=cparams)\n",
105+
"print(f\"cratio forcing non-string defaults: {round(arr1.cratio)}x\")\n",
106+
"arr1 = blosc2.asarray(nparr)\n",
107+
"print(f\"cratio allowing blosc2 to optimise: {round(arr1.cratio)}x\")"
108+
]
109+
},
110+
{
111+
"cell_type": "markdown",
112+
"metadata": {},
113+
"source": [
114+
"## Operating on arrays of strings\n",
115+
"\n",
116+
"Blosc2 has two sides, compression and computation, which are tightly enmeshed, and for arrays of strings the same applies. We have implemented a subset of useful functions for strings:\n",
117+
"- comparison operations ``<, <=, ==, !=, >=, >``\n",
118+
"- 2-argument functions ``contains, startswith, endswith``\n",
119+
"- 1-argument functions ``lower, upper``\n",
120+
"\n",
121+
"Where possible these will be computed by the ``miniexpr`` backend, which is a highly optimised, fully compiled, multithreaded library which is the most complete expression of Blosc2's goal: fully vertically integrated decompression/computation/recompression with optimal cache hierarchy exploitation and super-fast compiled-C code for as much of the pipeline as possible. In cases where this is not possible, a more robust path which still avoids memory overload for large arrays is used.\n",
122+
"\n",
123+
"The arguments may be scalars or arrays (``blosc2.NDArray`` or other types) of strings or bytes. "
124+
]
125+
},
126+
{
127+
"cell_type": "code",
128+
"execution_count": 4,
129+
"metadata": {},
130+
"outputs": [],
131+
"source": [
132+
"for t in (\"bytes\", \"string\"):\n",
133+
" if t == \"bytes\":\n",
134+
" a1 = np.array([b\"abc\", b\"def\", b\"atErr\", b\"oot\", b\"zu\", b\"ab c\"])\n",
135+
" a2 = a2_blosc = b\"a\"\n",
136+
" else:\n",
137+
" a1 = np.array([\"abc\", \"def\", \"atErr\", \"oot\", \"zu\", \"ab c\"])\n",
138+
" a2 = a2_blosc = \"a\"\n",
139+
" a1_blosc = blosc2.asarray(a1)\n",
140+
" for func, npfunc in zip(\n",
141+
" (blosc2.startswith, blosc2.endswith, blosc2.contains),\n",
142+
" (np.char.startswith, np.char.endswith, lambda *args: np.char.find(*args) != -1),\n",
143+
" strict=True,\n",
144+
" ):\n",
145+
" expr_lazy = func(a1_blosc, a2_blosc)\n",
146+
" res_numexpr = npfunc(a1, a2)\n",
147+
" assert expr_lazy.shape == res_numexpr.shape\n",
148+
" assert expr_lazy.dtype == blosc2.bool_\n",
149+
" np.testing.assert_array_equal(expr_lazy[:], res_numexpr)\n",
150+
"\n",
151+
" np.testing.assert_array_equal((a1_blosc < a2_blosc)[:], a1 < a2)\n",
152+
" np.testing.assert_array_equal((a1_blosc <= a2_blosc)[:], a1 <= a2)\n",
153+
" np.testing.assert_array_equal((a1_blosc == a2_blosc)[:], a1 == a2)\n",
154+
" np.testing.assert_array_equal((a1_blosc != a2_blosc)[:], a1 != a2)\n",
155+
" np.testing.assert_array_equal((a1_blosc >= a2_blosc)[:], a1 >= a2)\n",
156+
" np.testing.assert_array_equal((a1_blosc > a2_blosc)[:], a1 > a2)\n",
157+
"\n",
158+
" for func, npfunc in zip((blosc2.lower, blosc2.upper), (np.char.lower, np.char.upper), strict=True):\n",
159+
" expr_lazy = func(a1_blosc)\n",
160+
" res_numexpr = npfunc(a1)\n",
161+
" assert expr_lazy.shape == res_numexpr.shape\n",
162+
" np.testing.assert_array_equal(expr_lazy[:], res_numexpr)"
163+
]
164+
},
165+
{
166+
"cell_type": "markdown",
167+
"metadata": {},
168+
"source": []
169+
}
170+
],
171+
"metadata": {
172+
"kernelspec": {
173+
"display_name": "blosc2env",
174+
"language": "python",
175+
"name": "python3"
176+
},
177+
"language_info": {
178+
"codemirror_mode": {
179+
"name": "ipython",
180+
"version": 3
181+
},
182+
"file_extension": ".py",
183+
"mimetype": "text/x-python",
184+
"name": "python",
185+
"nbconvert_exporter": "python",
186+
"pygments_lexer": "ipython3",
187+
"version": "3.13.7"
188+
}
189+
},
190+
"nbformat": 4,
191+
"nbformat_minor": 2
192+
}

src/blosc2/__init__.py

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -653,6 +653,7 @@ def _raise(exc):
653653
logical_not,
654654
logical_or,
655655
logical_xor,
656+
lower,
656657
max,
657658
maximum,
658659
mean,
@@ -685,6 +686,7 @@ def _raise(exc):
685686
tan,
686687
tanh,
687688
trunc,
689+
upper,
688690
var,
689691
where,
690692
)
@@ -856,6 +858,7 @@ def _raise(exc):
856858
"logical_not",
857859
"logical_or",
858860
"logical_xor",
861+
"lower",
859862
"matmul",
860863
"matrix_transpose",
861864
"max",
@@ -927,6 +930,7 @@ def _raise(exc):
927930
"unpack_array",
928931
"unpack_array2",
929932
"unpack_tensor",
933+
"upper",
930934
"validate_expr",
931935
"var",
932936
"vecdot",

src/blosc2/lazyexpr.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -2822,7 +2822,7 @@ def result_type(
28222822
# Follow NumPy rules for scalar-array operations
28232823
# Create small arrays with the same dtypes and let NumPy's type promotion determine the result type
28242824
arrs = [
2825-
(np.array(value).dtype if isinstance(value, str) else value)
2825+
(np.array(value).dtype if isinstance(value, (str, bytes)) else value)
28262826
if (np.isscalar(value) or not hasattr(value, "dtype"))
28272827
else np.array([0], dtype=_convert_dtype(value.dtype))
28282828
for value in arrays_and_dtypes

src/blosc2/ndarray.py

Lines changed: 48 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -5048,6 +5048,54 @@ def endswith(
50485048
return blosc2.LazyExpr(new_op=(a, "endswith", suffix))
50495049

50505050

5051+
@_incomplete_lazyfunc
5052+
def lower(a: str | blosc2.Array) -> NDArray:
5053+
"""
5054+
Copy-pasted from numpy documentation: https://numpy.org/doc/stable/reference/generated/numpy.char.lower.html
5055+
Return an array with the elements converted to lowercase.
5056+
Call str.lower element-wise.
5057+
For 8-bit strings, this method is locale-dependent.
5058+
5059+
Parameters
5060+
----------
5061+
a : blosc2.Array
5062+
Input array of bytes_ or str_ dtype
5063+
kwargs: Any
5064+
kwargs accepted by the :func:`empty` constructor
5065+
5066+
Returns
5067+
-------
5068+
out: blosc2.Array, of bytes_ or str_ dtype
5069+
Has the same shape as element.
5070+
5071+
"""
5072+
return blosc2.LazyExpr(new_op=(a, "lower", None))
5073+
5074+
5075+
@_incomplete_lazyfunc
5076+
def upper(a: str | blosc2.Array) -> NDArray:
5077+
"""
5078+
Copy-pasted from numpy documentation: https://numpy.org/doc/stable/reference/generated/numpy.char.upper.html
5079+
Return an array with the elements converted to uppercase.
5080+
Call str.lower element-wise.
5081+
For 8-bit strings, this method is locale-dependent.
5082+
5083+
Parameters
5084+
----------
5085+
a : blosc2.Array
5086+
Input array of bytes_ or str_ dtype
5087+
kwargs: Any
5088+
kwargs accepted by the :func:`empty` constructor
5089+
5090+
Returns
5091+
-------
5092+
out: blosc2.Array, of bytes_ or str_ dtype
5093+
Has the same shape as element.
5094+
5095+
"""
5096+
return blosc2.LazyExpr(new_op=(a, "upper", None))
5097+
5098+
50515099
def lazywhere(value1=None, value2=None):
50525100
"""Decorator to apply a where condition to a LazyExpr."""
50535101

src/blosc2/utils.py

Lines changed: 13 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -54,6 +54,14 @@ def _string_startswith(a, b):
5454
return np.char.startswith(a, b)
5555

5656

57+
def _string_lower(a):
58+
return np.char.lower(a)
59+
60+
61+
def _string_upper(a):
62+
return np.char.upper(a)
63+
64+
5765
def _string_endswith(a, b):
5866
return np.char.endswith(a, b)
5967

@@ -97,6 +105,8 @@ def _format_expr_scalar(value):
97105
safe_numpy_globals["contains"] = _string_contains
98106
safe_numpy_globals["startswith"] = _string_startswith
99107
safe_numpy_globals["endswith"] = _string_endswith
108+
safe_numpy_globals["upper"] = _string_upper
109+
safe_numpy_globals["lower"] = _string_lower
100110

101111

102112
elementwise_funcs = [
@@ -155,6 +165,7 @@ def _format_expr_scalar(value):
155165
"logical_not",
156166
"logical_or",
157167
"logical_xor",
168+
"lower",
158169
"maximum",
159170
"minimum",
160171
"multiply",
@@ -178,6 +189,7 @@ def _format_expr_scalar(value):
178189
"tan",
179190
"tanh",
180191
"trunc",
192+
"upper",
181193
"where",
182194
]
183195

@@ -931,13 +943,6 @@ def process_key(key, shape):
931943
return key, mask
932944

933945

934-
incomplete_lazyfunc_map = {
935-
"contains": lambda *args: np.char.find(*args) != -1,
936-
"startswith": lambda *args: np.char.startswith(*args),
937-
"endswith": lambda *args: np.char.endswith(*args),
938-
} | safe_numpy_globals # clip and logaddexp available in safe_numpy_globals
939-
940-
941946
def is_inside_ne_evaluate() -> bool:
942947
"""
943948
Whether the current code is being executed from an ne_evaluate call
@@ -968,7 +973,7 @@ def filler(inputs_tuple, output, offset):
968973

969974
def wrapper(*args, **kwargs):
970975
if is_inside_ne_evaluate(): # haven't been able to use miniexpr so use numpy
971-
return incomplete_lazyfunc_map[func.__name__](*args, **kwargs)
976+
return safe_numpy_globals[func.__name__](*args, **kwargs)
972977
return func(*args, **kwargs)
973978

974979
return wrapper

0 commit comments

Comments
 (0)