Merge pull request #591 from Blosc/string_tutorial

FrancescAlted · web-flow · commit 7d45c9e6efe8 · 2026-02-26T14:16:59.000+01:00
String tutorial
diff --git a/ROADMAP-TO-4.0.md b/ROADMAP-TO-4.0.md
@@ -21,3 +21,26 @@ The constructor for the `Table` object should take some parameters to specify pr
     * `.__iter__()` for easy and fast iteration over rows.
     * `.where()`: an iterator for querying with conditions that are evaluated with the internal compute engine.
     * `.index()` for indexing a column and getting better performance in queries (desirable, but optional for 4.0).
+
+In particular, it should try to mimic much of the functionality of data-querying libraries such as ``pandas`` (see [this blog](https://datapythonista.me/blog/whats-new-in-pandas-3) for much of the followin). Hence, one should be able to filter rows of the `Table` via querying on multiple columns (accessed via `.` or perhaps ``__getitem__``), with conditions to select rows implemented via `.index`, `.where` like so
+
+```
+tbl.where((tbl.property_type == "hotel") & (tbl.country == "us"))
+```
+
+It should also be possible to modify the filtered ``Table`` in-place, using some operation which only acts on the filtered elements (e.g ``assign``)
+
+```
+tbl = tbl.where((tbl.property_type == "hotel") & (tbl.country == "us")).assign(max_people = tbl.max_people + tbl.max_children)
+```
+
+Secondly, it should be possible to write bespoke transformation functions which act row-wise and then may be applied to get results from the `Table` and/or modify the ``Table`` in-place:
+
+```
+def myudf(row):
+    col = row.name_of_column
+    # do things with col
+    return result
+
+ans = tbl.apply(myudf, axis=1)
+```
diff --git a/examples/ndarray/string_arrays.ipynb b/examples/ndarray/string_arrays.ipynb
@@ -0,0 +1,192 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# Working with arrays of strings in Blosc2\n",
+    "\n",
+    "Blosc2 provides support for arrays in which the elements are strings, either of bytes (``np.bytes_`` equivalent to ``np.dtype('S0')``) or of unicode characters (``np.str_``, equivalent to ``np.dtype('U0')``), with the typesize determined by the longest element in the array. That is"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 1,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Bytes array - dtype: |S3, typesize: 3\n",
+      "Unicode array - dtype: <U3, typesize: 12\n"
+     ]
+    }
+   ],
+   "source": [
+    "import numpy as np\n",
+    "\n",
+    "arr = np.array([b\"a23\", b\"89u\"])\n",
+    "print(f\"Bytes array - dtype: {arr.dtype}, typesize: {arr.dtype.itemsize}\")\n",
+    "arr = np.array([\"a23\", \"89u\"])\n",
+    "print(f\"Unicode array - dtype: {arr.dtype}, typesize: {arr.dtype.itemsize}\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "\n",
+    "since each unicode character encodes 4 bytes. This carries over to the ``blosc2.NDArray`` object. Indeed, such arrays, particularly those of Unicode type, are highly compressible, since almost all of the bits encoding each item will be 0 (i.e. ``\\x00``), as can be seen by viewing the second array above as an array of bytestrings"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 2,
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "array([b'a\\x00\\x00\\x002\\x00\\x00\\x003', b'8\\x00\\x00\\x009\\x00\\x00\\x00u'],\n",
+       "      dtype='|S12')"
+      ]
+     },
+     "execution_count": 2,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "arr.view(\"S12\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "(The trailing ``\\x00`` bytes are suppressed for the last character). Consequently, using Blosc2 can save you a lot of space (in memory or disk) when working with arrays of strings, if one exploits the structure of the (unicode) strings correctly. Specifically, the fundamental building block of the array should be the byte, and not the element - in this way, using the shuffle filter groups the $N$ elements having $m$ characters of bytesize 4 into $4$ streams of $Nm$ bytes, so that the corresponding bytes for all characters are grouped together. For the array above, one transforms the array from (2 elements of $3 \\times 4 = 12$ bytes)\n",
+    "```\n",
+    "|a\\x00\\x00\\x002\\x00\\x00\\x003\\x00\\x00\\x00|8\\x00\\x00\\x009\\x00\\x00\\x00u\\x00\\x00\\x00|\n",
+    "```\n",
+    "to (4 streams of $2 \\times 3 = 6$ bytes)\n",
+    "```\n",
+    "|a2389u|\\x00\\x00\\x00\\x00\\x00\\x00|\\x00\\x00\\x00\\x00\\x00\\x00|\\x00\\x00\\x00\\x00\\x00\\x00|\n",
+    "```\n",
+    "For the example above, 3 of the bytes for each character are 0, and so by grouping these zeros together, it is more likely to have chunks composed entirely or almost entirely of zeros, which may then be readily compressed.\n",
+    "If one were to break up the array by elements, one would end up with $4m$ streams of $N$ bytes, i.e. ($4 \\times 3 = 12$ streams of $2$ bytes)\n",
+    "```\n",
+    "|a8|\\x00\\x00|\\x00\\x00|\\x00\\x00|29|\\x00\\x00|\\x00\\x00|\\x00\\x00|3u|\\x00\\x00|\\x00\\x00|\\x00\\x00|\n",
+    "```\n",
+    "which is not very compressible, since the informative bytes are fragmented by small groups of 0 bytes. This heuristic has been implemented as a default for Blosc2 string compression, so you can just compress your arrays of strings and reap the benefits without having to worry about such intricacies though! Check it out below:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 3,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "cratio forcing non-string defaults: 2417x\n",
+      "cratio allowing blosc2 to optimise: 3902x\n"
+     ]
+    }
+   ],
+   "source": [
+    "import blosc2\n",
+    "\n",
+    "N = int(1e5)\n",
+    "nparr = np.repeat(np.array([\"josé\", \"pepe\", \"francisco\"]), N)\n",
+    "cparams = blosc2.cparams_dflts\n",
+    "arr1 = blosc2.asarray(nparr, cparams=cparams)\n",
+    "print(f\"cratio forcing non-string defaults: {round(arr1.cratio)}x\")\n",
+    "arr1 = blosc2.asarray(nparr)\n",
+    "print(f\"cratio allowing blosc2 to optimise: {round(arr1.cratio)}x\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Operating on arrays of strings\n",
+    "\n",
+    "Blosc2 has two sides, compression and computation, which are tightly enmeshed, and for arrays of strings the same applies. We have implemented a subset of useful functions for strings:\n",
+    "- comparison operations ``<, <=, ==, !=, >=, >``\n",
+    "- 2-argument functions ``contains, startswith, endswith``\n",
+    "- 1-argument functions ``lower, upper``\n",
+    "\n",
+    "Where possible these will be computed by the ``miniexpr`` backend, which is a highly optimised, fully compiled, multithreaded library which is the most complete expression of Blosc2's goal: fully vertically integrated decompression/computation/recompression with optimal cache hierarchy exploitation and super-fast compiled-C code for as much of the pipeline as possible. In cases where this is not possible, a more robust path which still avoids memory overload for large arrays is used.\n",
+    "\n",
+    "The arguments may be scalars or arrays (``blosc2.NDArray`` or other types) of strings or bytes. "
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 4,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "for t in (\"bytes\", \"string\"):\n",
+    "    if t == \"bytes\":\n",
+    "        a1 = np.array([b\"abc\", b\"def\", b\"atErr\", b\"oot\", b\"zu\", b\"ab c\"])\n",
+    "        a2 = a2_blosc = b\"a\"\n",
+    "    else:\n",
+    "        a1 = np.array([\"abc\", \"def\", \"atErr\", \"oot\", \"zu\", \"ab c\"])\n",
+    "        a2 = a2_blosc = \"a\"\n",
+    "    a1_blosc = blosc2.asarray(a1)\n",
+    "    for func, npfunc in zip(\n",
+    "        (blosc2.startswith, blosc2.endswith, blosc2.contains),\n",
+    "        (np.char.startswith, np.char.endswith, lambda *args: np.char.find(*args) != -1),\n",
+    "        strict=True,\n",
+    "    ):\n",
+    "        expr_lazy = func(a1_blosc, a2_blosc)\n",
+    "        res_numexpr = npfunc(a1, a2)\n",
+    "        assert expr_lazy.shape == res_numexpr.shape\n",
+    "        assert expr_lazy.dtype == blosc2.bool_\n",
+    "        np.testing.assert_array_equal(expr_lazy[:], res_numexpr)\n",
+    "\n",
+    "    np.testing.assert_array_equal((a1_blosc < a2_blosc)[:], a1 < a2)\n",
+    "    np.testing.assert_array_equal((a1_blosc <= a2_blosc)[:], a1 <= a2)\n",
+    "    np.testing.assert_array_equal((a1_blosc == a2_blosc)[:], a1 == a2)\n",
+    "    np.testing.assert_array_equal((a1_blosc != a2_blosc)[:], a1 != a2)\n",
+    "    np.testing.assert_array_equal((a1_blosc >= a2_blosc)[:], a1 >= a2)\n",
+    "    np.testing.assert_array_equal((a1_blosc > a2_blosc)[:], a1 > a2)\n",
+    "\n",
+    "    for func, npfunc in zip((blosc2.lower, blosc2.upper), (np.char.lower, np.char.upper), strict=True):\n",
+    "        expr_lazy = func(a1_blosc)\n",
+    "        res_numexpr = npfunc(a1)\n",
+    "        assert expr_lazy.shape == res_numexpr.shape\n",
+    "        np.testing.assert_array_equal(expr_lazy[:], res_numexpr)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": []
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "blosc2env",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.13.7"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 2
+}
diff --git a/src/blosc2/__init__.py b/src/blosc2/__init__.py
@@ -653,6 +653,7 @@ def _raise(exc):
     logical_not,
     logical_or,
     logical_xor,
+    lower,
     max,
     maximum,
     mean,
@@ -685,6 +686,7 @@ def _raise(exc):
     tan,
     tanh,
     trunc,
+    upper,
     var,
     where,
 )
@@ -856,6 +858,7 @@ def _raise(exc):
     "logical_not",
     "logical_or",
     "logical_xor",
+    "lower",
     "matmul",
     "matrix_transpose",
     "max",
@@ -927,6 +930,7 @@ def _raise(exc):
     "unpack_array",
     "unpack_array2",
     "unpack_tensor",
+    "upper",
     "validate_expr",
     "var",
     "vecdot",
diff --git a/src/blosc2/lazyexpr.py b/src/blosc2/lazyexpr.py
@@ -2822,7 +2822,7 @@ def result_type(
     # Follow NumPy rules for scalar-array operations
     # Create small arrays with the same dtypes and let NumPy's type promotion determine the result type
     arrs = [
-        (np.array(value).dtype if isinstance(value, str) else value)
+        (np.array(value).dtype if isinstance(value, (str, bytes)) else value)
         if (np.isscalar(value) or not hasattr(value, "dtype"))
         else np.array([0], dtype=_convert_dtype(value.dtype))
         for value in arrays_and_dtypes
diff --git a/src/blosc2/ndarray.py b/src/blosc2/ndarray.py
@@ -5048,6 +5048,54 @@ def endswith(
     return blosc2.LazyExpr(new_op=(a, "endswith", suffix))
 
 
+@_incomplete_lazyfunc
+def lower(a: str | blosc2.Array) -> NDArray:
+    """
+    Copy-pasted from numpy documentation: https://numpy.org/doc/stable/reference/generated/numpy.char.lower.html
+    Return an array with the elements converted to lowercase.
+    Call str.lower element-wise.
+    For 8-bit strings, this method is locale-dependent.
+
+    Parameters
+    ----------
+    a : blosc2.Array
+        Input array of bytes_ or str_ dtype
+    kwargs: Any
+        kwargs accepted by the :func:`empty` constructor
+
+    Returns
+    -------
+    out: blosc2.Array, of bytes_ or str_ dtype
+        Has the same shape as element.
+
+    """
+    return blosc2.LazyExpr(new_op=(a, "lower", None))
+
+
+@_incomplete_lazyfunc
+def upper(a: str | blosc2.Array) -> NDArray:
+    """
+    Copy-pasted from numpy documentation: https://numpy.org/doc/stable/reference/generated/numpy.char.upper.html
+    Return an array with the elements converted to uppercase.
+    Call str.lower element-wise.
+    For 8-bit strings, this method is locale-dependent.
+
+    Parameters
+    ----------
+    a : blosc2.Array
+        Input array of bytes_ or str_ dtype
+    kwargs: Any
+        kwargs accepted by the :func:`empty` constructor
+
+    Returns
+    -------
+    out: blosc2.Array, of bytes_ or str_ dtype
+        Has the same shape as element.
+
+    """
+    return blosc2.LazyExpr(new_op=(a, "upper", None))
+
+
 def lazywhere(value1=None, value2=None):
     """Decorator to apply a where condition to a LazyExpr."""
 
diff --git a/src/blosc2/utils.py b/src/blosc2/utils.py
@@ -54,6 +54,14 @@ def _string_startswith(a, b):
     return np.char.startswith(a, b)
 
 
+def _string_lower(a):
+    return np.char.lower(a)
+
+
+def _string_upper(a):
+    return np.char.upper(a)
+
+
 def _string_endswith(a, b):
     return np.char.endswith(a, b)
 
@@ -97,6 +105,8 @@ def _format_expr_scalar(value):
 safe_numpy_globals["contains"] = _string_contains
 safe_numpy_globals["startswith"] = _string_startswith
 safe_numpy_globals["endswith"] = _string_endswith
+safe_numpy_globals["upper"] = _string_upper
+safe_numpy_globals["lower"] = _string_lower
 
 
 elementwise_funcs = [
@@ -155,6 +165,7 @@ def _format_expr_scalar(value):
     "logical_not",
     "logical_or",
     "logical_xor",
+    "lower",
     "maximum",
     "minimum",
     "multiply",
@@ -178,6 +189,7 @@ def _format_expr_scalar(value):
     "tan",
     "tanh",
     "trunc",
+    "upper",
     "where",
 ]
 
@@ -931,13 +943,6 @@ def process_key(key, shape):
     return key, mask
 
 
-incomplete_lazyfunc_map = {
-    "contains": lambda *args: np.char.find(*args) != -1,
-    "startswith": lambda *args: np.char.startswith(*args),
-    "endswith": lambda *args: np.char.endswith(*args),
-} | safe_numpy_globals  # clip and logaddexp available in safe_numpy_globals
-
-
 def is_inside_ne_evaluate() -> bool:
     """
     Whether the current code is being executed from an ne_evaluate call
@@ -968,7 +973,7 @@ def filler(inputs_tuple, output, offset):
 
     def wrapper(*args, **kwargs):
         if is_inside_ne_evaluate():  # haven't been able to use miniexpr so use numpy
-            return incomplete_lazyfunc_map[func.__name__](*args, **kwargs)
+            return safe_numpy_globals[func.__name__](*args, **kwargs)
         return func(*args, **kwargs)
 
     return wrapper
diff --git a/tests/ndarray/test_lazyexpr.py b/tests/ndarray/test_lazyexpr.py