|
| 1 | +{ |
| 2 | + "cells": [ |
| 3 | + { |
| 4 | + "cell_type": "markdown", |
| 5 | + "metadata": {}, |
| 6 | + "source": [ |
| 7 | + "# Working with arrays of strings in Blosc2\n", |
| 8 | + "\n", |
| 9 | + "Blosc2 provides support for arrays in which the elements are strings, either of bytes (``np.bytes_`` equivalent to ``np.dtype('S0')``) or of unicode characters (``np.str_``, equivalent to ``np.dtype('U0')``), with the typesize determined by the longest element in the array. That is" |
| 10 | + ] |
| 11 | + }, |
| 12 | + { |
| 13 | + "cell_type": "code", |
| 14 | + "execution_count": 1, |
| 15 | + "metadata": {}, |
| 16 | + "outputs": [ |
| 17 | + { |
| 18 | + "name": "stdout", |
| 19 | + "output_type": "stream", |
| 20 | + "text": [ |
| 21 | + "Bytes array - dtype: |S3, typesize: 3\n", |
| 22 | + "Unicode array - dtype: <U3, typesize: 12\n" |
| 23 | + ] |
| 24 | + } |
| 25 | + ], |
| 26 | + "source": [ |
| 27 | + "import numpy as np\n", |
| 28 | + "\n", |
| 29 | + "arr = np.array([b\"a23\", b\"89u\"])\n", |
| 30 | + "print(f\"Bytes array - dtype: {arr.dtype}, typesize: {arr.dtype.itemsize}\")\n", |
| 31 | + "arr = np.array([\"a23\", \"89u\"])\n", |
| 32 | + "print(f\"Unicode array - dtype: {arr.dtype}, typesize: {arr.dtype.itemsize}\")" |
| 33 | + ] |
| 34 | + }, |
| 35 | + { |
| 36 | + "cell_type": "markdown", |
| 37 | + "metadata": {}, |
| 38 | + "source": [ |
| 39 | + "\n", |
| 40 | + "since each unicode character encodes 4 bytes. This carries over to the ``blosc2.NDArray`` object. Indeed, such arrays, particularly those of Unicode type, are highly compressible, since almost all of the bits encoding each item will be 0 (i.e. ``\\x00``), as can be seen by viewing the second array above as an array of bytestrings" |
| 41 | + ] |
| 42 | + }, |
| 43 | + { |
| 44 | + "cell_type": "code", |
| 45 | + "execution_count": 2, |
| 46 | + "metadata": {}, |
| 47 | + "outputs": [ |
| 48 | + { |
| 49 | + "data": { |
| 50 | + "text/plain": [ |
| 51 | + "array([b'a\\x00\\x00\\x002\\x00\\x00\\x003', b'8\\x00\\x00\\x009\\x00\\x00\\x00u'],\n", |
| 52 | + " dtype='|S12')" |
| 53 | + ] |
| 54 | + }, |
| 55 | + "execution_count": 2, |
| 56 | + "metadata": {}, |
| 57 | + "output_type": "execute_result" |
| 58 | + } |
| 59 | + ], |
| 60 | + "source": [ |
| 61 | + "arr.view(\"S12\")" |
| 62 | + ] |
| 63 | + }, |
| 64 | + { |
| 65 | + "cell_type": "markdown", |
| 66 | + "metadata": {}, |
| 67 | + "source": [ |
| 68 | + "(The trailing ``\\x00`` bytes are suppressed for the last character). Consequently, using Blosc2 can save you a lot of space (in memory or disk) when working with arrays of strings, if one exploits the structure of the (unicode) strings correctly. Specifically, the fundamental building block of the array should be the byte, and not the element - in this way, using the shuffle filter groups the $N$ elements having $m$ characters of bytesize 4 into $4$ streams of $Nm$ bytes, so that the corresponding bytes for all characters are grouped together. For the array above, one transforms the array from (2 elements of $3 \\times 4 = 12$ bytes)\n", |
| 69 | + "```\n", |
| 70 | + "|a\\x00\\x00\\x002\\x00\\x00\\x003\\x00\\x00\\x00|8\\x00\\x00\\x009\\x00\\x00\\x00u\\x00\\x00\\x00|\n", |
| 71 | + "```\n", |
| 72 | + "to (4 streams of $2 \\times 3 = 6$ bytes)\n", |
| 73 | + "```\n", |
| 74 | + "|a2389u|\\x00\\x00\\x00\\x00\\x00\\x00|\\x00\\x00\\x00\\x00\\x00\\x00|\\x00\\x00\\x00\\x00\\x00\\x00|\n", |
| 75 | + "```\n", |
| 76 | + "For the example above, 3 of the bytes for each character are 0, and so by grouping these zeros together, it is more likely to have chunks composed entirely or almost entirely of zeros, which may then be readily compressed.\n", |
| 77 | + "If one were to break up the array by elements, one would end up with $4m$ streams of $N$ bytes, i.e. ($4 \\times 3 = 12$ streams of $2$ bytes)\n", |
| 78 | + "```\n", |
| 79 | + "|a8|\\x00\\x00|\\x00\\x00|\\x00\\x00|29|\\x00\\x00|\\x00\\x00|\\x00\\x00|3u|\\x00\\x00|\\x00\\x00|\\x00\\x00|\n", |
| 80 | + "```\n", |
| 81 | + "which is not very compressible, since the informative bytes are fragmented by small groups of 0 bytes. This heuristic has been implemented as a default for Blosc2 string compression, so you can just compress your arrays of strings and reap the benefits without having to worry about such intricacies though! Check it out below:" |
| 82 | + ] |
| 83 | + }, |
| 84 | + { |
| 85 | + "cell_type": "code", |
| 86 | + "execution_count": 3, |
| 87 | + "metadata": {}, |
| 88 | + "outputs": [ |
| 89 | + { |
| 90 | + "name": "stdout", |
| 91 | + "output_type": "stream", |
| 92 | + "text": [ |
| 93 | + "cratio forcing non-string defaults: 2417x\n", |
| 94 | + "cratio allowing blosc2 to optimise: 3902x\n" |
| 95 | + ] |
| 96 | + } |
| 97 | + ], |
| 98 | + "source": [ |
| 99 | + "import blosc2\n", |
| 100 | + "\n", |
| 101 | + "N = int(1e5)\n", |
| 102 | + "nparr = np.repeat(np.array([\"josé\", \"pepe\", \"francisco\"]), N)\n", |
| 103 | + "cparams = blosc2.cparams_dflts\n", |
| 104 | + "arr1 = blosc2.asarray(nparr, cparams=cparams)\n", |
| 105 | + "print(f\"cratio forcing non-string defaults: {round(arr1.cratio)}x\")\n", |
| 106 | + "arr1 = blosc2.asarray(nparr)\n", |
| 107 | + "print(f\"cratio allowing blosc2 to optimise: {round(arr1.cratio)}x\")" |
| 108 | + ] |
| 109 | + }, |
| 110 | + { |
| 111 | + "cell_type": "markdown", |
| 112 | + "metadata": {}, |
| 113 | + "source": [ |
| 114 | + "## Operating on arrays of strings\n", |
| 115 | + "\n", |
| 116 | + "Blosc2 has two sides, compression and computation, which are tightly enmeshed, and for arrays of strings the same applies. We have implemented a subset of useful functions for strings:\n", |
| 117 | + "- comparison operations ``<, <=, ==, !=, >=, >``\n", |
| 118 | + "- 2-argument functions ``contains, startswith, endswith``\n", |
| 119 | + "- 1-argument functions ``lower, upper``\n", |
| 120 | + "\n", |
| 121 | + "Where possible these will be computed by the ``miniexpr`` backend, which is a highly optimised, fully compiled, multithreaded library which is the most complete expression of Blosc2's goal: fully vertically integrated decompression/computation/recompression with optimal cache hierarchy exploitation and super-fast compiled-C code for as much of the pipeline as possible. In cases where this is not possible, a more robust path which still avoids memory overload for large arrays is used.\n", |
| 122 | + "\n", |
| 123 | + "The arguments may be scalars or arrays (``blosc2.NDArray`` or other types) of strings or bytes. " |
| 124 | + ] |
| 125 | + }, |
| 126 | + { |
| 127 | + "cell_type": "code", |
| 128 | + "execution_count": 4, |
| 129 | + "metadata": {}, |
| 130 | + "outputs": [], |
| 131 | + "source": [ |
| 132 | + "for t in (\"bytes\", \"string\"):\n", |
| 133 | + " if t == \"bytes\":\n", |
| 134 | + " a1 = np.array([b\"abc\", b\"def\", b\"atErr\", b\"oot\", b\"zu\", b\"ab c\"])\n", |
| 135 | + " a2 = a2_blosc = b\"a\"\n", |
| 136 | + " else:\n", |
| 137 | + " a1 = np.array([\"abc\", \"def\", \"atErr\", \"oot\", \"zu\", \"ab c\"])\n", |
| 138 | + " a2 = a2_blosc = \"a\"\n", |
| 139 | + " a1_blosc = blosc2.asarray(a1)\n", |
| 140 | + " for func, npfunc in zip(\n", |
| 141 | + " (blosc2.startswith, blosc2.endswith, blosc2.contains),\n", |
| 142 | + " (np.char.startswith, np.char.endswith, lambda *args: np.char.find(*args) != -1),\n", |
| 143 | + " strict=True,\n", |
| 144 | + " ):\n", |
| 145 | + " expr_lazy = func(a1_blosc, a2_blosc)\n", |
| 146 | + " res_numexpr = npfunc(a1, a2)\n", |
| 147 | + " assert expr_lazy.shape == res_numexpr.shape\n", |
| 148 | + " assert expr_lazy.dtype == blosc2.bool_\n", |
| 149 | + " np.testing.assert_array_equal(expr_lazy[:], res_numexpr)\n", |
| 150 | + "\n", |
| 151 | + " np.testing.assert_array_equal((a1_blosc < a2_blosc)[:], a1 < a2)\n", |
| 152 | + " np.testing.assert_array_equal((a1_blosc <= a2_blosc)[:], a1 <= a2)\n", |
| 153 | + " np.testing.assert_array_equal((a1_blosc == a2_blosc)[:], a1 == a2)\n", |
| 154 | + " np.testing.assert_array_equal((a1_blosc != a2_blosc)[:], a1 != a2)\n", |
| 155 | + " np.testing.assert_array_equal((a1_blosc >= a2_blosc)[:], a1 >= a2)\n", |
| 156 | + " np.testing.assert_array_equal((a1_blosc > a2_blosc)[:], a1 > a2)\n", |
| 157 | + "\n", |
| 158 | + " for func, npfunc in zip((blosc2.lower, blosc2.upper), (np.char.lower, np.char.upper), strict=True):\n", |
| 159 | + " expr_lazy = func(a1_blosc)\n", |
| 160 | + " res_numexpr = npfunc(a1)\n", |
| 161 | + " assert expr_lazy.shape == res_numexpr.shape\n", |
| 162 | + " np.testing.assert_array_equal(expr_lazy[:], res_numexpr)" |
| 163 | + ] |
| 164 | + }, |
| 165 | + { |
| 166 | + "cell_type": "markdown", |
| 167 | + "metadata": {}, |
| 168 | + "source": [] |
| 169 | + } |
| 170 | + ], |
| 171 | + "metadata": { |
| 172 | + "kernelspec": { |
| 173 | + "display_name": "blosc2env", |
| 174 | + "language": "python", |
| 175 | + "name": "python3" |
| 176 | + }, |
| 177 | + "language_info": { |
| 178 | + "codemirror_mode": { |
| 179 | + "name": "ipython", |
| 180 | + "version": 3 |
| 181 | + }, |
| 182 | + "file_extension": ".py", |
| 183 | + "mimetype": "text/x-python", |
| 184 | + "name": "python", |
| 185 | + "nbconvert_exporter": "python", |
| 186 | + "pygments_lexer": "ipython3", |
| 187 | + "version": "3.13.7" |
| 188 | + } |
| 189 | + }, |
| 190 | + "nbformat": 4, |
| 191 | + "nbformat_minor": 2 |
| 192 | +} |
0 commit comments