@@ -729,6 +729,9 @@ group (requires `lmdb <http://lmdb.readthedocs.io/>`_ to be installed)::
729729 >>> z[:] = 42
730730 >>> store.close()
731731
732+ Distributed/cloud storage
733+ ~~~~~~~~~~~~~~~~~~~~~~~~~
734+
732735It is also possible to use distributed storage systems. The Dask project has
733736implementations of the ``MutableMapping `` interface for Amazon S3 (`S3Map
734737<http://s3fs.readthedocs.io/en/latest/api.html#s3fs.mapping.S3Map> `_), Hadoop
@@ -767,6 +770,141 @@ Here is an example using S3Map to read an array created previously::
767770 >>> z[:].tostring()
768771 b'Hello from the cloud!'
769772
773+ Note that retrieving data from a remote service via the network can be significantly
774+ slower than retrieving data from a local file system, and will depend on network latency
775+ and bandwidth between the client and server systems. If you are experiencing poor
776+ performance, there are several things you can try. One option is to increase the array
777+ chunk size, which will reduce the number of chunks and thus reduce the number of network
778+ round-trips required to retrieve data for an array (and thus reduce the impact of network
779+ latency). Another option is to try to increase the compression ratio by changing
780+ compression options or trying a different compressor (which will reduce the impact of
781+ limited network bandwidth). As of version 2.2, Zarr also provides the
782+ :class: `zarr.storage.LRUStoreCache ` which can be used to implement a local in-memory cache
783+ layer over a remote store. E.g.::
784+
785+ >>> s3 = s3fs.S3FileSystem(anon=True, client_kwargs=dict(region_name='eu-west-2'))
786+ >>> store = s3fs.S3Map(root='zarr-demo/store', s3=s3, check=False)
787+ >>> cache = zarr.LRUStoreCache(store, max_size=2**28)
788+ >>> root = zarr.group(store=cache)
789+ >>> z = root['foo/bar/baz']
790+ >>> from timeit import timeit
791+ >>> # first data access is relatively slow, retrieved from store
792+ ... timeit('print(z[:].tostring())', number=1, globals=globals()) # doctest: +SKIP
793+ b'Hello from the cloud!'
794+ 0.1081731989979744
795+ >>> # second data access is faster, uses cache
796+ ... timeit('print(z[:].tostring())', number=1, globals=globals()) # doctest: +SKIP
797+ b'Hello from the cloud!'
798+ 0.0009490990014455747
799+
800+ If you are still experiencing poor performance with distributed/cloud storage, please
801+ raise an issue on the GitHub issue tracker with any profiling data you can provide, as
802+ there may be opportunities to optimise further either within Zarr or within the mapping
803+ interface to the storage.
804+
805+ .. _tutorial_copy :
806+
807+ Copying/migrating data
808+ ----------------------
809+
810+ If you have some data in an HDF5 file and would like to copy some or all of it
811+ into a Zarr group, or vice-versa, the :func: `zarr.convenience.copy ` and
812+ :func: `zarr.convenience.copy_all ` functions can be used. Here's an example
813+ copying a group named 'foo' from an HDF5 file to a Zarr group::
814+
815+ >>> import h5py
816+ >>> import zarr
817+ >>> import numpy as np
818+ >>> source = h5py.File('data/example.h5', mode='w')
819+ >>> foo = source.create_group('foo')
820+ >>> baz = foo.create_dataset('bar/baz', data=np.arange(100), chunks=(50,))
821+ >>> spam = source.create_dataset('spam', data=np.arange(100, 200), chunks=(30,))
822+ >>> zarr.tree(source)
823+ /
824+ ├── foo
825+ │ └── bar
826+ │ └── baz (100,) int64
827+ └── spam (100,) int64
828+ >>> dest = zarr.open_group('data/example.zarr', mode='w')
829+ >>> from sys import stdout
830+ >>> zarr.copy(source['foo'], dest, log=stdout)
831+ copy /foo
832+ copy /foo/bar
833+ copy /foo/bar/baz (100,) int64
834+ all done: 3 copied, 0 skipped, 800 bytes copied
835+ (3, 0, 800)
836+ >>> dest.tree() # N.B., no spam
837+ /
838+ └── foo
839+ └── bar
840+ └── baz (100,) int64
841+ >>> source.close()
842+
843+ If rather than copying a single group or dataset you would like to copy all
844+ groups and datasets, use :func: `zarr.convenience.copy_all `, e.g.::
845+
846+ >>> source = h5py.File('data/example.h5', mode='r')
847+ >>> dest = zarr.open_group('data/example2.zarr', mode='w')
848+ >>> zarr.copy_all(source, dest, log=stdout)
849+ copy /foo
850+ copy /foo/bar
851+ copy /foo/bar/baz (100,) int64
852+ copy /spam (100,) int64
853+ all done: 4 copied, 0 skipped, 1,600 bytes copied
854+ (4, 0, 1600)
855+ >>> dest.tree()
856+ /
857+ ├── foo
858+ │ └── bar
859+ │ └── baz (100,) int64
860+ └── spam (100,) int64
861+
862+ If you need to copy data between two Zarr groups, the
863+ func:`zarr.convenience.copy ` and :func: `zarr.convenience.copy_all ` functions can
864+ be used and provide the most flexibility. However, if you want to copy data
865+ in the most efficient way possible, without changing any configuration options,
866+ the :func: `zarr.convenience.copy_store ` function can be used. This function
867+ copies data directly between the underlying stores, without any decompression or
868+ re-compression, and so should be faster. E.g.::
869+
870+ >>> import zarr
871+ >>> import numpy as np
872+ >>> store1 = zarr.DirectoryStore('data/example.zarr')
873+ >>> root = zarr.group(store1, overwrite=True)
874+ >>> baz = root.create_dataset('foo/bar/baz', data=np.arange(100), chunks=(50,))
875+ >>> spam = root.create_dataset('spam', data=np.arange(100, 200), chunks=(30,))
876+ >>> root.tree()
877+ /
878+ ├── foo
879+ │ └── bar
880+ │ └── baz (100,) int64
881+ └── spam (100,) int64
882+ >>> from sys import stdout
883+ >>> store2 = zarr.ZipStore('data/example.zip', mode='w')
884+ >>> zarr.copy_store(store1, store2, log=stdout)
885+ copy .zgroup
886+ copy foo/.zgroup
887+ copy foo/bar/.zgroup
888+ copy foo/bar/baz/.zarray
889+ copy foo/bar/baz/0
890+ copy foo/bar/baz/1
891+ copy spam/.zarray
892+ copy spam/0
893+ copy spam/1
894+ copy spam/2
895+ copy spam/3
896+ all done: 11 copied, 0 skipped, 1,138 bytes copied
897+ (11, 0, 1138)
898+ >>> new_root = zarr.group(store2)
899+ >>> new_root.tree()
900+ /
901+ ├── foo
902+ │ └── bar
903+ │ └── baz (100,) int64
904+ └── spam (100,) int64
905+ >>> new_root['foo/bar/baz'][:]
906+ array([ 0, 1, 2, ..., 97, 98, 99])
907+ >>> store2.close() # zip stores need to be closed
770908
771909.. _tutorial_strings :
772910
0 commit comments