@@ -729,6 +729,9 @@ group (requires `lmdb <http://lmdb.readthedocs.io/>`_ to be installed)::
729729 >>> z[:] = 42
730730 >>> store.close()
731731
732+ Distributed/cloud storage
733+ ~~~~~~~~~~~~~~~~~~~~~~~~~
734+
732735It is also possible to use distributed storage systems. The Dask project has
733736implementations of the ``MutableMapping `` interface for Amazon S3 (`S3Map
734737<http://s3fs.readthedocs.io/en/latest/api.html#s3fs.mapping.S3Map> `_), Hadoop
@@ -767,6 +770,37 @@ Here is an example using S3Map to read an array created previously::
767770 >>> z[:].tostring()
768771 b'Hello from the cloud!'
769772
773+ Note that retrieving data from a remote service via the network can be significantly
774+ slower than retrieving data from a local file system, and will depend on network latency
775+ and bandwidth between the client and server systems. If you are experiencing poor
776+ performance, there are several things you can try. One option is to increase the array
777+ chunk size, which will reduce the number of chunks and thus reduce the number of network
778+ round-trips required to retrieve data for an array (and thus reduce the impact of network
779+ latency). Another option is to try to increase the compression ratio by changing
780+ compression options or trying a different compressor (which will reduce the impact of
781+ limited network bandwidth). As of version 2.2, Zarr also provides the
782+ :class: `zarr.storage.LRUStoreCache ` which can be used to implement a local in-memory cache
783+ layer over a remote store. E.g.::
784+
785+ >>> s3 = s3fs.S3FileSystem(anon=True, client_kwargs=dict(region_name='eu-west-2'))
786+ >>> store = s3fs.S3Map(root='zarr-demo/store', s3=s3, check=False)
787+ >>> cache = zarr.LRUStoreCache(store, max_size=2**28)
788+ >>> root = zarr.group(store=cache)
789+ >>> z = root['foo/bar/baz']
790+ >>> from timeit import timeit
791+ >>> # first data access is relatively slow, retrieved from store
792+ ... timeit('print(z[:].tostring())', number=1, globals=globals()) # doctest: +SKIP
793+ b'Hello from the cloud!'
794+ 0.1081731989979744
795+ >>> # second data access is faster, uses cache
796+ ... timeit('print(z[:].tostring())', number=1, globals=globals()) # doctest: +SKIP
797+ b'Hello from the cloud!'
798+ 0.0009490990014455747
799+
800+ If you are still experiencing poor performance with distributed/cloud storage, please
801+ raise an issue on the GitHub issue tracker with any profiling data you can provide, as
802+ there may be opportunities to optimise further either within Zarr or within the mapping
803+ interface to the storage.
770804
771805.. _tutorial_copy :
772806
0 commit comments