Large Amounts of Massive Arrays (LAMA)

Data arrays contained within fields are stored and manipulated in a very memory efficient manner such that large numbers of fields may co-exist and be manipulated regardless of the size of their data arrays. The limiting factor on array size is not the amount of available physical memory, but the amount of disk space available, which is generally far greater.

The basic functionality is:

  • Arrays larger than the available physical memory may be created.

    Arrays larger than a preset number of bytes are partitioned into smaller sub-arrays which are not kept in memory but stored on disk, either as part of the original file they were read in from or as newly created temporary files, whichever is appropriate. Therefore data arrays larger than a preset size need never wholly exist in memory.

  • Large numbers of arrays which are in total larger than the available physical memory may co-exist.

    Large numbers of data arrays which are collectively larger than the available physical memory may co-exist, regardless of whether any individual array is in memory or stored on disk.

  • Arrays larger than the available physical memory may be operated on.

    Array operations (such as subsetting, assignment, arithmetic, comparison, collapses, etc.) are carried out on a partition-by-partition basis. When an operation traverses the partitions, if a partition is stored on disk then it is returned to disk before processing the next partition.

  • The memory management does not need to be known at the API level.

    As the array’s metadata (such as size, shape, data-type, number of dimensions, etc.) always reflect the data array in its entirety, regardless of how the array is partitioned and whether or not its partitions are in memory, fields and other variables containing data arrays may be used in the API as if they were normal, in-memory objects (like numpy arrays). Partitioning does carry a speed overhead, but this may be minimised for particular applications or hardware configurations.

Reading from files

When a field is read from a file, the data array is not realized in memory, however large or small it may be. Instead each partition refers to part of the original file on disk. Therefore reading even very large fields is initially very fast and uses up only a very small amount of memory.

Copying

When a field is deep copied with its copy method or the copy.deepcopy function, the partitions of its data array are transferred to the new field as object identities and are not deep copied. Therefore copying even very large fields is initially very fast and uses up only a very small amount of memory.

The independence of the copied field is, however, preserved since each partition that is stored in memory (as opposed to on disk) is deep copied if and when the data in the new field is actually accessed, and then only if the partition’s data still exists in the original (or any other) field.

Aggregation

When two fields are aggregated to form one, larger field there is no need for either field’s data array partitions to be accessed, and therefore brought into memory if they were stored on disk. The resulting field recombines the two fields’ array partitions as object identities into the new larger array. Therefore creating an aggregated field that uses up only a very small amount of extra memory.

The independence of the new field is, however, preserved since each partition that is stored in memory (as opposed to on disk) is deep copied when the data in the new field is actually accessed, and then only if the partition’s data still exists in its the original (or any other) field.

Subsetting

When a new field is created by subsetting a field, the new field is actually a LAMA deep copy of the original but with additional instructions on each of its data partitions to only use the part specified by the subset indices.

As with copying, creating subsetted fields is initially very fast and uses up only a very small amount of memory, with the added advantage that a deep copy of only the requested parts of the data array needs to be carried out at the time of data access, and then only if the partition’s data still exists in the original field.

When subsetting a field that has previously been subsetted but has not yet had its data accessed, the new subset merely updates the instructions on which parts of the array partitions to use. For example:

>>> f.shape = (12, 73, 96)
>>> g = f.subset[::-2, ...]
>>> h = g.subset[2:5, ...]

is equivalent to

>>> h = f.subset[7:2:-2, ...]

and if all of the partitions of field f are stored on disk then in both cases so are all of the partitions of field h and no data has been read from disk.

Speed and memory management

The creation of temporary files for array partitions of large arrays and the reading of data from files on disk can create significant speed overheads (for example, recent tests show that writing a 100 megabyte array to disk can take O(1) seconds), so it may be desirable to configure the maximum size of array which is kept in memory, and therefore has fast access.

The data array memory management is configurable with the chunk size and free memory threshold parameters.

  • The chunk size parameter sets the size in bytes (default 104857600) for which field data arrays larger than that size are partitioned into sub-arrays (each of which is smaller than the chunk size) and either retained in their original input file or stored in temporary files on disk, whichever is appropriate. It is found and set with the CHUNKSIZE function.

  • The free memory threshold parameter is the minimum amount of memory which should always be kept for temporary work space. If the amount of available memory goes below this threshold then all new field data arrays, even those smaller than the chunk size, will be stored on disk. If the free memory subsequently increases sufficiently then new field data arrays smaller than the chunk size will be allowed to remain in memory with no partitioning.

    The parameter’s value is in kibibytes (1 kibibyte = 1024 bytes), but it is set by specifying an integer amount (default 10) of chunk sizes to be kept free. This asymmetry is useful since setting this parameter should be in the context of how many array partitions may need to be temporarily realized in memory at any one time. It is found and set with the FM_THRESHOLD function, but setting the chunk size with CHUNKSIZE will also update free memory threshold.

>>> cf.CHUNKSIZE()
104857600
>>> cf.FM_THRESHOLD()
1024000.0                 # = 104857600 * 10 / 1024.
>>> cf.CHUNKSIZE(2**30)
>>> cf.CHUNKSIZE()
1073741824
>>> cf.FM_THRESHOLD()
10485760.0                # = 1073741824 * 10 / 1024.
>>> cf.FM_THRESHOLD(20)
>>> cf.FM_THRESHOLD()
20971520.0                # = 1073741824 * 20 / 1024.

The default number of chunk sizes to be kept free is set with the MINNCFM parameter in the CONSTANTS dictionary:

>>> cf.CONSTANTS['MINNCFM'] = 15

Setting MINNCFM in this way does not automatically update the free memory threshold, but the new value will be used to calculate a new free memory threshold when the chunk size is set in with subsequent CHUNKSIZE calls.

Temporary files

The directory in which temporary files is found and set with the TEMPDIR parameter in the CONSTANTS dictionary

>>> cf.CONSTANTS['TEMPDIR']
'/tmp'
>>> cf.CONSTANTS['TEMPDIR'] = '/home/me/tmp'

The removal of temporary files which are no longer required works in the same way as python’s automatic garbage collector.

When a partition’s data is stored in a temporary file, that file will only exist for as long as there are partitions referring to it. When no partitions require the file it will be deleted automatically.

When python exits normally, all temporary files are always deleted.

Note that changing the temporary file directory does prevent temporary files in the original directory from being garbage collected.

Partitioning

To maximise looping efficiency, array partitioning preserves as much as is possible the faster varying (inner) dimensions’ sizes in each of the sub-arrays .

Examples

If an array with shape (2, 4, 6) is partitioned into 2 partitions then both sub-arrays will have shape (1, 4, 6).

If the same array is partitioned into 4 partitions then all four sub-arrays will have shape (1, 2, 6).

If the same array is partitioned into 8 partitions then all eight sub-arrays will have shape (1, 2, 3).

If the same array is partitioned into 48 partitions then all forty eight sub-arrays will have shape (1, 1, 1).

Table Of Contents

Previous topic

Field manipulation

Next topic

Functions

This Page