cf.FieldList.collapse

FieldList.collapse(method, axes=None, squeeze=False, mtol=1, weights='auto', ddof=1, a=None, i=False, group=None, regroup=False, within_days=None, within_years=None, over_days=None, over_years=None, coordinate='mid_range', group_by='coords', **kwargs)

For each field, collapse axes of the field.

Collapsing an axis involves reducing its size with a given (typically statistical) method.

By default all axes with size greater than 1 are collapsed completely with the given method. For example, to find the minumum of the data array:

>>> g = f.collapse('min')

By default the calculations of means, standard deviations and variances use a combination of volume, area and linear weights based on the field’s metadata. For example to find the mean of the data array, weighted where possible:

>>> g = f.collapse('mean')

Specific weights may be forced with the weights parameter. For example to find the variance of the data array, weighting the X and Y axes by cell area, the T axis linearly and leaving all other axes unweighted:

>>> g = f.collapse('variance', weights=['area', 'T'])

A subset of the axes may be collapsed. For example, to find the mean over the time axis:

>>> f
[<CF Field: air_temperature(time(12), latitude(73), longitude(96) K>]
>>> g = f.collapse('T: mean')
>>> g
[<CF Field: air_temperature(time(1), latitude(73), longitude(96) K>]

For example, to find the maximum over the time and height axes:

>>> g = f.collapse('T: Z: max')

or, equivalently:

>>> g = f.collapse('max', axes=['T', 'Z'])

An ordered sequence of collapses over different (or the same) subsets of the axes may be specified. For example, to first find the mean over the time axis and subequently the standard deviation over the latitude and longitude axes:

>>> g = f.collapse('T: mean area: sd')

or, equivalently, in two steps:

>>> g = f.collapse('mean', axes='T').collapse('sd', axes='area')

Grouped collapses are possible, whereby groups of elements along an axis are defined and each group is collapsed independently. The collapsed groups are concatenated so that the collapsed axis in the output field has a size equal to the number of groups. For example, to find the variance along the longitude axis within each group of size 10 degrees:

>>> g = f.collapse('longitude: variance', group=cf.Data(10, 'degrees'))

Climatological statistics (a type of grouped collapse) as defined by the CF conventions may be specified. For example, to collapse a time axis into multiannual means of calendar monthly minima:

>>> g = f.collapse('time: minimum within years T: mean over years',
...                 within_years=cf.M())

In all collapses, missing data array elements are accounted for in the calculation.

The following collapse methods are available, over any subset of the axes:

Method Notes
Maximum The maximum of the values.
Minimum The minimum of the values.
Sum The sum of the values.
Mid-range The average of the maximum and the minimum of the values.
Range The absolute difference between the maximum and the minimum of the values.
Mean

The unweighted mean, \(m\), of \(N\) values \(x_i\) is

\[m=\frac{1}{N}\sum_{i=1}^{N} x_i\]

The weighted mean, \(\tilde{m}\), of \(N\) values \(x_i\) with corresponding weights \(w_i\) is

\[\tilde{m}=\frac{1}{\sum_{i=1}^{N} w_i} \sum_{i=1}^{N} w_i x_i\]
Standard deviation

The unweighted standard deviation, \(s\), of \(N\) values \(x_i\) with mean \(m\) and with \(N-ddof\) degrees of freedom (\(ddof\ge0\)) is

\[s=\sqrt{\frac{1}{N-ddof} \sum_{i=1}^{N} (x_i - m)^2}\]

The weighted standard deviation, \(\tilde{s}_N\), of \(N\) values \(x_i\) with corresponding weights \(w_i\), weighted mean \(\tilde{m}\) and with \(N\) degrees of freedom is

\[\tilde{s}_N=\sqrt{\frac{1} {\sum_{i=1}^{N} w_i} \sum_{i=1}^{N} w_i(x_i - \tilde{m})^2}\]

The weighted standard deviation, \(\tilde{s}\), of \(N\) values \(x_i\) with corresponding weights \(w_i\) and with \(N-ddof\) degrees of freedom \((ddof>0)\) is

\[\tilde{s}=\sqrt{ \frac{a \sum_{i=1}^{N} w_i}{a \sum_{i=1}^{N} w_i - ddof}} \tilde{s}_N\]

where \(a\) is the smallest positive number whose product with each weight is an integer. \(a \sum_{i=1}^{N} w_i\) is the size of a new sample created by each \(x_i\) having \(aw_i\) repeats. In practice, \(a\) may not exist or may be difficult to calculate, so \(a\) is either set to a predetermined value or an approximate value is calculated (see the a parameter for details).

Variance The variance is the square of the standard deviation.
Sample size The sample size, \(N\), as would be used for other statistical calculations.
Sum of weights The sum of sample weights, \(\sum_{i=1}^{N} w_i\), as would be used for other statistical calculations.
Sum of squares of weights The sum of squares of sample weights, \(\sum_{i=1}^{N} {w_i}^{2}\), as would be used for other statistical calculations.

New in version 1.0.

See also

cell_area, weights, max, mean, mid_range, min, range, sample_size, sd, sum, var

Parameters:
method : str

Define the collapse method. All of the axes specified by the axes parameter are collapsed simultaneously by this method. The method is given by one of the following strings:

method Description
'max' or 'maximum' Maximum
'min' or 'minimum' Minimum
'sum' Sum
'mid_range' Mid-range
'range' Range
'mean' or 'average' or 'avg' Mean
'sd' or 'standard_deviation' Standard deviation
'var' or 'variance' Variance
'sample_size' Sample size
'sum_of_weights' Sum of weights
'sum_of_weights2' Sum of squares of weights

An alternative form is to provide a CF cell methods-like string. In this case an ordered sequence of collapses may be defined and both the collapse methods and their axes are provided. The axes are interpreted as for the axes parameter, which must not also be set. For example:

>>> g = f.collapse('time: max (interval 1 hr) X: Y: mean dim3: sd')

is equivalent to:

>>> g = f.collapse('max', axes='time')
>>> g = g.collapse('mean', axes=['X', 'Y'])
>>> g = g.collapse('sd', axes='dim3')    

Climatological collapses are carried out if a method string contains any of the modifiers 'within days', 'within years', 'over days' or 'over years'. For example, to collapse a time axis into multiannual means of calendar monthly minima:

>>> g = f.collapse('time: minimum within years T: mean over years',
...                 within_years=cf.M())

which is equivalent to:

>>> g = f.collapse('time: minimum within years', within_years=cf.M())
>>> g = g.collapse('mean over years', axes='T')
axes, kwargs : optional

The axes to be collapsed. The axes are those that would be selected by this call of the field’s axes method: f.axes(axes, **kwargs). See cf.Field.axes for details. If an axis has size 1 then it is ignored. By default all axes with size greater than 1 are collapsed. If axes has the special value 'area' then it is assumed that the X and Y axes are intended.

Example:

axes='area' is equivalent to axes=['X', 'Y']. axes=['area', Z'] is equivalent to axes=['X', 'Y', 'Z'].

weights : optional

Specify the weights for the collapse. The weights are those that would be returned by this call of the field’s weights method: f.weights(weights, components=True). By default weights is 'auto', meaning that a combination of volume, area and linear weights is created based on the field’s metadata. See cf.Field.weights for details.

Example:

To specify weights based on cell areas use weights='area'. To specify weights based on cell areas and linear height you could set weights=('area', 'Z').

squeeze : bool, optional

If True then size 1 collapsed axes are removed from the output data array. By default the axes which are collapsed are retained in the result’s data array.

mtol : number, optional

Set the fraction of input array elements which is allowed to contain missing data when contributing to an individual output array element. Where this fraction exceeds mtol, missing data is returned. The default is 1, meaning that a missing datum in the output array only occurs when its contributing input array elements are all missing data. A value of 0 means that a missing datum in the output array occurs whenever any of its contributing input array elements are missing data. Any intermediate value is permitted.

Example:

To ensure that an output array element is a missing datum if more than 25% of its input array elements are missing data: mtol=0.25.

ddof : number, optional

The delta degrees of freedom in the calculation of a standard deviation or variance. The number of degrees of freedom used in the calculation is (N-ddof) where N represents the number of non-missing elements. By default ddof is 1, meaning the standard deviation and variance of the population is estimated according to the usual formula with (N-1) in the denominator to avoid the bias caused by the use of the sample mean (Bessel’s correction).

a : optional

Specify the value of \(a\) in the calculation of a weighted standard deviation or variance when the ddof parameter is greater than 0. See the notes above for details. A value is required each output array element, so a must be a single number or else a field which is broadcastable to the collapsed field. By default the calculation of each output array element uses an approximate value of a which is the smallest positive number whose products with the smallest and largest of the contributing weights, and their sum, are all integers. In this case, a positive number is considered to be an integer if its decimal part is sufficiently small (no greater than 10-8 plus 10-5 times its integer part).

Example:

To guarantee that \(\tilde{s}\) is exact when the weights for each output array element are collectively coprime integers: a=1.

Note:
  • The default approximation will never overestimate \(a\), so \(\tilde{s}\) will always greater than or equal to its true value when \(a\) is not specified.
coordinate : str, optional

Set how the cell coordinate values for collapsed axes are defined. This has no effect on the cell bounds for the collapsed axes, which always represent the extrema of the input coordinates. Valid values are:

coordinate Description
'mid_range' An output coordinate is the average of the first and last input coordinate bounds (or the first and last coordinates if there are no bounds). This is the default.
'min' An output coordinate is the minimum of the input coordinates.
'max' An output coordinate is the maximum of the input coordinates.
group : optional

Independently collapse groups of axis elements. Upon output, the results of the collapses are concatenated so that the output axis has a size equal to the number of groups. The group parameter defines how the elements are partitioned into groups, and may be one of:

  • A cf.Data defining the group size in terms of ranges of coordinate values. The first group starts at the first coordinate bound of the first axis element (or its coordinate if there are no bounds) and spans the defined group size. Each susbsequent group immediately follows the preceeeding one. By default each group contains the consective run of elements whose coordinate values lie within the group limits (see the group_by parameter).

    Example:

    To define groups of 10 kilometres: group=cf.Data(10, 'km').

    Note:
    • By default each element will be in exactly one group (see the group_by parameter).
    • Groups may contain different numbers of elements.
    • If no units are specified then the units of the coordinates are assumed.
  • A cf.TimeDuration defining the group size in terms of calendar months and years or other time intervals. The first group starts at or before the first coordinate bound of the first axis element (or its coordinate if there are no bounds) and spans the defined group size. Each susbsequent group immediately follows the preceeeding one. By default each group contains the consective run of elements whose coordinate values lie within the group limits (see the group_by parameter).

    Example:

    To define groups of 5 days, starting and ending at midnight on each day: group=cf.D(5) (see cf.D).

    Example:

    To define groups of 1 calendar month, starting and ending at day 16 of each month: group=cf.M(day=16) (see cf.M).

    Note:
    • By default each element will be in exactly one group (see the group_by parameter).
    • Groups may contain different numbers of elements.
    • The start of the first group may be before the first first axis element, depending on the offset defined by the time duration. For example, if group=cf.Y(month=12) then the first group will start on the closest 1st December to the first axis element.
  • A (sequence of) cf.Query, each of which is a condition defining one or more groups. Each query selects elements whose coordinates satisfy its condition and from these elements multiple groups are created - one for each maximally consecutive run within these elements.

    Example:

    To define groups of the season MAM in each year: group=cf.mam() (see cf.mam).

    Example:

    To define groups of the seasons DJF and JJA in each year: group=[cf.jja(), cf.djf()]. To define groups for seasons DJF, MAM, JJA and SON in each year: group=cf.seasons() (see cf.djf, cf.jja and cf.season).

    Example:

    To define groups for longitude elements less than or equal to 90 degrees and greater than 90 degrees: group=[cf.le(90, 'degrees'), cf.gt(90, 'degrees')] (see cf.le and cf.gt).

    Note:
    • If a coordinate does not satisfy any of the conditions then its element will not be in a group.
    • Groups may contain different numbers of elements.
    • If no units are specified then the units of the coordinates are assumed.
    • If an element is selected by two or more queries then the latest one in the sequence defines which group it will be in.
  • An int defining the number of elements in each group. The first group starts with the first axis element and spans the defined number of consecutive elements. Each susbsequent group immediately follows the preceeeding one.

    Example:

    To define groups of 5 elements: group=5.

    Note:
    • Each group has the defined number of elements, apart from the last group which may contain fewer elements.
  • A numpy.array of integers defining groups. The array must have the same length as the axis to be collapsed and its sequence of values correspond to the axis elements. Each group contains the elements which correspond to a common non-negative integer value in the numpy array. Upon output, the collapsed axis is arranged in order of increasing group number.

    Example:

    For an axis of size 8, create two groups, the first containing the first and last elements and the second containing the 3rd, 4th and 5th elements, whilst ignoring the 2nd, 6th and 7th elements: group=numpy.array([0, -1, 4, 4, 4, -1, -2, 0]).

    Note:
    • The groups do not have to be in runs of consective elements; they may be scattered throughout the axis.
    • An element which corresponds to a negative integer in the array will not be in a group.
group_by : str, optional

Specify how coordinates are assigned to the groups defined by the group, within_days or within_years parameter. Ignored unless one of these parameters is a cf.Data or cf.TimeDuration object. The group_by parameter may be one of:

  • 'coords'. This is the default. Each group contains the axis elements whose coordinate values lie within the group limits. Every element will be in a group.
  • 'bounds'. Each group contains the axis elements whose upper and lower coordinate bounds both lie within the group limits. Some elements may not be inside any group, either because the group limits do not coincide with coordinate bounds or because the group size is sufficiently small.
regroup : bool, optional

For grouped collapses, return a numpy.array of integers which identifies the groups defined by the group parameter. The array is interpreted as for a numpy array value of the group parameter, and thus may subsequently be used by group parameter in a separate collapse. For example:

>>> groups = f.collapse('time: mean', group=10, regroup=True)
>>> g = f.collapse('time: mean', group=groups)

is equivalent to:

>>> g = f.collapse('time: mean', group=10)
within_days : optional

Independently collapse groups of reference-time axis elements for CF “within days” climatological statistics. Each group contains elements whose coordinates span a time interval of up to one day. Upon output, the results of the collapses are concatenated so that the output axis has a size equal to the number of groups.

Note:

For CF compliance, a “within days” collapse should be followed by an “over days” collapse.

The within_days parameter defines how the elements are partitioned into groups, and may be one of:

  • A cf.TimeDuration defining the group size in terms of a time interval of up to one day. The first group starts at or before the first coordinate bound of the first axis element (or its coordinate if there are no bounds) and spans the defined group size. Each susbsequent group immediately follows the preceeeding one. By default each group contains the consective run of elements whose coordinate values lie within the group limits (see the group_by parameter).

    Example:

    To define groups of 6 hours, starting at 00:00, 06:00, 12:00 and 18:00: within_days=cf.h(6) (see cf.h).

    Example:

    To define groups of 1 day, starting at 06:00: within_days=cf.D(1, hour=6) (see cf.D).

    Note:
    • Groups may contain different numbers of elements.
    • The start of the first group may be before the first first axis element, depending on the offset defined by the time duration. For example, if group=cf.D(hour=12) then the first group will start on the closest midday to the first axis element.
  • A (sequence of) cf.Query, each of which is a condition defining one or more groups. Each query selects elements whose coordinates satisfy its condition and from these elements multiple groups are created - one for each maximally consecutive run within these elements.

    Example:

    To define groups of 00:00 to 06:00 within each day, ignoring the rest of each day: within_days=cf.hour(cf.le(6)) (see cf.hour and cf.le).

    Example:

    To define groups of 00:00 to 06:00 and 18:00 to 24:00 within each day, ignoring the rest of each day: within_days=[cf.hour(cf.le(6)), cf.hour(cf.gt(18))] (see cf.gt, cf.hour and cf.le).

    Note:
    • Groups may contain different numbers of elements.
    • If no units are specified then the units of the coordinates are assumed.
    • If a coordinate does not satisfy any of the conditions then its element will not be in a group.
    • If an element is selected by two or more queries then the latest one in the sequence defines which group it will be in.
within_years : optional

Independently collapse groups of reference-time axis elements for CF “within years” climatological statistics. Each group contains elements whose coordinates span a time interval of up to one calendar year. Upon output, the results of the collapses are concatenated so that the output axis has a size equal to the number of groups.

Note:

For CF compliance, a “within years” collapse should be followed by an “over years” collapse.

The within_years parameter defines how the elements are partitioned into groups, and may be one of:

  • A cf.TimeDuration defining the group size in terms of a time interval of up to one calendar year. The first group starts at or before the first coordinate bound of the first axis element (or its coordinate if there are no bounds) and spans the defined group size. Each susbsequent group immediately follows the preceeeding one. By default each group contains the consective run of elements whose coordinate values lie within the group limits (see the group_by parameter).

    Example:

    To define groups of 90 days: within_years=cf.D(90) (see cf.D).

    Example:

    To define groups of 3 calendar months, starting on the 15th of a month: within_years=cf.M(3, day=15) (see cf.M).

    Note:
    • Groups may contain different numbers of elements.
    • The start of the first group may be before the first first axis element, depending on the offset defined by the time duration. For example, if group=cf.Y(month=12) then the first group will start on the closest 1st December to the first axis element.
  • A (sequence of) cf.Query, each of which is a condition defining one or more groups. Each query selects elements whose coordinates satisfy its condition and from these elements multiple groups are created - one for each maximally consecutive run within these elements.

    Example:

    To define groups for the season MAM within each year: within_years=cf.mam() (see cf.mam).

    Example:

    To define groups for February and for November to December within each year: within_years=[cf.month(2), cf.month(cf.ge(11))] (see cf.month and cf.ge).

    Note:
    • The first group may start outside of the range of coordinates (the start of the first group is controlled by parameters of the cf.TimeDuration).
    • If group boundaries do not coincide with coordinate bounds then some elements may not be inside any group.
    • If the group size is sufficiently small then some elements may not be inside any group.
    • Groups may contain different numbers of elements.
over_days : optional

Independently collapse groups of reference-time axis elements for CF “over days” climatological statistics. Each group contains elements whose coordinates are matching, in that their lower bounds have a common time of day but different dates of the year, and their upper bounds also have a common time of day but different dates of the year. Upon output, the results of the collapses are concatenated so that the output axis has a size equal to the number of groups.

Example:

An element with coordinate bounds {1999-12-31 06:00:00, 1999-12-31 18:00:00} matches an element with coordinate bounds {2000-01-01 06:00:00, 2000-01-01 18:00:00}.

Example:

An element with coordinate bounds {1999-12-31 00:00:00, 2000-01-01 00:00:00} matches an element with coordinate bounds {2000-01-01 00:00:00, 2000-01-02 00:00:00}.

Note:
  • A coordinate parameter value of 'min' is assumed, regardless of its given value.

  • A group_by parameter value of 'bounds' is assumed, regardless of its given value.

  • An “over days” collapse must be preceded by a “within days” collapse, as described by the CF conventions. If the field already contains sub-daily data, but does not have the “within days” cell methods flag then it may be added, for example, as follows (this example assumes that the appropriate cell method is the most recently applied, which need not be the case; see cf.CellMethods for details):

    >>> f.cell_methods[-1].within = 'days'
    

The over_days parameter defines how the elements are partitioned into groups, and may be one of:

  • None. This is the default. Each collection of matching elements forms a group.
  • A cf.TimeDuration defining the group size in terms of a time duration of at least one day. Multiple groups are created from each collection of matching elements - the first of which starts at or before the first coordinate bound of the first element and spans the defined group size. Each susbsequent group immediately follows the preceeeding one. By default each group contains the matching elements whose coordinate values lie within the group limits (see the group_by parameter).

    Example:

    To define groups spanning 90 days: over_days=cf.D(90) or over_days=cf.h(2160). (see cf.D and cf.h).

    Example:

    To define groups spanning 3 calendar months, starting and ending at 06:00 in the first day of each month: over_days=cf.M(3, hour=6) (see cf.M).

    Note:
    • Groups may contain different numbers of elements.
    • The start of the first group may be before the first first axis element, depending on the offset defined by the time duration. For example, if group=cf.M(day=15) then the first group will start on the closest 15th of a month to the first axis element.
  • A (sequence of) cf.Query, each of which is a condition defining one or more groups. Each query selects elements whose coordinates satisfy its condition and from these elements multiple groups are created - one for each subset of matching elements.

    Example:

    To define groups for January and for June to December, ignoring all other months: over_days=[cf.month(1), cf.month(cf.wi(6, 12))] (see cf.month and cf.wi).

    Note:
    • If a coordinate does not satisfy any of the conditions then its element will not be in a group.
    • Groups may contain different numbers of elements.
    • If an element is selected by two or more queries then the latest one in the sequence defines which group it will be in.
over_years : optional

Independently collapse groups of reference-time axis elements for CF “over years” climatological statistics. Each group contains elements whose coordinates are matching, in that their lower bounds have a common sub-annual date but different years, and their upper bounds also have a common sub-annual date but different years. Upon output, the results of the collapses are concatenated so that the output axis has a size equal to the number of groups.

Example:

An element with coordinate bounds {1999-06-01 06:00:00, 1999-09-01 06:00:00} matches an element with coordinate bounds {2000-06-01 06:00:00, 2000-09-01 06:00:00}.

Example:

An element with coordinate bounds {1999-12-01 00:00:00, 2000-12-01 00:00:00} matches an element with coordinate bounds {2000-12-01 00:00:00, 2001-12-01 00:00:00}.

Note:
  • A coordinate parameter value of 'min' is assumed, regardless of its given value.

  • A group_by parameter value of 'bounds' is assumed, regardless of its given value.

  • An “over years” collapse must be preceded by a “within years” or an “over days” collapse, as described by the CF conventions. If the field already contains sub-annual data, but does not have the “within years” or “over days” cell methods flag then it may be added, for example, as follows (this example assumes that the appropriate cell method is the most recently applied, which need not be the case; see cf.CellMethods for details):

    >>> f.cell_methods[-1].over = 'days'
    

The over_years parameter defines how the elements are partitioned into groups, and may be one of:

  • None. Each collection of matching elements forms a group. This is the default.
  • A cf.TimeDuration defining the group size in terms of a time interval of at least one calendar year. Multiple groups are created from each collection of matching elements - the first of which starts at or before the first coordinate bound of the first element and spans the defined group size. Each susbsequent group immediately follows the preceeeding one. By default each group contains the matching elements whose coordinate values lie within the group limits (see the group_by parameter).

    Example:

    To define groups spanning 10 calendar years: over_years=cf.Y(10) or over_years=cf.M(120) (see cf.M and cf.Y).

    Example:

    To define groups spanning 5 calendar years, starting and ending at 06:00 on 01 December of each year: over_years=cf.Y(5, month=12, hour=6) (see cf.Y).

    Note:
    • Groups may contain different numbers of elements.
    • The start of the first group may be before the first first axis element, depending on the offset defined by the time duration. For example, if group=cf.Y(month=12) then the first group will start on the closest 1st December to the first axis element.
  • A (sequence of) cf.Query, each of which is a condition defining one or more groups. Each query selects elements whose coordinates satisfy its condition and from these elements multiple groups are created - one for each subset of matching elements.

    Example:

    To define one group spanning 1981 to 1990 and another spanning 2001 to 2005: over_years=[cf.year(cf.wi(1981, 1990), cf.year(cf.wi(2001, 2005)] (see cf.year and cf.wi).

    Note:
    • If a coordinate does not satisfy any of the conditions then its element will not be in a group.
    • Groups may contain different numbers of elements.
    • If an element is selected by two or more queries then the latest one in the sequence defines which group it will be in.
i : bool, optional

If True then update the field list in place. By default a new field list is created. In either case, a field list is returned.

Returns:
out : cf.field list or list

For each field, the collapsed field. If the regroup parameter is True then a numpy array is returned.

Examples:

Calculate the unweighted mean over a the entire field:

>>> g = f.collapse('mean')

Five equivalent ways to calculate the unweighted mean over a CF latitude axis:

>>> g = f.collapse('latitude: mean')
>>> g = f.collapse('lat: avg')
>>> g = f.collapse('Y: average')
>>> g = f.collapse('mean', 'Y')
>>> g = f.collapse('mean', ['latitude'])

Three equivalent ways to calculate an area weighted mean over CF latitude and longitude axes:

>>> g = f.collapse('area: mean', weights='area')
>>> g = f.collapse('lat: lon: mean', weights='area')
>>> g = f.collapse('mean', axes=['Y', 'X'], weights='area')

Two equivalent ways to calculate a time weighted mean over CF latitude, longitude and time axes:

>>> g = f.collapse('X: Y: T: mean', weights='T')
>>> g = f.collapse('mean', axes=['T', 'Y', 'X'], weights='T')

Find how many non-missing elements in each group of a grouped collapse:

>>> f.collapse('latitude: sample_size', group=cf.Data(5 'degrees'))