classify

Classifies data into categories, i.e. labels ranges of values.

var classes1 ... 2n+1 (facet) classify

Arguments
labeltypedescription
var variable input data to be classified
classes name and number set alternating names and numbers, starting and ending with a name, so that there are N+1 names and N numbers (optional)
facet string name of new independent variable (name of var if omitted) (optional)
Returns
weights variable output. There is an additional grid consisting of the N+1 names, and the values are 0, 1, or missing depending on whether the data was between the values given in the classify number set. This variable is sometimes referred to as being in complete disjunctive form.

Description

classify is used to assign ranges of values from a variable into user-defined classes. Given a variable with a given range of values, the classify statement accepts a list of alternating class names and constants which define the boundaries between the classes within that range. As a result, a new grid composed of the defined classes is created, and the values from the input variable are transformed into flags of 0 (not a member of the class), 1 (is a member of the class), or NaN (not a number -- missing). This is best illustrated with an example.

Examples

SOURCES .KAPLAN .Indices .NINO3 .avOS
T (Jan 1901) (Dec 1990) RANGE
T 3 boxAverage
[T]percentileover
{LaNina 0.2 Neutral 0.8 ElNino}classify


This example first takes non-overlapping 3-month seasonal averages of sea surface temperature anomalies (SSTA) from the NINO3 region of the equatorial Pacific Ocean over the period January 1901 to December 1990. This gives a single time series of seasonal sea surface temperature anomalies. The first time step is Jan-Mar 1901, the second is Apr-Jun 1901, and so on until Oct-Dec 1990.

Then, the SSTA values are converted into percentiles, from 0. to 1., using [T]percentileover. The most negative SSTA values in the distribution are assigned a value near zero, and the most positive values are assigned a value near one, with intermediate values ranging between these extremes.

The next line comprises the classify statement and its parameters. The class names and the boundaries between them are placed within the curly braces. Since the input variable is composed of percentiles that range between 0. and 1. the class boundaries should fall within this range. In this case, any values below 0.2 (in the lowest 20% of the distribution) are classified as 'LaNina', values between 0.2 and 0.8 are classified as 'Neutral', and values from 0.8 upward (in the upper 20% of the distribution) are classified as 'ElNino'.

Whereas the input variable was a function of time (T) only, after classify was applied the output variable became a function of both time (T) and class, which was named after the variable (avOS). So, for a given season and a given class, the output variable will have a value of either 0 (not a member of the class), 1 (is a member of the class), or NaN (missing). The Live Example Link below is followed by a link to a table from the same calculation, which will help to illustrate the meaning of the output.

Live Example Link

Table Link

classify makes it very handy to calculate composites, such as long-term average seasonal precipitation conditioned upon the state of ENSO. The following example shows an application of the classify statement to illustrate a relationship between ENSO and monsoon rainfall in India.

expert SOURCES .Indices .india .rainfall
SOURCES .KAPLAN .Indices .NINO3 .avOS
T (Oct 1901) (Dec 1990) RANGE
T 4 boxAverage
T 12 STEP
[T]percentileover
{LaNina 0.2 Neutral 0.8 ElNino}classify
T 4 shiftdatashort
[T]weighted-average
table: 1 :table


A weighted average of the June-September all-India rainfall index is taken with the classification (0 or 1) of October-January seasonal SSTAs as either 'LaNina', 'Neutral', or 'ElNino' to illustrate differences in long-term mean June-September Indian monsoon rainfall based upon the state of ENSO later in the year (an ENSO-based composite of seasonal precipitation).

Live Example Link

The following example is similar to the previous one except that instead of using a single precipitation time series for India it uses CMAP gridded precipitation values that vary in space over India. It produces composite maps of June-September 1979-2006 seasonal average precipitation according to ENSO state (LaNina, Neutral, and ElNino) over south Asia.

SOURCES .NOAA .NCEP .CPC .Merged_Analysis .monthly .v0703 .ver2 .prcp_est
X 60. 100. RANGEEDGES
Y 0 40 RANGEEDGES
T (Jun 1979) (Sep 2006) RANGE
T 4 runningAverage
T 12 STEP
SOURCES .KAPLAN .Indices .NINO3 .avOS
T (Oct 1979) (Dec 2006) RANGE
T 4 boxAverage
T 12 STEP
[T]percentileover
{LaNina 0.2 Neutral 0.8 ElNino}classify
T 4 shiftdatashort
[T]weighted-average

Live Example Link

Note that in the special case of a variable in categorical form (i.e. which has integral values that are chosen from a list), and that list is given with the variable, then the list of classes and transitions can be omitted. For example,

SOURCES .NASA .ISLSCP .GDSLAM .Hydrology-Soils .soils .texture classify

transforms the variable from categorical form (a 3D variable of integer values) to complete disjunctive form (a 4D variable of weights (0 or 1) where the added dimension has the list of possibilities defined with the original categorical dataset). This dataset can now be regridded or factor-analyzed.

Live Example Link


See also

categorical form: dominant_class
Categorization: distrib distrib1D distrib2D dominant_class
complete disjunctive form: dominant_class
Independent Variable Creation: :cressman grouptogrid invertontogrid shiftdata shiftdatashort toS :weaver