OpenViz supports a host
of aggregation functions that can analyze data
in records segregated into bins. OpenViz allows
the definition of record-binning schemes, ranging
from the simplest one-dimensional discrete bin
specification to the most complex three-dimensional
scheme with non-uniform continuous binning ranges.
The mechanism for defining bin-membership criteria
in OpenViz is called the AxisMap. The
wide range of AxisMaps that can be created within
OpenViz provides dynamic flexibility in defining
bins including:
• Discrete
and continuous binning: Bin membership
can be defined based on unique, discrete data
values or on continuous ranges of data values.
• Non-uniform
bin ranges: Bins can be defined by chopping
a continuous data space into a series of ranges.
These ranges can be of non-uniform size and can
be adjusted dynamically.
• Variety
of data types: Bin membership criteria
need not be based on simple numerical data, but
can be based on a wider range of types including
string, date/time and currency values. In the
case of date/time values, the bins can be defined
using convenient intervals such as months or quarters
and can easily be made to omit weekends and holidays.
• Multi-dimensional
analysis: As many as three different criteria
can be used to define bin membership, allowing
for 1-, 2- and 3-dimensional analyses.
The versatile binning options outlined above the
implementation of analytic techniques such as
cluster analysis and correspondence analysis require
bin membership criteria to be flexible and mutable.
(Cluster analysis provides tools for grouping
data items with increasing specificity; for example,
biologists classify man as an animal, a vertebrate,
an amniote, a mammal, and ultimately a primate.
Correspondence analysis portrays higher-level
similarities between data sets by generalizing
their lower-level values based on correspondence;
for example, five data points expressing the prevalence
of smoking per level of employees could be normalized
and generalized to show a similar contour in two
of those levels.)
The aggregation functions (also known as amalgamation
functions) that can be applied to the records
in each bin include:
• Sum:
Computes the sum of the values in a specified
column of the records in each bin. In addition
to the summing of raw data, this operation can
be applied to the results of analytical operations
to implement stacked generalization analyses.
(Stacked generalization is an extension of cross-validation,
and is a scheme for minimizing the error rate
in data generalization.)
• Mean:
Computes the arithmetic mean of the values in
a specified column of the records in each bin.
In addition to the summing of raw data, this operation
can be applied to the results of analytical operations
to implement bagging or voting analyses. (Bagging
uses sampled data as a tool for predicting data
grouping.)
• Minimum:
Identifies the least of the values in a specified
column of the records in each bin.
• Maximum:
Identifies the greatest of the values in a specified
column of the records in each bin.
• Median:
Computes the median of the values in a specified
column of the records in each bin.
• Count:
Returns the number of records in each bin.
• First:
Identifies the first value encountered in a specified
column of the records in each bin.
• Last:
Identifies the last value encountered in a specified
column of the records in each bin.
• Standard
Deviation: Computes the standard deviation
of the values in a specified column of the records
in
each bin. (The standard deviation is a measure
of the dispersion of values in a data set and
is equal to the square root of the variance. The
variance is the average of the squares of the
amounts by which each value deviates from the
mean.)
• nth Percentile:
For any n, computes the data value corresponding
to the nth percentile of the values in a specified
column of the records in each bin. (The nth percentile
is a value below which lie n% of the data values.
So, for example, a test score in the 95th percentile
would be among the highest.)
• n% Confidence
Limit: For any n, computes c, the data
value corresponding to the n% confidence limit
of the values in a specified column of the records
in each bin. (If c is the n% confidence limit
for a set of data, then there is an n% chance
that any given data value is less than c.)
• Mean Plus
n Standard Deviations: For any n, computes
the sum of the arithmetic mean and n times the
standard deviation of the values in a specified
column of the records in each bin. These measurements
are useful for determining how a data set is distributed
around its mean value.
• Median
Plus n Standard Deviations: For any n,
computes the sum of the median and n times the
standard deviation of the values in a specified
column of the records in each bin. Similarly to
the previous operation, this measurement is useful
for determining how a data set is distributed
around its median value.
For data analyses based on a relational data model,
binning and aggregation functionality is implemented
within the ColumnDataToBins and BinStatistics
components, using binning schemes that are defined
using the AxisMap components.
For data analyses based on the hierarchical model,
the ColumnDataToTree and TableRollUp
components implement the aggregation functionality,
allowing you to summarize data values in child
nodes by assigning appropriate aggregated measures
to the parent. The TableRollUp component
allows for a drill-down analysis of multi-dimensional
hierarchical data sets. (Drill-down refers to
the exploration of successively more-detailed
data items; for example, of students who achieved
perfect scores, how many took the recommended
prerequisites? Of those who did, how many received
a grade of B or higher? Of those who did…
and so on.) |