Patent attributes
A system provides analysis of distributed data and grouping of variables in support of analytics. Policy parameter values that define thresholds are received. A first computation of a cardinality value and of a number of observations having a non-missing value is requested for each variable of a plurality of variables included in the distributed data by each worker computing device. A number of observation vectors having the non-missing value and the cardinality value are computed by each worker computing device for each variable in response to the first computation request. Each respective worker computing device computes the number of observation vectors having the non-missing value and the cardinality value from a subset of the input dataset distributed to the respective worker computing device by reading each observation vector from the subset once. Each variable is assigned a category based on a comparison between computed values and the policy parameter values.