Patent attributes
Systems and methods are described for determining clusters for large collections of content items. A fast cluster-identifying algorithm can be used to find high density areas where certain less interesting content items might be clustered in a feature space. An example algorithm is a mean shift algorithm. Once these high-density clusters are located, a system can remove them and proceed to analyze the remaining data. Removing these clusters of featureless content items can greatly reduce the collection size and also enhance the overall quality of the collection. Labels can then be applied to clusters and, when a content item is received, classification algorithms can be used to assign an appropriate label to the content item.