Patent attributes
Methods, systems, and computer-readable storage media for providing weighted vector representations of documents, with actions including receiving text data, the text data including a plurality of documents, each document including a plurality of words, processing the text data to provide a plurality of word-vectors, each word-vector being based on a respective word of the plurality of words, determining a plurality of similarity scores based on the plurality of word-vectors, each similarity score representing a degree of similarity between word-vectors, grouping words of the plurality of words into clusters based on the plurality of similarity scores, each cluster including two or more words of the plurality of words, and providing a document representation for each document in the plurality of documents, each document representation including a feature vector, each feature corresponding to a cluster.