Patent attributes
Methods, systems and computer program products for clustering pages into headline clusters are provided by collecting web data, identifying pages from the web data, tokenizing unique words in each page, recognizing unique entities in each page, detecting media links in each page, and constructing a plurality of vector representations of each page. A first dimension of each vector representation includes the unique words tokenized in each page, a second dimension of each vector representation includes the unique entities recognized in each page, and a third dimension of each vector representation includes the media links detected in each page. The vector representations are, in turn, clustered.