Patent attributes
Described herein is an approach for automatically determining the semantic relatedness of documents to semantic concepts. A first text mining analysis extracts a set of reference concepts from reference documents. A second text mining analysis extracts a set of test concepts from test documents that include a mixture of new concepts and reference concepts. An extended co-occurrence matrix is computed that indicates a frequency of co-occurrence (RCCF) of each new and each reference concept in the test documents with all other new and reference concepts. The extended co-occurrence matrix is used for computing a new concept relatedness score (NCRS) for the new concepts. A document similarity score (DSS) is computed for each of the test documents by aggregating, inter alia, the NCRS of each new concept with the RCCF of each reference concept. The DSS represents the semantic relatedness of the test document to the totality of the reference concepts.