Patent attributes
An embodiment of the present invention provides a computer-based method for automatically identifying clusters of conceptually-related documents in a collection of documents, including the following steps: generating a document-representation of each document in an abstract mathematical space; identifying a plurality of document clusters in the collection of documents based on a conceptual similarity between respective pairs of the document-representations, wherein each document cluster is associated with an exemplary document and a plurality of other documents; and identifying a non-intersecting document cluster from among the plurality of document clusters based on (i) a conceptual similarity between the document-representation of the exemplary document and the document-representation of each document in the non-intersecting cluster and (ii) a conceptual dissimilarity between a cluster-representation of the non-intersecting document cluster and a cluster-representation of each other document cluster. Variants of the method enable creating hierarchy of clusters and conducting incremental updates of preexisting hierarchical structures.