Patent attributes
In an example embodiment, a fuzzy join operation is performed by, for each pair of records, evaluating a first plurality of features for both records in the pair of records by calculating term frequency-inverse term frequency (TF-IDF) for each token of each field relevant to each feature and based on the calculated TF-IDF for each token of each field relevant to each feature, computing a similarity score based on the similarity function by adding a weight assigned to the TF-IDF for any token that appears in both records. Then a graph data structure is created, having a node for each record in the plurality of records and edges between each of the nodes, except, for each record pair having a similarity score that does not transgress a first threshold, causing no edge between the nodes for the record pair to appear in the graph data structure.