Patent attributes
A computing apparatus, including: a hardware platform including a processor circuit and a memory; and instructions encoded within the memory to instruct the processor circuit to: extract human readable text from a plurality of known websites, the known websites having known classifiers; apply a MinHash algorithm to respective human readable text of the known websites; generate a plurality of different locality sensitive hashing (LSH) indexes for the respective websites; extract human readable text from a test website; apply the MinHash algorithm to the human readable text of the test website to provide a MinHash of the test website; query the plurality of different LSH indexes with the MinHash of the test website; and according to a result of the query, assign a category the test website, wherein the category matches a known category of at least one of the plurality of known website found to have a containment with the test website above a threshold.