Patent 8041662 was granted and assigned to Microsoft on October, 2011 by the United States Patent and Trademark Office.
Character-based n-grams are derived from a domain name in order to classify such domain name in pre-established categories. Domain name character-based n-grams are mapped to vector points in a multidimensional space, where the number of dimensions is the number of different n-grams that can exist for an n-character combination. The relationship between the domain name vector point and the vector points of the various other domain names is used to classify the domain name vector point. The classification system can use statistical methods using relative frequencies of character-based n-grams in various classifications as indicators. A dictionary set of character-based n-grams can be derived from one or more domain names and associated with probability indicating the likelihood that the character-based n-gram is found in a domain name of a given classification. Such probability can be an estimator of a classification of a new domain name having such character-based n-gram.