Patent attributes
Systems and methods are set forth for identifying key-words and key-phrases, collectively referred to as key-terms, from a document. A document is accessed and the document is tokenized, each token corresponding to a word or phrase occurring within the document. Term frequencies of the terms of the tokens may be determined and TF-IDF scores may be generated according to the term frequencies. Embedding vectors for the terms of the tokens may be generated and a document embedding vector may be generated according to the embedding vectors of the documents. A similarity score may be determined for each token according to the embedding vector of a token and the document embedding vector. Additionally, an overall score may be determined for each token according to the term of the token, a TF-IDF score, similarity scores, and the like. Terms from the highest scoring tokens are selected as the key-terms for the document.