Patent attributes
Methods, systems, and apparatus, including computer programs encoded on computer storage media, for classifying documents. One of the methods includes obtaining a collection of training documents, the training documents including positive documents identified as being longform documents and negative documents identified as not being longform documents; extracting one or more features from the training documents, wherein the features represent lexical or textual content of the training documents; and generating a longform document classifier trained using feature instances extracted from the training documents, wherein the generated longform document classifier is trained such that input documents are classified as being longform documents or classified as not being longform documents.