Patent 11769072 was granted and assigned to Adobe Inc. on September, 2023 by the United States Patent and Trademark Office.
The structure of an untagged document can be derived using a predictive model that is trained in a supervised learning framework based on a corpus of tagged training documents. Analyzing the training documents results in a plurality of document part feature vectors, each of which correlates a category defining a document part (for example, “title” or “body paragraph”) with one or more feature-value pairs (for example, “font=Arial” or “alignment=centered”). Any suitable machine learning algorithm can be used to train the predictive model based on the document part feature vectors extracted from the training documents. Once the predictive model has been trained, it can receive feature-value pairs corresponding to a portion of an untagged document and make predictions with respect to the how that document part should be categorized. The predictive model can therefore generate tag metadata that defines a structure of the untagged document in an automated fashion.