Patent attributes
Methods, systems and computer program products for automatically generating structured training data based on an unstructured document are provided. Aspects include receiving an unstructured document and a corresponding structured document that includes labeled portions. Aspects also include generating a parsed document that has one or more extracted objects by applying a parsing tool to the unstructured document. Aspects also include identifying one or more matching extracted objects by applying a matching algorithm to the structured document and the parsed document. Each matching extracted object is an extracted object of the parsed document that corresponds to a labeled portion of the structured document. Aspects also include annotating a region of the unstructured document that corresponds to the bounding box of the respective matching extracted object with a respective label of the corresponding labeled portion of the unstructured document.