Patent attributes
Information contained in tables in a digitized document is extracted by retrieving table layout data regarding bounding boxes, each being auto-generated by the system and/or (re)generated by a user to the digitized image of a sample document. A row template is used to identify a first table, by automatically scanning within the document. Upon detecting a possible row in the input image, a Row Possibility Confidence Value (RPCV) is generated that indicates a likelihood that the possible row corresponds to an actual row in the first table. The possible row is regarded as an actual row if the RPCV exceeds a predetermined threshold value. For repeated tables in a document only the first table needs to be identified via bounding boxes. Also, related tables can be linked to permit linked data to be extracted to a structured file. Also, only the primary column in a readable and existent table header is required to extract table values across columns.