Patent attributes
Methods, devices, and non-transitory computer readable storage media for extracting text from documents are disclosed. The method includes performing layout analysis on the document to identify a plurality of regions within a plurality of pages in the document. The method further includes identifying a table region from within the plurality of regions based on homogeneity between a plurality of textual lines in a page from the plurality of pages. The method includes identifying at least two rows and at least two columns within the table region. The method further includes identifying a plurality of cells within the table region based on the at least two rows and the at least two columns. The method includes extracting text from each of the plurality of cells.