Patent attributes
The disclosed embodiments relate to a tax-information assembly technique, which extracts tax information and associated context information from income-tax documents, where these income-tax documents are associated with an income-tax agency, and some of the income-tax documents include the same tax information in different document formats. During this technique, semantic and structural heuristics are used to identify tax phrases in the extracted tax information. Moreover, additional tax phrases in the extracted tax information are identified using a statistical identification technique. Next, relationships between the tax phrases and the additional tax phrases are determined, and the context information is used to consolidate the tax phrases and the additional tax phrases into a tax-information data structure.