Patent attributes
Digital content may be processed to determine a set of containers in the content. Each container may correspond to a particular text element of the digital content such as a line of text on a page of a digital content file. Container data indicating values of base content properties for each container may be obtained. Derived content properties may be determined from the base content properties and values of the derived content properties may be determined for each container. Multiple iterations of a clustering algorithm may be executed, where each iteration involves grouping the containers into a set of clusters by applying a particular distance function to the values of a particular set of base and/or derived properties for each container. The distance function and set of properties utilized at each iteration may be configurable to obtain clusters that can be associated with particular semantic classifiers.