Patent attributes
A method and an apparatus for screening TB-scale of incremental data. In the present application, according to the memory capacity of the device, the raw data is divided into a plurality of raw data blocks, and the data is cleaned. By adopting a single-block index sorting algorithm, the de-duplicating ordering in the data blocks is completed without dropping operation, and the processed data blocks and a matrix hash index table are respectively generated and saved as initial data after completion. For the subsequent incremental data, the inter-block index-sorting algorithm is adopted, and the processed data blocks and the matrix hash index table are loaded in turn. The data is preliminarily screened on the basis of the matrix hash index table, and an incremental binary search algorithm is used for fine screening. Finally, the indexing and de-duplication screening of all data are completed.