Patent attributes
A distributed system processing a publisher's streaming data. The distributed system comprises multiple workers and publisher data stores, each publisher data store dedicated to a worker and a publisher. A sampling ratio (the fraction of data items for storage in the publisher's data store) is selected by a publisher data store's worker based on historical information. At least two workers select different sampling ratios. Data items representing an interaction between an entity and the publisher are received. Each data item is assigned to a worker for processing. A hash function is applied to the data item's identifier, resulting in a key value falling within the hash function's range. The scope of the publisher's data store is equal to the hash function's range multiplied by the sampling ratio of the publisher's data store. A data item with a key value within the scope of the publisher's data store is stored therein.

