Patent attributes
A system, method, and machine-readable storage medium for determining an amount of unique data in a distributed storage system are provided. In some embodiments, a combined efficiency set for a first data set stored in the distributed storage system, such as at a volume, may be generated. The first data set may include a first subset of data and a second subset of data in the distributed storage system. Additionally, a set of efficiency sets for the first subset of data may be generated. A set difference based on the combined efficiency set and the set of efficiency sets may be computed. An amount of memory used for storing unique data of the second subset of data may be estimated based on the set difference. The unique data may be present in the second subset of data but absent from the first subset of data.