Patent attributes
In one example method, a hash space is identified that includes hashes that each point to a respective piece of data in the customer data set. A first hash is selected using a sub sample ratio, and the first hash is checked to determine if the first hash points to data previously observed in an associated backup stream. If the first hash points to data not previously observed in the associated backup stream, the number of data pieces to which the first hash points are recorded, and if the first hash points to data previously observed in the associated backup stream, a second hash is selected using the sub sample ratio. The method can be repeated until the entire hash space has been sampled. The required storage capacity is calculated using the sub sample ratio and a sum of the recorded number of data pieces.