A plurality of clustered files is received. A first tile included in the plurality is selected and loaded into a suffix array. A chunk is located in a second file that is also present in the first file. A determination is made that the located chunk is present in a threshold number of additional files included in the plurality of clustered files. A signature is generated for the plurality of clustered files at least in part by using the chunk.