Patent attributes
Compressing files is disclosed. An input file to be compressed is first aligned. Aligning the file includes splitting the file into sequences that can be aligned. The result is a compression matrix, where each row of the matrix corresponds to part of the file. The compression matrix may also serve as a warm start if additional compression is desired. Compression may be performed in stages, where an initial compression matrix is generated in a first stage using larger letter sizes for alignment and then a second compression stage is performed using smaller letter sizes. A consensus sequence id determined from the compression matrix. Using the consensus sequence, pointer pairs are generated. Each pointer pair identifies a subsequence of the consensus matrix. The compressed file includes the pointer pairs and the consensus sequence.