Patent attributes
Methods, systems, and apparatus, including computer programs encoded on computer storage media, for clustering for parallel processing. One of the methods includes providing virtual machines with an interface to a shuffle service, the shuffle service executing external of the virtual machines. The method includes receiving data records through the interface, each data record having a key and a value. The method includes partitioning the data records, using the shuffle service, according to the respective keys. The method includes providing a part of the partitioned data records through the interface to the virtual machines, wherein data records having the same key are provided to the same virtual machine. Each of the virtual machines can execute on a host machine and each of the virtual machine is a hardware virtualization of a machine.