Patent attributes
Data files in a distributed system sometimes becomes unavailable. A method for fault tolerance without data loss in a distributed file system includes allocating data nodes of the distributed file system among a plurality of compute groups, replicating a data file among a subset of the plurality of the compute groups such that the data file is located in at least two compute zones, wherein the first compute zone is isolated from the second compute zone, monitoring the accessibility of the data files, and causing a distributed task requiring data in the data file to be executed by a compute instance in the subset of the plurality of the compute groups. Upon detecting a failure in the accessibility of a data node with the data file, the task management node may redistribute the distributed task among other compute instances with access to any replica of the data file.