The present system uses quorum based aggregator failure detection in which a failed aggregator is detected and configured. Rather than repair and roll-up of all metrics for a period of time associated with the failed aggregator, only the specific metrics that were to be processed by the failed aggregator are repaired. Once the failed aggregator is identified, the time range for the downed aggregator and keys processed by the aggregator are identified. Keys for replica aggregators associated with the identified time ranges and key values are then pulled, provided to a batch processor, and processed. At cluster roll-up task completion, a time rollup task for cluster rollup is then started.