资 源 简 介
Sailfish tries to improve the disk subsystem performance for large scale Map-Reduce computations. This work is based on the observation that bandwidth within a datacenter will substantially increase over the coming years. Such an increase can be harnessed to build network-wide data aggregation and thereby improve disk throughput. That is, "group-commit" data to disk by batching data from multiple writers that run on different machines in the network.
There are two key ideas in the Sailfish design:
1. I-files, a network-wide data aggregation abstraction
1. Use I-files to aggregate intermediate data (namely, map output) in a Map-Reduce computation. Gather statistics on the aggregated data to plan the reduce phase of execution.
Using I-files to transport map output helps improve disk performance when handling TB"s of intermediate data. Dynamically planning the reduce phase improves auto-tune/auto-scale functionality: Sailfish handles skew as well