资 源 简 介
HAMAKE
Introduction
Most non-trivial data processing scenarios with Hadoop typically require more than one MapReduce job. Usually such processing is data-driven, with the data funneled through a sequence of jobs. The processing model could be presented in terms of dataflow programming. It could be expressed as a directed graph, with datasets as nodes. Each edge indicates a dependency between two or more datasets and is associated with a processing instruction (Hadoop MapReduce job, PIG Latin script or an external command), which produces one dataset from the others. Using fuzzy timestamps as a way to detect when a dataset needs to be updated, we can calculate a sequence in which the tasks need to be executed to bring all datasets up to date. Jobs for updating independent datasets could be executed concurrently, taking advantage of your Hadoop cluster"s full capacity. The dependency graph may even contain cycles, leading to dependency loops w