Our paper on debugging systems for data-intensive analytics got accepted to the ACM Symposium on Cloud Computing. The paper presents Newt, a scalable architecture for capturing and querying data lineage information, to find and resolve errors in processing pipelines.
Newt provides a ﬂexible instrumentation API that allows system developers to collect ﬁne-grain lineage from a range of data intensive scalable computing (DISC) architectures. Newt pairs this API with a scale-out, fault-tolerant lineage store and query engine.
Until the camera-ready version, take a look at the technical report here.