Distributed Data Flow Processing: Challenges and Opportunities


An automated analysis of a blood smear involves digitising the sample in the form of images and sending it over the network to the cloud. Further processing happens on the cloud which involves identifying various kinds of cells in them, estimating many of their physical properties and aggregating the data to arrive at approximations as accurate as possible. The raw images, thus, need to be processed for extracting objects of interest which are then fed to various machine learning models, primarily for classification. The output of these models is then aggregated and combined into meaningful metrics. Those metrics are then interpreted for any abnormalities. Finally, all the data is fed to a reporting UI. In this process, the data flows through a number of functions, which classify, transform, aggregate, interpret or simply output them. Some of these functions are independent and can run in parallel, some however need to wait until the output from other functions is available. Moreover, the logic of these functions and their behaviour can be separated, for example, a function might be responsible for running a model, but which model to run and with what parameters defines its behaviour. We can, therefore, write the logic of these functions and keep its behaviour configurable. The dependencies among the functions can be specified in form of a Directed Acyclic Graph (DAG). The process of analysing a blood smear then becomes taking the configuration and executing the DAG according to it. The underlying process is not very different for analysing a semen or urine sample, so the same abstraction can be used for them.

The analysis process mentioned above is not constant and will keep changing very frequently. We might want to compute new metrics from the existing ones or sub-classify the cells further or replace existing models with better ones. The challenge here is to facilitate the evolution of models as well as computation of the metrics, to an extent, without necessitating any change in the engine. Only the processing capabilities should be part of the engine and the behaviour should be completely configurable based on which kind of sample is being analysed.

The engine runs multiple analyses at a time, all of which should complete in a given amount of time. Although, different analyses can be run on separate nodes in parallel, the representation of the analysis as a Directed Acyclic Graph allows taking advantage of data and task parallelism and creates a scope for distributing even a single analysis on separate nodes. The next challenge is to make all of this processing real time and make the result of the analysis available to the end user as soon as it is submitted.