Data stream processing

Question

Data stream processing

I have a class of calculations that naturally takes the structure of a graph. The graph is far from linear, as there are several inputs, as well as nodes that fail, and nodes that require the result of several other nodes. All of these calculations may have several shells. However, cycles are never present. Input nodes are updated (not necessarily one at a time), and I have their values passing through a (currently purely conceptual) graph. Nodes maintain state when inputs change, and calculations must be performed sequentially with respect to the inputs.

How should I write such calculations quite often, and I was reluctant to write ad-hoc code every time, I tried to write a small library to simplify the addition of these calculations by writing classes for different vertices. My code, however, is rather inelegant, and it does not take advantage of the parallel structure of these calculations. Although each vertex is usually lightweight, the calculations can be quite complex and "wide." To make the problem even more complex, the inputs for these calculations are very often updated in a loop. Fortunately, the problems are quite small, and I can deal with them on one node.

Has anyone dealt with something like this? What ideas / approaches / tools would you recommend?

+8

stream dataflow control-flow-graph

em70 Nov 18 '15 at 3:26

source share

1 answer

Make42 · Answer 1 · 2015-11-27T08:55:31+0000

Apache Storm: reliable, real-time streaming processing on distributed hardware

This sounds like a problem for which Apache Storm (Open Source) would be perfect: http://storm.apache.org/

Apache Storm is a real-time streaming calculation that processes single tuples (data points) one at a time. Storm ensures that each tuple is processed at least once. With Storm Trident, you can further abstract from Storm and get exactly the semantics.

Apache Storm is a free open source, real-time computing system. Storm makes it easy to reliably process unlimited data streams by doing what Hadoop did for batch processing for real-time processing.

My company and I have been working with Apache Storm for several years, and this is one of the most mature Big Data technologies. Big Data technology is a technology that works in horizontal distribution (on cheap commercial equipment).

API and documentation

The main API for Java, but there are adapters for Ruby, Python, Javascript, Perl. However, you can really use any language: http://storm.apache.org/about/multi-language.html

The documentation is good (although a JavaDoc could use more details): http://storm.apache.org/documentation.html

The main idea is spouts and bolts (= graph nodes)

The storm has nozzles from which you can read data into the so-called topology. A topology is the graph that you described. When new tuples fall into the nozzles, they are sent through the topology. Each of the nodes is one of the assault bolts.

Use cases

The storm has many uses: real-time analytics, online machine learning, continuous computing, distributed RPC, ETL, etc. The storm is fast: the reference sample has triggered on it more than a million tuples processed per second per node. It is scalable, fault tolerant, ensures that your data will be processed and easily configured and working.

Data stream processing

Apache Storm: reliable, real-time streaming processing on distributed hardware

API and documentation

The main idea is spouts and bolts (= graph nodes)

Use cases

More articles: