Storage in Apache Flink

Question

Storage in Apache Flink

After processing these millions of events / data, where is the best place to store information to say it’s worth saving millions of events? I saw a pull request closed with this latch , indicating the formats of the Parquet, but is HDFS used by default? My concern is to save (where?) If it is easy (fast!) To extract this data?

+4

apache-flink

Jonathan Aug 11 '15 at 9:08

source share

1 answer

Fabian hueske · Answer 1 · 2015-08-11T22:15:30+0000

Apache Flink is not associated with specific storage systems or formats. The best place to store Flink calculated results depends on your use case.

Is a batch or streaming job running?
What do you want to do with the result?
Do you need batch (full view), point or continuous streaming data access?
What format does the data have? flat structured (relational), nested, blob, ...

Depending on the answer to these questions, you can choose from various repositories, such as - Apache HDFS for packet access (with different storage formats, such as Parquet, ORC, user binary) - Apache Kafka, if you want to access data as to a stream - storage with key values, such as Apache HBase and Apache Cassandra for point-to-point access to data - a database such as MongoDB, MySQL, ...

Flink provides OutputFormats for most of these systems (some through the shell for Hadoop OutputFormats). The “best” system depends on your use case.

Storage in Apache Flink

More articles: