Storage in Apache Flink

After processing these millions of events / data, where is the best place to store information to say itโ€™s worth saving millions of events? I saw a pull request closed with this latch , indicating the formats of the Parquet, but is HDFS used by default? My concern is to save (where?) If it is easy (fast!) To extract this data?

+4
source share
1 answer

Apache Flink is not associated with specific storage systems or formats. The best place to store Flink calculated results depends on your use case.

  • Is a batch or streaming job running?
  • What do you want to do with the result?
  • Do you need batch (full view), point or continuous streaming data access?
  • What format does the data have? flat structured (relational), nested, blob, ...

Depending on the answer to these questions, you can choose from various repositories, such as - Apache HDFS for packet access (with different storage formats, such as Parquet, ORC, user binary) - Apache Kafka, if you want to access data as to a stream - storage with key values, such as Apache HBase and Apache Cassandra for point-to-point access to data - a database such as MongoDB, MySQL, ...

Flink provides OutputFormats for most of these systems (some through the shell for Hadoop OutputFormats). The โ€œbestโ€ system depends on your use case.

+7
source

All Articles