I register millions of small journal documents weekly:
- special requests for data mining
- combining, comparing, filtering and calculating values
- a lot of multi-text search using python
- start these operations with all millions of documents, sometimes every day
My first thought was put on all documents in HBase / HDFS and runs Hadoop jobs, creating statistics results.
The problem is this: some results should be close to real time.
So, after some research, I discovered ElasticSearch and now Iām thinking about transferring all millions of documents and using DSL queries to generate statistics.
Is that a good idea? ElasticSearch seems to be so easy to handle with millions / billions of documents.
hbase hadoop hdfs elasticsearch bigdata
user3175226
source share