Hadoop as a document repository database

Question

Hadoop as a document repository database

We have a large document warehouse currently occupying 3 TB in space, and it is increasing by 1 TB every six months. They are currently stored in the Windows file system, which sometimes causes problems in terms of access and search. We aim to use a Hadoop-based document repository database. Is it a good idea to continue working with Hadoop? Does anyone have anything to do with this? What can be the problems, technological obstacles in achieving the same?

+8

hadoop

Msdnexpert Feb 22 '12 at 5:03

source share

3 answers

HDFS does not seem to be the right solution. It is optimized for massive, parallel data processing and is not a general-purpose file system. In particular, it has the following limitations, which makes it the wrong choice:
a) It is sensitive to the number of files. The practical limit should be around tens of millions of files.
b) Files are read-only and can be added but not edited. This is normal for processing analytical data, but may not meet your needs.
c) It has a single point of failure - namenode. Therefore, its reliability is limited.

If you need a system with comparable scalability, but not sensitive to the number of files, I would suggest OpenStack Swift. He also does not have SPOF.

0

David gruzman Feb 22 '12 at 7:20

source share

My suggestion is you can buy NAS storage. Maybe an EMS isilon product that you can consider.

Hadoop HDFS is not intended for file storage. This is a repository for data processing (for reports, analytics ..)

NAS is for file sharing

SAN is more suitable for a database

http://www.slideshare.net/jabramo/emc-sanoverviewpresentation

Declaration: I am not an EMC character, so you can consider any product. I just used EMC for reference.

0

Prabakaran Jun 13 '14 at 12:09

source share

Nightwolf · Accepted Answer · 2012-02-23T03:23:59+0000

Hadoop is more suitable for batch processing, which provides a high level of data access. You should take a look at some NoSQL systems, such as document-oriented databases. It is difficult to answer without knowing what your data is.

The number one rule for NoSQL design is to first define your query scripts. When you really understand how you want to query the data, you can look at various NoSQL solutions. The default distribution block is the key. Therefore, you need to remember that you need to be able to share your data between your node machines, otherwise you will get a horizontally scalable system, and all the work will be performed on one node (although the best requests depending on the case).

You also need to return to the CAP theorem, most NoSQL databases are ultimately consistent (CP or AP), while CA is the traditional relational DBMS. This will affect how you process the data and create certain things, for example, key generation can be complicated. Obviously, the files in the folder are a little different.

Also remember that on some systems, such as HBase, there is no indexing concept (I want you to have indexing files set up in this Windows FS document store). All your indexes must be created by your application logic, and any updates and deletes will need to be managed as such. With Mongo, you can create indexes on fields and query them relatively quickly, there is also the ability to integrate Solr with Mongo. You just don’t need to request an identifier in Mongo, like in HBase, which is a family of columns (like a Google BigTable style database) where you essentially have nested key-value pairs.

So, again, it is about your data, about what you want to save, about how you plan to store it, and, most importantly, about how you want to access it. The Lily project looks very promising. I am working to ensure that we take a large amount of data from the Internet, and we store, analyze, remove, parse, analyze, process, update, etc. We do not just use one system, but many that are best suited to work. For this process, we use different systems at different stages, because it gives us quick access where we need it, provides the ability to stream and analyze data in real time and, importantly, track everything as it moves (like data loss in the product the system is a big deal). I use Hadoop, HBase, Hive, MongoDB, Solr, MySQL and even good old text files. Remember that to produce a system using these technologies is a bit more complicated than installing Oracle on a server, some versions are not so stable, and you really need to test first. In the end, it really depends on the level of business resistance and the critical nature of your system.

Another way that no one has mentioned so far is NewSQL - that is, horizontally scalable RDBMS ... There are several such as MySQL-cluster (I think) and VoltDB that may suit your reason. But again, depending on your data (document file files or text documents with information about products, accounts or tools, or something else ...)

Again, to understand your data and access patterns, NoSQL systems are also Non-Rel, i.e. non-relational, and they are better suited for non-relational datasets. If your data is inherently relational, and you need some SQL query functions that really need to do things like Cartesian products (aka joins), then you might be better off sticking to Oracle and investing some time in indexing, tuning, and tuning performance.

My advice would be to play with several different systems. Take a look:

MongoDB - Document - CP

CouchDB - Document - AP

Cassandra - Column Family - Separation Available and Enabled (AP)

VoltDB is a really good product, a relationship database that is distributed and can work for your business (maybe easier). They also seem to provide corporate support that may be more appropriate for the product (that is, to give business users a sense of security).

In any case, this is my 2c. Playing with systems is the only way to find out what really works for your business.

Hadoop as a document repository database

More articles: