Is HBase significant if it does not work in a distributed environment?

I create a data index that will entail storing a large number of triplets in the form (document, term, weight) . I will store up to several million such lines. I am currently doing this in MySQL as a simple table. I store document and term identifiers as string values ​​than foreign keys for other tables. I am rewriting software and looking for the best ways to store data.

Looking at how HBase works, this seems to be a good fit for the circuit. Instead of storing many triplets, I could map document to {term => weight} .

I do this on one node , so I do not need distributed nodes, etc. Should I just stick with MySQL because it works, or would it be wise to try HBase? I see that Lucene uses it for full-text indexing (which is similar to what I am doing). My question is how to compare one HBase node with one MySQL node? I come from Scala, so can a direct Java API have an edge over JDBC and MySQL parsing, etc. Every request?

My main concern is insertion speed, as that was the bottleneck earlier. After processing, I will most likely return the data back to MySQL for an interactive query, because I need to do some calculations that are better to do in MySQL.

I will try to prototype both, but I am sure that the community can give me some idea about this.

+4
source share
2 answers

Use the right tool for the job.

There are many anti-RDBMS or BASE systems (mostly affordable, mild, ultimately compatible), unlike ACIDs (Atomicity, Consistency, Isolation, Durability) to choose from here and here .

I used traditional RDBMS, and although you can store CLOB / BLOB, they do not have built-in indexes specifically designed to search for these objects.

You want to do most of the work (calculating the weighted frequency for each tuple found) when inserting a document.

You may also want to do some work by calculating the utility of each (documentId, searchWord) pair after each search.

This way you can search better and better each time.

You also want to save a rating or weight for each search and weighted ratings for similarities with other searches.

It is likely that some searches are more common than others, and that users do not correctly formulate their search query, although they mean for a general search.

The insertion of the document should also lead to some change in the weight of the index search.

The more I think about it, the more difficult the decision. You must start with a good design first. The more factors the design expects, the better the result.

+1
source

MapReduce seems like a great way to generate tuples. If you can get the scala job into a jar file (not sure, since I haven't used scala before and I jvm n00b), it would just send it and write a little wrapper to run it on the map, reduces the cluster.

Regarding saving tuples after you are done, you can also consider a document-based database like mongodb if you are just saving tuples.

All in all, it looks like you are doing something more statistical with texts ... Have you just considered using lucene or solr to do what you are doing, instead of writing your own?

+1
source

All Articles