Convert legacy EAV schema to Mongo or Couch

Let's say I have an outdated application, which for various reasons, according to previous developers, should have an arbitrarily flexible scheme, and they again invented the Entity-Attribute-Value model. They actually tried to create a document repository for which tools such as Mongo or Couch would now better fit the world today, but were not available or not known to previous teams.

To remain competitive, say, we need to create more powerful methods for querying and analyzing information in our system. Based on the large number and variety of attributes, it seems that map / reduce is better for our set of problems than gradually reorganizing the system into a more relational scheme.

The original source database contains millions of documents, but only a small number of different types of documents. Different types of documents have some common features.

What is an effective strategy for moving from a massive EAV implementation, say, in MySql, to document-oriented storage like Mongo or Couch?

I can, of course, imagine an approach to attacking this, but I would really like to see a textbook or military history to learn from someone who has already attacked this type of problem.

What were the strategies for such a conversion that worked well? What lessons did you learn? What pitfalls should be avoided? How did you deal with legacy applications that are still waiting to interact with an existing database?

+4
source share
1 answer

My first use of Couch was after I wrote web gurus Ruby and Postgres (targeted scanning of mp3 blogs to create a recommendation mechanism).

The relational schema was deeply apparent when I tried to record ID3 metadata, sound captions, etc. etc., as well as detect overlap and otherwise do deduplication. It worked, but it was slow. So slowly, I began to cache my JSON API strings on the corresponding ActiveRecord primary objects as blob fields.

I had a choice: to delve into and study the performance tuning of Postgres or move on to a horizontal approach. Therefore, I used Nutch and Hadoop for the web on the Internet, and PipeMapper for analyzing pages using Ruby / Hpricot. Thus, I was able to reuse all my parser code and simply change it to save as a normalized database, to save as JSON. I wrote a small library for handling JSON and REST endpoints called CouchRest, which I used to store Hpricot results in CouchDB.

For this project, I just ran Couch on one EC2 node, with a small 6 node Hadoop cluster populating it. Only when I talked about creating a view interface for spidered data did I really get a good idea about query capabilities.

I turned out to be flexible and especially suitable for OLTP applications, I quickly started using it in all my projects and eventually founded a technology company with two of the creators.

+5
source

All Articles