Recommended Document Structure for CouchDB

We are currently considering a change from Postgres to CouchDB for a usage monitoring application. Some rooms:

About 2,000 connections polled every 5 minutes, for about 600,000 new series per day. In Postgres, we store this data broken down by day:

t_usage {service_id, timestamp, data_in, data_out}
t_usage_20100101 inherits t_usage.
t_usage_20100102 inherits t_usage. and etc.

We record data with an optimized stored procedure that assumes the partition exists and creates it if necessary. We can insert very quickly.

For reading data, our use cases in order of importance and current performance:
* Single-use, one-day use: good performance
* Multiple services, use of the month: low productivity
* One service, use of the month: low productivity
* Several services, several months: very low productivity
* Multiple services, one day: good performance

This makes sense because the sections are optimized for days, which is by far our most important use case. However, we are considering methods to improve secondary requirements.

We often also have to parameterize the query by the hour, for example, only with results from 8 to 18 hours, so pivot tables have limited use. (These parameters change with sufficient frequency that the creation of several pivot tables of data is prohibitive).

With this background, the first question is: is CouchDB suitable for this data? If so, given the use cases above, how would you best model the data in CouchDB documents? Some parameters that I have collected so far, we are in the process of benchmarking: (_id, _rev excluded):

One connection per day

{ service_id:555 day:20100101 usage: {1265248762: {in:584,out:11342}, 1265249062: {in:94,out:1242}} } 

Approximately 60,000 new documents per month. Most new data will update existing documents, not new documents.

(Here, the objects used are tied to the polling timestamp, and the bytes are in bytes).

One connection document per month

 { service_id:555 month:201001 usage: {1265248762: {in:584,out:11342}, 1265249062: {in:94,out:1242}} } 

Approximately 2,000 new documents per month. A moderate update of existing documents is required.

One document per row of data collected

 { service_id:555 timestamp:1265248762 in:584 out:11342 } { service_id:555 timestamp:1265249062 in:94 out:1242 } 

Approximately 15,000,000 new documents per month. All data will be inserted into a new document. Faster insertions, but I have questions about how effective it will be in a year or 2 years with hundreds of millions of documents. An IO file will seem unacceptable (although I'm the first to admit that I don't fully understand its mechanics).

I am trying to approach this in a document oriented on breaking the RDMS habit is difficult :) The fact that you can only parameterize the views minimally also worries me a little. However, which of the above would be most appropriate? Are there other formats that I have not considered that will work better?

Thanks in advance,

Jamie

+7
couchdb data-modeling
source share
1 answer

I do not think this is a terrible idea.

Letโ€™s take a look at your Connection / Month scenario.

Given that the record lasts ~ 40 (these are generous) and you get ~ 8,200 records per month, your final document size will be ~ 350K at the end of the month.

This means that you will read and write 2000 350K documents every 5 minutes.

I / O wise, it's less than 6 MB / s, given the read and write averaged over a 5 minute time window. Today it is good even for low-maintenance equipment.

However, there is one more problem. When you store this document, Couch is going to evaluate its contents to build its presentation, so Couch will process 350K documents. My fear is that (finally check, but it has been a while). I donโ€™t think Couch scales well across processor cores, so it can easily tie a single processor core that Couch will use. Hopefully, Couch can read, parse and process 2 MB / s, but I honestly don't know. With all the benefits, erlang is not the best ass in direct computer language.

The ultimate problem is maintaining the database. It will record 700 MB every 5 minutes at the end of the month. With the Couchs architecture (add only) you will record 700 MB of data every 5 minutes, which is 8.1 GB per hour, and 201 GB after 24 hours.

After the database is compressed, it is reduced to 700 MB (within one month), but during this process this file will grow quite quickly.

On the extraction side, these large documents do not scare me. Downloading a 350K JSON document, yes itโ€™s big, but itโ€™s not so big, not on modern equipment. There are more avatars on bulletin boards. So, all you want to do regarding the activity of the compound for a month will be pretty fast, I think. Compared to connections, obviously, the more you capture, the more expensive it will get (700 MB for all 2000 connections). 700 MB is a real number that has a real impact. In addition, your process should be aggressive in throwing out data that you do not need so that it can throw chaff (if you do not want to load 700 MB of heap into the reporting process).

Given these numbers, a connection / day may be the best bet since you can better control the details. However, to be honest, I would go for the coarsest document you can, because I think it gives you the best value from the database, because today all heads are looking and spinning on disk, which kills a lot of I / O performance Many drives stream data very well. Large documents (assuming well-spaced data, since Couch is constantly being compacted, this should not be a problem) flow more than searching. A memory search is โ€œfreeโ€ compared to a disk.

Be sure to perform your own tests on our equipment, but take all these measures to the heart.

EDIT:

After several experiments ...

Some interesting observations.

When importing large documents, the CPU is equally important for I / O speed. This is due to the amount of marshalling and processor consumed by converting JSON into an internal model for use by views. Using large (350 thousand) Documents, my processors were significantly increased (350%). On the contrary, with smaller documents, they buzzed together at 200%, although, in general, it was the same information, simply laid out in different ways.

For I / O during 350K documents, I was 11 MB / s, but with smaller documents it was only 8 MB / s.

The seal turned out to be almost related to I / O. I find it difficult to get good numbers on my I / O potential. A copy of the cached file pushes 40 + MB / s. The compaction was about 8 MB / s. But this is consistent with rough loading (assuming the couch moves the message by message). The processor is smaller because it does less processing (it does not interpret the JSON payload or rebuilds the views), plus it was the only processor to do the work.

Finally, for reading, I tried to unload the entire database. One processor was tied for this, and my I / O was pretty low. I decided that the CouchDB file is not actually cached, my machine has a lot of memory, so many things are cached. The raw dump through _all_docs was only about 1 MB / s. This is almost all search and rotational delays than anything else. When I did this with large documents, I / O beat 3 MB / s, which simply shows the effect of streaming, which I mentioned about the benefits for larger documents.

And it should be noted that there are performance enhancement methods on the Couch website that I have not followed. In particular, I used random identifiers. Finally, this was not done as an indicator of what Couch performance is, rather where the load seems to end. Big and small differences in the documents, which I found interesting.

Lastly, ultimate performance is not as important as being just good enough for an application with your hardware. As you mentioned, you are doing your own testing, and all of this matters.

+10
source share

All Articles