One connection per day

{ service_id:555 day:20100101 usage: {1265248762: {in:584,out:11342}, 1265249062: {in:94,out:1242}} }

Approximately 60,000 new documents per month. Most new data will update existing documents, not new documents.

(Here, the objects used are tied to the polling timestamp, and the bytes are in bytes).

One connection document per month

 { service_id:555 month:201001 usage: {1265248762: {in:584,out:11342}, 1265249062: {in:94,out:1242}} }

Approximately 2,000 new documents per month. A moderate update of existing documents is required.

One document per row of data collected

 { service_id:555 timestamp:1265248762 in:584 out:11342 } { service_id:555 timestamp:1265249062 in:94 out:1242 }

Approximately 15,000,000 new documents per month. All data will be inserted into a new document. Faster insertions, but I have questions about how effective it will be in a year or 2 years with hundreds of millions of documents. An IO file will seem unacceptable (although I'm the first to admit that I don't fully understand its mechanics).

I am trying to approach this in a document oriented on breaking the RDMS habit is difficult :) The fact that you can only parameterize the views minimally also worries me a little. However, which of the above would be most appropriate? Are there other formats that I have not considered that will work better?

Thanks in advance,

Jamie

+7

couchdb data-modeling

majelbstoat Feb 04 '10 at 3:14

source share

1 answer

Will hartung · Accepted Answer · 2010-02-04T03:50:10+0000

I do not think this is a terrible idea.

Let’s take a look at your Connection / Month scenario.

Given that the record lasts ~ 40 (these are generous) and you get ~ 8,200 records per month, your final document size will be ~ 350K at the end of the month.

This means that you will read and write 2000 350K documents every 5 minutes.

I / O wise, it's less than 6 MB / s, given the read and write averaged over a 5 minute time window. Today it is good even for low-maintenance equipment.

However, there is one more problem. When you store this document, Couch is going to evaluate its contents to build its presentation, so Couch will process 350K documents. My fear is that (finally check, but it has been a while). I don’t think Couch scales well across processor cores, so it can easily tie a single processor core that Couch will use. Hopefully, Couch can read, parse and process 2 MB / s, but I honestly don't know. With all the benefits, erlang is not the best ass in direct computer language.

The ultimate problem is maintaining the database. It will record 700 MB every 5 minutes at the end of the month. With the Couchs architecture (add only) you will record 700 MB of data every 5 minutes, which is 8.1 GB per hour, and 201 GB after 24 hours.

After the database is compressed, it is reduced to 700 MB (within one month), but during this process this file will grow quite quickly.

On the extraction side, these large documents do not scare me. Downloading a 350K JSON document, yes it’s big, but it’s not so big, not on modern equipment. There are more avatars on bulletin boards. So, all you want to do regarding the activity of the compound for a month will be pretty fast, I think. Compared to connections, obviously, the more you capture, the more expensive it will get (700 MB for all 2000 connections). 700 MB is a real number that has a real impact. In addition, your process should be aggressive in throwing out data that you do not need so that it can throw chaff (if you do not want to load 700 MB of heap into the reporting process).

Given these numbers, a connection / day may be the best bet since you can better control the details. However, to be honest, I would go for the coarsest document you can, because I think it gives you the best value from the database, because today all heads are looking and spinning on disk, which kills a lot of I / O performance Many drives stream data very well. Large documents (assuming well-spaced data, since Couch is constantly being compacted, this should not be a problem) flow more than searching. A memory search is “free” compared to a disk.

Be sure to perform your own tests on our equipment, but take all these measures to the heart.

EDIT:

After several experiments ...

Some interesting observations.

When importing large documents, the CPU is equally important for I / O speed. This is due to the amount of marshalling and processor consumed by converting JSON into an internal model for use by views. Using large (350 thousand) Documents, my processors were significantly increased (350%). On the contrary, with smaller documents, they buzzed together at 200%, although, in general, it was the same information, simply laid out in different ways.

For I / O during 350K documents, I was 11 MB / s, but with smaller documents it was only 8 MB / s.

The seal turned out to be almost related to I / O. I find it difficult to get good numbers on my I / O potential. A copy of the cached file pushes 40 + MB / s. The compaction was about 8 MB / s. But this is consistent with rough loading (assuming the couch moves the message by message). The processor is smaller because it does less processing (it does not interpret the JSON payload or rebuilds the views), plus it was the only processor to do the work.

Finally, for reading, I tried to unload the entire database. One processor was tied for this, and my I / O was pretty low. I decided that the CouchDB file is not actually cached, my machine has a lot of memory, so many things are cached. The raw dump through _all_docs was only about 1 MB / s. This is almost all search and rotational delays than anything else. When I did this with large documents, I / O beat 3 MB / s, which simply shows the effect of streaming, which I mentioned about the benefits for larger documents.

And it should be noted that there are performance enhancement methods on the Couch website that I have not followed. In particular, I used random identifiers. Finally, this was not done as an indicator of what Couch performance is, rather where the load seems to end. Big and small differences in the documents, which I found interesting.

Lastly, ultimate performance is not as important as being just good enough for an application with your hardware. As you mentioned, you are doing your own testing, and all of this matters.

Recommended Document Structure for CouchDB

One connection per day

One connection document per month

One document per row of data collected

More articles: