How to store 7.3 billion lines of market data (optimized for reading)?

Question

How to store 7.3 billion lines of market data (optimized for reading)?

I have a data set for 1 minute out of 1000 stocks since 1998, which are about (2012-1998)*(365*24*60)*1000 = 7.3 Billion rows.

Most (99.9%) of the time I will only read requests.

What is the best way to store this data in db?

1 large table with 7.3-row rows?
1000 tables (one for each stock symbol) with 7.3 M rows each?
any recommendation of a database engine? (I plan to use Amazon RDS MySQL)

I'm not used to dealing with datasets, it's big, so this is a great opportunity for me to learn. I will be very grateful for your help and advice.

Edit:

This is an example line:

'XX', 20041208, 938, 43.7444, 43.7541, 43.735, 43.7444, 35116.7, 1, 0, 0

Column 1 is the stock symbol, column 2 is the date, column 3 is the minute, the rest are prices with open-high-low-close, volume and 3 integer columns.

Most requests will be like "Give me AAPL prices between April 12, 2012 12:15 and April 13, 2012 12:52"

About hardware: I plan to use Amazon RDS, so I'm flexible about this

+68

database

Victor Mar 22 '12 at 1:26

source share

13 answers

Thus, databases are designed for situations where you are constantly changing a complex scheme. You have only one “table” with a manual combination of simple numeric fields. I would do it like this:

Prepare a C / C ++ structure to store the recording format:

 struct StockPrice { char ticker_code[2]; double stock_price; timespec when; etc };

Then calculate sizeof (StockPrice [N]), where N is the number of records. (On a 64-bit system) This should be only a few hundred gigabytes, and it will be placed on your $ 50 hard drive.

Then crop the file to this size and mmap (on linux or use CreateFileMapping on windows) into memory:

 //pseduo-code file = open("my.data", WRITE_ONLY); truncate(file, sizeof(StockPrice[N])); void* p = mmap(file, WRITE_ONLY);

Mark the mmaped pointer to StockPrice * and loop through your data filling the array. Close mmap, and now you will have your data in one large binary array in a file, which can be deleted later.

 StockPrice* stocks = (StockPrice*) p; for (size_t i = 0; i < N; i++) { stocks[i] = ParseNextStock(stock_indata_file); } close(file);

Now you can play it again as read-only from any program, and your data will be available:

 file = open("my.data", READ_ONLY); StockPrice* stocks = (StockPrice*) mmap(file, READ_ONLY); // do stuff with stocks;

So now you can treat it like an array of arrays in memory. You can create different types of index data structures depending on your "queries." The kernel will deal with the ability to transfer data to / from the drive transparently so that it is insanely fast.

If you expect to have a specific access pattern (for example, a continuous date), it is best to sort the array in that order so that it sequentially gets to disk.

+35

Andrew Tomazos Mar 31 '12 at 4:27

source share

As far as I understand, HDF5 was designed specifically for storing time margin data as one potential application. Stacking experts have shown that HDF5 is good for large amounts of data: chromosomes , physics .

+14

sarnold Mar 22 '12 at 1:37

source share

Okay, so this is a bit from the other answers, but ... it seems to me that if you have data in the file system (maybe one reserve per file) with a fixed record size, you can get on the data really easily: given the request for specific stock and time range, you can find the right place, get all the necessary data (you know exactly how many bytes), convert the data to the desired format (which can be very fast depending on your storage format), and you are absent.

I don’t know anything about the Amazon repository, but if you don’t have anything like direct file access, you may have simple drops - you will need to balance the large drops (fewer records, but maybe more data than you need each time ) with small blobs (more entries give more overhead and probably more requests for them, but less useless data is returned every time).

Then you add caching - I would suggest giving other servers different reserves for processing, for example - and you can pretty much just use memory. If you can afford enough memory on a sufficient number of servers, bypass the download-on-demand part and just upload all the files at startup. This would make things easier by running slower (which obviously affects switching to another resource, unless you can afford to have two servers for any particular stock, which would be useful).

Please note that you do not need to save the stock symbol, date or minute for each record, because they are implicit in the download file and position in the file. You should also consider what accuracy you need for each value, and how efficiently to store it - you gave 6SF in your question, which you could store in 20 bits. Potentially store three 20-bit integers in 64-bit stores: read it as long (or whatever your 64-bit integer value is) and use masking / shift to return it to three integers. You will need to know what weight to use, of course, that you could probably encode in 4 spare bits if you cannot make it permanent.

You did not say what the other three integer columns are, but if you can get away with 64 bits for these three, you can save the whole record in 16 bytes. This is only ~ 110 GB for the entire database, which is actually not very ...

EDIT: Another thing to keep in mind is that stocks do not seem to change on weekends or even overnight. If the stock market is open only 8 hours a day, 5 days a week, then you only need 40 values per week instead of 168. At this point, you can only get about 28 GB of data in your files ... which sounds much less than you probably originally thought. Having a lot of data in memory is very reasonable.

EDIT: I think I missed an explanation of why this approach fits here: you have a very predictable aspect for a significant part of your data - stock ticker, date and time. Selecting a ticker once (as a file name) and leaving the date / time completely implicit in the data position, you delete a whole bunch of work. This is a bit like the difference between String[] and a Map<Integer, String> - knowing that your array index always starts at 0 and increments in increments of 1 to the length of the array, which provides faster access and more efficient storage.

+14

Jon Skeet Mar 24 '12 at 9:27

source share

I have a data set for 1 minute out of 1000 shares [...] the most (99.9%) of the time when I will only read requests.

Storing once and reading multiple time-based numerical data is a use case called a "time series". Other common time series are sensor data on the Internet “Things”, server monitoring statistics, application events, etc.

This question was asked in 2012, and since then several database engines have been developing features specifically designed for time series management. I got great results with InfluxDB , which is open source written in Go and MIT-licensed.

InfluxDB is specifically optimized for storing and querying time series data. Much more than Cassandra , which is often advertised just as well for storing time series:

The optimization for time series included certain tradeoffs. For example:

Updates to existing data are rare and content updates never happen. Time series data is predominantly new data that is never updated.
Pro: Restricting access to updates improves query and write performance
Con: update functionality is significantly limited

In open source criteria ,

InfluxDB outperformed MongoDB in all three tests with 21 times higher write throughput, 84 times less when used on disks and providing relatively equal performance when query speed was achieved.

Queries are also very simple. If your lines look like <symbol, timestamp, open, high, low, close, volume> , with InfluxDB you can just save it and then query easily. Say for the last 10 minutes of data:

 SELECT open, close FROM market_data WHERE symbol = 'AAPL' AND time > '2012-04-12 12:15' AND time < '2012-04-13 12:52'

No identifiers, no keys, and no associations. You can make many interesting units . You do not need to vertically split the table like in PostgreSQL , or reduce your schema into second arrays like in MongoDB . In addition, InfluxDB compresses very well, and PostgreSQL will not be able to perform compression on the type of data that you have .

+13

Dan Dascalescu Sep 13 '16 at 17:00

source share

Here is an attempt to create a market data server at the top of the Microsoft SQL Server 2012 database, which should be good for analyzing OLAP, a free open source project:

http://github.com/kriasoft/market-data

+4

Konstantin Tarkus Dec 11 '12 at 18:43

source share

Let me recommend you take a look at apache solr , which I think would be ideal for your specific problem. Basically, you should index your data first (each line is a “document”). Solr is search-optimized and natively supports date range queries. Your nominal request,

"Give me the prices of AAPL between April 12 2012 12:15 and April 13 2012 12:52"

will translate something like:

 ?q=stock:AAPL AND date:[2012-04-12T12:15:00Z TO 2012-04-13T12:52:00Z]

Assuming "stock" is the name of the stock, and "date" is the "datfield" created from the "date" and "minute" columns of your input data when indexing. Sol is incredibly flexible, and I really can't say enough good things about it. For example, if you need to maintain fields in the source data, you might find a way to dynamically create a "DateField" as part of a query (or filter).

+3

aliasmrchips Mar 27 '12 at 15:37

source share

Firstly, there are no 365 trading days per year, with holidays 52 days off (104) = say, 250 x the actual hour of the day is open, as someone said, and using a symbol as a primary key is not a good idea, since the symbols change , use k_equity_id (numeric) with the character (char), since the characters can be like A or GAC-DB-B.TO, and then in your price data tables, so your estimate of 7.3 billion is much higher calculation, since it is about 1.7 million lines per character for 14 years.

k_equity_id k_date k_minute

and for the EOD table (which will be viewed at 1000x compared to other data)

k_equity_id k_date

Secondly, do not store OHLC data by minute in the same DB table as the EOD table (end of day), since anyone who wants to look at a pnf or line chart for a year has zero percent in the minimum information.

+3

Richard Aug 13 '13 at 19:54 on

source share

I think any core RDBMS can handle this. At the atomic level, a single table with the correct partition seems reasonable (a section based on the use of your data, if it is fixed, it will probably be either a symbol or a date).

You can also look at building aggregate tables for faster access above the atomic level. For example, if your data is per day, but you often get the data back at the wekk level or even for a month, then this can be preliminary calculated in the aggregate table. In some databases, this can be done using a cached view (different names for different database solutions, but basically it's a view of atomic data, but as soon as the start of the view is cached / hardened into a fixed temporary table, it is requested for subsequently matching queries. This can discard at intervals to free up memory / disk space).

I think we could help you with some idea regarding data usage.

+2

Wolf5370 Mar 30 2018-12-12T00:

source share

You should compare slow solutions with a simple, optimized memory model. It is uncompressed, it fits in a 256 GB RAM RAM server. The snapshot fits at 32K, and you just index it by position by date and time. Then you can create specialized snapshots, since opening them is often equal to closing the previous one.

[edit] Why do you think it makes sense to use the database at all (rdbms or nosql)? This data does not change, and it fits into memory. This is not the case when dbms can add value.

+2

Stephan Eggermont Mar 30 '12 at 9:28

source share

If you have the hardware, I recommend MySQL Cluster . You get the MySQL / RDBMS interface that you are so familiar with and get fast and parallel writes. Reading will be slower than normal MySQL due to network latency, but you have the advantage of being able to parallelize queries and reads because of how MySQL Cluster and NDB work.

Make sure you have enough MySQL Cluster machines and enough memory / RAM for each one, although MySQL Cluster is a large memory database architecture.

Or Redis , if you are not against the key-value / NoSQL interface for reading / writing. Make sure Redis has enough memory - its superfast for reading and writing, you can make basic queries with it (although not RDBMS), but it is also a database in memory.

As others have said, more detailed information about the queries you will run will help.

+2

Suman Mar 30 '12 at 21:15

source share

If your use case is simple read lines without aggregation, you can use the Aerospike cluster. It is in a memory database with file system support for saving. It is also optimized for SSD.

If aggregation data is needed for your use case, go to a Mongo DB cluster with a date range. You can enter dash data per year in shards.

+1

Gaurav Kumar Sep 20 '16 at 12:28

source share

You will need the data stored in the column column of the column table / database . Database systems like Vertica and Greenplum are columnar databases, and I believe SQL Server now allows columnar tables. They are extremely efficient for SELECT from very large datasets. They are also effective in importing large data sets.

Free MonetDB column database.

+1

ZagNut Sep 20 '16 at 18:06

source share

Charlie Martin · Accepted Answer · 2012-03-22 01:30

Tell us about your requests and your hardware environment.

I would love to go NoSQL using Hadoop or something similar if you can use parallelism.

Update

Ok why?

First of all, please note that I asked about requests. You cannot — and we certainly cannot — answer these questions without knowing what the workload is. (In the near future I will have an article about this, but I can’t link it today.) But the scale of the problem makes me think about moving away from the big old database, because

My experience with similar systems suggests that access will be either large sequential (calculating some kind of time series analysis) or very flexible data mining (OLAP). Serial data can be processed better and faster sequentially; OLAP means lot calculation and many indexes that either take a lot of time or a lot of space.
If you are doing something that works efficiently with a lot of data in the OLAP world, however, it is best to use a column-oriented approach.
If you want to make random queries, especially by doing cross-comparisons, the Hadoop system can be efficient. What for? Insofar as
- you can better use parallelism on relatively small commercial equipment.
- You can also better implement high reliability and redundancy.
- many of these problems naturally lend themselves to the MapReduce paradigm.

But the fact is, until we find out about your workload, it is impossible to say anything final.

How to store 7.3 billion lines of market data (optimized for reading)?

More articles: