Scalable, fast, text file supported database engine?

I am dealing with large amounts of scientific data that are stored in files separated by .tsv tabs. Typical operations that need to be performed are reading several large files, filtering only certain columns / rows, combining with other data sources, adding calculated values, and writing the result as another .tsv.

Simple text is used for its reliability, durability and self-documenting nature. Storing data in a different format is not an option; it should remain open and easy to process. There is a lot of data (dozens of TB), and it is impossible to upload a copy to a relational database (we would need to buy twice as much storage space).

Since I mainly do selection and attachment, I realized that I basically need a database engine with a .tsv-based backup storage. I do not care about transactions, since my data is all one-time records. I need to process the data in place, without the main step of converting and cloning the data.

Since there is a lot of data to query in this way, I need to process it efficiently using caching and a grid of computers.

Does anyone know of a system that will provide database-like capabilities when using files with shared tables as a backend? It seems to me that this is a very common problem that almost all scientists face one way or another.

+7
database csv large-data scientific-computing plaintext
source share
7 answers

There is a lot of data (dozens of TB), and it is impossible to upload a copy to a relational database (we would need to buy twice as much storage space).

You know your requirements better than any of us, but I would suggest that you think about it. If you have 16-bit integers (0-65535) stored in the csv file, the .tsv storage efficiency is around 33%: 5 bytes are required to store most 16-bit integers and a separator = 6 bytes, whereas native integers numbers take 2 bytes. For floating point data, performance is even worse.

I would consider using existing data instead of storing raw, processing it in two ways:

  • Keep it compressed in a well-known compression format (for example, gzip or bzip2) on your permanent backup media (backup servers, tape drives, etc.) to preserve the benefits of the .tsv format.
  • Process it in a database that has good storage efficiency. If the files are in a fixed and strict format (for example, column X is always a string, column Y is always a 16-bit integer), then you are probably in good shape. Otherwise, the NoSQL database might be better (see Stefan's answer).

This will create a proven (but possibly slowly accessible) archive with a low risk of data loss and a fast-accessible database that does not need to lose the original data, since you can always re-read it into the database from the archive.

You should be able to reduce storage space and should not have twice as much storage space as you declare.

Indexing will be difficult; You’d better have an idea of ​​which subset of data is needed for an efficient query.

+5
source share

One of these nosql dbs may work. I highly doubt that any of them are configurable to sit on top of flat delimited files. You can look at one of the open source projects and write your own database level.

+2
source share

Scalability starts at a point outside of ASCII, divided into a tab.

Just be practical β€” don’t learn it β€” an agreement frees your fingers as well as your mind.

+2
source share

You can do this with LINQ to Objects if you are in a .NET environment. Streaming / deferred execution, functional programming model, and all SQL statements. Joins will work in a streaming model, but one table will be pulled in, so you should have a large table connected to a smaller situation in the table.

The ease of data generation and the ability to write your own expressions will really shine in a scientific application.

LINQ for a delimited text file is a general LINQ demonstration. You need to be able to combine LINQ with a tabular model. Google LINQ for text files for some examples (e.g. see http://www.codeproject.com/KB/linq/Linq2CSV.aspx , http://www.thereforesystems.com/tutorial-reading-a-text-file -using-linq / etc.).

Expect a learning curve, but it is a good solution to your problem. One of the best treatments on this subject is Jon Skeet C #. Take Manning's β€œMEAP” version for early access to its latest edition.

I did as before, with large mailing lists that need to be cleaned, deduplicated, and added. You are always associated with IO. Try SSDs, especially Intel "E" drives, which have very fast write performance and RAID as parallel as possible. We also used grids, but I had to adjust the algorithms to perform multi-pass approaches that would reduce the data.

Note. I agree with the other answers that load the load into the database and index if the data is very regular. In this case, you are mainly doing ETL, which is a well understood problem in the storage community. However, if the data is ad-hoc, you have scientists who simply discard their results in the directory, you need agile / just in time transformations, and if most transformations are selected in one pass ... where ... join, then you approach him in the right way.

+1
source share

I would support Jason's recommendation if I had a reputation. My only addition is that if you did not store it in a different format, such as in a database, Jason suggested you pay the cost of the parsing for each operation, and not just once when you initially processed it.

+1
source share

You can do this with VelocityDB . This is very fast when reading tabs of individual data into C # objects and databases. All Wikipedia text is a 33 megabyte XML file. This file takes 18 minutes to read and save as objects (1 each on the Wikipedia theme) and store in compact databases. Many examples are shown for reading in tabbed text files as part of the download.

+1
source share

The question has already been given, and I agree with most statements.

In our center, we have a standard conversation that we give , β€œso that you have 40 TB of data,” because scientists are once again in this situation all the time now. The conversation is nomenclature about visualization, but first of all about managing large amounts of data for those that are new to him. The main points that we are trying to overcome:

  • I / O scheduling
    • Binary files
    • As much as possible, large files
    • File formats that can be read in parallel, extracted subregions
    • Avoid zillions files
    • Especially avoid creating files in the same directory.
  • Data management must scale:
    • Include metadata for origin
      • Reduce the need for re-execution
    • Intelligent data management
      • Hierarchy of data directories only if it always works
    • Databases, formats that allow metadata
  • Use scalable, automated tools:
    • For large data sets, parallel tools - ParaView, VisIt, etc.
    • Script - gnuplot, python, R, ParaView / Visit ...
    • Scripts provide reproducibility!

We have quite a few things on large-scale I / O in general , as this is an increasingly common stumbling block for scientists.

+1
source share

All Articles