I am dealing with large amounts of scientific data that are stored in files separated by .tsv tabs. Typical operations that need to be performed are reading several large files, filtering only certain columns / rows, combining with other data sources, adding calculated values, and writing the result as another .tsv.
Simple text is used for its reliability, durability and self-documenting nature. Storing data in a different format is not an option; it should remain open and easy to process. There is a lot of data (dozens of TB), and it is impossible to upload a copy to a relational database (we would need to buy twice as much storage space).
Since I mainly do selection and attachment, I realized that I basically need a database engine with a .tsv-based backup storage. I do not care about transactions, since my data is all one-time records. I need to process the data in place, without the main step of converting and cloning the data.
Since there is a lot of data to query in this way, I need to process it efficiently using caching and a grid of computers.
Does anyone know of a system that will provide database-like capabilities when using files with shared tables as a backend? It seems to me that this is a very common problem that almost all scientists face one way or another.
database csv large-data scientific-computing plaintext
Roman zenka
source share