Storage of large amounts of data: database or file system?

Let's say my application creates, saves and retrieves a very large number of records (tens of millions). Each record has a variable amount of different data (for example, some records have only a few bytes, such as ID / title, while some may have megabytes of additional data). The basic structure of each record is the same and is in XML format.

Records are created and edited (most likely by adding rather than rewriting) arbitrarily.

Does it make sense to store records as separate files in the file system, while maintaining the necessary sets of indexes in the database and saving everything in the database?

+7
database filesystems data-structures indexing database-design
source share
7 answers

It depends on how you are going to use it. Databases can handle more records in a table than most people think, especially with the right indexing. On the other hand, if you are not going to use the functionality provided by the relational database, there can be many reasons for using it.

Ok, enough generalization. Given that the database ultimately boils down to โ€œfiles on diskโ€ anyway, I would not worry too much about what to โ€œdo rightโ€. If the main goal of the database is simply to efficiently extract these files, I think it would be nice to keep records in the database small and look for file paths instead of actual data - all the more so that your file system should be quite efficient at extracting data based on a certain location.

In case you are interested, this is actually a general data storage template for search engines: the index will store indexed data and a pointer to the stored data on disk, and not store everything in the index.

+4
source share

I would definitely save the data in the file system and the hash path in the database.

+3
source share

Depending on your costs, MS SQL Server has what is called a "Primary XML Index", which can be created even on unstructured data. This allows you to write XQuery for column searches, and the database will help you.

If there is any coherence in the data or it can be placed in a schema, you can see it.

Can I recommend if you have a large amount of binary data, such as images, etc., that you delete and place elsewhere, such as the file system. Or, if you are using 2008, there is a type called โ€œFilestreamโ€ (cheers @Marc_s) that allows you to index, store and protect all the files that you write, and use the NTFS APIs to retrieve them (i.e. fast block transfers ), but still have them stored in columns in the database.

Having a database can give you a good level of abstraction and scaling if your application has great requirements for searching through XML data, which means you don't need it.

Just my 2c.

+1
source share

At work, I often have to accumulate large sets of XML documents for later analysis. This is usually done by inserting them into a directory, and the analysis is performed using grep (or on a custom Java program with all its attributes XML factory / builder / wrapper / API).

One slow day, I thought I'd try putting it in PostgreSQL. There are two possibilities that I wanted to try:

  • Automatic Big Data Compression (TOAST).
  • Indexing using an expression.

As for the first function, the database size was less than half the size of the raw files. Performing a full-text search, scanning the table with WHERE data::TEXT LIKE '%pattern%' , was actually faster than running grep in files. When you are dealing with several GB of XML, this in itself makes the value of the database.

The second function, indexing, is a bit more maintenance work. There were a few specific elements that I assumed would be useful to index. The index on xpath('//tradeHeader/tradeId/text()', data) works, but it can be a pain to duplicate each request. It was easier for me to add regular columns for some fields and use insert / update triggers to synchronize them.

+1
source share

A few considerations:

  • transaction management;
  • backup and restore.

This is easier to simplify for the database than with the file system. But perhaps the most difficult thing is to synchronize the file system backup with the database browser logging (redo). The more transactional applications, the more these factors matter.

It follows from your question that you are not going to use the usual functionality of the database (relational integrity, joining). In this case, you should seriously consider the third option: store your data in the file system and use a file-based text search engine instead of a database, such as Solr (or Lucene), Sphinx, Autonomy, etc.

+1
source share

I will use HDFS (Hadoop Distributed File System) to store data. The basic idea is that you get high availability, scalability and replication. Any requests to your application can be made with a decrease in the number of requests to the map. And the main fields can be stored as a distributed index on top of Hadoop using Katta.

Try using Google for these technologies.

+1
source share

It depends on how you are going to use the data, as stated in the previous answer.

The data in the database can be used to support many different queries and submit the results to reports, forms, OLAP modules and many other tools. Appropriate indexing can significantly speed up the search.

If you know SQL, and if the database is well designed, querying is simpler, faster, and less than using equivalent files. But, as others have noted, you can connect your XML data to SQL without moving it to the database.

Developing a good multipurpose scheme is more complicated than many beginners think. There is a lot to learn, and the point is not only how to manipulate one tool or another. And a poor multipurpose design can be harder to work with files.

If you decide to go with a database, be prepared to make a significant investment. And make sure you get the benefits of this investment.

0
source share

All Articles