How do databases handle data tables that cannot fit into memory?

Suppose you have a really large table, say several billion unordered rows, and now you want to index it for a quick search. Or maybe you are going to load it and order it on a disk with a clustered index. Obviously, when you get to the amount of data of this size, you need to stop assuming that you can do things like sorting in memory (well, without leaving virtual memory and dropping heavily).

Can someone give me some tips on how databases process a large amount of data like this under the hood? I guess there are algorithms that use some form of smart disk caching to process all the data, but I don't know where to start. Links are especially welcome. Maybe an advanced database tutorial?

+4
source share
5 answers

Multiway Merge Sort is a keyword for sorting huge amounts of memory.

+6
source

As far as I know, most indexes use some form of B-tree , which does not need material in memory. You can simply put the tree nodes in a file, and then go to the varios position in the file. It can also be used for sorting.

+1
source

Are you creating a database engine?

Edit: I built a disk-based database system back in the mid-90s.

Fixed-size recordings are easiest to work with, since the file offset to determine the recording location can easily be calculated as a multiple of the recording size. I also had some variable-sized records.

My system needs to be optimized for reading. The data was actually saved on a CD-ROM, so it was read-only. I created search tree binaries for each column that I wanted to find. I took the open source binary search tree and converted it to random access to the disk file. Sorted reads from each index file were easy, and reading each data record from the main data file in accordance with the indexed order was also easy. I did not need to sort in memory, and the system was faster than any of the available RDBMS systems that were running on the client machine at that time.

For fixed record size data, an index can simply keep track of the record number. For variable-length data records, the index simply needs to store the offset in the file where the record begins, and each record must begin with a structure that defines the length.

0
source

You need to somehow split the data set. Distribute each partition on a separate RAM server. If I had a billion 32-bit ints - there is 32 GB of RAM. And this is just your index.

For data with low power, such as Gender (has only 2 bits - Male, Female), you can represent each index entry less than bytes. In such cases, Oracle uses a bitmap index.

0
source

Hmm ... Interesting question.

I think most of the database management systems used use the operating system mechanism to manage memory, and when the physical memory runs out, the memory tables change.

-one
source

All Articles