Mainting a sorted list that is larger than memory

I have a list of tuples.

[ "Bob": 3, "Alice": 2, "Jane": 1, ] 

As the number increases

  "Alice" += 2 

order should be maintained:

 [ "Alice": 4, "Bob": 3, "Jane": 1, ] 

When everything in memory there are fairly simple ways (some more or less) to effectively implement this. (using index, insertion-sorting, etc.) The question though is: What is the most promising approach when the list does not fit into memory .

Bonus question: what if even the index does not fit in memory?

How do you approach this?

+4
source share
9 answers

B + trees order several elements with a key. In this case, the key is the counter, and the element is the name of the person. The entire B + tree does not need to be inserted into memory - only search for the current node. You can set the maximum size of the nodes (and indirectly the depth of the tree) so that the node fits in memory. (In practice, nodes are usually much smaller than memory.)

Data elements are stored in tree leaves in so-called blocks. You can either save inline elements in the index, or save pointers to external storage. If the data is of regular size, this can provide an efficient search from files. In the example question, data elements can be single names, but it would be more efficient to store name blocks, all names in a block that has the same counter. Names inside each block can also be sorted. (Names in the blocks themselves can be organized as a B-tree.)

If the number of names becomes so large that the blocks of the B + tree become excessively large, the key can be made into a compound key, for example. (count, first letter). When searching in the tree, you only need to compare the counter to find all the names with this counter. When inserting or searching for a specific name with a given count, the full key can be compared to enable filtering by the name prefix.

Alternatively, instead of a compound key, data elements can point to offsets / blocks in an external file that contains name blocks that will contain a minimal B + tree.

If btree blocks are connected to each other, range requests can be effectively implemented by looking for the beginning of the range and then following block pointers to the next block until the end of the range is reached. This will allow you to effectively implement "find all names with a score from 10 to 20".

As other answers noted, RDBMS is a pre-packaged way of storing lists that don't fit into memory, but I hope this gives an idea of ​​the structures used to solve the problem.

+7
source

A relational database such as MySQL is specifically designed to store large amounts of data, the sum of which does not fit into memory, requests this large amount of data and even updates them in place.

For instance:

 CREATE TABLE `people` ( `name` VARCHAR(255), `count` INT ); INSERT INTO `people` VALUES ('Bob', 3), ('Alice', 2), ('Jane', 1); UPDATE `people` SET `count` = `count` + 2; 

After the UPDATE in a SELECT * FROM people ; query SELECT * FROM people ; will be shown:

  + ------- + ------- +
 |  name |  count |
 + ------- + ------- +
 |  Bob |  5 |
 |  Alice |  4 |
 |  Jane |  3 |
 + ------- + ------- +

You can keep the order of people in your table by adding a primary key auto-increment:

 CREATE TABLE `people` ( `id` INT UNSIGNED NOT NULL AUTO_INCREMENT, `name` VARCHAR(255), `count` INT, PRIMARY KEY(`id`) ); INSERT INTO `people` VALUES (DEFAULT, 'Bob', 3), (DEFAULT, 'Alice', 2), (DEFAULT, 'Jane', 1); 
+7
source

RDMS? Even flat file versions such as SQLite. Otherwise, this is a combination using lazy loading. Only store X-records in the memory in the upper Y-records and in the last Z which have been updated. Otherwise, the column table Key, Count, in which you start UPDATE, changing the values. An ordered list can be obtained with a simple SELECT ORDER BY.

+1
source

Read about B-trees and B + ribs. Moreover, the index can always be made small enough to fit into the memory.

+1
source

An interesting approach, unlike BTrees, is Judy Tree

+1
source

What you are looking for is from the basic algorithms for container classes, in particular, from the container class of the main list. Check out the stxxl library for some great examples from the main horizon and processing.

You can also see this related question.

+1
source

Regarding the “implementation details” associated with this “manually”, you can read about how database systems do this by searching for source documents for database design or by looking for course notes for database architecture.

I did a few searches and found a review article by G. Grafe entitled " Query Evaluation Techniques for Large Databases ". It covers somewhat all aspects of queries to large databases, but section 4 describes how “query scoring systems ... access the underlying data stored in a database.” In addition, Graefe's review was linked to the course page for CPS 216: Advanced Databases Systems at Duke, Fall 2001. Week 5 was on Physical Data Organization , which states that most commercial DBMSs organize disk data using N-shaped blocks storage models (NSM): records are stored from the beginning of each block, and at the end there is a "directory".

See also:

0
source

Of course, I know that I can use the database. This question was more about implementation details that solved this "manually."

Basically, you ask: "How does the database do?" In response to this, it uses a tree (for both data and index) and only stores part of the tree in memory at any given time.

As already mentioned, B-Trees are especially useful for this: since hard drives always read a fixed amount at a time ("sector size"), you can make each node sector size to maximize efficiency.

0
source

You do not indicate that you need to add or remove any items from the list, just save it.

If so, a simple approach with flat files - usually using mmap for convenience - will work and will be faster than a more general database.

You can use bsearch to find an item or maintain a set of slot samples with each value.

When you access an element, so that the part of the file in which it is located (think in terms of a “memory page”) automatically receives RAM in RAM, and the slot and neighboring slots are even copied to the L1 cache line.

You can perform an immediate comparison with your neighboring slots to see if the increment or decrement increases an element out of order; If so, you can use linear iteration (possibly supplemented with bsearch ) to find the first / last element with the appropriate counter, and then swap them.

File management is what the OS has built.

0
source

Source: https://habr.com/ru/post/1311434/


All Articles