How to insert O (log (n)) in a Data.Set?

While viewing the Data.Set I saw that inserting an element into a tree is referred to as O (log (n)) . However, I would intuitively expect it to be O (n * log (n)) (or maybe O (n)?), Since link transparency requires a full copy of the previous tree in O (n).

I understand that, for example, (:) you can make O (1) instead of O (n), since here the full list does not need to be copied; the new list can be optimized by the compiler as the first element plus a pointer to the old list (note that this is a compiler, not language level optimization). However, inserting a value into the Data.Set involves rebalancing, which seems pretty complicated for me, to the extent that I doubt that there is anything like optimizing a list. I tried to read the document referenced by the set of documents , but could not answer my question.

So: how to insert an element into the binary tree O (log (n)) in a (purely) functional language?

+8
complexity-theory haskell referential-transparency
source share
2 answers

There is no need to make a full copy of Set to insert an element into it. Inside the element is stored in a tree, which means that you only need to create new nodes along the insertion path. Untouched nodes can be shared between the version of Set for pre-installation and after insertion. And as Deitrich Epp pointed out that in a balanced tree O(log(n)) there is an insertion path length. (Sorry for the exclusion of this important fact.)

Let's say your Tree type looks like this:

 data Tree a = Node a (Tree a) (Tree a) | Leaf 

... and let's say you have a Tree that looks like

 let t = Node 10 tl (Node 15 Leaf tr') 

... where tl and tr' are some named subtrees. Now say that you want to insert 12 into this tree. Well, it will look something like this:

 let t' = Node 10 tl (Node 15 (Node 12 Leaf Leaf) tr') 

The subtrees tl and tr' are split between t and t' , and you only needed to build 3 new Nodes , although the size of t can be much larger than 3.


EDIT: Rebalancing

Regarding rebalancing, think of it this way and note that I do not pretend to be strict here. Say you have an empty tree. Already balanced! Now say that you are inserting an element. Already balanced! Now say that you are inserting another element. Well, there is a strange number, so you cannot do much.

Here is the hard part. Let's say you insert another element. This can go in two ways: left or right; balanced or unbalanced. In case of imbalance, you can clearly rotate the tree to balance it. In case it is balanced, already balanced!

It is important to note that you are constantly balancing. This is not like you have a tree mess, decided to insert the element, but before you do this, you will rebalance and then leave the mess after you have completed the insert.

Now say that you continue to insert elements. The tree will be unbalanced, but not much. And when this happens, first you fix it immediately, and secondly, the correction occurs along the insert path, which is O(log(n)) in a balanced tree. The rotations in the paper you linked touch no more than three nodes in the tree to perform the rotation. so you do O(3 * log(n)) work when rebalancing. This is still O(log(n)) .

+16
source share

To add extra emphasis to what dave4420 said in the comment, there are no compiler optimizers related to the fact that (:) runs in constant time. You can implement your own list data type and run it in a simple non-optimizing Haskell interpreter, and it will still be O (1).

The list is defined as the starting element plus the list (or it is empty in the base case). Here's a definition equivalent to own lists:

 data List a = Nil | Cons a (List a) 

So, if you have an element and a list, and you want to build a new list from them with Cons , this is just creating a new data structure directly from the arguments that the constructor requires. There is no need to even check the list of tails (not to mention copying it) than to check or copy a string when you do something like Person "Fred" .

You are simply mistaken when you say that this is a compiler optimization, not a language level. This behavior follows directly from the definition of the language level for the list type.

Similarly, for a tree defined as an element plus two trees (or an empty tree), when you insert an element into a non-empty tree, it must either go in the left or right subtree. You will need to create a new version of this tree containing the element, which means that you will need to build a new parent node containing the new subtree. But another subtree does not need to go at all; it can be placed in the new parent tree as is. In a balanced tree, this is the full half of the tree that can be used.

Applying this argument recursively should show you that there really is no need to copy data elements; there are only new parent nodes needed on the way to the end position of the inserted element. Each new node stores 3 things: an element (shared with a reference to an element in the source tree), an unmodified subtree (shared with the source tree), and a newly created subtree (which shares almost its entire structure with the original tree). There will be O (log (n)) of those that are in a balanced tree.

+7
source share

All Articles