How to safely implement "Using uninitialized memory for fun and profit"?

Question

How to safely implement "Using uninitialized memory for fun and profit"?

I would like to build a dense integer set in C ++ using the trick described at https://research.swtch.com/sparse . This approach provides good performance by allowing yourself to read uninitialized memory.

How can I implement this data structure without running undefined behavior and without using tools like Valgrand or ASAN?

Edit: It appears that respondents focus on the word “uninitialized” and interpret it in the context of the locale. It was probably the wrong choice of words on my part - here, “uninitialized” means only that its value is not important for the correct functioning of the algorithm. Obviously, you can safely implement this data structure (LLVM does this in SparseMultiSet). My question is the best and most effective way to do this?

+5

c ++ data-structures valgrind

ridiculous_fish May 04 '17 at 18:09

source share

3 answers

If we reformulate your question:

What code is read from uninitialized memory without shutdown tools designed to read reads from uninitialized memory?

Then the answer becomes clear - it is impossible. Any way you can do this that you can find is a bug in Valgrind that would be fixed.

It may be possible to get the same performance without UB, but the limitations that you asked your question "I would like to ... use a trick ... allowing myself to read uninitialized memory" guarantee UB. Any competing method that avoids UB will not use the trick you like so much.

+2

Ben voigt May 04 '17 at 18:19

source share

Valgrind doesn't complain if you just read uninitialized memory. Valgrind will complain if you use this data in a way that affects the apparent behavior of the program, for example. using this data as input to syscall or making a jump based on this data or using this data to compile another address. See http://www.valgrind.org/docs/manual/mc-manual.html#mc-manual.uninitvals for more details. Thus, it is possible that you will not have problems with Valgrind.

If valgrind is still complaining, but your algorithm is correct even when using this uninit data, then you can use client requests to declare this memory as initialized.

+1

phd May 6 '17 at 12:38

source share

BeeOnRope · Accepted Answer · 2017-05-06T20:36:17+0000

I see four basic approaches you can take. They apply not only to C ++, but also to most other low-level languages, such as C, which make uninitialized access possible, but not allowed, and the latter applies even to more secure languages of a higher level.

Ignore the standard, implement it in the usual way.

This is one of the craziest lawyers! However, don’t worry - the solutions following this will not break the rules, so just skip this part if you use a sort of rules.

The standard uses most uninitialized undefined values, and the few loopholes that it allows (for example, copying one undefined value to another) do not actually give you enough rope to actually implement what you want - even in C, which is slightly less restrictive (see., for example, this response covering C11, which explains that when accessing the value indeterminiate can not directly initiate UB that all results are also uncertain, and indeed, the value may seem random from access to access).

So, you just implement it anyway, bearing in mind that most or all compilers at the moment simply compile it with the expected code and know that your code does not meet the standards.

At least in my test all gcc , clang and icc did not use invalid access to do something crazy. Of course, the test is not exhaustive, and even if you can build it, the behavior may change in the new version of the compiler.

It would be safer if the implementation of methods that access uninitialized memory were compiled once in a separate compilation unit - this makes it easy to verify that it does the right thing (just check the assembly once) and makes it almost impossible (outside of LTGC ) to the compiler to do something complicated, because he can not prove that access to uninitialized values.

However, this approach is theoretically unsafe, and you should carefully check the collected result and have additional precautions if you take it.

If you take this approach, tools like valgrind are likely to report an uninitialized read error.

Now these tools work at the assembly level, and some uninitialized readings may be accurate (see, for example, the next paragraph in the implementations of the fast standard library), so they actually do not report uninitialized readings immediately, but rather various heuristics to determine if invalid values are used. For example, they can avoid the error message until they determine that the uninitialized value is used to determine the direction of the conditional branch or some other action that is not monitored / restored according to the heuristic. You may be able to force the compiler to emit code that reads uninitialized memory but is safe according to this heuristic.

Most likely, you will not be able to do this (since the logic here is rather subtle, since it depends on the relationship between the values in the two arrays), so you can use the suppression options in your selection tools to hide errors. For example, valgrind can suppress based on the stack trace - and in fact there are already many such suppression records that are used by default to hide false positives in various standard libraries.

Since it works on the basis of the stack trace, you are likely to have difficulties if the readings are performed in the embedded code, since the top stack will be different for each call site. You can avoid this by making sure the function is not built-in.

Use assembly

What is poorly defined in the standard is usually well defined at the assembly level. That is why the compiler and the standard library can often do things faster than you could achieve with C or C ++: the libc subroutine written in the assembly is already oriented to a specific architecture and should not worry about various caveats in the language specification that are there to make things work fast on various hardware.

Typically, implementing any serious amount of code in an assembly is expensive, but here are just a few, so it may be feasible depending on how many platforms you target. You don’t even need to write methods yourself - just compile the C ++ version (or use godbolt and copy the assembly. is_member , for example ¹ looks like this:

 sparse_array::is_member(unsigned long): mov rax, QWORD PTR [rdi+16] mov rdx, QWORD PTR [rax+rsi*8] xor eax, eax cmp rdx, QWORD PTR [rdi] jnb .L1 mov rax, QWORD PTR [rdi+8] cmp QWORD PTR [rax+rdx*8], rsi sete al

Rely on `calloc` magic

If you use calloc ² you explicitly request nullified memory from the main allocator. Now, the correct version of calloc can simply call malloc and then reset the returned memory, but actual implementations rely on OS-level memory allocation procedures ( sbrk and mmap , to a large extent) to return zero memory to any OS with protected memory ( i.e., all are large) to avoid zeroing the memory again.

Typically, for large distributions, this is usually done by implementing a call like anonymous mmap by matching a special zero page of all zeros. When (if ever) a memory is written, does write-by-copy really assign a new page. Thus, the distribution of large areas with zero areas can be free, since the OS already needs to reset the pages.

In this case, implementing your sparse set on top of calloc can be as fast as the nominally uninitialized version, being safe and standards compliant.

Calloc Cautions

You should, of course, verify that the calloc behaves as expected. Optimized behavior usually occurs only when your program initializes a lot of long-lived nullified memory about "up". That is, typical logic for optimized calloc, if something like this:

 calloc(N) if (can satisfy a request for N bytes from allocated-then-freed memory) memset those bytes to zero and return them else ask the OS for memory, return it directly because it is zeroed

In principle, the malloc infrastructure (which also underlies new and friends) has a (possibly empty) memory pool that it has already requested from the OS, and usually tries to allocate the first there. This pool consists of memory from the last block request from the OS, but is not transferred (for example, because the user requested 32 bytes, but the allocated one requested pieces from the OS in 1 MB blocks, so much remained) and also the memory that was transferred to the process, but then returned via free or delete or something else. The memory in this pool has arbitrary values, and if a calloc can be satisfied from this pool, you will not get your magic, since zero-init must occur.

On the other hand, if memory should be allocated from the OS, you get magic. So it depends on your use case: if you often create and destroy sparse_set objects, you will usually draw malloc from internal pools and pay for zero costs. If you have long-lived sparse_set objects that take up a lot of memory, they were probably allocated by asking for the OS and you got zero almost for free.

The good news is that if you don't want to rely on the above calloc behavior (indeed, on your OS or on your distributor it might not even be optimized that way), you could usually repeat the behavior by matching in /dev/zero manually for your distributions. In the OSs that offer it, this ensures that you get "cheap" behavior.

Use lazy initialization

For a solution that is completely agnostic for the platform, you can simply use another array that tracks the initialization state of the array.

First, you select some granule in which you will track the initialization, and use a bitmap where each bit tracks the initialization state of this granule of the sparse array.

For example, suppose you select a granule as 4 elements, and the size of the elements in your array is 4 bytes (for example, int32_t values): you need 1 bit to track each 4 elements * 4 bytes / element * 8 bits / byte, which is overhead the cost is less than 1% ³ in the allocated memory.

Now you just check the corresponding bit in this array before accessing sparse . This adds some small overhead to accessing the sparse array, but does not change the overall complexity, and validation is still pretty quick.

For example, your is_member function now looks like :

 bool sparse_set::is_member(size_t i){ bool init = is_init[i >> INIT_SHIFT] & (1UL << (i & INIT_MASK)); return init && sparse[i] < n && dense[sparse[i]] == i; }

The generated x86 build (gcc) now starts with:

 mov rax, QWORD PTR [rdi+24] mov rdx, rsi shr rdx, 8 mov rdx, QWORD PTR [rax+rdx*8] xor eax, eax bt rdx, rsi jnc .L2 ...

.L2: RET

All this is associated with checking the bitmap image. Everything will be pretty fast (and often from a critical path, since it is not part of the data stream).

In general, the cost of this approach depends on the density of your set and on which functions you call - is_member is the worst case for this approach, since some functions (e.g. clear ) are not at all, and others (e.g. iterate ) can perform a check is_init and do this only once every INIT_COVERAGE elements (which means that the overhead will again be ~ 1% for example values).

Sometimes this approach will be faster than the approach proposed in the OP link, especially when processing elements are not included in the set - in this case, the is_init check will fail and often reduce the remaining code, in which case you have a working set that is much smaller ( 256 times using the granule size example) than the size of the sparse array, so you can significantly reduce gaps to DRAM or external cache levels.

Pellet size per se is an important customizable for this approach. Intuitively, a larger granule size pays a large initialization cost when the element covered by the granule gets access for the first time, but retains the memory and initial initialization cost is_init . You can come up with a formula that finds the optimal size in the simple case, but the behavior also depends on the "clustering" of values and other factors. Finally, it is wise to use dynamic granule size to cover your bases under various loads, but this is due to variable shifts.

Really lazy decision

It is worth noting that there is a similarity between calloc and lazy init solutions: both lazily initialize memory blocks as needed, but the calloc solution tracks this implicitly in hardware using MMU magic with page tables and TLBs, while lazy init- the solution does this in software, with the bitmap clearly tracking which granules were selected.

The advantage of the hardware approach is that it is almost free (for the “hit” case), since it uses the constant support of virtual memory in the CPU to detect misses, but the advantage is portable and allows you to precisely control the size of the granules, etc.

You can combine these approaches, make a lazy approach that does not use a bitmap, and you don’t even need a dense array: just select the sparse array with mmap as PROT_NONE , so you are mistaken when reading from an unallocated page in the sparse array. You will catch the error and select the page in the sparse array with a sentinel value indicating "no" for each element.

This is the fastest of all for the "hot" case: you no longer need any verification ... && dense[sparse[i]] == i .

Disadvantages:

Your code really is not portable, because you need to implement error handling logic, which usually depends on the platform.
You cannot control the size of the granule: it must be at the level of detail on the page (or several or several). If your set is very sparse (say, less than 1 out of 4096 occupied elements) and evenly distributed, you end up paying the high initialization cost, since you need to handle the error and initialize a full page of values for each element.
Miss (i.e. without inserting access to installed elements that do not exist) either needs to select the page even if there are no elements in this range, or it will be very slow (with an error) every time.

¹ This implementation does not have a “range check” MAX_ELEM is, it does not check whether i MAX_ELEM — depending on your use case, you can check this. My implementation used the template parameter for MAX_ELEM , which could lead to slightly faster code, but also more bloated, and you would do anything to just make the maximum size a member of the class.

² Indeed, the only requirement is that you use something that calls calloc under the covers or performs equivalent zero-fill optimization, but based on my tests, are more idiomatic C ++ approaches like new int[size]() just do the highlighting, and then memset . gcc optimizes malloc and then memset in calloc , but this is not useful if you avoid using C routines anyway!

³ Similarly, you will need 1 extra bit to track every 128 bits of the sparse array.

How to safely implement "Using uninitialized memory for fun and profit"?

Ignore the standard, implement it in the usual way.

Use assembly

Rely on calloc magic

Calloc Cautions

Use lazy initialization

Really lazy decision

More articles:

Rely on `calloc` magic