Implementing your own memcpy (size in bytes?)

I recently had a question from an interview where I had to implement memcpy. I used memcpy a lot in my experience, so this did not seem like a difficult problem.

So, I started implementing a loop to copy one address at a time from pointer to pointer, something like this:

void memcpy(void* dest, void* src, int size){ for(int index = 0; index < size; index++){ dest[index] = src[index]; } } 

However, the interviewers interrupted, noting that the memcpy man page says that โ€œcopy n bytes from src to destโ€ (which I confirmed later), and then wanted me to iterate over the size / 4 instead, and then take the remaining index loop with another <size% 4 (I assume it was a 32-bit system?)

Well, that seemed strange, since I used memcpy for many years without any problems, without giving it a modifier * 4). When I got home, I ran gdb and copied a small โ€œhelloโ€ line and tried to enter the size using both strlen () and the constants to see where it starts and stops.

  char* src = "hello"; char* dest = calloc(16, sizeof(char)); int len = strlen(src); memcpy(dest, src, len); // both my version and official version 

Now I have carefully studied src and dest with gdb, in which both contained "hello \ 0".

So my question is: what I do not understand about using the number 4 (or "size in bytes")? And why does the documentation say "n bytes" when this is not really behavior? What I donโ€™t see here?

+8
c ++ c
source share
7 answers

They asked you to optimize your implementation and copy the 32-bit word at a time inside the loop or byte at a time. This would require a careful check of the handling of boundary cases, such as size not being a multiple of 4, or dest or src does not align on a 4-byte boundary.

+12
source share

As others have said, copying 4 bytes at a time is faster than copying 1 byte at a time. The interviewer wanted you to do something like this:

 void memcpy(void* dest, void* src, int size) { uint8_t *pdest = (uint8_t*) dest; uint8_t *psrc = (uint8_t*) src; int loops = (size / sizeof(uint32_t)); for(int index = 0; index < loops; ++index) {  *((uint32_t*)pdest) = *((uint32_t*)psrc); pdest += sizeof(uint32_t); psrc += sizeof(uint32_t); } loops = (size % sizeof(uint32_t)); for (int index = 0; index < loops; ++index) { *pdest = *psrc; ++pdest; ++psrc; } } 
+11
source share

Your memcpy's logic is correct, and your interviewer did not ask you to change it or add a constraint. Copying 4 bytes at a time is faster, but it becomes a problem if your size is not a multiple of 4. Therefore, your interviewer told you to use two loops: the first copies are 4 bytes at a time, and the second cycle is one byte per (it will iterate no more than 3 time).

Thus, the main part of the copy is executed with a fast 4-byte copy, but you are not limited by the size that is a multiple of 4, because the second cleaning cycle will clear everything that is not a multiple of 4.

1st loop: copy uint32_t and increment by 4
2nd cycle: copy uint8_t and increase by 1

+1
source share

The interviewer tested your knowledge of computer architecture and wanted you to optimize your algorithm. Memory works in words, not bytes. In a 32-bit system, a word is usually 4 bytes, it takes as long to read / write 1 byte as it does to read / write 1 word. The second loop is to handle the case where the number of bytes you want to copy is not exactly divided by 4 bytes.

What you really want is 3 cycles. 2 cycles for bytes after dest and to dest + size, which, if either are not aligned word by word. Then another cycle for all the words in between.

You can optimize even more with the instructions related to architecture. If interested, check out this article: http://www.eetimes.com/design/embedded/4024961/Optimizing-Memcpy-improves-speed

+1
source share

Interviewers have asked you to perform premature optimization for one reason or another. This is usually a bad idea.

It is true that a 32-bit machine will copy one 32-bit cartridge faster than it will copy 4x1 bytes. But there is more to optimization.

There is a high likelihood that a 32-bit machine will cache your data, and then suddenly fast memory access may be much more relevant than CPU instructions. Cache memory may have different alignment requirements. They may prefer a simple loop, or they may prefer 32-bit aligned pieces. I am not an expert on this issue, so I avoid premature optimization and leave it to the compiler, which I hope knows more about cache memory than I do.

Then there is a prediction and branching of the processor. This particular code is pretty deterministic, so this may not be a problem. But as a rule: simple code provides more efficient branch prediction than complex code.

In addition, there is a separation that is slow on many processor architectures. Depending on the amount of data to copy, partitions can cause memcpy to be much slower.

To summarize: manual optimization is quite complex and requires a deep knowledge of the processor and equipment. You cannot and should not "optimize for a 32-bit processor", you need to know the specifics. In most cases, the compiler will optimize the code much more efficiently than you. The memcpy () library, in particular, is often written to inline assembler optimized for a specific purpose.

+1
source share

They wanted you to speed it up. A 32-bit processor can copy 32 bits faster than it can copy 8 bits. Therefore, if someone wants to copy 4 bytes, and not do it at a time, you can do it all at once.

0
source share

Check this.

 void myMemCpy(void *dest, void *src, size_t n) { // Typecast src and dest addresses to (char *) char *csrc = (char *)src; char *cdest = (char *)dest; // Copy contents of src[] to dest[] for (int i=0; i<n; i++) cdest[i] = csrc[i]; } 

More details

0
source share

All Articles