The built-in memcpy implementation tends to be optimized quite strongly for the platform in question, so it will usually be faster than the naive one for the loop.
Some optimizations include as many copies as possible at a time (not single bytes, but whole words, or if this processor supports it, even more), some degree of loop rotation, etc. Of course, the best optimization course depends on the platform, so it is usually better to stick with the built-in function.
In most cases, this is written by more experienced people than the user anyway.
source share