First and foremost, the C language has no speed. This is an attribute introduced by the implementation of C. For example, C has no speed, but the GCC compiler generates code that may differ in speed from the code generated by the Clang compiler, and both of them can generate code that performs the behavior produced by the Cint or Ch translators. All this is an implementation of C. Some of them are slower than others, but the speed cannot be attributed to C anyway!
6.3.2.1 of standard C says:
Unless it is an operand of the sizeof operator, the _Alignof operator, either unary and operator, or is a string literal used to initialize an array, an expression that is of type '' array of type is converted to an expression with a pointer of type '' to indicate which points to the original element of the array object and is not an lvalue.
This should be a sign that both *(ap+1) and a[1] in your code are pointer operations. This translation will take place at the compilation stage in Visual Studio. Therefore, this should not affect runtime.
6.5.2.1 with respect to the "array substring" says:
One of the expressions must have a type pointer '' to complete the object type, the other expression must have an integer type, and the result has a type type. This indicates that the array index operator is actually a pointer operator ...
This is a confirmation that ap[1] indeed a pointer operation, as we postulated earlier. However, at run time, the array has already been translated to a pointer. Performance must be identical.
... so why are they not identical?
What are the characteristics of the OS used? Isn't it multitasking, multi-user OS? Suppose that the OS was supposed to complete the first cycle without interruption, but then interrupt the second cycle and switch control to another process. Wouldn't this interruption justify your experiment? How do you measure the frequency and time of interruptions caused by task switching? Please note that this will be different for different OSs, and the OS is part of the implementation.
What are the specifications of the processor you are using? Does it have its own fast internal cache for machine code? Suppose your whole first cycle, and it covers the synchronization mechanism, should fit in the code cache well, but the second cycle has been truncated. Wouldn't that lead to a cache miss and a long wait for your processor to select the rest of the code from RAM? How do you measure interrupt time caused by cache misses? Note that this will be different for different CPUs, and the CPU is part of the implementation.
These questions should raise some questions, such as "Is this microoptimization benchmark a crucial or important issue?" The success of the optimization will depend on the size and complexity of the problem. Find an important problem, solve it, profile the solution, optimize it and profile again. Thus, you can give meaningful information about how much faster the version is optimized. Your boss will be much happier with you, letting you not disclose that optimization is probably only important for your implementation, as I mentioned earlier. I am sure you will find that the smallest of your worries will be dereferenced by marking up the array and dereferencing the pointer.