EDIT / REWRITE:
If you need books, Michael Abrash has done a good job in this area, the Zen of Assembly language, a number of magazine articles, a large black book on graphical programming, etc. Much of what he tuned up is no longer a problem; problems have changed. What you get from this is the ideas of those things that can cause the neck of the bottle and the ways to solve it. The most important thing is all the time and understand how your time measurements work so that you do not deceive yourself by measuring incorrectly. A time of various decisions and crazy, strange decisions, you can find an optimization that you did not know about and did not understand until you exposed it.
I just started reading, but see MIPS Run (early / first edition) looks pretty good so far (note that ARM has taken MIPS as the market leader in processors, so MIPS and RISC ads are a bit outdated). There are several text books, old and new for MIPS. Mips designed for performance (to the detriment of the software developer in some way).
Today, bottlenecks fall into the categories of the processor itself and the I / Os around it and what is associated with this I / O. The internals of the processor chips themselves (for systems with a higher level) work much faster than I / O can handle, so you can only configure them before you have to turn off the chip and wait forever. When leaving the train, it takes half a minute faster from the train to the destination when the train was in the train for 3 hours, it is not necessary to optimize it.
It's all about learning hardware; you might be left within the world of zeros and not supposed to get into real electronics. But, without knowing the interfaces and internal components, you really cannot do much performance tuning. You can reorganize or change a few instructions and increase a little, but to do something several hundred times faster, you need more. Learning many different sets of instructions (assembler languages) helps you get into processors. I would recommend simulating HDL, for example, processors on opencores, to understand how some people do their projects and get a solid pen on how to really compress the clock from a task. The knowledge of processors is great, memory interfaces are a huge deal, and you need to study them, use media (flash, hard drives, etc.), as well as displays and graphics, network interfaces and all types of interfaces between all these things. And understanding at the watch level or as close to it as possible is what you need.
source share