Do not try to align sp yourself yourself, instead of push one more register to get alignment. For example, instead of
push {r3, r4, lr}
add another register to the list to easily align to 8.
push {r1, r3, r4, lr}
This may seem like additional memory access, but in general, caches work with wider bit vectors than with native word sizes.
One more note: you do not need to force yourself to align the stacks correctly if you are not making external calls or receiving. Therefore, if you have a closed box assembly procedure that does not cause calls to the outside world or does not receive them, you can live with a broken stack alignment if it does not bite your own loads.
source share