Latency, Throughput and Risks

in this document: http://infocenter.arm.com/help/topic/com.arm.doc.ddi0301g/DDI0301G_arm1176jzfs_r0p7_trm.pdf

on pages 21–25 (pdf p. 875) for the assembly instructions for the VFP, the slew times and latencies are given.

Are these numbers independent of vectors?

1: let’s take FMULS with a bandwidth of 1 and latency of 8. Does this mean that I can start a new FMULS operation in each cycle if I do not use a register that is not currently evaluated by the previous function? eg:

FMULS s8, s16, s20 FMULS s12, s21, s25 

will these outputs be right after each other?

2: what happens if I have two FMULS functions after each other, where one argument depends on the previous calculation

 FMULS s8, s16, s20 FMULS s12, s21, s8 

Will VFP wait 8 cycles before processing the second command?

3: what if we are in a vectormode with 4 elements, and in the second FMULS instruction all input registers except one are available. what will happen

4: sqrt and partition: will the sqrt or division operation prevent any subsequent operation from starting for 19 cycles?

thanks!

+4
source share
1 answer

The answers to all your questions are contained in the document that you linked. You must read it carefully.

Are these numbers independent of vectors?

Not. See, for example, table 21-15 in the document that you linked. Note the delay of the short FADDS vector.

Does this mean that I can start a new FMULS operation every cycle if it does not depend on an earlier result that is not yet available?

Yes, this is a definition of bandwidth.

what happens if I have two FMULS functions after each other, where one argument depends on the previous calculation

Execution will stop until the result of the first FMULS . See details in 21.6 "Operation of the display".

what if we are in vectormode with 4 elements, and in the second FMULS instruction all input registers except one are available. what will happen

He will stop. The same link.

sqrt and division: will the sqrt or division operation prevent any subsequent operation from starting for 19 cycles?

Not. See Section 21.10, Parallel Execution. An example is shown in Table 21-15, in which independent FADDS is executed immediately after FDIVS .

Note that writing short-vector VFP code, which runs significantly faster than scalar code for many types of computations, can be a bit complicated (although not impossible). Even if you learn how to do this, it will be of dubious value, since the NEON block is apparently a new vector model for computing on ARM. Ultimately, you may be better served by ignoring the short-vector operation at the moment and focusing on training NEON for the future.

+2
source

All Articles