in this document: http://infocenter.arm.com/help/topic/com.arm.doc.ddi0301g/DDI0301G_arm1176jzfs_r0p7_trm.pdf
on page 21-25 (pdf page 875) the througput and latency timings are given for the assembly instructions of the VFP unit.
Are those numbers independant of vectorsize?
1: let s take FMULS which has throughput of 1 and latency of 8. does it mean that i can start in each cycle a new FMULS operation if i don t use a register which is not currently calculated by a previous function? for example:
FMULS s8, s16, s20
FMULS s12, s21, s25
will those exectue right after each other?
2: what happens if I have two FMULS functions after each other where one argument depends upon the previous computation
FMULS s8, s16, s20
FMULS s12, s21, s8
will the VFP wait for 8 cycles before starting to process the second instruction?
3: what if we are in vectormode with 4 elements and on the second FMULS instruction all inputregisters but one are available. what will happen?
4: sqrt and division: will a sqrt or division operation prevent any subsequent operation from being started for 19 cycles?
thanks!