我通过在上替换了“蓝线”的Algebra小路段。 (DGEMV, DAXPY and DNRM only)。 由此可见,a'b比DCOPY要快,2*a比DSCAL要快。
答案是正确的,在执行方面没有问题。 然而,当我将其汇编成<代码>ifort CG.f90 -mkl时。 结果是:
MKL_SET_DYNAMIC = TRUE; 140秒
MKL_SET_DYNAMIC = FALSE, MKL_SET_NUM_THREADS=1;70秒。
MKL_SET_DYNAMIC = FALSE, MKL_SET_NUM_THREADS=2 ; ~100 seconds.
几个要点:
- I have 2 real cores and 2 virtual cores through hyperthreading. I am not trying to run 16 threads on a 2 core machine.
- Profiling has yielded abstruse references to a
M16_LAY_GAS16
which after a lot of searching came down tomultpd
ASM. Nothing useful came out otherwise (or maybe, I didn t know where to look) FWIW, I used VTune. - The problem size is not small. The above examples are for matrix sizes proportional to the size of my RAM (Roughly 13k x 13k for my 4 GB System).
KMP_AFFINITY
maps one thread to one processor in serial case and 2 threads to 2 processors in parallel.
我的问题是: 如果是最佳的,为什么将read数定为1? 如果同一工作(较少时间)在1.完成,我不一定需要使用2条线。
我对英特尔·麦克卢做了一些错误或一些错误?