Question

Im 基准软件,在Intel 2670QM上安装4x更快,然后使用我所有8个逻辑校对的序列版。我愿就我对基准结果的感觉提出一些社区反馈意见。

当我使用4个核心的4个标线时,我的速度达到4x,整个算法都是平行执行的。这对我来说看来是合乎逻辑的,因为阿姆哈拉法预测了这一点。视窗任务主管使用50%的CPU向我说一米。

然而,如果我在所有8条深处安装同样的软件,我再次得到4x和not的加速度达到8x。

如果我正确理解这一点:我的邮联有4个核心,每个核心的频率为2.2千兆赫,但在8个符合逻辑的校对时,频率被分为1.1千兆赫,其余部分,如海滩记忆,则相同? 如果情况确实如此,那么为什么任务主管机构只要求使用我方邮联的50%?

#define NumberOfFiles 8 ... char startLetter = a ; #pragma omp parallel for shared(startLetter) for(int f=0; f<NumberOfFiles; f++){ ... }

我没有包括使用磁盘I/O的时间。我只关心STL打电话的时间,而不是软盘I/O。

Answer 1

A i7-2670QM processor has 4 cores. But it can run 8 threads in parallel. This means that it only has 4 processing units (Cores) but has support in hardware to run 8 threads in parallel. This means that a maximum of four jobs run in on the Cores, if one of the jobs stall due to for example memory access another thread can very fast start executing on the free Core with very little penalty. Read more on Hyper threading. In Reality there are few scenarios where hyper threading gives a large performance gain. More modern processors handle hyper threading better than older processors.

Your benchmark showed that it was CPU bound, i.e. There was little stalls in the pipeline that would have given Hyper Threading an advantage. 50% CPU is correct has the 4 cores are working and the 4 extra are not doing anything. Turn of hyper threading in the BIOS and you will see 100% CPU.

Answer 2

这是高校正快速总结

翻开速度缓慢,必须停止执行,将一股价值观复制成记忆,将一股价值观复制到万国邮联,然后用新的镜子开始。

这就是你4个虚拟核心所在。你们有4个核心,即,但高校正使得万国邮联能够做的是2个核心。

只有1只透镜可以在一个时间执行,但是当1只透镜需要停止读取、磁盘或需要一定时间的其他任何东西时,它就可以在另一个路面上转开,并用于一个轨道。在老的加工商中,他们现在基本上有睡觉的界限。

因此,贵方核心有4个核心,每个核心单位可以做1件事,但一旦需要等待另一部分电脑,即可有2份待命工作。

如果你的任务有许多记忆用法和许多使用邮联的做法,你应看到总执行时间略有下降,但如果你几乎完全受邮联的约束,你就会比照仅仅4条镜子。

Answer 3

The important piece of information to understand here is the difference between physical and logical thread.
If you have 4 physical cores on your CPU, that means you have physical resources to execute 4 distinct thread of execution in parallel. So, if your threads do not have data contention, you can normally measure a x4 performance increase, compared to the speed of the single thread.
I m also assuming that the OS (or you :)) sets the thread affinity correctly, so each thread is run on each physical core.
When you enable HT (Hyper-Threading) on your CPU the core frequency is not modified. :)
What happen is that part of the hw pipeline (inside the core and around (uncore, cache, etc)) is duplicated, but part of it is still shared between the logical threads. That s the reason why you do not measure a x8 performance increase. In my experience enabling all logical cores you can get a x1.5 - x1.7 performance improvement per physical core, depending on the code you are executing, cache usage (remember that the L1 cache is shared between two logical cores/1 physical core, for instance), thread affinity, and so on and so forth. Hope this helps.

Answer 4

HT is called SMT (Simultaneous MultiThreading) or HTT (HyperThreading Technology) in most BIOSes. The efficiency of HT depends on the so called compute-to-fetch ratio that is how many in-core (or register/cache) operations your code does before it fetches from or stores to the slow main memory or I/O memory. For highly cache efficient and CPU-bound codes the HT gives almost no noticeable performance increase. For more memory bound codes the HT can really benefit the execution due to the so-called "latency hiding". That s why most non-x86 server CPUs provide 4 (e.g. IBM POWER7) to 8 (e.g. UltraSPARC T4) hardware threads per core. These CPUs are usually used in database and transactional processing systems where many concurrent memory-bound requests are serviced at once.

By the way, the Amdhal s law states that the upper limit of the parallel speedup is one over the serial fraction of the code. Usually the serial fraction increases with the number of processing elements if there is (possibly hidden in the runtime) communication or other synchronisation between the threads, although sometimes cache effects can lead to superlinear speedup and sometimes cache trashing can reduce performance drastically.

Answer 5

一些实际数字:

CPU对我7的密集任务(从1-100000人增加到16次)平均超过8次测试:

概述、校对/尺度:

Note that in the using X threads line in the reports below, X is one greater than the number of threads available to do the tasks - one thread submits the tasks and waits on a countdown-latch evnet for their completion - it processes none of the CPU-heavy tasks and used no CPU.

8 tests,
16 tasks,
counting to 1000000000,
using 2 threads:
Ticks: 26286
Ticks: 26380
Ticks: 26317
Ticks: 26474
Ticks: 26442
Ticks: 26426
Ticks: 26474
Ticks: 26520
Average: 26414 ms

8 tests,
16 tasks,
counting to 1000000000,
using 5 threads:
Ticks: 8799
Ticks: 9157
Ticks: 8829
Ticks: 9002
Ticks: 9173
Ticks: 8720
Ticks: 8830
Ticks: 8876
Average: 8923 ms

8 tests,
16 tasks,
counting to 1000000000,
using 9 threads:
Ticks: 6615
Ticks: 6583
Ticks: 6630
Ticks: 6599
Ticks: 6521
Ticks: 6895
Ticks: 6848
Ticks: 6583
Average: 6659 ms

8 tests,
16 tasks,
counting to 1000000000,
using 13 threads:
Ticks: 6661
Ticks: 6599
Ticks: 6552
Ticks: 6630
Ticks: 6583
Ticks: 6583
Ticks: 6568
Ticks: 6567
Average: 6592 ms

8 tests,
16 tasks,
counting to 1000000000,
using 17 threads:
Ticks: 6739
Ticks: 6864
Ticks: 6599
Ticks: 6693
Ticks: 6676
Ticks: 6864
Ticks: 6646
Ticks: 6677
Average: 6719 ms

8 tests,
16 tasks,
counting to 1000000000,
using 65 threads:
Ticks: 7223
Ticks: 6552
Ticks: 6879
Ticks: 6677
Ticks: 6833
Ticks: 6786
Ticks: 6739
Ticks: 6802
Average: 6811 ms

8 tests,
16 tasks,
counting to 1000000000,
using 129 threads:
Ticks: 6771
Ticks: 6677
Ticks: 6755
Ticks: 6692
Ticks: 6864
Ticks: 6817
Ticks: 6849
Ticks: 6801
Average: 6778 ms

友情链接