Your code might appear to do parallel processing, but in practice it doesn t.
Basically, you are spending your time copying data around and queueing for memory allocator accesses.
此外,你正在使用未经保护的<代码>i loop indice,就像没有东西可以做的那样,这将用随机的 ju子,而不是图像的美分。
如同往常一样,C++在 sugar糖的厚颜色下隐藏着这些可悲的事实。
但是,你的法典中最大的缺陷是算法本身,正如你可能看到的,你是否进一步读过。
由于这个例子似乎是与我平行处理的一个教科书案例,而且我从未看到过对它进行的“教育”分析,我将尝试一个例子。
Functional analysis
你们希望利用所有CPU核心,为Mandelbrot的 set子 cr。 这是一种完全可平行计算的情况,因为每一计算方法都可以独立进行。
因此,基本上,你在机器上拥有N芯,在加工过程中,你应该每芯一对一只。
不幸的是,将投入数据分开,以便每个加工商最终完成所需加工的1/N,这并不像现在这样明显。
特定的粉碎机可以从0到255英寸计算。 “黑色”粉末比“白”粉末高出255倍。
因此,如果你简单地把你的照片分割为平坦的地表,那么你的所有加工者都将通过“白色”地区进行呼吸,但通过“黑色”地区进行拖网。 因此,最缓慢的地区计算时间将占主导地位,而平行化实际上将没有任何好处。
在实际情况下,这不会是巨大的,但仍然是巨大的计算能力损失。
Load balancing
To better balance the load, it is more efficient to split your picture in much smaller bits, and have each worker thread pick and compute the next available bit as soon as it is finished with the previous one.
That way, a worker processing "white" chunks will eventually finish its job and start picking "black" chunks to help its less fortunate siblings.
理想的情况是,应该通过减少复杂性来分类,以避免将一个大草块的线性成本增加到总 comp时间。
不幸的是,由于曼德布罗布罗特集的混乱性质,无法预测某个地区的计算时间。
如果我们决定这块地块是地貌的横向特征,则按自然的<代码>y顺序分类,则明显是次优。 如果这一特定地区是“黑白”梯度,那么最昂贵的线路将在草原清单的末尾连接起来,而且你将最终计算成本最低的比值,这是平衡负荷的最坏情况。
A possible solution is to shuffle the chunks in a butterfly pattern, so that the likelihood of having a "black" area concentrated in the end is small.
Another way would simply be to shuffle input patterns at random.
这里是我证明概念执行的两项产出:
Jobs are executed in reverse order (jobs 39 is the first, job 0 is the last).
Each line decodes as follows:
t a-b : thread n°a on processor b
b : begining time (since image computation start)
e : end time
d : duration (all times in milliseconds)
(1) 有40个工作,有 but
job 0: t 1-1 b 162 e 174 d 12 // the 4 tasks finish within 5 ms from each other
job 1: t 0-0 b 156 e 176 d 20 //
job 2: t 2-2 b 155 e 173 d 18 //
job 3: t 3-3 b 154 e 174 d 20 //
job 4: t 1-1 b 141 e 162 d 21
job 5: t 2-2 b 137 e 155 d 18
job 6: t 0-0 b 136 e 156 d 20
job 7: t 3-3 b 133 e 154 d 21
job 8: t 1-1 b 117 e 141 d 24
job 9: t 0-0 b 116 e 136 d 20
job 10: t 2-2 b 115 e 137 d 22
job 11: t 3-3 b 113 e 133 d 20
job 12: t 0-0 b 99 e 116 d 17
job 13: t 1-1 b 99 e 117 d 18
job 14: t 2-2 b 96 e 115 d 19
job 15: t 3-3 b 95 e 113 d 18
job 16: t 0-0 b 83 e 99 d 16
job 17: t 3-3 b 80 e 95 d 15
job 18: t 2-2 b 77 e 96 d 19
job 19: t 1-1 b 72 e 99 d 27
job 20: t 3-3 b 69 e 80 d 11
job 21: t 0-0 b 68 e 83 d 15
job 22: t 2-2 b 63 e 77 d 14
job 23: t 1-1 b 56 e 72 d 16
job 24: t 3-3 b 54 e 69 d 15
job 25: t 0-0 b 53 e 68 d 15
job 26: t 2-2 b 48 e 63 d 15
job 27: t 0-0 b 41 e 53 d 12
job 28: t 3-3 b 40 e 54 d 14
job 29: t 1-1 b 36 e 56 d 20
job 30: t 3-3 b 29 e 40 d 11
job 31: t 2-2 b 29 e 48 d 19
job 32: t 0-0 b 23 e 41 d 18
job 33: t 1-1 b 18 e 36 d 18
job 34: t 2-2 b 16 e 29 d 13
job 35: t 3-3 b 15 e 29 d 14
job 36: t 2-2 b 0 e 16 d 16
job 37: t 3-3 b 0 e 15 d 15
job 38: t 1-1 b 0 e 18 d 18
job 39: t 0-0 b 0 e 23 d 23
在经过几个小的工作后,如果一只read子的read子再花更多的时间处理自己的chu,你就可以看到工作负荷平衡。
2) 40个按行业顺序排列的工作
job 0: t 2-2 b 157 e 180 d 23 // last thread lags 17 ms behind first
job 1: t 1-1 b 154 e 175 d 21
job 2: t 3-3 b 150 e 171 d 21
job 3: t 0-0 b 143 e 163 d 20 // 1st thread ends
job 4: t 2-2 b 137 e 157 d 20
job 5: t 1-1 b 135 e 154 d 19
job 6: t 3-3 b 130 e 150 d 20
job 7: t 0-0 b 123 e 143 d 20
job 8: t 2-2 b 115 e 137 d 22
job 9: t 1-1 b 112 e 135 d 23
job 10: t 3-3 b 112 e 130 d 18
job 11: t 0-0 b 105 e 123 d 18
job 12: t 3-3 b 95 e 112 d 17
job 13: t 2-2 b 95 e 115 d 20
job 14: t 1-1 b 94 e 112 d 18
job 15: t 0-0 b 90 e 105 d 15
job 16: t 3-3 b 78 e 95 d 17
job 17: t 2-2 b 77 e 95 d 18
job 18: t 1-1 b 74 e 94 d 20
job 19: t 0-0 b 69 e 90 d 21
job 20: t 3-3 b 60 e 78 d 18
job 21: t 2-2 b 59 e 77 d 18
job 22: t 1-1 b 57 e 74 d 17
job 23: t 0-0 b 55 e 69 d 14
job 24: t 3-3 b 45 e 60 d 15
job 25: t 2-2 b 45 e 59 d 14
job 26: t 1-1 b 43 e 57 d 14
job 27: t 0-0 b 43 e 55 d 12
job 28: t 2-2 b 30 e 45 d 15
job 29: t 3-3 b 30 e 45 d 15
job 30: t 0-0 b 27 e 43 d 16
job 31: t 1-1 b 24 e 43 d 19
job 32: t 2-2 b 13 e 30 d 17
job 33: t 3-3 b 12 e 30 d 18
job 34: t 0-0 b 11 e 27 d 16
job 35: t 1-1 b 11 e 24 d 13
job 36: t 2-2 b 0 e 13 d 13
job 37: t 3-3 b 0 e 12 d 12
job 38: t 1-1 b 0 e 11 d 11
job 39: t 0-0 b 0 e 11 d 11
在这里,费用高昂的棚户往往在点末坐在一起,从而造成明显的业绩损失。
3) 只拥有1个核心职位,1至4个核心启动
reported cores: 4
Master: start jobs 4 workers 1
job 0: t 0-0 b 410 e 590 d 180 // purely linear execution
job 1: t 0-0 b 255 e 409 d 154
job 2: t 0-0 b 127 e 255 d 128
job 3: t 0-0 b 0 e 127 d 127
Master: start jobs 4 workers 2 // gain factor : 1.6 out of theoretical 2
job 0: t 1-1 b 151 e 362 d 211
job 1: t 0-0 b 147 e 323 d 176
job 2: t 0-0 b 0 e 147 d 147
job 3: t 1-1 b 0 e 151 d 151
Master: start jobs 4 workers 3 // gain factor : 1.82 out of theoretical 3
job 0: t 0-0 b 142 e 324 d 182 // 4th packet is hurting the performance badly
job 1: t 2-2 b 0 e 158 d 158
job 2: t 1-1 b 0 e 160 d 160
job 3: t 0-0 b 0 e 142 d 142
Master: start jobs 4 workers 4 // gain factor : 3 out of theoretical 4
job 0: t 3-3 b 0 e 199 d 199 // finish at 199ms vs. 176 for butterfly 40, 13% loss
job 1: t 1-1 b 0 e 182 d 182 // 17 ms wasted
job 2: t 0-0 b 0 e 146 d 146 // 44 ms wasted
job 3: t 2-2 b 0 e 150 d 150 // 49 ms wasted
Here we get a 3x improvement while a better load balancing could have yielded a 3.5x.
And this is a very mild test case (you can see the computation times only vary by a factor of about 2, while they could theoretically vary by a factor of 255 !).
不管怎么说,如果你不执行某种负荷平衡,你可能写出的所有光辉多处理器法,仍然会产生令人难以置信的不良表现。
Implementation
为使线人不受干扰地工作,他们必须不受潮.世界的干扰。
这种干涉之一是记忆分配。 每当你分配甚至传记时,你都会排他性地获得全球记忆分配器(并浪费一个CPU的分行)。
Also, creating worker tasks for each picture computation is another waste of time and resources. The computation might be used to display the Mandlebrot set in an interactive application, so better have the workers premanently created and synchronized to compute successive images.
最后,有数据副本。 如果你每次重新计算几个点时都与主要方案同步,你将再次花费相当一部分时间来排他性地进入成果领域。 此外,大量数据的无用拷贝将进一步伤害业绩。
明显的解决办法是完全分发复印件,并处理原始数据。
design
你们必须向你们的劳动者提供他们无阻碍工作所需要的一切。 阁下:
- determine the number of available cores on your system
- pre-allocate all the memory needed
- give access to a list of image chunks to each of your worker
- launch exactly one thread per core and let them run free to do their job
job queue
There is no need for fancy no-wait or whatever gizmos, nor do we need to pay special attention to cache optimization.
Here again, the time needed to compute pixels dwarves the inter-thread synchronization cost and cache efficiency problems.
基本上,可以在图像生成之初就算出整个问题。 工人只能读到自己的工作岗位:在这个岗位上永远不会同时读/做成,因此,围绕执行职务问题制定的标准范围将越来越低,对手头的工作来说是非理想性的,过于复杂。
我们需要两点:
- let the workers wait for a new batch of jobs
- let the master wait for a picture completion
workers will wait until the queue length changes to a positive value.
They will then all wakeup and start atomically decrementing the queue length. The current value of the queue length will provide them exclusive access to the associated job data (basically an area of the Mandlebrot set to compute, with an associated bitmap area to store the computed iteration values).
同样的机制被用于解雇工人。 贫穷工人不会找到新的工作,而是会发现解雇令。
the master waiting for a picture completion will be awoken by the worker that will finish processing the last job. This will be based on an atomic counter of the number of jobs to process.
This is how I implemented it:
class synchro {
friend class mandelbrot_calculator;
mutex lock; // queue lock
condition_variable work; // blocks workers waiting for jobs/termination
condition_variable done; // blocks master waiting for completion
int pending; // number of jobs in the queue
atomic_int active; // number of unprocessed jobs
bool kill; // poison pill for workers termination
void synchro (void)
{
pending = 0; // no job in queue
kill = false; // workers shall live (for now :) )
}
int worker_start(void)
{
unique_lock<mutex> waiter(lock);
while (!pending && !kill) work.wait(waiter);
return kill
? -1 // worker should die
: --pending; // index of the job to process
}
void worker_done(void)
{
if (!--active) // atomic decrement (exclusive with other workers)
done.notify_one(); // last job processed: wakeup master
}
void master_start(int jobs)
{
unique_lock<mutex> waiter(lock);
pending = active = jobs;
work.notify_all(); // wakeup all workers to start jobs
}
void master_done(void)
{
unique_lock<mutex> waiter(lock);
while (active) done.wait(waiter); // wait for workers to finish
}
void master_kill(void)
{
kill = true;
work.notify_all(); // wakeup all workers (to die)
}
};
齐心协力:
class mandelbrot_calculator {
int num_cores;
int num_jobs;
vector<thread> workers; // worker threads
vector<job> jobs; // job queue
synchro sync; // synchronization helper
mandelbrot_calculator (int num_cores, int num_jobs)
: num_cores(num_cores)
, num_jobs (num_jobs )
{
// worker thread
auto worker = [&]()
{
for (;;)
{
int job = sync.worker_start(); // fetch next job
if (job == -1) return; // poison pill
process (jobs[job]); // we have exclusive access to this job
sync.worker_done(); // signal end of picture to the master
}
};
jobs.resize(num_jobs, job()); // computation windows
workers.resize(num_cores);
for (int i = 0; i != num_cores; i++)
workers[i] = thread(worker, i, i%num_cores);
}
~mandelbrot_calculator()
{
// kill the workers
sync.master_kill();
for (thread& worker : workers) worker.join();
}
void compute(const viewport & vp)
{
// prepare worker data
function<void(int, int)> butterfly_jobs;
butterfly_jobs = [&](int min, int max)
// computes job windows in butterfly order
{
if (min > max) return;
jobs[min].setup(vp, max, num_jobs);
if (min == max) return;
jobs[max].setup(vp, min, num_jobs);
int mid = (min + max) / 2;
butterfly_jobs(min + 1, mid );
butterfly_jobs(mid + 1, max - 1);
};
butterfly_jobs(0, num_jobs - 1);
// launch workers
sync.master_start(num_jobs);
// wait for completion
sync.master_done();
}
};
Testing the concept
该守则在我2个核心领域(即4个CPUsIntel I3 @ 3.1 GHz)上运行良好,这些核心单位是微软Wevudio于2013年汇编的。
我在1280x1024 pixels的窗户上使用了一套平均90个炉./钢板的镜子。
The computation time is about 1.700s with only one worker and drops to 0.480s with 4 workers.
The maximal possible gain would be a factor 4. I get a factor 3.5. Not too bad.
I assume the difference is partly due to the processor architecture (the I3 has only two "real" cores).
Tampering with the scheduler
My program forces the threads to run on one core each (using MSDN SetThreadAffinityMask
).
If the scheduler is left free to allocate the tasks, the gain factor drops from 3,5 to 3,2.
这一点很重要,但是,当独任时,Win7的表率仍然相当好。
synchronization overhead
运行关于“湿”窗口的算法(在坡道外;2个区)对系统电话管理有好的想法。
与代表区的480米相比,需要约7米才能对这个“白色”地区进行测量。
大约为1.5%,包括工作团的同步和计算。 而且,在1024个工作岗位上,这项工作正在同步进行。
我要说的是,完全被忽视。 这可能给所有“诺瓦伊特”的狂热分子提供食物。
optimizing iterations
The way iterations are done is a key factor for optimization.
After a few trials, I settled for this method:
static inline unsigned char mandelbrot_pixel(double x0, double y0)
{
register double x = x0;
register double y = y0;
register double x2 = x * x;
register double y2 = y * y;
unsigned iteration = 0;
const int max_iteration = 255;
while (x2 + y2 < 4.0)
{
if (++iteration == max_iteration) break;
y = 2 * x * y + y0;
x = x2 - y2 + x0;
x2 = x * x;
y2 = y * y;
}
return (unsigned char)iteration;
}
净收益: +20%与OP 方法
(register
) 指令没有造成差别,只是为了纠正错误。
killing the tasks after each computation
离开工人的福利金约为计算时间的5%。
butterfly effect
在我的测试案例中,“绝对”命令实际上做得很好,在极端情况下产生30%以上的收益,并且由于最大规模的请求,通常达到10-15%。