Question

因此,我贴出一个类似问题,但我并没有拿出足够的法典来获得我所需要的帮助。即使我回过头并补充说现在的法典,我也不认为会注意到它,因为问题已经老了,“已经解决”。因此,我的问题是:

I m试图产生一部分红del。我可以提出罚款,但当我增加更多的核心时,不管问题的规模如何大,外线不会加快。我完全是新鲜的,而且可能只是一个小小小的Im失踪的人。不管怎样,这里是产生兄弟关系的职能:

void mandelbrot_all(std::vector<std::vector<int>>& pixels, int X, int Y, int numThreads) {
    using namespace std;

    vector<thread> threads (numThreads);
    int rowsPerThread = Y/numThreads;
    mutex m;

    for(int i=0; i<numThreads; i++) {
        threads[i] = thread ([&](){
            vector<int> row;
            for(int j=(i-1)*rowsPerThread; j<i*rowsPerThread; j++) {
                row = mandelbrot_row(j, X, Y);
                {
                    lock_guard<mutex> lock(m);
                    pixels[j] = row;
                }
            }
        });
    }
    for(int i=0; i<numThreads; i++) {
        threads[i].join();
    }
}

std::vector<int> mandelbrot_row(int rowNum, int topX, int topY) {
    std::vector<int> row (topX);
    for(int i=0; i<topX; i++) {
        row[i] = mandelbrotOne(i, rowNum, topX, topY);
    }
    return row;
}

int mandelbrotOne(int currX, int currY, int X, int Y) { //code adapted from http://en.wikipedia.org/wiki/Mandelbrot_set
    double x0 = convert(X, currX, true);
    double y0 = convert(Y, currY, false);
    double x = 0.0;
    double y = 0.0;
    double xtemp;
    int iteration = 0;
    int max_iteration = 255;
    while ( x*x + y*y < 2*2  &&  iteration < max_iteration) {
        xtemp = x*x - y*y + x0;
        y = 2*x*y + y0;
        x = xtemp;
        ++iteration;
    }
    return iteration;
}

脑膜炎(all)通过病媒控制,最大程度是病媒X和Y,还有read子使用的数量,这是在方案实施时从指挥线上提取的。它试图将工作分成多层。不幸的是,似乎即使这是它所做的事情,但它并没有使事情更快。如果你们需要更多的细节,那么我会自由地要求,我将尽力提供这些细节。

Thanks in advance for the help.

Edit: reserved vectors in advance Edit 2: ran this code with problem size 9600x7200 on a quad core laptop. It took an average of 36590000 cycles for one thread (over 5 runs) and 55142000 cycles for four threads.

Answer 1

Your code might appear to do parallel processing, but in practice it doesn t. Basically, you are spending your time copying data around and queueing for memory allocator accesses.

此外,你正在使用未经保护的<代码>i loop indice,就像没有东西可以做的那样,这将用随机的 ju子,而不是图像的美分。

如同往常一样,C++在 sugar糖的厚颜色下隐藏着这些可悲的事实。

但是,你的法典中最大的缺陷是算法本身,正如你可能看到的,你是否进一步读过。

由于这个例子似乎是与我平行处理的一个教科书案例,而且我从未看到过对它进行的“教育”分析,我将尝试一个例子。

Functional analysis

你们希望利用所有CPU核心,为Mandelbrot的 set子 cr。这是一种完全可平行计算的情况,因为每一计算方法都可以独立进行。

因此,基本上,你在机器上拥有N芯,在加工过程中,你应该每芯一对一只。

不幸的是,将投入数据分开,以便每个加工商最终完成所需加工的1/N,这并不像现在这样明显。

特定的粉碎机可以从0到255英寸计算。 “黑色”粉末比“白”粉末高出255倍。

因此,如果你简单地把你的照片分割为平坦的地表,那么你的所有加工者都将通过“白色”地区进行呼吸,但通过“黑色”地区进行拖网。因此,最缓慢的地区计算时间将占主导地位,而平行化实际上将没有任何好处。

在实际情况下,这不会是巨大的,但仍然是巨大的计算能力损失。

Load balancing

To better balance the load, it is more efficient to split your picture in much smaller bits, and have each worker thread pick and compute the next available bit as soon as it is finished with the previous one. That way, a worker processing "white" chunks will eventually finish its job and start picking "black" chunks to help its less fortunate siblings.

理想的情况是,应该通过减少复杂性来分类,以避免将一个大草块的线性成本增加到总 comp时间。

不幸的是,由于曼德布罗布罗特集的混乱性质,无法预测某个地区的计算时间。

如果我们决定这块地块是地貌的横向特征,则按自然的<代码>y顺序分类,则明显是次优。如果这一特定地区是“黑白”梯度,那么最昂贵的线路将在草原清单的末尾连接起来,而且你将最终计算成本最低的比值,这是平衡负荷的最坏情况。

A possible solution is to shuffle the chunks in a butterfly pattern, so that the likelihood of having a "black" area concentrated in the end is small.
Another way would simply be to shuffle input patterns at random.

这里是我证明概念执行的两项产出:

Jobs are executed in reverse order (jobs 39 is the first, job 0 is the last). Each line decodes as follows:

t a-b : thread n°a on processor b
b : begining time (since image computation start)
e : end time
d : duration (all times in milliseconds)

(1) 有40个工作,有 but

job  0: t 1-1  b 162 e 174 d  12 // the 4 tasks finish within 5 ms from each other
job  1: t 0-0  b 156 e 176 d  20 //
job  2: t 2-2  b 155 e 173 d  18 //
job  3: t 3-3  b 154 e 174 d  20 //
job  4: t 1-1  b 141 e 162 d  21
job  5: t 2-2  b 137 e 155 d  18
job  6: t 0-0  b 136 e 156 d  20
job  7: t 3-3  b 133 e 154 d  21
job  8: t 1-1  b 117 e 141 d  24
job  9: t 0-0  b 116 e 136 d  20
job 10: t 2-2  b 115 e 137 d  22
job 11: t 3-3  b 113 e 133 d  20
job 12: t 0-0  b  99 e 116 d  17
job 13: t 1-1  b  99 e 117 d  18
job 14: t 2-2  b  96 e 115 d  19
job 15: t 3-3  b  95 e 113 d  18
job 16: t 0-0  b  83 e  99 d  16
job 17: t 3-3  b  80 e  95 d  15
job 18: t 2-2  b  77 e  96 d  19
job 19: t 1-1  b  72 e  99 d  27
job 20: t 3-3  b  69 e  80 d  11
job 21: t 0-0  b  68 e  83 d  15
job 22: t 2-2  b  63 e  77 d  14
job 23: t 1-1  b  56 e  72 d  16
job 24: t 3-3  b  54 e  69 d  15
job 25: t 0-0  b  53 e  68 d  15
job 26: t 2-2  b  48 e  63 d  15
job 27: t 0-0  b  41 e  53 d  12
job 28: t 3-3  b  40 e  54 d  14
job 29: t 1-1  b  36 e  56 d  20
job 30: t 3-3  b  29 e  40 d  11
job 31: t 2-2  b  29 e  48 d  19
job 32: t 0-0  b  23 e  41 d  18
job 33: t 1-1  b  18 e  36 d  18
job 34: t 2-2  b  16 e  29 d  13
job 35: t 3-3  b  15 e  29 d  14
job 36: t 2-2  b   0 e  16 d  16
job 37: t 3-3  b   0 e  15 d  15
job 38: t 1-1  b   0 e  18 d  18
job 39: t 0-0  b   0 e  23 d  23

在经过几个小的工作后,如果一只read子的read子再花更多的时间处理自己的chu,你就可以看到工作负荷平衡。

2) 40个按行业顺序排列的工作

job  0: t 2-2  b 157 e 180 d  23 // last thread lags 17 ms behind first
job  1: t 1-1  b 154 e 175 d  21
job  2: t 3-3  b 150 e 171 d  21
job  3: t 0-0  b 143 e 163 d  20 // 1st thread ends
job  4: t 2-2  b 137 e 157 d  20
job  5: t 1-1  b 135 e 154 d  19
job  6: t 3-3  b 130 e 150 d  20
job  7: t 0-0  b 123 e 143 d  20
job  8: t 2-2  b 115 e 137 d  22
job  9: t 1-1  b 112 e 135 d  23
job 10: t 3-3  b 112 e 130 d  18
job 11: t 0-0  b 105 e 123 d  18
job 12: t 3-3  b  95 e 112 d  17
job 13: t 2-2  b  95 e 115 d  20
job 14: t 1-1  b  94 e 112 d  18
job 15: t 0-0  b  90 e 105 d  15
job 16: t 3-3  b  78 e  95 d  17
job 17: t 2-2  b  77 e  95 d  18
job 18: t 1-1  b  74 e  94 d  20
job 19: t 0-0  b  69 e  90 d  21
job 20: t 3-3  b  60 e  78 d  18
job 21: t 2-2  b  59 e  77 d  18
job 22: t 1-1  b  57 e  74 d  17
job 23: t 0-0  b  55 e  69 d  14
job 24: t 3-3  b  45 e  60 d  15
job 25: t 2-2  b  45 e  59 d  14
job 26: t 1-1  b  43 e  57 d  14
job 27: t 0-0  b  43 e  55 d  12
job 28: t 2-2  b  30 e  45 d  15
job 29: t 3-3  b  30 e  45 d  15
job 30: t 0-0  b  27 e  43 d  16
job 31: t 1-1  b  24 e  43 d  19
job 32: t 2-2  b  13 e  30 d  17
job 33: t 3-3  b  12 e  30 d  18
job 34: t 0-0  b  11 e  27 d  16
job 35: t 1-1  b  11 e  24 d  13
job 36: t 2-2  b   0 e  13 d  13
job 37: t 3-3  b   0 e  12 d  12
job 38: t 1-1  b   0 e  11 d  11
job 39: t 0-0  b   0 e  11 d  11

在这里,费用高昂的棚户往往在点末坐在一起,从而造成明显的业绩损失。

3) 只拥有1个核心职位,1至4个核心启动

reported cores: 4
Master: start jobs 4 workers 1
job  0: t 0-0  b 410 e 590 d 180 // purely linear execution
job  1: t 0-0  b 255 e 409 d 154
job  2: t 0-0  b 127 e 255 d 128
job  3: t 0-0  b   0 e 127 d 127
Master: start jobs 4 workers 2   // gain factor : 1.6 out of theoretical 2
job  0: t 1-1  b 151 e 362 d 211 
job  1: t 0-0  b 147 e 323 d 176
job  2: t 0-0  b   0 e 147 d 147
job  3: t 1-1  b   0 e 151 d 151
Master: start jobs 4 workers 3   // gain factor : 1.82 out of theoretical 3
job  0: t 0-0  b 142 e 324 d 182 // 4th packet is hurting the performance badly
job  1: t 2-2  b   0 e 158 d 158
job  2: t 1-1  b   0 e 160 d 160
job  3: t 0-0  b   0 e 142 d 142
Master: start jobs 4 workers 4   // gain factor : 3 out of theoretical 4
job  0: t 3-3  b   0 e 199 d 199 // finish at 199ms vs. 176 for butterfly 40, 13% loss
job  1: t 1-1  b   0 e 182 d 182 // 17 ms wasted
job  2: t 0-0  b   0 e 146 d 146 // 44 ms wasted
job  3: t 2-2  b   0 e 150 d 150 // 49 ms wasted

Here we get a 3x improvement while a better load balancing could have yielded a 3.5x.
And this is a very mild test case (you can see the computation times only vary by a factor of about 2, while they could theoretically vary by a factor of 255 !).

不管怎么说,如果你不执行某种负荷平衡,你可能写出的所有光辉多处理器法,仍然会产生令人难以置信的不良表现。

Implementation

为使线人不受干扰地工作,他们必须不受潮.世界的干扰。

这种干涉之一是记忆分配。每当你分配甚至传记时,你都会排他性地获得全球记忆分配器(并浪费一个CPU的分行)。

Also, creating worker tasks for each picture computation is another waste of time and resources. The computation might be used to display the Mandlebrot set in an interactive application, so better have the workers premanently created and synchronized to compute successive images.

最后,有数据副本。如果你每次重新计算几个点时都与主要方案同步,你将再次花费相当一部分时间来排他性地进入成果领域。此外,大量数据的无用拷贝将进一步伤害业绩。

明显的解决办法是完全分发复印件,并处理原始数据。

design

你们必须向你们的劳动者提供他们无阻碍工作所需要的一切。阁下:

determine the number of available cores on your system
pre-allocate all the memory needed
give access to a list of image chunks to each of your worker
launch exactly one thread per core and let them run free to do their job

job queue

There is no need for fancy no-wait or whatever gizmos, nor do we need to pay special attention to cache optimization.
Here again, the time needed to compute pixels dwarves the inter-thread synchronization cost and cache efficiency problems.

基本上,可以在图像生成之初就算出整个问题。工人只能读到自己的工作岗位:在这个岗位上永远不会同时读/做成,因此,围绕执行职务问题制定的标准范围将越来越低,对手头的工作来说是非理想性的,过于复杂。

我们需要两点:

let the workers wait for a new batch of jobs
let the master wait for a picture completion

workers will wait until the queue length changes to a positive value. They will then all wakeup and start atomically decrementing the queue length. The current value of the queue length will provide them exclusive access to the associated job data (basically an area of the Mandlebrot set to compute, with an associated bitmap area to store the computed iteration values).

同样的机制被用于解雇工人。贫穷工人不会找到新的工作,而是会发现解雇令。

the master waiting for a picture completion will be awoken by the worker that will finish processing the last job. This will be based on an atomic counter of the number of jobs to process.

This is how I implemented it:

class synchro {
    friend class mandelbrot_calculator;

    mutex              lock;    // queue lock
    condition_variable work;    // blocks workers waiting for jobs/termination
    condition_variable done;    // blocks master waiting for completion
    int                pending; // number of jobs in the queue
    atomic_int         active;  // number of unprocessed jobs
    bool               kill;    // poison pill for workers termination

    void synchro (void)
    {
        pending = 0;  // no job in queue
        kill = false; // workers shall live (for now :) )
    }

    int worker_start(void)
    {
        unique_lock<mutex> waiter(lock);
        while (!pending && !kill) work.wait(waiter);
        return kill 
            ? -1         // worker should die
            : --pending; // index of the job to process
    }

    void worker_done(void)
    {
        if (!--active) // atomic decrement (exclusive with other workers)
            done.notify_one(); // last job processed: wakeup master
    }

    void master_start(int jobs)
    {
        unique_lock<mutex> waiter(lock);
        pending = active = jobs;
        work.notify_all(); // wakeup all workers to start jobs
    }

    void master_done(void)
    {
        unique_lock<mutex> waiter(lock);
        while (active) done.wait(waiter); // wait for workers to finish
    }

    void master_kill(void)
    {
        kill = true;
        work.notify_all(); // wakeup all workers (to die)
    }
};

齐心协力:

class mandelbrot_calculator {
    int      num_cores;
    int      num_jobs;
    vector<thread> workers; // worker threads
    vector<job> jobs;      // job queue
    synchro sync;          // synchronization helper

    mandelbrot_calculator (int num_cores, int num_jobs)
        : num_cores(num_cores)
        , num_jobs (num_jobs )
    {
        // worker thread
        auto worker = [&]()
        {
            for (;;)
            {
                int job = sync.worker_start(); // fetch next job

                if (job == -1) return; // poison pill
                process (jobs[job]);   // we have exclusive access to this job

                sync.worker_done();    // signal end of picture to the master
            }
        };

        jobs.resize(num_jobs, job()); // computation windows
        workers.resize(num_cores);
        for (int i = 0; i != num_cores; i++)
            workers[i] = thread(worker, i, i%num_cores);
    }

    ~mandelbrot_calculator()
    {
        // kill the workers
        sync.master_kill();
        for (thread& worker : workers) worker.join();
    }

    void compute(const viewport & vp)
    {
        // prepare worker data
        function<void(int, int)> butterfly_jobs;
        butterfly_jobs = [&](int min, int max) 
            // computes job windows in butterfly order
            {
                if (min > max) return;
                jobs[min].setup(vp, max, num_jobs);

                if (min == max) return;
                jobs[max].setup(vp, min, num_jobs);

                int mid = (min + max) / 2;
                butterfly_jobs(min + 1, mid    );
                butterfly_jobs(mid + 1, max - 1);
            };
        butterfly_jobs(0, num_jobs - 1);

        // launch workers
        sync.master_start(num_jobs);

        // wait for completion
        sync.master_done();
    }
};

Testing the concept

该守则在我2个核心领域(即4个CPUsIntel I3 @ 3.1 GHz)上运行良好,这些核心单位是微软Wevudio于2013年汇编的。

我在1280x1024 pixels的窗户上使用了一套平均90个炉./钢板的镜子。

The computation time is about 1.700s with only one worker and drops to 0.480s with 4 workers.
The maximal possible gain would be a factor 4. I get a factor 3.5. Not too bad.

I assume the difference is partly due to the processor architecture (the I3 has only two "real" cores).

Tampering with the scheduler

My program forces the threads to run on one core each (using MSDN SetThreadAffinityMask).
If the scheduler is left free to allocate the tasks, the gain factor drops from 3,5 to 3,2.

这一点很重要,但是,当独任时,Win7的表率仍然相当好。

synchronization overhead

运行关于“湿”窗口的算法(在坡道外;2个区)对系统电话管理有好的想法。

与代表区的480米相比,需要约7米才能对这个“白色”地区进行测量。

大约为1.5%,包括工作团的同步和计算。而且,在1024个工作岗位上,这项工作正在同步进行。

我要说的是,完全被忽视。这可能给所有“诺瓦伊特”的狂热分子提供食物。

optimizing iterations

The way iterations are done is a key factor for optimization.
After a few trials, I settled for this method:

static inline unsigned char mandelbrot_pixel(double x0, double y0)
{
    register double x = x0;
    register double y = y0;
    register double x2 = x * x;
    register double y2 = y * y;
    unsigned       iteration = 0;
    const int      max_iteration = 255;
    while (x2 + y2 < 4.0)
    {
        if (++iteration == max_iteration) break;
        y = 2 * x * y + y0;
        x = x2 - y2   + x0;
        x2 = x * x;
        y2 = y * y;
    }
    return (unsigned char)iteration;
}

净收益: +20%与OP 方法

(register) 指令没有造成差别,只是为了纠正错误。

killing the tasks after each computation

离开工人的福利金约为计算时间的5%。

butterfly effect

在我的测试案例中,“绝对”命令实际上做得很好,在极端情况下产生30%以上的收益,并且由于最大规模的请求,通常达到10-15%。

Answer 2

The problem in your code is that all thread capture and access the same i variable. This creates a race condition and the results are wildly incorrect.

你们需要把这一条作为理由,来对待read,并纠正幅度(i-1,使你的指数脱离了界限)。

Answer 3

我的工作是多管齐下。它使工作量在1至64之间。无需间歇。这里是工人的read子......

// Worker thread for processing the Mandelbrot algorithm
DWORD WINAPI MandelbrotWorkerThread(LPVOID lpParam)
{
    // This is a copy of the structure from the paint procedure.
    // The address of this structure is passed with lParam.
    typedef struct ThreadProcParameters
    {
        int       StartPixel;
        int       EndPixel;
        int       yMaxPixel;
        int       xMaxPixel;
        uint32_t* BitmapData;
        double    dxMin;
        double    dxMax;
        double    dyMin;
        double    dyMax;
    } THREADPROCPARAMETERS, *PTHREADPROCPARAMETERS;
    PTHREADPROCPARAMETERS P;
    P = (PTHREADPROCPARAMETERS)lpParam;

    // Algorithm obtained from https://en.wikipedia.org/wiki/Mandelbrot_set.

    double x0, y0, x, y, xtemp;

    int iteration;

    // Loop for each pixel in the slice.
    for (int Pixel = P->StartPixel; Pixel < P->EndPixel; ++Pixel)
    {
        // Calculate the x and y coordinates of the pixel.
        int xPixel = Pixel % P->xMaxPixel;
        int yPixel = Pixel / P->xMaxPixel;

        // Calculate the real and imaginary coordinates of the point.
        x0 = (P->dxMax - P->dxMin) / P->xMaxPixel * xPixel + P->dxMin;
        y0 = (P->dyMax - P->dyMin) / P->yMaxPixel * yPixel + P->dyMin;

        // Initial values.
        x = 0.0;
        y = 0.0;
        iteration = 0;

        // Main Mandelbrot algorithm. Determine the number of iterations
        // that it takes each point to escape the distance of 2. The black
        // areas of the image represent the points that never escape. This
        // algorithm is supposed to be using complex arithmetic, but this
        // is a simplified separation of the real and imaginary parts of
        // the point s coordinate. This algorithm is described as the
        // naive "escape time algorithm" in the WikiPedia article noted.

        while (x * x + y * y <= 2.0 * 2.0 && iteration < max_iterations)
        {
            xtemp = x * x - y * y + x0;
            y = 2 * x * y + y0;
            x = xtemp;
            ++iteration;
        }

        // When we get here, we have a pixel and an iteration count.
        // Lookup the color in the spectrum of all colors and set the
        // pixel to that color. Note that we are only ever using 1000
        // of the 16777215 possible colors. Changing max_iterations uses
        // a different pallette, but 1000 seems to be the best choice.
        // Note also that this bitmap is shared by all 64 threads, but
        // there is no concurrency conflict as each thread is assigned
        // a different region of the bitmap. The user has the option of
        // using the original RGB or the new and improved Log HSV system.

        if (!bUseHSV)
        {
            // The old RGB system.
            P->BitmapData[Pixel] = ReverseRGBBytes
            ((COLORREF)(-16777215.0 / max_iterations * iteration + 16777215.0));
        }
        else
        {
            // The new HSV system.
            sRGB rgb;
            sHSV hsv;
            hsv = mandelbrotHSV(iteration, max_iterations);
            rgb = hsv2rgb(hsv);
            P->BitmapData[Pixel] =
                (((int)(rgb.r * 255)))       +
                (((int)(rgb.g * 255)) <<  8) +
                (((int)(rgb.b * 255)) << 16   );
        }
    }
    // End of thread execution. The return value is available
    // to the invoking thread, but we don t presently use it.

    return 0;
}

and here is the thread dispatcher...

        // Parameters for each thread.
        typedef struct ThreadProcParameters
        {
            int       StartPixel;
            int       EndPixel;
            int       yMaxPixel;
            int       xMaxPixel;
            uint32_t* BitmapData;
            double    dxMin;
            double    dxMax;
            double    dyMin;
            double    dyMax;
        } THREADPROCPARAMETERS, *PTHREADPROCPARAMETERS;

        // Allocate per thread parameter and handle arrays.
        PTHREADPROCPARAMETERS* pThreadProcParameters = new PTHREADPROCPARAMETERS[Slices];
        HANDLE* phThreadArray = new HANDLE[Slices];

        // MaxPixel is the total pixel count among all threads.
        int MaxPixel = (rect.bottom - tm.tmHeight) * rect.right;
        int StartPixel, EndPixel, Slice;

        // Main thread dispatch loop. Walk the start and end pixel indices.
        for (StartPixel = 0, EndPixel = PixelStepSize, Slice = 0;
            (EndPixel <= MaxPixel) && (Slice < Slices);
            StartPixel += PixelStepSize, EndPixel = min(EndPixel + PixelStepSize, MaxPixel), ++Slice)
        {
            // Allocate the parameter structure for this thread.
            pThreadProcParameters[Slice] =
                (PTHREADPROCPARAMETERS)HeapAlloc(GetProcessHeap(),
                    HEAP_ZERO_MEMORY, sizeof(THREADPROCPARAMETERS));
            if (pThreadProcParameters[Slice] == NULL) ExitProcess(2);

            // Initialize the parameters for this thread.
            pThreadProcParameters[Slice]->StartPixel = StartPixel;
            pThreadProcParameters[Slice]->EndPixel   = EndPixel;
            pThreadProcParameters[Slice]->yMaxPixel  = rect.bottom - tm.tmHeight; // Leave room for the status bar.
            pThreadProcParameters[Slice]->xMaxPixel  = rect.right;
            pThreadProcParameters[Slice]->BitmapData = BitmapData; // Bitmap is shared among all threads.
            pThreadProcParameters[Slice]->dxMin      = dxMin;
            pThreadProcParameters[Slice]->dxMax      = dxMax;
            pThreadProcParameters[Slice]->dyMin      = dyMin;
            pThreadProcParameters[Slice]->dyMax      = dyMax;

            // Create and launch this thread.
            phThreadArray[Slice] = CreateThread
                (NULL, 0, MandelbrotWorkerThread, pThreadProcParameters[Slice], 0, NULL);
            if (phThreadArray[Slice] == NULL)
            {
                ErrorHandler((LPTSTR)_T("CreateThread"));
                ExitProcess(3);
            }
        } // End of main thread dispatch loop.

        // Wait for all threads to terminate.
        WaitForMultipleObjects(Slices, phThreadArray, TRUE, INFINITE);

        // Deallocate the thread arrays and structures.
        for (Slice = 0; Slice < Slices; ++Slice)
        {
            CloseHandle(phThreadArray[Slice]);
            HeapFree(GetProcessHeap(), 0, pThreadProcParameters[Slice]);
        }
        delete[] phThreadArray;
        delete[] pThreadProcParameters;

        // Refresh the image with the bitmap.
        SetDIBitsToDevice(hdc, 0, 0, rect.right, rect.bottom - tm.tmHeight,
            0, 0, 0, rect.bottom - tm.tmHeight, BitmapData, &dbmi, 0);

如果你想看到整个方案,或许要将其汇编成册,看看它是如何在你的机器上进行的,请看https://github.com/alexsokolek2/mandelbrot 。