English 中文(简体)
采用多读法,在++中形成红.图像。 无速度?
原标题:Generating mandelbrot images in c++ using multithreading. No speedup?

因此,我贴出一个类似问题,但我并没有拿出足够的法典来获得我所需要的帮助。 即使我回过头并补充说现在的法典,我也不认为会注意到它,因为问题已经老了,“已经解决”。 因此,我的问题是:

I m试图产生一部分红del。 我可以提出罚款,但当我增加更多的核心时,不管问题的规模如何大,外线不会加快。 我完全是新鲜的,而且可能只是一个小小小的Im失踪的人。 不管怎样,这里是产生兄弟关系的职能:

void mandelbrot_all(std::vector<std::vector<int>>& pixels, int X, int Y, int numThreads) {
    using namespace std;

    vector<thread> threads (numThreads);
    int rowsPerThread = Y/numThreads;
    mutex m;

    for(int i=0; i<numThreads; i++) {
        threads[i] = thread ([&](){
            vector<int> row;
            for(int j=(i-1)*rowsPerThread; j<i*rowsPerThread; j++) {
                row = mandelbrot_row(j, X, Y);
                    lock_guard<mutex> lock(m);
                    pixels[j] = row;
    for(int i=0; i<numThreads; i++) {

std::vector<int> mandelbrot_row(int rowNum, int topX, int topY) {
    std::vector<int> row (topX);
    for(int i=0; i<topX; i++) {
        row[i] = mandelbrotOne(i, rowNum, topX, topY);
    return row;

int mandelbrotOne(int currX, int currY, int X, int Y) { //code adapted from http://en.wikipedia.org/wiki/Mandelbrot_set
    double x0 = convert(X, currX, true);
    double y0 = convert(Y, currY, false);
    double x = 0.0;
    double y = 0.0;
    double xtemp;
    int iteration = 0;
    int max_iteration = 255;
    while ( x*x + y*y < 2*2  &&  iteration < max_iteration) {
        xtemp = x*x - y*y + x0;
        y = 2*x*y + y0;
        x = xtemp;
    return iteration;

脑膜炎(all)通过病媒控制,最大程度是病媒X和Y,还有read子使用的数量,这是在方案实施时从指挥线上提取的。 它试图将工作分成多层。 不幸的是,似乎即使这是它所做的事情,但它并没有使事情更快。 如果你们需要更多的细节,那么我会自由地要求,我将尽力提供这些细节。

Thanks in advance for the help.

Edit: reserved vectors in advance Edit 2: ran this code with problem size 9600x7200 on a quad core laptop. It took an average of 36590000 cycles for one thread (over 5 runs) and 55142000 cycles for four threads.


Your code might appear to do parallel processing, but in practice it doesn t. Basically, you are spending your time copying data around and queueing for memory allocator accesses.

此外,你正在使用未经保护的<代码>i loop indice,就像没有东西可以做的那样,这将用随机的 ju子,而不是图像的美分。

如同往常一样,C++在 sugar糖的厚颜色下隐藏着这些可悲的事实。



Functional analysis

你们希望利用所有CPU核心,为Mandelbrot的 set子 cr。 这是一种完全可平行计算的情况,因为每一计算方法都可以独立进行。



特定的粉碎机可以从0到255英寸计算。 “黑色”粉末比“白”粉末高出255倍。

因此,如果你简单地把你的照片分割为平坦的地表,那么你的所有加工者都将通过“白色”地区进行呼吸,但通过“黑色”地区进行拖网。 因此,最缓慢的地区计算时间将占主导地位,而平行化实际上将没有任何好处。


Load balancing

To better balance the load, it is more efficient to split your picture in much smaller bits, and have each worker thread pick and compute the next available bit as soon as it is finished with the previous one. That way, a worker processing "white" chunks will eventually finish its job and start picking "black" chunks to help its less fortunate siblings.

理想的情况是,应该通过减少复杂性来分类,以避免将一个大草块的线性成本增加到总 comp时间。


如果我们决定这块地块是地貌的横向特征,则按自然的<代码>y顺序分类,则明显是次优。 如果这一特定地区是“黑白”梯度,那么最昂贵的线路将在草原清单的末尾连接起来,而且你将最终计算成本最低的比值,这是平衡负荷的最坏情况。

A possible solution is to shuffle the chunks in a butterfly pattern, so that the likelihood of having a "black" area concentrated in the end is small.
Another way would simply be to shuffle input patterns at random.


Jobs are executed in reverse order (jobs 39 is the first, job 0 is the last). Each line decodes as follows:

t a-b : thread n°a on processor b
b : begining time (since image computation start)
e : end time
d : duration (all times in milliseconds)

(1) 有40个工作,有 but

job  0: t 1-1  b 162 e 174 d  12 // the 4 tasks finish within 5 ms from each other
job  1: t 0-0  b 156 e 176 d  20 //
job  2: t 2-2  b 155 e 173 d  18 //
job  3: t 3-3  b 154 e 174 d  20 //
job  4: t 1-1  b 141 e 162 d  21
job  5: t 2-2  b 137 e 155 d  18
job  6: t 0-0  b 136 e 156 d  20
job  7: t 3-3  b 133 e 154 d  21
job  8: t 1-1  b 117 e 141 d  24
job  9: t 0-0  b 116 e 136 d  20
job 10: t 2-2  b 115 e 137 d  22
job 11: t 3-3  b 113 e 133 d  20
job 12: t 0-0  b  99 e 116 d  17
job 13: t 1-1  b  99 e 117 d  18
job 14: t 2-2  b  96 e 115 d  19
job 15: t 3-3  b  95 e 113 d  18
job 16: t 0-0  b  83 e  99 d  16
job 17: t 3-3  b  80 e  95 d  15
job 18: t 2-2  b  77 e  96 d  19
job 19: t 1-1  b  72 e  99 d  27
job 20: t 3-3  b  69 e  80 d  11
job 21: t 0-0  b  68 e  83 d  15
job 22: t 2-2  b  63 e  77 d  14
job 23: t 1-1  b  56 e  72 d  16
job 24: t 3-3  b  54 e  69 d  15
job 25: t 0-0  b  53 e  68 d  15
job 26: t 2-2  b  48 e  63 d  15
job 27: t 0-0  b  41 e  53 d  12
job 28: t 3-3  b  40 e  54 d  14
job 29: t 1-1  b  36 e  56 d  20
job 30: t 3-3  b  29 e  40 d  11
job 31: t 2-2  b  29 e  48 d  19
job 32: t 0-0  b  23 e  41 d  18
job 33: t 1-1  b  18 e  36 d  18
job 34: t 2-2  b  16 e  29 d  13
job 35: t 3-3  b  15 e  29 d  14
job 36: t 2-2  b   0 e  16 d  16
job 37: t 3-3  b   0 e  15 d  15
job 38: t 1-1  b   0 e  18 d  18
job 39: t 0-0  b   0 e  23 d  23


2) 40个按行业顺序排列的工作

job  0: t 2-2  b 157 e 180 d  23 // last thread lags 17 ms behind first
job  1: t 1-1  b 154 e 175 d  21
job  2: t 3-3  b 150 e 171 d  21
job  3: t 0-0  b 143 e 163 d  20 // 1st thread ends
job  4: t 2-2  b 137 e 157 d  20
job  5: t 1-1  b 135 e 154 d  19
job  6: t 3-3  b 130 e 150 d  20
job  7: t 0-0  b 123 e 143 d  20
job  8: t 2-2  b 115 e 137 d  22
job  9: t 1-1  b 112 e 135 d  23
job 10: t 3-3  b 112 e 130 d  18
job 11: t 0-0  b 105 e 123 d  18
job 12: t 3-3  b  95 e 112 d  17
job 13: t 2-2  b  95 e 115 d  20
job 14: t 1-1  b  94 e 112 d  18
job 15: t 0-0  b  90 e 105 d  15
job 16: t 3-3  b  78 e  95 d  17
job 17: t 2-2  b  77 e  95 d  18
job 18: t 1-1  b  74 e  94 d  20
job 19: t 0-0  b  69 e  90 d  21
job 20: t 3-3  b  60 e  78 d  18
job 21: t 2-2  b  59 e  77 d  18
job 22: t 1-1  b  57 e  74 d  17
job 23: t 0-0  b  55 e  69 d  14
job 24: t 3-3  b  45 e  60 d  15
job 25: t 2-2  b  45 e  59 d  14
job 26: t 1-1  b  43 e  57 d  14
job 27: t 0-0  b  43 e  55 d  12
job 28: t 2-2  b  30 e  45 d  15
job 29: t 3-3  b  30 e  45 d  15
job 30: t 0-0  b  27 e  43 d  16
job 31: t 1-1  b  24 e  43 d  19
job 32: t 2-2  b  13 e  30 d  17
job 33: t 3-3  b  12 e  30 d  18
job 34: t 0-0  b  11 e  27 d  16
job 35: t 1-1  b  11 e  24 d  13
job 36: t 2-2  b   0 e  13 d  13
job 37: t 3-3  b   0 e  12 d  12
job 38: t 1-1  b   0 e  11 d  11
job 39: t 0-0  b   0 e  11 d  11


3) 只拥有1个核心职位,1至4个核心启动

reported cores: 4
Master: start jobs 4 workers 1
job  0: t 0-0  b 410 e 590 d 180 // purely linear execution
job  1: t 0-0  b 255 e 409 d 154
job  2: t 0-0  b 127 e 255 d 128
job  3: t 0-0  b   0 e 127 d 127
Master: start jobs 4 workers 2   // gain factor : 1.6 out of theoretical 2
job  0: t 1-1  b 151 e 362 d 211 
job  1: t 0-0  b 147 e 323 d 176
job  2: t 0-0  b   0 e 147 d 147
job  3: t 1-1  b   0 e 151 d 151
Master: start jobs 4 workers 3   // gain factor : 1.82 out of theoretical 3
job  0: t 0-0  b 142 e 324 d 182 // 4th packet is hurting the performance badly
job  1: t 2-2  b   0 e 158 d 158
job  2: t 1-1  b   0 e 160 d 160
job  3: t 0-0  b   0 e 142 d 142
Master: start jobs 4 workers 4   // gain factor : 3 out of theoretical 4
job  0: t 3-3  b   0 e 199 d 199 // finish at 199ms vs. 176 for butterfly 40, 13% loss
job  1: t 1-1  b   0 e 182 d 182 // 17 ms wasted
job  2: t 0-0  b   0 e 146 d 146 // 44 ms wasted
job  3: t 2-2  b   0 e 150 d 150 // 49 ms wasted

Here we get a 3x improvement while a better load balancing could have yielded a 3.5x.
And this is a very mild test case (you can see the computation times only vary by a factor of about 2, while they could theoretically vary by a factor of 255 !).




这种干涉之一是记忆分配。 每当你分配甚至传记时,你都会排他性地获得全球记忆分配器(并浪费一个CPU的分行)。

Also, creating worker tasks for each picture computation is another waste of time and resources. The computation might be used to display the Mandlebrot set in an interactive application, so better have the workers premanently created and synchronized to compute successive images.

最后,有数据副本。 如果你每次重新计算几个点时都与主要方案同步,你将再次花费相当一部分时间来排他性地进入成果领域。 此外,大量数据的无用拷贝将进一步伤害业绩。



你们必须向你们的劳动者提供他们无阻碍工作所需要的一切。 阁下:

  • determine the number of available cores on your system
  • pre-allocate all the memory needed
  • give access to a list of image chunks to each of your worker
  • launch exactly one thread per core and let them run free to do their job

job queue

There is no need for fancy no-wait or whatever gizmos, nor do we need to pay special attention to cache optimization.
Here again, the time needed to compute pixels dwarves the inter-thread synchronization cost and cache efficiency problems.

基本上,可以在图像生成之初就算出整个问题。 工人只能读到自己的工作岗位:在这个岗位上永远不会同时读/做成,因此,围绕执行职务问题制定的标准范围将越来越低,对手头的工作来说是非理想性的,过于复杂。


  1. let the workers wait for a new batch of jobs
  2. let the master wait for a picture completion

workers will wait until the queue length changes to a positive value. They will then all wakeup and start atomically decrementing the queue length. The current value of the queue length will provide them exclusive access to the associated job data (basically an area of the Mandlebrot set to compute, with an associated bitmap area to store the computed iteration values).

同样的机制被用于解雇工人。 贫穷工人不会找到新的工作,而是会发现解雇令。

the master waiting for a picture completion will be awoken by the worker that will finish processing the last job. This will be based on an atomic counter of the number of jobs to process.

This is how I implemented it:

class synchro {
    friend class mandelbrot_calculator;

    mutex              lock;    // queue lock
    condition_variable work;    // blocks workers waiting for jobs/termination
    condition_variable done;    // blocks master waiting for completion
    int                pending; // number of jobs in the queue
    atomic_int         active;  // number of unprocessed jobs
    bool               kill;    // poison pill for workers termination

    void synchro (void)
        pending = 0;  // no job in queue
        kill = false; // workers shall live (for now :) )

    int worker_start(void)
        unique_lock<mutex> waiter(lock);
        while (!pending && !kill) work.wait(waiter);
        return kill 
            ? -1         // worker should die
            : --pending; // index of the job to process

    void worker_done(void)
        if (!--active) // atomic decrement (exclusive with other workers)
            done.notify_one(); // last job processed: wakeup master

    void master_start(int jobs)
        unique_lock<mutex> waiter(lock);
        pending = active = jobs;
        work.notify_all(); // wakeup all workers to start jobs

    void master_done(void)
        unique_lock<mutex> waiter(lock);
        while (active) done.wait(waiter); // wait for workers to finish

    void master_kill(void)
        kill = true;
        work.notify_all(); // wakeup all workers (to die)


class mandelbrot_calculator {
    int      num_cores;
    int      num_jobs;
    vector<thread> workers; // worker threads
    vector<job> jobs;      // job queue
    synchro sync;          // synchronization helper

    mandelbrot_calculator (int num_cores, int num_jobs)
        : num_cores(num_cores)
        , num_jobs (num_jobs )
        // worker thread
        auto worker = [&]()
            for (;;)
                int job = sync.worker_start(); // fetch next job

                if (job == -1) return; // poison pill
                process (jobs[job]);   // we have exclusive access to this job

                sync.worker_done();    // signal end of picture to the master

        jobs.resize(num_jobs, job()); // computation windows
        for (int i = 0; i != num_cores; i++)
            workers[i] = thread(worker, i, i%num_cores);

        // kill the workers
        for (thread& worker : workers) worker.join();

    void compute(const viewport & vp)
        // prepare worker data
        function<void(int, int)> butterfly_jobs;
        butterfly_jobs = [&](int min, int max) 
            // computes job windows in butterfly order
                if (min > max) return;
                jobs[min].setup(vp, max, num_jobs);

                if (min == max) return;
                jobs[max].setup(vp, min, num_jobs);

                int mid = (min + max) / 2;
                butterfly_jobs(min + 1, mid    );
                butterfly_jobs(mid + 1, max - 1);
        butterfly_jobs(0, num_jobs - 1);

        // launch workers

        // wait for completion

Testing the concept

该守则在我2个核心领域(即4个CPUsIntel I3 @ 3.1 GHz)上运行良好,这些核心单位是微软Wevudio于2013年汇编的。

我在1280x1024 pixels的窗户上使用了一套平均90个炉./钢板的镜子。

The computation time is about 1.700s with only one worker and drops to 0.480s with 4 workers.
The maximal possible gain would be a factor 4. I get a factor 3.5. Not too bad.

I assume the difference is partly due to the processor architecture (the I3 has only two "real" cores).

Tampering with the scheduler

My program forces the threads to run on one core each (using MSDN SetThreadAffinityMask).
If the scheduler is left free to allocate the tasks, the gain factor drops from 3,5 to 3,2.


synchronization overhead



大约为1.5%,包括工作团的同步和计算。 而且,在1024个工作岗位上,这项工作正在同步进行。

我要说的是,完全被忽视。 这可能给所有“诺瓦伊特”的狂热分子提供食物。

optimizing iterations

The way iterations are done is a key factor for optimization.
After a few trials, I settled for this method:

static inline unsigned char mandelbrot_pixel(double x0, double y0)
    register double x = x0;
    register double y = y0;
    register double x2 = x * x;
    register double y2 = y * y;
    unsigned       iteration = 0;
    const int      max_iteration = 255;
    while (x2 + y2 < 4.0)
        if (++iteration == max_iteration) break;
        y = 2 * x * y + y0;
        x = x2 - y2   + x0;
        x2 = x * x;
        y2 = y * y;
    return (unsigned char)iteration;

净收益: +20%与OP 方法

(register) 指令没有造成差别,只是为了纠正错误。

killing the tasks after each computation


butterfly effect


The problem in your code is that all thread capture and access the same i variable. This creates a race condition and the results are wildly incorrect.


我的工作是多管齐下。 它使工作量在1至64之间。 无需间歇。 这里是工人的read子......

// Worker thread for processing the Mandelbrot algorithm
DWORD WINAPI MandelbrotWorkerThread(LPVOID lpParam)
    // This is a copy of the structure from the paint procedure.
    // The address of this structure is passed with lParam.
    typedef struct ThreadProcParameters
        int       StartPixel;
        int       EndPixel;
        int       yMaxPixel;
        int       xMaxPixel;
        uint32_t* BitmapData;
        double    dxMin;
        double    dxMax;
        double    dyMin;
        double    dyMax;

    // Algorithm obtained from https://en.wikipedia.org/wiki/Mandelbrot_set.

    double x0, y0, x, y, xtemp;

    int iteration;

    // Loop for each pixel in the slice.
    for (int Pixel = P->StartPixel; Pixel < P->EndPixel; ++Pixel)
        // Calculate the x and y coordinates of the pixel.
        int xPixel = Pixel % P->xMaxPixel;
        int yPixel = Pixel / P->xMaxPixel;

        // Calculate the real and imaginary coordinates of the point.
        x0 = (P->dxMax - P->dxMin) / P->xMaxPixel * xPixel + P->dxMin;
        y0 = (P->dyMax - P->dyMin) / P->yMaxPixel * yPixel + P->dyMin;

        // Initial values.
        x = 0.0;
        y = 0.0;
        iteration = 0;

        // Main Mandelbrot algorithm. Determine the number of iterations
        // that it takes each point to escape the distance of 2. The black
        // areas of the image represent the points that never escape. This
        // algorithm is supposed to be using complex arithmetic, but this
        // is a simplified separation of the real and imaginary parts of
        // the point s coordinate. This algorithm is described as the
        // naive "escape time algorithm" in the WikiPedia article noted.

        while (x * x + y * y <= 2.0 * 2.0 && iteration < max_iterations)
            xtemp = x * x - y * y + x0;
            y = 2 * x * y + y0;
            x = xtemp;

        // When we get here, we have a pixel and an iteration count.
        // Lookup the color in the spectrum of all colors and set the
        // pixel to that color. Note that we are only ever using 1000
        // of the 16777215 possible colors. Changing max_iterations uses
        // a different pallette, but 1000 seems to be the best choice.
        // Note also that this bitmap is shared by all 64 threads, but
        // there is no concurrency conflict as each thread is assigned
        // a different region of the bitmap. The user has the option of
        // using the original RGB or the new and improved Log HSV system.

        if (!bUseHSV)
            // The old RGB system.
            P->BitmapData[Pixel] = ReverseRGBBytes
            ((COLORREF)(-16777215.0 / max_iterations * iteration + 16777215.0));
            // The new HSV system.
            sRGB rgb;
            sHSV hsv;
            hsv = mandelbrotHSV(iteration, max_iterations);
            rgb = hsv2rgb(hsv);
            P->BitmapData[Pixel] =
                (((int)(rgb.r * 255)))       +
                (((int)(rgb.g * 255)) <<  8) +
                (((int)(rgb.b * 255)) << 16   );
    // End of thread execution. The return value is available
    // to the invoking thread, but we don t presently use it.

    return 0;

and here is the thread dispatcher...

        // Parameters for each thread.
        typedef struct ThreadProcParameters
            int       StartPixel;
            int       EndPixel;
            int       yMaxPixel;
            int       xMaxPixel;
            uint32_t* BitmapData;
            double    dxMin;
            double    dxMax;
            double    dyMin;
            double    dyMax;

        // Allocate per thread parameter and handle arrays.
        HANDLE* phThreadArray = new HANDLE[Slices];

        // MaxPixel is the total pixel count among all threads.
        int MaxPixel = (rect.bottom - tm.tmHeight) * rect.right;
        int StartPixel, EndPixel, Slice;

        // Main thread dispatch loop. Walk the start and end pixel indices.
        for (StartPixel = 0, EndPixel = PixelStepSize, Slice = 0;
            (EndPixel <= MaxPixel) && (Slice < Slices);
            StartPixel += PixelStepSize, EndPixel = min(EndPixel + PixelStepSize, MaxPixel), ++Slice)
            // Allocate the parameter structure for this thread.
            pThreadProcParameters[Slice] =
                    HEAP_ZERO_MEMORY, sizeof(THREADPROCPARAMETERS));
            if (pThreadProcParameters[Slice] == NULL) ExitProcess(2);

            // Initialize the parameters for this thread.
            pThreadProcParameters[Slice]->StartPixel = StartPixel;
            pThreadProcParameters[Slice]->EndPixel   = EndPixel;
            pThreadProcParameters[Slice]->yMaxPixel  = rect.bottom - tm.tmHeight; // Leave room for the status bar.
            pThreadProcParameters[Slice]->xMaxPixel  = rect.right;
            pThreadProcParameters[Slice]->BitmapData = BitmapData; // Bitmap is shared among all threads.
            pThreadProcParameters[Slice]->dxMin      = dxMin;
            pThreadProcParameters[Slice]->dxMax      = dxMax;
            pThreadProcParameters[Slice]->dyMin      = dyMin;
            pThreadProcParameters[Slice]->dyMax      = dyMax;

            // Create and launch this thread.
            phThreadArray[Slice] = CreateThread
                (NULL, 0, MandelbrotWorkerThread, pThreadProcParameters[Slice], 0, NULL);
            if (phThreadArray[Slice] == NULL)
        } // End of main thread dispatch loop.

        // Wait for all threads to terminate.
        WaitForMultipleObjects(Slices, phThreadArray, TRUE, INFINITE);

        // Deallocate the thread arrays and structures.
        for (Slice = 0; Slice < Slices; ++Slice)
            HeapFree(GetProcessHeap(), 0, pThreadProcParameters[Slice]);
        delete[] phThreadArray;
        delete[] pThreadProcParameters;

        // Refresh the image with the bitmap.
        SetDIBitsToDevice(hdc, 0, 0, rect.right, rect.bottom - tm.tmHeight,
            0, 0, 0, rect.bottom - tm.tmHeight, BitmapData, &dbmi, 0);


Undefined reference

I m getting this linker error. I know a way around it, but it s bugging me because another part of the project s linking fine and it s designed almost identically. First, I have namespace LCD. Then I ...

C++ Equivalent of Tidy

Is there an equivalent to tidy for HTML code for C++? I have searched on the internet, but I find nothing but C++ wrappers for tidy, etc... I think the keyword tidy is what has me hung up. I am ...

Template Classes in C++ ... a required skill set?

I m new to C++ and am wondering how much time I should invest in learning how to implement template classes. Are they widely used in industry, or is this something I should move through quickly?

Print possible strings created from a Number

Given a 10 digit Telephone Number, we have to print all possible strings created from that. The mapping of the numbers is the one as exactly on a phone s keypad. i.e. for 1,0-> No Letter for 2->...

typedef ing STL wstring

Why is it when i do the following i get errors when relating to with wchar_t? namespace Foo { typedef std::wstring String; } Now i declare all my strings as Foo::String through out the program, ...

C# Marshal / Pinvoke CBitmap?

I cannot figure out how to marshal a C++ CBitmap to a C# Bitmap or Image class. My import looks like this: [DllImport(@"test.dll", CharSet = CharSet.Unicode)] public static extern IntPtr ...

Window iconification status via Xlib

Is it possible to check with the means of pure X11/Xlib only whether the given window is iconified/minimized, and, if it is, how?
