English 中文(简体)
[OpenCL]nearest relatives using euclidean long
原标题:[OpenCL]nearest neighbour using euclidean distance

我在两组3D点之间找到最近邻。

近邻: 对于数据单中的每一点(x,y,z),我必须找到该模型中最接近的一点。 (Ax-Bx)^2 + (Ay-By)^2 + (Az-Bz)^2

Here what I ve done so far:

struct point {
int x;
int y;
int z;
};

__kernel void 
nearest_neighbour(__global struct point *model,
__global struct point *dataset,
__global int *nearest,
const unsigned int model_size)
{
    int g_dataset_id = get_global_id(0);

    int dmin = -1;
    int d, dx, dy, dz;

    for (int i=0; i<model_size; ++i) {
        dx = model[i].x - dataset[g_dataset_id].x;
        dx = dx * dx;

        dy = model[i].y - dataset[g_dataset_id].y;
        dy = dy * dy;

        dz = model[i].z - dataset[g_dataset_id].z;
        dz = dz * dz;

        d = dx + dy + dz;

        if(dmin == -1 || d < dmin)
        {
            nearest[g_dataset_id] = i;
            dmin = d;
        }
    }
}

The code seems to work, but I m sure that it can be optimized. I would like to know how can I take advantage of the local memory to make it better.

增 编

P.S. I know that there are other (better) methods to find nearest neighbour, like kd-tree, but for now I would like to do the easy one.

问题回答

The compiler is probably hoisting these loop-invariants for you, but to be sure it gets done, try this code which assigns them to temporaries named datum_x and so on. Also, initializing dmin to MAX_INT allows you to avoid the superfluous comparison with -1. Another approach is to unroll the first loop iteration (with i=0) in order to initialize dmin.

int dmin = MAX_INT;
int d, dx, dy, dz;
int datum_x, datum_y, datum_z;

datum_x = dataset[g_model_id].x;
datum_y = dataset[g_model_id].y;
datum_z = dataset[g_model_id].z;

for (int i=0; i<size_dataset; ++i) {
    dx = model[i].x - datum_x;
    dx = dx * dx;

    dy = model[i].y - datum_y;
    dy = dy * dy;

    dz = model[i].z - datum_z;
    dz = dz * dz;

    d = dx + dy + dz;

    if(d < dmin)
    {
        nearest[g_dataset_id] = i;
        dmin = d;
    }
}

Maybe a quick pre-filter can speed things up. Instead of calculating the squared distance immediately, you can first check if distance in all three coordinates are closer than dmin. So, you can replace your inner loop with

{
    dx = model[i].x - datum_x;
    if (abs(dx) >= dmin) continue;

    dy = model[i].y - datum_y;
    if (abs(dy) >= dmin) continue;

    dz = model[i].z - datum_z;
    if (abs(dz) >= dmin) continue;

    dx = dx * dx;    
    dy = dy * dy;
    dz = dz * dz;

    d = dx + dy + dz;

    if(d < dmin)
    {
        nearest[g_dataset_id] = i;
        dmin = d;
    }
}

我不敢确定,对<代码>abs()和if的附加呼吁将渗透到足够的数据点,但我认为,这是一个简单的改动,可以尝试。

对我来说,第一件事是Heath的建议。 每个工作项目都在同时查阅<代码>model[i]。 视汇编者的最佳程度而定,使每个工作项目能够从各阵列中获取不同内容可能更好。 夸大其词的一种方法是:

int datum_x, datum_y, datum_z;

datum_x = dataset[g_model_id].x;
datum_y = dataset[g_model_id].y;
datum_z = dataset[g_model_id].z;

for (int i=0; i<size_dataset; ++i) {
    j = (i + g_model_id) % size_dataset;  // i --> j by cyclic permutation

    dx = model[j].x - datum_x;
    dx = dx * x;

    dy = model[j].y - datum_y;
    dy = dy * dy;

    /* and so on... */
}

然而,在您的法典中查阅model[i]很可能是作为“广播”处理,在这种情况下,我的代码将放慢。

我确信,将许多时间用在<条码>上写现的斜体。 查阅全球记忆往往非常缓慢,因此,你在你与<条码>dmin = d

与此类似:

...
int dmin = MAX_INT;
int imin = 0;
...
for (...)
{
  ...
  if(d < dmin)
  {
    imin = i;
    dmin = d;
  }
}

nearest[g_dataset_id] = imin; //write to global memory only once

附加建议也可适用于产出指数:维持一个可变的<代码>_nearest_id,在排位之后仅写上。

我不是3个组成部分,而是试验4个病媒,并利用病媒操作。

在Eric Bainville建议之后 我试图从中删除点。 正如我所建议的那样,我在这里做了些什么:

__kernel void 
nearest_neighbour(__global float4 *model,
__global float4 *dataset,
__global unsigned int *nearest,
const unsigned int model_size)
{
    int g_dataset_id = get_global_id(0);

    float dmin = MAXFLOAT;
    float d;

    /* Ottimizzato per memoria locale */
    float4 local_xyz = dataset[g_dataset_id];
    float4 d_xyz;
    int imin;

    for (int i=0; i<model_size; ++i) {
        d_xyz = model[i] - local_xyz;

        d_xyz *= d_xyz;

        d = d_xyz.x + d_xyz.y + d_xyz.z;

        if(d < dmin)
        {
            imin = i;
            dmin = d;
        }
    }

    nearest[g_dataset_id] = imin; // Write only once in global memory
}

问题在于,这一版本比基于点块的低点。 或许是因为在结构中,I ve使用了过滤前装置:

dx = model[i].x - local_x;
dx = dx * dx;
if (dx >= dmin) continue;

dy = model[i].y - local_y;
dy = dy * dy;
if (dy >= dmin) continue;

dz = model[i].z - local_z;
dz = dz * dz;
if (dz >= dmin) continue;

d = dx + dy + dz;

I can t use that pre-filter width float4 version. In your opinion are there other optimization I can do on the float4 version?

感谢大家的宝贵建议

具体议题是”,我想知道,我如何利用当地记忆,使之更好>。

利用万国邮联在当地的记忆可以trick。 在处理数据之前,你需要花一些质量的时间,将SDK的编码样本和方案规划指南放在一起。

基本上,你利用当地记忆来收集全球数据的某些部分——如果是模型阵列——以便你能够从那里读取比从全球读取更快。 如果你想尝试的话,那就象这一伪装:

For each block of the model array {
    1) Read data from __global and write it to __local
    2) Barrier
    3) For each model datum in the __local cache,
       Read it and process it.
    4) Barrier
}

第3步基本上是你现在的空档,但只能处理模型数据的一个空白,而不是整个问题。

步骤2和4在你使用当地记忆时都是至关重要的。 各位必须协调你们工作小组的所有工作。 隔离墙迫使所有工作项目在隔离墙之前完成该守则,然后才允许在隔离墙之后继续实施该守则。 这就使得工作项目无法从当地记忆中读到数据,然后才由其他的读物者撰写。 我不记得隔离墙指令的辛迪加,但又在《开放式世界宣言》的文件中。

步骤1 您的每个工作项目都将与全球不同,并写给当地切身。

与此类似(强调这一点过于简单,未经测试!)

__local float4 modelcache[CACHESIZE];
int me = get_local_id(0);

for (int j = 0; j < model_size; j += CACHESIZE) {
    modelcache[me] = dataset[j+me];
    barrier(CLK_LOCAL_MEM_FENCE);
    for (int i=0; i < CACHESIZE; ++i) {
        d_xyz = modelcache[i] - local_xyz;
        ... etc.
    }
    barrier(CLK_LOCAL_MEM_FENCE);
}

然后是设计问题:地方海滩如何大? 工作组的规模如何?

当地数据储存在工作组工作项目之间。 如果您的国别组合工作项目同时执行一些工作组,每个工作组都有自己的示范项目。

如果你使当地数据阵列太小,那么使用这些数据就没有什么好处。 如果你使工作组变得过于庞大,那么万国邮联就可以同时执行尽可能多的工作组,而实际上,你可能要慢得多。

最后,我必须指出,这一特定算法可能从当地记忆库中大大受益。 在你的方案中,所有工作项目都同时阅读同样的模式地点,大多数全球采购集团拥有专门优化的硬件,以便迅速做到这一点。





相关问题
Undefined reference

I m getting this linker error. I know a way around it, but it s bugging me because another part of the project s linking fine and it s designed almost identically. First, I have namespace LCD. Then I ...

C++ Equivalent of Tidy

Is there an equivalent to tidy for HTML code for C++? I have searched on the internet, but I find nothing but C++ wrappers for tidy, etc... I think the keyword tidy is what has me hung up. I am ...

Template Classes in C++ ... a required skill set?

I m new to C++ and am wondering how much time I should invest in learning how to implement template classes. Are they widely used in industry, or is this something I should move through quickly?

Print possible strings created from a Number

Given a 10 digit Telephone Number, we have to print all possible strings created from that. The mapping of the numbers is the one as exactly on a phone s keypad. i.e. for 1,0-> No Letter for 2->...

typedef ing STL wstring

Why is it when i do the following i get errors when relating to with wchar_t? namespace Foo { typedef std::wstring String; } Now i declare all my strings as Foo::String through out the program, ...

C# Marshal / Pinvoke CBitmap?

I cannot figure out how to marshal a C++ CBitmap to a C# Bitmap or Image class. My import looks like this: [DllImport(@"test.dll", CharSet = CharSet.Unicode)] public static extern IntPtr ...

Window iconification status via Xlib

Is it possible to check with the means of pure X11/Xlib only whether the given window is iconified/minimized, and, if it is, how?