English 中文(简体)
CUDA 优化问题
原标题:CUDA optimization question

这是一项简单的方案:

   void multiply(const int* v_in, const int* w_in, int n_v, int n_w, int* w_out)
   {
      for(int i=0; i<n_w; i++)
      {
         int sum=0;
         for(int j=0; j<n_v; j++)
            sum += (w_in[i]*v_in[j])>>1;
         w_out[i]=sum;
      }
   }

Presume n_v, n_w ~10^6。 显然,在《世界人权宣言》中至少有十种类似方式这样做,有不同的方式将(n_v*n_w)行动注入深层,并不会分享记忆。 理论上说,什么方式应该最快?

最佳回答

最简单:

   void multiply(const int* v_in, const int* w_in, int n_v, int n_w, int* w_out)
   {
      int *v = shared; // dynamic
      for(int i = block.rank; i < n_w; i += block.size)
      {
         int w = w_in[i]; // coalesced
         int sum=0;
         for(int j=0; j<n_v; j += block.size) { // assumption
            v[block.rank] = v_in[j+block.rank];
            __synch();
            for(int k = 0; k < block.size; ++k) 
                sum += (w*v[k])>>1;  // 
            __synch(); // ouch
         }
         w_out[i] = sum; // ditto
      }
   }
问题回答

暂无回答




相关问题
Optimizing a LAN server for a game

I m the network programmer on a school game project. We want to have up to 16 players at once on a LAN. I am using the Server-Client model and am creating a new thread per client that joins. ...

SQL Table Size And Query Performance

We have a number of items coming in from a web service; each item containing an unknown number of properties. We are storing them in a database with the following Schema. Items - ItemID - ...

Most optimized way to store crawler states?

I m currently writing a web crawler (using the python framework scrapy). Recently I had to implement a pause/resume system. The solution I implemented is of the simplest kind and, basically, stores ...

Do bitwise operations distribute over addition?

I m looking at an algorithm I m trying to optimize, and it s basically a lot of bit twiddling, followed by some additions in a tight feedback. If I could use carry-save addition for the adders, it ...

Improve INSERT-per-second performance of SQLite

Optimizing SQLite is tricky. Bulk-insert performance of a C application can vary from 85 inserts per second to over 96,000 inserts per second! Background: We are using SQLite as part of a desktop ...

Profiling Vim startup time

I’ve got a lot of plugins enabled when using Vim – I have collected plugins over the years. I’m a bit fed up with how long Vim takes to start now, so I’d like to profile its startup and see which of ...

Quick padding of a string in Delphi

I was trying to speed up a certain routine in an application, and my profiler, AQTime, identified one method in particular as a bottleneck. The method has been with us for years, and is part of a "...

热门标签