English 中文(简体)
• 如何在《世界人权宣言》中使用%的全球日记登记册?
原标题:How to use the %%globaltimer register in CUDA?
  • 时间:2024-01-06 23:18:39
  •  标签:
  • cuda

对这一问题的回答表明,使用%globaltimer登记册来衡量在加澳新集团中过久的时间。 我决定尝试:

#define NS_PER_S 1000000000

__global__ void sleepKernel() {
    uint64_t start, end;
    uint64_t sleepTime = 5 * NS_PER_S;     // Sleep for 5 seconds

    if (threadIdx.x == 0) {
        // Record start time
        asm volatile("mov.u64 %0, %%globaltimer;" : "=l"(start));

        // Sleep for 5 seconds
        __nanosleep(sleepTime);

        // Record end time
        asm volatile("mov.u64 %0, %%globaltimer;" : "=l"(end));

        // Calculate and print the elapsed time in nanoseconds and milliseconds
        uint64_t elapsedNs = end - start;
        double elapsedMs = (double)elapsedNs / 1000000.0;
        printf("Slept for %llu nanoseconds (%.3f milliseconds)
", elapsedNs, elapsedMs);
    }
}

但是,当我说这nel的时候,产出就象:

slept for 73728 nanoseconds (0.074 milliseconds)
slept for 471040 nanoseconds (0.471 milliseconds)

两者都小于5秒。 我失踪了吗?

<><>Edit>: do:

    uint64_t sleepTime = 5 * (uint64_t)NS_PER_S;     // Sleep for 5 seconds

帮助了一条轨道(防止喷气的溢流),但是,它就足够了。

问题回答

我认为,造成混淆的原因是使用了_nanosleep()。 如果我们开始阅读documentation 和相关的ptx/。 我们将看到值得注意的事项:

  1. the argument passed to it is unsigned, which is a 32-bit quantity. There is no way to closely express 5 seconds when the quantum is a nanosecond, and the argument is unsigned. Passing a 64-bit value isn t formally available, and doesn t change this. But wait, there s more.
  2. the word "approximately" is used. In the PTX doc we see that the function it is based on actually offers a delay in the range of zero(!) to twice the value you specify. That s probably not what most people expect in a delay function.
  3. at least in PTX, the function cannot be used to provide a delay of more than 1 millisecond. Nowhere close to 5 seconds, nor close to what you can ask for in a 32-bit quantity.

compilation and Plf analysis将显示,CUDA C++的内在作用是使用Plf(多或少直接)功能,因此具有同样的可治愈的特点。 我没有能够回答“如何回答”问题或“什么是有用的?”问题。

考虑到这一点,由于贵问题的标题至少与<条码>全球时间有关,而不是<条码>_nanosleep,我们可以随时核实<条码>全球时日/代码>似乎与广告大致相同:

# cat t118.cu
#define NS_PER_S 1000000000ULL
#include <time.h>
#include <sys/time.h>
#include <iostream>
#include <cstdint>
#define USECPSEC 1000000ULL

unsigned long long dtime_usec(unsigned long long start=0){

  timeval tv;
  gettimeofday(&tv, 0);
  return ((tv.tv_sec*USECPSEC)+tv.tv_usec)-start;
}

__global__ void sleepKernel(uint64_t sleepTime = 5 *NS_PER_S) {
    uint64_t start, end;

    if (threadIdx.x == 0) {
        // Record start time
        asm volatile("mov.u64 %0, %%globaltimer;" : "=l"(start));
        end = start;
        // Sleep for 5 seconds
        while (end < (start + sleepTime))
          asm volatile("mov.u64 %0, %%globaltimer;" : "=l"(end));
    }
}


int main(){
        sleepKernel<<<1,1>>>(100);
        cudaDeviceSynchronize();
        unsigned long long dt = dtime_usec(0);
        sleepKernel<<<1,1>>>();
        cudaDeviceSynchronize();
        dt = dtime_usec(dt);
        std::cout << "elapsed: " << dt << "us" << std::endl;
}
# nvcc -o t118 t118.cu -arch=sm_89
# ./t118
elapsed: 5000023us
#

注:

  1. 我通常不建议PEK进行分析以了解情况。 然而,在这里,显示CUDA C++的内在联系和我们能从行为中推断出来的Pock功能是有用和充分的。

  2. I ve already filed an internal bug (3608779) at NVIDIA to have the documentation for the CUDA C++ intrinsic (__nanosleep()) updated to better reflect what is discoverable in the PTX doc.

  3. 第一个弹道发射是“垫子”,以吸收各种电离层电离层扰动启动的间接费用。 如果你想看到这意味着什么,那就去了。





相关问题
Why won t OpenCV compile in NVCC?

I am trying to integrate CUDA and openCV in a project. Problem is openCV won t compile when NVCC is used, while a normal c++ project compiles just fine. This seems odd to me, as I thought NVCC ...

error in CUDA compilation

I m getting this error while trying to run sample codes in CUDA SDK. I have CUDA 2.3 and Visual studio 2008 LINK : fatal error LNK1181: cannot open input file cutil32D.lib Any pointers how to ...

CUDA Memory Allocation accessible for both host and device

I m trying to figure out a way to allocate a block of memory that is accessible by both the host (CPU) and device (GPU). Other than using cudaHostAlloc() function to allocate page-locked memory that ...

热门标签