English 中文(简体)
CUDA 复发的怪异行为
原标题:Weird behaviour of CUDA recursion
  • 时间:2024-07-16 19:05:21
  •  标签:
  • cuda
  • nvidia
In the following minimal reproducible example, when the recursion in device_func is active, the __synchthreads() barrier is ignored, and when debugged, breakpoint 2 occurs before breakpoint 1. If the recursion is removed, it works as expected. How could this be? The code is compiled with nvcc -arch=sm_61 -G -g example.cu for an NVIDIA Quadro P600, using CUDA Toolkit 12.5. #include #include #include #include #include __device__ void device_func(uint32_t *octants, size_t const level) { if (threadIdx.x == 0) { octants[0] = 0; //! breakpoint 1 } __syncthreads(); printf("first octant: %d ", octants[0]); //! breakpoint 2 //! if used, breakpoint 2 occurs before breakpoint 1 if (level < 2) { device_func(octants, level + 1); }; } __global__ void kernel() { __shared__ uint32_t octants[9]; if (threadIdx.x == 0) { octants[0] = 99999; } __syncthreads(); device_func(octants, 0); return; } void checkCudaError(cudaError_t err, const char *msg) { if (err != cudaSuccess) { fprintf(stderr, "CUDA error at %s: %s ", msg, cudaGetErrorString(err)); exit(EXIT_FAILURE); } } int main() { kernel<<<1, 32>>>(); // Check for kernel launch errors checkCudaError(cudaGetLastError(), "Kernel launch"); // Wait for the kernel to finish executing checkCudaError(cudaDeviceSynchronize(), "Kernel execution"); return 0; } Here is the debug session: $ nvcc -o example -arch=sm_61 -G -g example.cu ptxas warning : Stack size for entry function _Z6kernelv cannot be statically determined $ cuda-gdb ./example NVIDIA (R) cuda-gdb 12.5 Portions Copyright (C) 2007-2024 NVIDIA Corporation Based on GNU gdb 13.2 Copyright (C) 2023 Free Software Foundation, Inc. License GPLv3+: GNU GPL version 3 or later This is free software: you are free to change and redistribute it. There is NO WARRANTY, to the extent permitted by law. Type "show copying" and "show warranty" for details. This CUDA-GDB was configured as "x86_64-pc-linux-gnu". Type "show configuration" for configuration details. For bug reporting instructions, please see: . Find the CUDA-GDB manual and other documentation resources online at: . For help, type "help". Type "apropos word" to search for commands related to "word"... Reading symbols from ./example... (cuda-gdb) b 11 Breakpoint 1 at 0xabfc: file /home/marco/phase_2_barnes_hut/example.cu, line 20. (cuda-gdb) b 14 Note: breakpoint 1 also set at pc 0xabfc. Breakpoint 2 at 0xabfc: file /home/marco/phase_2_barnes_hut/example.cu, line 20. (cuda-gdb) r Starting program: /home/marco/phase_2_barnes_hut/example [Thread debugging using libthread_db enabled] Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1". [New Thread 0x7ffff6dff000 (LWP 1519)] [New Thread 0x7ffff5b21000 (LWP 1520)] [Detaching after fork from child process 1521] [New Thread 0x7ffff5320000 (LWP 1529)] [Thread 0x7ffff5320000 (LWP 1529) exited] [New Thread 0x7ffff5320000 (LWP 1530)] [New Thread 0x7fffe1fff000 (LWP 1532)] [Switching focus to CUDA kernel 0, grid 1, block (0,0,0), thread (1,0,0), device 0, sm 0, warp 0, lane 1] CUDA thread hit Breakpoint 2, device_func (octants=0x7fffe5000000, level=0) at example.cu:14 14 printf("first octant: %d ", octants[0]); //! breakpoint 2 (cuda-gdb)
最佳回答
After filing a bug, I was told that this is expected for a Pascal GPU, since divergent threads are not supported. More info here.
问题回答

暂无回答




相关问题
Why won t OpenCV compile in NVCC?

I am trying to integrate CUDA and openCV in a project. Problem is openCV won t compile when NVCC is used, while a normal c++ project compiles just fine. This seems odd to me, as I thought NVCC ...

error in CUDA compilation

I m getting this error while trying to run sample codes in CUDA SDK. I have CUDA 2.3 and Visual studio 2008 LINK : fatal error LNK1181: cannot open input file cutil32D.lib Any pointers how to ...

CUDA Memory Allocation accessible for both host and device

I m trying to figure out a way to allocate a block of memory that is accessible by both the host (CPU) and device (GPU). Other than using cudaHostAlloc() function to allocate page-locked memory that ...

热门标签