We read every piece of feedback, and take your input very seriously.
To see all available qualifiers, see our documentation.
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
cub::BlockReduceWarpReductions<__half, ...>
Performance
CUB
https://github.com/NVIDIA/cccl/actions/runs/9601285137/job/26479819287?pr=1879
20/58 Test #119: cub.cpp17.test.device_reduce.lid_2.types_3 ..........................***Failed 16.23 sec -- Using CCCL_TEST_MODE from env: compute-sanitizer-memcheck -- >> Running: compute-sanitizer --tool memcheck --check-device-heap yes --leak-check full --padding 512 --track-stream-ordered-races all --check-warpgroup-mma yes --require-cuda-init no --check-exit-code yes --error-exitcode 1 --nvtx true /home/coder/cccl/build/cuda12.4-gcc13/cub-cpp17/bin/cub.cpp17.test.device_reduce.lid_2.types_3 ========= COMPUTE-SANITIZER ========= Invalid __shared__ read of size 16 bytes ========= at __half cub::CUB_200500_600_700_800_NS::BlockReduceWarpReductions<__half, (int)256, (int)1, (int)1, (int)0>::ApplyWarpAggregates<(bool)1, cub::CUB_200500_600_700_800_NS::Min, (int)1>(T2, __half, int, cub::CUB_200500_600_700_800_NS::Int2Type<T3>)+0x6550 in /home/coder/cccl/cub/cub/block/specializations/block_reduce_warp_reductions.cuh:143 ========= by thread (0,0,0) in block (0,0,0) ========= Address 0x8 is misaligned ========= Device Frame:__half cub::CUB_200500_600_700_800_NS::BlockReduceWarpReductions<__half, (int)256, (int)1, (int)1, (int)0>::ApplyWarpAggregates<(bool)1, cub::CUB_200500_600_700_800_NS::Min>(T2, __half, int)+0x6550 in /home/coder/cccl/cub/cub/block/specializations/block_reduce_warp_reductions.cuh:192 ========= Device Frame:__half cub::CUB_200500_600_700_800_NS::BlockReduceWarpReductions<__half, (int)256, (int)1, (int)1, (int)0>::Reduce<(bool)1, cub::CUB_200500_600_700_800_NS::Min>(__half, int, T2)+0x6300 in /home/coder/cccl/cub/cub/block/specializations/block_reduce_warp_reductions.cuh:253 ========= Device Frame:__half cub::CUB_200500_600_700_800_NS::BlockReduce<__half, (int)256, (cub::CUB_200500_600_700_800_NS::BlockReduceAlgorithm)2, (int)1, (int)1, (int)0>::Reduce<cub::CUB_200500_600_700_800_NS::Min>(__half, T1)+0x62c0 in /home/coder/cccl/cub/cub/block/block_reduce.cuh:353 ========= Device Frame:__half cub::CUB_200500_600_700_800_NS::AgentReduce<cub::CUB_200500_600_700_800_NS::AgentReducePolicy<(int)256, (int)16, __half, (int)4, (cub::CUB_200500_600_700_800_NS::BlockReduceAlgorithm)2, (cub::CUB_200500_600_700_800_NS::CacheLoadModifier)5, cub::CUB_200500_600_700_800_NS::MemBoundScaling<(int)256, (int)16, __half>>, __half *, __half *, unsigned int, cub::CUB_200500_600_700_800_NS::Min, __half, cuda::std::__4::__identity>::ConsumeRange<(int)1>(cub::CUB_200500_600_700_800_NS::GridEvenShare<unsigned int> &, cub::CUB_200500_600_700_800_NS::Int2Type<T1>)+0x62c0 in /home/coder/cccl/cub/cub/agent/agent_reduce.cuh:371 ========= Device Frame:cub::CUB_200500_600_700_800_NS::AgentReduce<cub::CUB_200500_600_700_800_NS::AgentReducePolicy<(int)256, (int)16, __half, (int)4, (cub::CUB_200500_600_700_800_NS::BlockReduceAlgorithm)2, (cub::CUB_200500_600_700_800_NS::CacheLoadModifier)5, cub::CUB_200500_600_700_800_NS::MemBoundScaling<(int)256, (int)16, __half>>, __half *, __half *, unsigned int, cub::CUB_200500_600_700_800_NS::Min, __half, cuda::std::__4::__identity>::ConsumeTiles(cub::CUB_200500_600_700_800_NS::GridEvenShare<unsigned int> &)+0xf0 in /home/coder/cccl/cub/cub/agent/agent_reduce.cuh:399 ========= Device Frame:void cub::CUB_200500_600_700_800_NS::DeviceReduceKernel<cub::CUB_200500_600_700_800_NS::DeviceReducePolicy<__half, unsigned int, cub::CUB_200500_600_700_800_NS::Min>::Policy600, __half *, unsigned int, cub::CUB_200500_600_700_800_NS::Min, __half, cuda::std::__4::__identity>(T2, T5 *, T3, cub::CUB_200500_600_700_800_NS::GridEvenShare<T3>, T4, T6)+0x20 in /home/coder/cccl/cub/cub/device/dispatch/dispatch_reduce.cuh:193 ========= Saved host backtrace up to driver entry point at kernel launch time ========= Host Frame: [0x2f8c1d] ========= in /usr/lib/x86_64-linux-gnu/libcuda.so.1 ========= Host Frame:libcudart_static_9bbdd5ca4a7c063204936dc6c5907f47dcddbefb [0x12a65c] ========= in /home/coder/cccl/build/cuda12.4-gcc13/cub-cpp17/bin/cub.cpp17.test.device_reduce.lid_2.types_3 ========= Host Frame:cudaGraphLaunch [0x18331b] ========= in /home/coder/cccl/build/cuda12.4-gcc13/cub-cpp17/bin/cub.cpp17.test.device_reduce.lid_2.types_3 ========= Host Frame:void launch<device_min_invocable_t, __half*, __half*, int>(device_min_invocable_t, __half*, __half*, int) [clone .isra.0] [0x96e40] ========= in /home/coder/cccl/build/cuda12.4-gcc13/cub-cpp17/bin/cub.cpp17.test.device_reduce.lid_2.types_3 ========= Host Frame:void C_A_T_C_H_T_E_M_P_L_A_T_E_T_E_S_T_F_U_N_C_0<metal::list<type_pair<half_t, half_t> > >() [0xa7d28] ========= in /home/coder/cccl/build/cuda12.4-gcc13/cub-cpp17/bin/cub.cpp17.test.device_reduce.lid_2.types_3 ========= Host Frame:Catch::RunContext::invokeActiveTestCase() [0x398c2] ========= in /home/coder/cccl/build/cuda12.4-gcc13/cub-cpp17/bin/cub.cpp17.test.device_reduce.lid_2.types_3 ========= Host Frame:Catch::RunContext::runCurrentTest(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >&) [0x51d27] ========= in /home/coder/cccl/build/cuda12.4-gcc13/cub-cpp17/bin/cub.cpp17.test.device_reduce.lid_2.types_3 ========= Host Frame:Catch::RunContext::runTest(Catch::TestCase const&) [0x5aa0a] ========= in /home/coder/cccl/build/cuda12.4-gcc13/cub-cpp17/bin/cub.cpp17.test.device_reduce.lid_2.types_3 ========= Host Frame:Catch::Session::runInternal() [0x64e8f] ========= in /home/coder/cccl/build/cuda12.4-gcc13/cub-cpp17/bin/cub.cpp17.test.device_reduce.lid_2.types_3 ========= Host Frame:Catch::Session::run() [0x6547d] ========= in /home/coder/cccl/build/cuda12.4-gcc13/cub-cpp17/bin/cub.cpp17.test.device_reduce.lid_2.types_3 ========= Host Frame:main [0x2cf45] ========= in /home/coder/cccl/build/cuda12.4-gcc13/cub-cpp17/bin/cub.cpp17.test.device_reduce.lid_2.types_3 ========= Host Frame: [0x23a8f] ========= in /usr/lib/x86_64-linux-gnu/libc.so.6 ========= Host Frame:__libc_start_main [0x23b48] ========= in /usr/lib/x86_64-linux-gnu/libc.so.6 ========= Host Frame:_start [0x30614] ========= in /home/coder/cccl/build/cuda12.4-gcc13/cub-cpp17/bin/cub.cpp17.test.device_reduce.lid_2.types_3
Run the device_reduce tests under compute-sanitizer as shown in the above snippet.
compute-sanitizer
Compute sanitizer should not emit any diagnostics.
No response
The text was updated successfully, but these errors were encountered:
No branches or pull requests
Is this a duplicate?
Type of Bug
Performance
Component
CUB
Describe the bug
https://github.com/NVIDIA/cccl/actions/runs/9601285137/job/26479819287?pr=1879
How to Reproduce
Run the device_reduce tests under
compute-sanitizer
as shown in the above snippet.Expected behavior
Compute sanitizer should not emit any diagnostics.
Reproduction link
No response
Operating System
No response
nvidia-smi output
No response
NVCC version
No response
The text was updated successfully, but these errors were encountered: