Tag Info

Hot answers tagged cuda

14 votes

Accepted

CUDA Mandelbrot Kernel

Small issues Avoid doing calculations that the CPU can do. For example, let the CPU calculate scale_factor and pass it in as an argument. Then you also don't need <...

G. Sliepen

69.3k

answered Apr 21 at 20:29

10 votes

Accepted

CUDA/NVRTC context switching function

Get rid of all the macros Macros are problematic for various reasons, among them: the parameters are expanded in unexpected ways they can confuse IDEs and debuggers they are less readable So avoid ...

G. Sliepen

69.3k

answered May 16 at 21:01

9 votes

Strongly-typed CUDA device memory

Consider making it look like a container I think you already made something that is safe and relatively easy to use. However, the to_vector() member function is a ...

G. Sliepen

69.3k

answered Sep 20 at 19:37

7 votes

RAII Wrapper For Registering/Mapping CUDA Resources

Use std::runtime_error You derive cuda_exception from std::exception, but now you have to ...

G. Sliepen

69.3k

answered Sep 6 at 16:06

5 votes

Summation over different determinants that are independently computed using CUDA

General Observations This code isn't ready to be optimized, because it is not currently maintainable. Code optimization for performance is something that is done only when the code is production ready ...

pacmaninbw♦

26.1k

answered May 18, 2023 at 16:32

4 votes

Avoid --compiler-options in makefile using nvcc

There are two different problems. One is to pass the same options to gcc and to gcc via nvcc (the accepted solution). The other is to pass several options without repeating the long command. The ...

alfC

answered Feb 4, 2020 at 1:41

4 votes

Summation over different determinants that are independently computed using CUDA

improving the efficiency of the code I'll only address that goal. Move redundant calculations Example: ...

chux

36.4k

answered May 22, 2023 at 5:51

4 votes

Summation over different determinants that are independently computed using CUDA

Disclaimer I don't know CUDA coding very well so I may miss some things. This second answer is in response to questions on my first answer from the Original Poster. The ...

pacmaninbw♦

26.1k

answered May 19, 2023 at 15:36

4 votes

Accepted

CUDA/C++ Host/Device Polymorphic Class Implementation

nice! The code is generally rather good. I don’t have a lot to say about the language usage in the individual lines. The idea of using an abstract base to do different implementations selected at ...

JDługosz

11.7k

answered May 1, 2018 at 20:33

4 votes

Pytorch code running slow for Deep Q learning (Reinforcement Learning)

What I can do for you and give you some general suggestions: Use library like Nuba or similar; try Pypy is a JIT compiler; if is possible use C or C++ modules. and here the code with some ...

AsrtoMichi

answered May 25, 2024 at 15:15

3 votes

Accepted

Tracking total iterations in CUDA fractal renderer

Use __shared__ A single global atomic variable will be a bottleneck if it will be written to frequently, so prefer a __shared__ ...

G. Sliepen

69.3k

answered Apr 1 at 20:34

3 votes

SYMGS implementation in CUDA

Do you have any suggestions that can speed it up? First, remove coding weaknesses Check against expected value, not just one of the undesired values. ...

chux

36.4k

answered Apr 8, 2023 at 1:02

3 votes

A CUDA compatible Vector class

Separate concerns Your class has too many responsibilities. It is: Being a container like std::vector. Doing CUDA memory management. Adding row/column vector ...

G. Sliepen

69.3k

answered Mar 31, 2023 at 12:48

3 votes

Accepted

gpuIncreaseOne Function Implementation in CUDA

In your kernel, vectorIncreaseOne is using long double* types. According to NIVIDIA forum, there is no support for ...

PaulH

answered Jun 3 at 13:43

3 votes

CUDA matrix class

A few short comments: Initializing a matrix with a compile-time list of values (initializer list) does not seem to be very useful - as CUDA is used to process huge amounts of data, not a tiny number ...

einpoklum

2,099

answered Jun 16, 2019 at 7:44

2 votes

Accepted

Matrix Multiplication Implementation in CUDA C++ API with and without shared memory

matrix(int rows, int cols) { this->rows = rows; this->cols = cols; this->size = rows * cols; } 0_0 Did you actually mean ...

bipll

answered Apr 16, 2018 at 12:03

2 votes

Accepted

A sleep primitive for CUDA device-side code

If, as Robert Crovella suggests, the clock64() calls don't get optimized away, then this should be enough: ...

einpoklum

2,099

answered Oct 24, 2017 at 13:46

2 votes

Accepted

CUDA kernel to compare pairs of matrices

... this is a little slow for such a simple operation. Any suggestions to improve? Avoid work Minor idea: defer computations until needed with re-ordered code. Only assign once. ...

chux

36.4k

answered Apr 17, 2023 at 16:19

2 votes

RAII Wrapper For CUDA Pointers

We define a helper macro CUDA_SAFE_CALL() but don't #undef it after use. Not clear whether this is intentional, but if our ...

Toby Speight

88.4k

answered Sep 26 at 12:33

2 votes

Speeding up Buddhabrot calculation in PyCuda

Why using binary and instead of logical and in write_pixel: ...

Calak

2,411

answered Oct 23, 2018 at 11:15

2 votes

I have a pytorch module that takes in some parameters and predicts the difference between one of it inputs and the target

computing a discarded result Please don't write code like this: def greet(name): 42 name + " is cool." print(f"Hello {name}!") Yes, ...

J_H

42.3k

answered Dec 6, 2024 at 3:33

1 vote

Accepted

Applying cointegration function from statsmodels on a large dataframe

Your code is short, clear, and time consuming. How to drastically improve the speed? You need to compute fewer figures. For N time series you perform O(N^2) cointegration tests. The OP does not ...

J_H

42.3k

answered Jun 2, 2023 at 18:19

1 vote

Accepted

CUDA-Kernel for a Dense-Sparse matrix multiplication

Thank you for offering this for review. I understand you're primarily interested in performance. But I confess I found the code a little on the opaque side and not quite ready to invite lots of folks ...

J_H

42.3k

answered Jan 13, 2023 at 19:37

1 vote

Matrix Multiplication Implementation in CUDA C++ API with and without shared memory

In addition to what bipll noted, the elements member is not initialized by the constructor, but left in a garbage state. At the very least, make it ...

JDługosz

11.7k

answered Apr 18, 2018 at 6:21

1 vote

Accepted

Mirrors number at borders of interval, untill it lays in the interval

__device__ mirror(int index , int lB, int uB) You've got an extra space before the first comma, and I think you're missing the function's return type entirely. ...

Quuxplusone

19.7k

answered Mar 22, 2019 at 4:01

1 vote

Speeding up Buddhabrot calculation in PyCuda

Packaging floats I have implemented some selective sampling that I found on a forum, and I've changed the container for my complex numbers from a ...

maxb

1,582

answered Oct 25, 2018 at 8:28

1 vote

Calculating the distance between several spatial points

#define _SQR(a) ((a)*(a)) #define _BLOCKSIZE 32 Identifiers beginning with an underscore followed by uppercase letter are reserved to the implementation for any ...

Toby Speight

88.4k

answered Aug 31, 2021 at 7:57

Only top scored, non community-wiki answers of a minimum length are eligible

questions tagged

cuda

cuda × 55
c++ × 34
performance × 22
c × 14
beginner × 7
matrix × 6
python × 5
fractals × 5
memory-management × 4
multithreading × 3
raii × 3
object-oriented × 2
random × 2
image × 2
primes × 2
pandas × 2
template × 2
c++20 × 2
opengl × 2
neural-network × 2
pytorch × 2
python-3.x × 1
algorithm × 1
array × 1
time-limit-exceeded × 1

Tag Info

Hot answers tagged cuda

Related Tags