Skip to main content
14 votes
Accepted

CUDA Mandelbrot Kernel

Small issues Avoid doing calculations that the CPU can do. For example, let the CPU calculate scale_factor and pass it in as an argument. Then you also don't need <...
G. Sliepen's user avatar
  • 69.3k
10 votes
Accepted

CUDA/NVRTC context switching function

Get rid of all the macros Macros are problematic for various reasons, among them: the parameters are expanded in unexpected ways they can confuse IDEs and debuggers they are less readable So avoid ...
G. Sliepen's user avatar
  • 69.3k
9 votes

Strongly-typed CUDA device memory

Consider making it look like a container I think you already made something that is safe and relatively easy to use. However, the to_vector() member function is a ...
G. Sliepen's user avatar
  • 69.3k
7 votes

RAII Wrapper For Registering/Mapping CUDA Resources

Use std::runtime_error You derive cuda_exception from std::exception, but now you have to ...
G. Sliepen's user avatar
  • 69.3k
5 votes

Summation over different determinants that are independently computed using CUDA

General Observations This code isn't ready to be optimized, because it is not currently maintainable. Code optimization for performance is something that is done only when the code is production ready ...
pacmaninbw's user avatar
  • 26.1k
4 votes

Avoid --compiler-options in makefile using nvcc

There are two different problems. One is to pass the same options to gcc and to gcc via nvcc (the accepted solution). The other is to pass several options without repeating the long command. The ...
alfC's user avatar
  • 198
4 votes

Summation over different determinants that are independently computed using CUDA

improving the efficiency of the code I'll only address that goal. Move redundant calculations Example: ...
chux's user avatar
  • 36.4k
4 votes

Summation over different determinants that are independently computed using CUDA

Disclaimer I don't know CUDA coding very well so I may miss some things. This second answer is in response to questions on my first answer from the Original Poster. The ...
pacmaninbw's user avatar
  • 26.1k
4 votes
Accepted

CUDA/C++ Host/Device Polymorphic Class Implementation

nice! The code is generally rather good. I don’t have a lot to say about the language usage in the individual lines. The idea of using an abstract base to do different implementations selected at ...
JDługosz's user avatar
  • 11.7k
4 votes

Pytorch code running slow for Deep Q learning (Reinforcement Learning)

What I can do for you and give you some general suggestions: Use library like Nuba or similar; try Pypy is a JIT compiler; if is possible use C or C++ modules. and here the code with some ...
AsrtoMichi's user avatar
3 votes
Accepted

Tracking total iterations in CUDA fractal renderer

Use __shared__ A single global atomic variable will be a bottleneck if it will be written to frequently, so prefer a __shared__ ...
G. Sliepen's user avatar
  • 69.3k
3 votes

SYMGS implementation in CUDA

Do you have any suggestions that can speed it up? First, remove coding weaknesses Check against expected value, not just one of the undesired values. ...
chux's user avatar
  • 36.4k
3 votes

A CUDA compatible Vector class

Separate concerns Your class has too many responsibilities. It is: Being a container like std::vector. Doing CUDA memory management. Adding row/column vector ...
G. Sliepen's user avatar
  • 69.3k
3 votes
Accepted

gpuIncreaseOne Function Implementation in CUDA

In your kernel, vectorIncreaseOne is using long double* types. According to NIVIDIA forum, there is no support for ...
PaulH's user avatar
  • 173
3 votes

CUDA matrix class

A few short comments: Initializing a matrix with a compile-time list of values (initializer list) does not seem to be very useful - as CUDA is used to process huge amounts of data, not a tiny number ...
einpoklum's user avatar
  • 2,099
2 votes
Accepted

Matrix Multiplication Implementation in CUDA C++ API with and without shared memory

matrix(int rows, int cols) { this->rows = rows; this->cols = cols; this->size = rows * cols; } 0_0 Did you actually mean ...
bipll's user avatar
  • 998
2 votes
Accepted

A sleep primitive for CUDA device-side code

If, as Robert Crovella suggests, the clock64() calls don't get optimized away, then this should be enough: ...
einpoklum's user avatar
  • 2,099
2 votes
Accepted

CUDA kernel to compare pairs of matrices

... this is a little slow for such a simple operation. Any suggestions to improve? Avoid work Minor idea: defer computations until needed with re-ordered code. Only assign once. ...
chux's user avatar
  • 36.4k
2 votes

RAII Wrapper For CUDA Pointers

We define a helper macro CUDA_SAFE_CALL() but don't #undef it after use. Not clear whether this is intentional, but if our ...
Toby Speight's user avatar
  • 88.4k
2 votes

Speeding up Buddhabrot calculation in PyCuda

Why using binary and instead of logical and in write_pixel: ...
Calak's user avatar
  • 2,411
2 votes

I have a pytorch module that takes in some parameters and predicts the difference between one of it inputs and the target

computing a discarded result Please don't write code like this: def greet(name): 42 name + " is cool." print(f"Hello {name}!") Yes, ...
J_H's user avatar
  • 42.3k
1 vote
Accepted

Applying cointegration function from statsmodels on a large dataframe

Your code is short, clear, and time consuming. How to drastically improve the speed? You need to compute fewer figures. For N time series you perform O(N^2) cointegration tests. The OP does not ...
J_H's user avatar
  • 42.3k
1 vote
Accepted

CUDA-Kernel for a Dense-Sparse matrix multiplication

Thank you for offering this for review. I understand you're primarily interested in performance. But I confess I found the code a little on the opaque side and not quite ready to invite lots of folks ...
J_H's user avatar
  • 42.3k
1 vote

Matrix Multiplication Implementation in CUDA C++ API with and without shared memory

In addition to what bipll noted, the elements member is not initialized by the constructor, but left in a garbage state. At the very least, make it ...
JDługosz's user avatar
  • 11.7k
1 vote
Accepted

Mirrors number at borders of interval, untill it lays in the interval

__device__ mirror(int index , int lB, int uB) You've got an extra space before the first comma, and I think you're missing the function's return type entirely. ...
Quuxplusone's user avatar
  • 19.7k
1 vote

Speeding up Buddhabrot calculation in PyCuda

Packaging floats I have implemented some selective sampling that I found on a forum, and I've changed the container for my complex numbers from a ...
maxb's user avatar
  • 1,582
1 vote

Calculating the distance between several spatial points

#define _SQR(a) ((a)*(a)) #define _BLOCKSIZE 32 Identifiers beginning with an underscore followed by uppercase letter are reserved to the implementation for any ...
Toby Speight's user avatar
  • 88.4k

Only top scored, non community-wiki answers of a minimum length are eligible