14
votes
Accepted
CUDA Mandelbrot Kernel
Small issues
Avoid doing calculations that the CPU can do. For example, let the CPU calculate scale_factor and pass it in as an argument. Then you also don't need <...
10
votes
Accepted
CUDA/NVRTC context switching function
Get rid of all the macros
Macros are problematic for various reasons, among them:
the parameters are expanded in unexpected ways
they can confuse IDEs and debuggers
they are less readable
So avoid ...
9
votes
Strongly-typed CUDA device memory
Consider making it look like a container
I think you already made something that is safe and relatively easy to use. However, the to_vector() member function is a ...
7
votes
RAII Wrapper For Registering/Mapping CUDA Resources
Use std::runtime_error
You derive cuda_exception from std::exception, but now you have to ...
5
votes
Summation over different determinants that are independently computed using CUDA
General Observations
This code isn't ready to be optimized, because it is not currently maintainable. Code optimization for performance is something that is done only when the code is production ready ...
4
votes
Avoid --compiler-options in makefile using nvcc
There are two different problems. One is to pass the same options to gcc and to gcc via nvcc (the accepted solution). The other is to pass several options without repeating the long command.
The ...
4
votes
Summation over different determinants that are independently computed using CUDA
improving the efficiency of the code
I'll only address that goal.
Move redundant calculations
Example:
...
4
votes
Summation over different determinants that are independently computed using CUDA
Disclaimer
I don't know CUDA coding very well so I may miss some things.
This second answer is in response to questions on my first answer from the Original Poster.
The ...
4
votes
Accepted
CUDA/C++ Host/Device Polymorphic Class Implementation
nice!
The code is generally rather good. I don’t have a lot to say about the language usage in the individual lines.
The idea of using an abstract base to do different implementations selected at ...
4
votes
Pytorch code running slow for Deep Q learning (Reinforcement Learning)
What I can do for you and give you some general suggestions:
Use library like Nuba or similar;
try Pypy is a JIT compiler;
if is possible use C or C++ modules.
and here the code with some ...
3
votes
Accepted
Tracking total iterations in CUDA fractal renderer
Use __shared__
A single global atomic variable will be a bottleneck if it will be written to frequently, so prefer a __shared__ ...
3
votes
SYMGS implementation in CUDA
Do you have any suggestions that can speed it up?
First, remove coding weaknesses
Check against expected value, not just one of the undesired values.
...
3
votes
A CUDA compatible Vector class
Separate concerns
Your class has too many responsibilities. It is:
Being a container like std::vector.
Doing CUDA memory management.
Adding row/column vector ...
3
votes
Accepted
gpuIncreaseOne Function Implementation in CUDA
In your kernel, vectorIncreaseOne is using long double* types. According to NIVIDIA forum, there is no support for ...
3
votes
CUDA matrix class
A few short comments:
Initializing a matrix with a compile-time list of values (initializer list) does not seem to be very useful - as CUDA is used to process huge amounts of data, not a tiny number ...
2
votes
Accepted
Matrix Multiplication Implementation in CUDA C++ API with and without shared memory
matrix(int rows, int cols) {
this->rows = rows;
this->cols = cols;
this->size = rows * cols;
}
0_0
Did you actually mean
...
2
votes
Accepted
A sleep primitive for CUDA device-side code
If, as Robert Crovella suggests, the clock64() calls don't get optimized away, then this should be enough:
...
2
votes
Accepted
CUDA kernel to compare pairs of matrices
... this is a little slow for such a simple operation. Any suggestions to improve?
Avoid work
Minor idea: defer computations until needed with re-ordered code. Only assign once.
...
2
votes
RAII Wrapper For CUDA Pointers
We define a helper macro CUDA_SAFE_CALL() but don't #undef it after use. Not clear whether this is intentional, but if our ...
2
votes
Speeding up Buddhabrot calculation in PyCuda
Why using binary and instead of logical and in write_pixel:
...
2
votes
I have a pytorch module that takes in some parameters and predicts the difference between one of it inputs and the target
computing a discarded result
Please don't write code like this:
def greet(name):
42
name + " is cool."
print(f"Hello {name}!")
Yes, ...
1
vote
Accepted
Applying cointegration function from statsmodels on a large dataframe
Your code is short, clear, and time consuming.
How to drastically improve the speed?
You need to compute fewer figures.
For N time series you perform O(N^2) cointegration tests.
The OP does not ...
1
vote
Accepted
CUDA-Kernel for a Dense-Sparse matrix multiplication
Thank you for offering this for review.
I understand you're primarily interested in performance.
But I confess I found the code a little on the opaque side
and not quite ready to invite lots of folks ...
1
vote
Matrix Multiplication Implementation in CUDA C++ API with and without shared memory
In addition to what bipll noted, the elements member is not initialized by the constructor, but left in a garbage state. At the very least, make it ...
1
vote
Accepted
Mirrors number at borders of interval, untill it lays in the interval
__device__
mirror(int index , int lB, int uB)
You've got an extra space before the first comma, and I think you're missing the function's return type entirely. ...
1
vote
Speeding up Buddhabrot calculation in PyCuda
Packaging floats
I have implemented some selective sampling that I found on a forum, and I've changed the container for my complex numbers from a ...
1
vote
Calculating the distance between several spatial points
#define _SQR(a) ((a)*(a))
#define _BLOCKSIZE 32
Identifiers beginning with an underscore followed by uppercase letter are reserved to the implementation for any ...
Only top scored, non community-wiki answers of a minimum length are eligible
Related Tags
cuda × 55c++ × 34
performance × 22
c × 14
beginner × 7
matrix × 6
python × 5
fractals × 5
memory-management × 4
multithreading × 3
raii × 3
object-oriented × 2
random × 2
image × 2
primes × 2
pandas × 2
template × 2
c++20 × 2
opengl × 2
neural-network × 2
pytorch × 2
python-3.x × 1
algorithm × 1
array × 1
time-limit-exceeded × 1