gemm

The HPC toolbox: fused matrix multiplication, convolution, data-parallel strided tensor primitives, OpenMP facilities, SIMD, JIT Assembler, CPU detection, state-of-the-art vectorized BLAS for floats and integers

deep-learning assembler parallel openmp jit simd matrix-multiplication high-performance-computing blas convolution tensor compiler-optimization gemm runtime-cpu-detection

Updated Feb 26, 2021
Nim

ROCmSoftwarePlatform / Tensile

Star

Stretching GPU performance for GEMMs and tensor contractions.

python machine-learning amd gpu assembly opencl dnn matrix-multiplication neural-networks gpu-acceleration blas hip gpu-computing tensors tensor-contraction gemm radeon auto-tuning radeon-open-compute

Updated May 19, 2022
Python

yui0 / slibs

Star

Single file libraries for C/C++

audio c mp4 opencl mp3 glsl aac mpeg gpgpu flac blas m4a gemm single-header-lib

Updated May 14, 2022
C

cp2k / dbcsr

Star

DBCSR: Distributed Block Compressed Sparse Row matrix library

hpc linear-algebra mpi cuda matrix-multiplication blas sparse-matrix cp2k gemm mkl openmp-parallelization

Updated May 19, 2022
Fortran

hma02 / cublasHgemm-P100

Star

Code for testing the native float16 matrix multiplication performance on Tesla P100 and V100 GPU based on cublasHgemm

gpu cublas precision gemm half-precision float16 p100 v100

Updated Aug 20, 2019
Cuda

yzhaiustc / Optimizing-SGEMM-on-NVIDIA-Turing-GPUs

Star

Optimizing SGEMM kernel functions on NVIDIA GPUs to a close-to-cuBLAS performance.

optimization cuda nvidia gemm

Updated Nov 28, 2021
Cuda

szagoruyko / openai-gemm.pytorch

Star

PyTorch bindings for openai-gemm

pytorch gemm

Updated Feb 6, 2017
Python

hma02 / cublasgemm-benchmark

Star

code for benchmarking GPU performance based on cublasSgemm and cublasHgemm

benchmarking gpu cuda cublas gemm gpu-performance

Updated Jul 7, 2017
Cuda

yzhaiustc / Optimizing-DGEMM-on-Intel-CPUs-with-AVX512F

Star

Stepwise optimizations of DGEMM on CPU, reaching performance faster than Intel MKL eventually, even under multithreading.

openmp simd blas avx512 gemm mkl

Updated Feb 3, 2022
C

eth-cscs / spla

Star

Specialized Parallel Linear Algebra, providing distributed GEMM functionality for specific matrix distributions with optional GPU acceleration.

linear-algebra mpi cuda gemm rocm

Updated Mar 15, 2022
C++

mz24cn / gemm_optimization

Star

The repository targets the OpenCL gemm function performance optimization. It compares several libraries clBLAS, clBLAST, MIOpenGemm, Intel MKL(CPU) and cuBLAS(CUDA) on different matrix sizes/vendor's hardwares/OS. Out-of-the-box easy as MSVC, MinGW, Linux(CentOS) x86_64 binary provided. 在不同矩阵大小/硬件/操作系统下比较几个BLAS库的sgemm函数性能，提供binary，开盒即用。

opencl cublas matrix-multiplication blas gemm mkl clblas sgemm clblast gemm-optimization clnet

Updated Mar 28, 2019
C

zixuanweeei / gemm-opt

Star

Manually optimize the GEMM (GEneral Matrix Multiply) operation. There is a long way to go.

cpu cpp gemm gemm-optimization

Updated Aug 22, 2021
C++

CoffeeBeforeArch / mmul

Sponsor

Star

Serial and parallel implementations of matrix multiplication

serial parallel matrix-multiplication benchmarks gemm mmul

Updated Feb 19, 2021
C++

yui0 / ugemm

Star

GEMM

opengl opencl glsl gles avx sse simd gpgpu gemm single-header-lib sgemm

Updated Jan 7, 2021
C

blackccpie / fastconv

Star

fast 2D convolution implementation benchmark

cpp avx simd convolution gemm toeplitz im2col

Updated Nov 21, 2017
C++

KaiserKlayton / lpa_cnn

Star

Low Precision Arithmetic for Convolutional Neural Network Inference

benchmarking caffe deep-learning image-recognition convolutional-neural-networks 8-bit gemm lpa-cnn

Updated Oct 29, 2017
C++

scocoyash / Convolution-To-Gemm

Star

My experiments with convolution

matrix-multiplication convolution openmpi gemm gemm-optimization

Updated Jun 21, 2020
C++

iVishalr / GEMM

Star

Fast Matrix Multiplication Implementation in C programming language. This matrix multiplication algorithm is similar to what Numpy uses to compute dot products.

c matrix-multiplication gemm gemm-optimization