-
Updated
May 6, 2022 - C
#
gemm
Here are 41 public repositories matching this topic...
Tuned OpenCL BLAS
-
Updated
May 17, 2022 - C++
Fast inference engine for Transformer models
deep-neural-networks
deep-learning
cpp
neon
machine-translation
openmp
parallel-computing
cuda
inference
avx
intrinsics
avx2
neural-machine-translation
opennmt
quantization
gemm
mkl
thrust
transformer-models
onednn
-
Updated
May 19, 2022 - C++
BLISlab: A Sandbox for Optimizing GEMM
-
Updated
Jun 17, 2021 - C
The HPC toolbox: fused matrix multiplication, convolution, data-parallel strided tensor primitives, OpenMP facilities, SIMD, JIT Assembler, CPU detection, state-of-the-art vectorized BLAS for floats and integers
deep-learning
assembler
parallel
openmp
jit
simd
matrix-multiplication
high-performance-computing
blas
convolution
tensor
compiler-optimization
gemm
runtime-cpu-detection
-
Updated
Feb 26, 2021 - Nim
Stretching GPU performance for GEMMs and tensor contractions.
python
machine-learning
amd
gpu
assembly
opencl
dnn
matrix-multiplication
neural-networks
gpu-acceleration
blas
hip
gpu-computing
tensors
tensor-contraction
gemm
radeon
auto-tuning
radeon-open-compute
-
Updated
May 19, 2022 - Python
DBCSR: Distributed Block Compressed Sparse Row matrix library
hpc
linear-algebra
mpi
cuda
matrix-multiplication
blas
sparse-matrix
cp2k
gemm
mkl
openmp-parallelization
-
Updated
May 19, 2022 - Fortran
Optimizing SGEMM kernel functions on NVIDIA GPUs to a close-to-cuBLAS performance.
-
Updated
Nov 28, 2021 - Cuda
code for benchmarking GPU performance based on cublasSgemm and cublasHgemm
-
Updated
Jul 7, 2017 - Cuda
Specialized Parallel Linear Algebra, providing distributed GEMM functionality for specific matrix distributions with optional GPU acceleration.
-
Updated
Mar 15, 2022 - C++
The repository targets the OpenCL gemm function performance optimization. It compares several libraries clBLAS, clBLAST, MIOpenGemm, Intel MKL(CPU) and cuBLAS(CUDA) on different matrix sizes/vendor's hardwares/OS. Out-of-the-box easy as MSVC, MinGW, Linux(CentOS) x86_64 binary provided. 在不同矩阵大小/硬件/操作系统下比较几个BLAS库的sgemm函数性能,提供binary,开盒即用。
-
Updated
Mar 28, 2019 - C
Manually optimize the GEMM (GEneral Matrix Multiply) operation. There is a long way to go.
-
Updated
Aug 22, 2021 - C++
Serial and parallel implementations of matrix multiplication
-
Updated
Feb 19, 2021 - C++
Low Precision Arithmetic for Convolutional Neural Network Inference
-
Updated
Oct 29, 2017 - C++
My experiments with convolution
-
Updated
Jun 21, 2020 - C++
Fast Matrix Multiplication Implementation in C programming language. This matrix multiplication algorithm is similar to what Numpy uses to compute dot products.
-
Updated
Jun 6, 2021 - C
CUDA kernel functions
-
Updated
May 17, 2022 - Cuda
DGEMM on KNL, achieve 75% MKL
-
Updated
May 19, 2022 - C++
Improve this page
Add a description, image, and links to the gemm topic page so that developers can more easily learn about it.
Add this topic to your repo
To associate your repository with the gemm topic, visit your repo's landing page and select "manage topics."

