The Wayback Machine - https://web.archive.org/web/20251208083819/https://github.com/Bruce-Lee-LY
Skip to content
View Bruce-Lee-LY's full-sized avatar

Block or report Bruce-Lee-LY

Report abuse

Contact GitHub support about this user’s behavior. Learn more about reporting abuse.

Report abuse

Pinned Loading

  1. decoding_attention decoding_attention Public

    Decoding Attention is specially optimized for MHA, MQA, GQA and MLA using CUDA core for the decoding stage of LLM inference.

    C++ 46 4

  2. flash_attention_inference flash_attention_inference Public

    Performance of the C++ interface of flash attention and flash attention v2 in large language model (LLM) inference scenarios.

    C++ 43 6

  3. cuda_hgemm cuda_hgemm Public

    Several optimization methods of half-precision general matrix multiplication (HGEMM) using tensor core with WMMA API and MMA PTX instruction.

    Cuda 505 87

  4. cuda_hook cuda_hook Public

    Hooked CUDA-related dynamic libraries by using automated code generation tools.

    C 172 44

  5. cuda_hgemv cuda_hgemv Public

    Several optimization methods of half-precision general matrix vector multiplication (HGEMV) using CUDA core.

    Cuda 70 7

  6. cutlass_gemm cutlass_gemm Public

    Multiple GEMM operators are constructed with cutlass to support LLM inference.

    C++ 20 2