| layout | default |
|---|---|
| title | Changelog |
| nav_order | 10 |
| permalink | /CHANGELOG/ |
All notable changes to this project are documented here.
The format follows Keep a Changelog and the project aims to follow Semantic Versioning.
- Integer overflow risk in
verify.cuhandtensor_core_sgemm.cuhfor large matrices (usesize_t) - Command-line parsing now uses
strtol()with proper error handling instead ofatoi()
- Consolidated repository governance around
openspec/specs/, updated agent instructions, and simplified documentation roles. - Reworked README, GitHub Pages content, and supporting docs into clearer repository-entry and learning surfaces.
- Began pruning redundant release-history and engineering guidance artifacts in favor of fewer authoritative files.
- Duplicate
LICENSEfile (keptLICENSE.mdwith third-party info) - Legacy
_bmad/and_bmad-output/directories (replaced by OpenSpec)
.clang-tidyconfiguration for static analysis
- Tensor Core WMMA SGEMM kernel with guarded FP32 fallback for unsupported dimensions
- Benchmark enhancements, including roofline data export and configurable warmup/benchmark iterations
- Google Test coverage for standard kernels, Tensor Core fast path, fallback behavior, and edge cases
- Bilingual documentation and a GitHub Pages documentation site
- Consolidated source code into
src/kernels/,src/utils/, andtests/ - Adopted CMake as the primary build system while retaining the Makefile for quick local runs
- Expanded supported CUDA architecture targets to cover Volta through Hopper generation GPUs
- Tensor Core path memory management issues
- Double-buffer synchronization issues
- Grid dimension handling for non-square matrices
- Bank-conflict-free and double-buffer SGEMM kernels
- CUDA Events-based benchmark infrastructure
- Nsight-oriented profiling support
- Migrated from an earlier single-file layout to the current modular structure
- Standardized on CUDA 11.0+ and C++17
- Legacy single-file benchmark script
- SM 6.x support
- Initial naive and tiled SGEMM kernels
- Basic cuBLAS correctness verification
- First benchmark CLI