Skip to content

Commit de19259

Browse files
refactor: move algorithms to sage-libs and add compatibility shim
1 parent ffc9fdf commit de19259

63 files changed

Lines changed: 435 additions & 4023 deletions

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

README.md

Lines changed: 15 additions & 244 deletions
Original file line numberDiff line numberDiff line change
@@ -1,250 +1,21 @@
11
# SAGE-DB-Bench
22

3-
流式向量索引基准测试框架
3+
整理后的目录导航:
44

5-
## 1. 一键部署
5+
- 快速使用与完整说明:见 [docs/README.md](docs/README.md)
6+
- 部署与安装指南:见 [docs/INSTALL.md](docs/INSTALL.md)
7+
- 目录结构与测试约定:见 [docs/STRUCTURE.md](docs/STRUCTURE.md)
8+
- 算法发布与 CI 备注:见 [docs/ALGORITHM_DEPLOYMENT.md](docs/ALGORITHM_DEPLOYMENT.md)[docs/CICD_FIXES.md](docs/CICD_FIXES.md)
9+
- 脚本:位于 [scripts/](scripts/)(如 deploy、install、activate、test_local)
10+
- 容器配置:位于 [docker/](docker/)(Dockerfile、docker-compose.yml)
11+
- 测试与工具配置:位于 [config/](config/)(pytest.ini、pre-commit、setup.cfg);`tests/conftest.py` 自动生效
612

7-
```bash
8-
# 克隆仓库(包含 submodules)
9-
git clone --recursive https://github.com/intellistream/SAGE-DB-Bench.git
10-
cd SAGE-DB-Bench
13+
核心代码与数据:
1114

12-
# 运行部署脚本
13-
./deploy.sh
15+
- 基准框架与算法接口:bench/
16+
- 数据集管理:datasets/
17+
- 第三方/子模块算法实现:algorithms_impl/
18+
- 实验配置:runbooks/
19+
- 结果导出与工具:compute_gt.py、run_benchmark.py、export_results.py
1420

15-
# 激活虚拟环境
16-
source sage-db-bench/bin/activate
17-
```
18-
19-
**部署选项:**
20-
```bash
21-
./deploy.sh --skip-system-deps # 跳过系统依赖安装(已有依赖时使用)
22-
./deploy.sh --skip-build # 跳过构建(仅设置环境)
23-
./deploy.sh --help # 查看帮助
24-
```
25-
26-
## 2. 数据集
27-
28-
### 支持的数据集
29-
30-
| 数据集 | 维度 | 数据量 | 说明 |
31-
|--------|------|--------|------|
32-
| sift | 128 | 1M | SIFT 特征向量 |
33-
| glove | 100 | 1.2M | GloVe 词向量 |
34-
| random-xs | 32 | 10K | 随机数据(测试用) |
35-
| random-s | 64 | 100K | 随机数据(小规模) |
36-
| random-m | 128 | 1M | 随机数据(中规模) |
37-
38-
### 下载数据集
39-
40-
```bash
41-
python prepare_dataset.py --dataset sift
42-
python prepare_dataset.py --dataset glove
43-
```
44-
45-
### 添加新数据集
46-
47-
`datasets/registry.py` 中添加:
48-
49-
```python
50-
class MyDataset(Dataset):
51-
def __init__(self):
52-
self.nb = 100000 # 数据量
53-
self.nq = 10000 # 查询数量
54-
self.d = 128 # 向量维度
55-
self.dtype = 'float32'
56-
self.basedir = 'raw_data/mydataset'
57-
58-
def prepare(self):
59-
# 下载或生成数据
60-
pass
61-
62-
def get_data_in_range(self, start, end):
63-
# 返回 [start, end) 范围的数据
64-
pass
65-
66-
def get_queries(self):
67-
# 返回查询向量
68-
pass
69-
70-
def distance(self):
71-
return 'euclidean' # 或 'ip'
72-
73-
# 注册数据集
74-
DATASETS['mydataset'] = lambda: MyDataset()
75-
```
76-
77-
## 3. 算法
78-
79-
### 支持的算法
80-
81-
| 算法 | 类型 | 说明 |
82-
|------|------|------|
83-
| faiss_HNSW | 图索引 | Faiss HNSW 实现 |
84-
| faiss_HNSW_Optimized | 图索引 | 支持 Gorder 优化的 HNSW |
85-
| faiss_IVFPQ | 量化 | 倒排文件 + 乘积量化 |
86-
| diskann | 图索引 | DiskANN |
87-
| vsag_hnsw | 图索引 | VSAG HNSW |
88-
89-
### 添加新算法
90-
91-
1.`bench/algorithms/` 下创建目录:
92-
93-
```
94-
bench/algorithms/my_algo/
95-
├── __init__.py
96-
├── my_algo.py
97-
└── config.yaml
98-
```
99-
100-
2. 实现算法接口 (`my_algo.py`):
101-
102-
```python
103-
from ..base import BaseStreamingANN
104-
105-
class MyAlgorithm(BaseStreamingANN):
106-
def __init__(self, metric, index_params):
107-
self.metric = metric
108-
self.name = "my_algo"
109-
# 解析 index_params
110-
111-
def setup(self, dtype, max_pts, ndim):
112-
# 初始化索引
113-
pass
114-
115-
def insert(self, X, ids):
116-
# 插入向量
117-
pass
118-
119-
def delete(self, ids):
120-
# 删除向量
121-
pass
122-
123-
def query(self, X, k):
124-
# 查询,返回 (ids, distances)
125-
pass
126-
127-
def set_query_arguments(self, query_args):
128-
# 设置查询参数(如 ef)
129-
pass
130-
```
131-
132-
3. 创建配置文件 (`config.yaml`):
133-
134-
```yaml
135-
sift:
136-
my_algo:
137-
module: benchmark_anns.bench.algorithms.my_algo.my_algo
138-
constructor: MyAlgorithm
139-
base-args: ["@metric"]
140-
run-groups:
141-
base:
142-
args: |
143-
[{"param1": 32, "param2": 100}]
144-
query-args: |
145-
[{"ef": 40}]
146-
```
147-
148-
4. 在 `__init__.py` 中导出:
149-
150-
```python
151-
from .my_algo import MyAlgorithm
152-
__all__ = ['MyAlgorithm']
153-
```
154-
155-
## 4. 测试流程
156-
157-
### 4.1 计算 Ground Truth
158-
159-
```bash
160-
python compute_gt.py \
161-
--dataset sift \
162-
--runbook_file runbooks/simple.yaml \
163-
--gt_cmdline_tool ./DiskANN/build/apps/utils/compute_groundtruth
164-
```
165-
166-
生成的真值文件保存在 `raw_data/{dataset}/{size}/{runbook}.yaml/` 目录。
167-
168-
### 4.2 运行测试
169-
170-
```bash
171-
# 基本用法
172-
python run_benchmark.py \
173-
--algorithm faiss_HNSW_Optimized \
174-
--dataset sift \
175-
--runbook runbooks/simple.yaml
176-
177-
# 启用 Cache Miss 测量
178-
python run_benchmark.py \
179-
--algorithm faiss_HNSW_Optimized \
180-
--dataset sift \
181-
--runbook runbooks/simple.yaml \
182-
--enable-cache-profiling
183-
```
184-
185-
### 4.3 导出结果
186-
187-
```bash
188-
python export_results.py \
189-
--dataset sift \
190-
--algorithm faiss_HNSW_Optimized \
191-
--runbook simple
192-
```
193-
194-
导出的结果包含:
195-
- **recall**: 每个批次的召回率
196-
- **query_qps**: 查询吞吐量
197-
- **query_latency_ms**: 查询延迟
198-
- **cache_misses**: Cache Miss 数量(如果启用)
199-
200-
结果文件保存在 `results/{dataset}/{algorithm}/` 目录。
201-
202-
## 5. Runbook 格式
203-
204-
```yaml
205-
sift:
206-
max_pts: 1000000
207-
1:
208-
operation: "startHPC"
209-
2:
210-
operation: "initial"
211-
start: 0
212-
end: 50000
213-
3:
214-
operation: "batch_insert"
215-
start: 50000
216-
end: 100000
217-
batchSize: 2500
218-
eventRate: 10000
219-
4:
220-
operation: "waitPending"
221-
5:
222-
operation: "search"
223-
6:
224-
operation: "endHPC"
225-
```
226-
227-
**支持的操作:**
228-
- `startHPC` / `endHPC`: 启动/停止工作线程
229-
- `initial`: 初始数据加载
230-
- `batch_insert`: 批量插入(同时执行查询)
231-
- `batch_insert_delete`: 带删除的批量插入
232-
- `search`: 单独的搜索操作
233-
- `waitPending`: 等待待处理操作完成
234-
235-
## 6. 目录结构
236-
237-
```
238-
SAGE-DB-Bench/
239-
├── bench/ # 测试框架核心
240-
│ └── algorithms/ # 算法实现
241-
├── datasets/ # 数据集管理
242-
├── algorithms_impl/ # C++ 算法库(Faiss, DiskANN 等)
243-
├── runbooks/ # 实验配置
244-
├── raw_data/ # 数据集文件
245-
├── results/ # 测试结果
246-
├── deploy.sh # 一键部署脚本
247-
├── compute_gt.py # 计算 Ground Truth
248-
├── run_benchmark.py # 运行测试
249-
└── export_results.py # 导出结果
250-
```
21+
运行 pytest 或 pre-commit 请参考 docs/STRUCTURE.md 中的最新路径与命令。

bench/algorithms/.gitignore

Lines changed: 0 additions & 20 deletions
This file was deleted.

bench/algorithms/__init__.py

Lines changed: 52 additions & 48 deletions
Original file line numberDiff line numberDiff line change
@@ -1,51 +1,55 @@
1-
"""
2-
Algorithms Module for Benchmark ANNS
1+
"""Compatibility shims: ANN implementations moved to sage.libs.annimplementations.
32
4-
This module provides algorithm interfaces and implementations.
5-
All algorithms are automatically discovered from subdirectories.
3+
This package now proxies imports to `sage.libs.annimplementations`.
64
"""
5+
from __future__ import annotations
6+
7+
import importlib
8+
import sys
9+
from typing import Iterable
10+
11+
_MIGRATED = [
12+
"base",
13+
"registry",
14+
"candy_lshapg",
15+
"candy_mnru",
16+
"candy_sptag",
17+
"cufe",
18+
"diskann",
19+
"faiss_HNSW",
20+
"faiss_HNSW_Optimized",
21+
"faiss_IVFPQ",
22+
"faiss_NSW",
23+
"faiss_fast_scan",
24+
"faiss_lsh",
25+
"faiss_onlinepq",
26+
"faiss_pq",
27+
"gti",
28+
"ipdiskann",
29+
"plsh",
30+
"puck",
31+
"pyanns",
32+
"vsag_hnsw",
33+
]
34+
35+
_BASE = "sage.libs.annimplementations"
36+
37+
38+
def _load(name: str):
39+
module = importlib.import_module(f"{_BASE}.{name}")
40+
sys.modules[f"{__name__}.{name}"] = module
41+
return module
42+
43+
44+
for _name in _MIGRATED:
45+
_load(_name)
46+
47+
48+
def __getattr__(name: str): # pragma: no cover - compatibility path
49+
if name in _MIGRATED:
50+
return sys.modules[f"{__name__}.{name}"]
51+
raise AttributeError(name)
52+
753

8-
from .base import BaseANN, BaseStreamingANN, DummyStreamingANN
9-
from .registry import (
10-
ALGORITHMS,
11-
register_algorithm,
12-
get_algorithm,
13-
discover_algorithms,
14-
auto_register_algorithms
15-
)
16-
17-
# 尝试导入各种算法 wrapper(向后兼容 - 已弃用)
18-
try:
19-
from .candy_wrapper import CANDYWrapper, get_candy_algorithm
20-
__all_wrappers = ['CANDYWrapper', 'get_candy_algorithm']
21-
except ImportError:
22-
__all_wrappers = []
23-
24-
try:
25-
from .faiss_wrapper import FaissWrapper
26-
__all_wrappers.extend(['FaissWrapper'])
27-
except ImportError:
28-
pass
29-
30-
try:
31-
from .diskann_wrapper import DiskANNWrapper
32-
__all_wrappers.extend(['DiskANNWrapper'])
33-
except ImportError:
34-
pass
35-
36-
try:
37-
from .puck_wrapper import PuckWrapper
38-
__all_wrappers.extend(['PuckWrapper'])
39-
except ImportError:
40-
pass
41-
42-
__all__ = [
43-
'BaseANN',
44-
'BaseStreamingANN',
45-
'DummyStreamingANN',
46-
'ALGORITHMS',
47-
'register_algorithm',
48-
'get_algorithm',
49-
'discover_algorithms',
50-
'auto_register_algorithms',
51-
] + __all_wrappers
54+
def __dir__() -> Iterable[str]: # pragma: no cover - introspection
55+
return sorted(list(globals().keys()) + _MIGRATED)

0 commit comments

Comments
 (0)