主要内容

介绍HNSW的算法原理
介绍hnswlib和faiss中的实现，以及他们之间的区别
介绍HNSW中每个参数的实际影响

HNSW的算法原理

概述

HNSW的算法大致原理很像skplist，区别在于hnsw每层维护的是节点直接的链接（links）,每一层都是graph的结构，相同点就是每下一层的节点数更多,维护的图也越大，到了0层维护了所有节点的链接信息
无论是插入还是查找，都是从上往下找enter point（当前层和插入或者查询点最接近的点）,到了下一层时候，通过在enter point的链接的点中早出指定数量的候选集，并选出和目标点最接近的指定数量的点，建立连接
插入算法还需要维护当前的enter point

具体算法

论文中有提到一些最佳参数设置，mL = 1/ln(M), Mmax = M, Mmax0 = M * 2 可以直接带入算法中的参数，减少未知量，看起来会清晰一点，其中hnswlib的实现就是按照这组参数来的

SEARCH-LAYER算法

SEARCH-LAYER(q, ep, ef, lc)

Input: query element q, enter points ep, number of nearest to q elements to return ef, layer number lc

Output: ef closest neighbors to q

v ← ep // set of visited elements
C ← ep // set of candidates
W ← ep // dynamic list of found nearest neighbors

while │C│ > 0
  c ← extract nearest element from C to q
  f ← get furthest element from W to q

  if distance(c, q) > distance(f, q)
	break // all elements in W are evaluated
  for each e ∈ neighbourhood(c) at layer lc // update C and W
	if e ∉ v
	  v ← v ⋃ e
	f ← get furthest element from W to q
	if distance(e, q) < distance(f, q) or │W│ < ef
	  C ← C ⋃ e
	  W ← W ⋃ e
	  if │W│ > ef
		remove furthest element from W to q

return W

说明

ep可以是由多个点组成的集合，也可以是单个点

插入算法

INSERT(hnsw, q, M, Mmax, efConstruction, mL)

Input: multilayer graph hnsw, new element q, number of established connections M, maximum number of connections for each element per layer Mmax, size of the dynamic candidate list efConstruction, normalization factor for level generation mL

Output: update hnsw inserting element q

W ← ∅ // list for the currently found nearest elements
ep ← get enter point for hnsw
L ← level of ep // top layer for hnsw
l ← ⌊-ln(unif(0..1))∙mL⌋ // new element’s level

for lc ← L … l+1
  W ← SEARCH-LAYER(q, ep, ef=1, lc)
  ep ← get the nearest element from W to q
for lc ← min(L, l) … 0
  W ← SEARCH-LAYER(q, ep, efConstruction, lc)
  neighbors ← SELECT-NEIGHBORS(q, W, M, lc) // alg. 3 or alg. 4 
 	add bidirectionall connectionts from neighbors to q at layer lc
	for each e ∈ neighbors // shrink connections if needed
	  eConn ← neighbourhood(e) at layer lc
	  if │eConn│ > Mmax // shrink connections of e
		if lc = 0 then Mmax = Mmax0
		eNewConn ← SELECT-NEIGHBORS(e, eConn, Mmax, lc)
	  // alg. 3 or alg. 4
	  set neighbourhood(e) at layer lc to eNewConn
	ep ← W

if l > L
  set enter point for hnsw to q

entry point的更新

新插入的节点最终会变为全局的enter pointer
level0会维护2*M的链接

搜索算法

K-NN-SEARCH(hnsw, q, K, ef)

Input: multilayer graph hnsw, query element q, number of nearest neighbors to return K, size of the dynamic candidate list ef

Output: K nearest elements to q

W ← ∅ // set for the current nearest elements
ep ← get enter point for hnsw
L ← level of ep // top layer for hnsw
for lc ← L … 1
  W ← SEARCH-LAYER(q, ep, ef=1, lc)
  ep ← get nearest element from W to q

W ← SEARCH-LAYER(q, ep, ef, lc =0)
return K nearest elements from W to q

说明

搜索过程从L->1 层每层只搜和query最近的enter point, 这就要求数据集是要有一定结构的，这样才能保证在最后一层搜索的时候质量不至于太差，如果是随机的数据集可能结果会不太好
efSearch要大于K才行(faiss里会保证efSearch最少是k)

SELECT-NEIGHBORS 算法

简单算法

SELECT-NEIGHBORS-SIMPLE(q, C, M)
Input: base element q, candidate elements C, number of neighbors to return M

Output: M nearest elements to q
return M nearest elements from C to q
// 实际的实现就是一个优先权队列

启发式方法

SELECT-NEIGHBORS-HEURISTIC(q, C, M, lc, extendCandidates, keepPrunedConnections)

Input: base element q, candidate elements C, number of neighbors to return M, layer number lc, flag indicating whether or not to extend
candidate list extendCandidates, flag indicating whether or not to add discarded elements keepPrunedConnections

Output: M elements selected by the heuristic

R ← ∅
W ← C // working queue for the candidates
if extendCandidates // extend candidates by their neighbors
  for each e ∈ C
    for each eadj ∈ neighbourhood(e) at layer lc
      if eadj ∉ W
	    W ← W ⋃ eadj

Wd ← ∅ // queue for the discarded candidates
while │W│ > 0 and │R│< M
  e ← extract nearest element from W to q
    if e is closer to q compared to any element from R
	  R ← R ⋃ e
	else
	  Wd ← Wd ⋃ e

if keepPrunedConnections // add some of the discarded 
  // connections from Wd
  while │Wd│> 0 and │R│< M
    R ← R ⋃ extract nearest element from Wd to q

return R

说明

第一种方法只从candidates选择，第二种方法从candidates里的邻接的点里选择（选择范围更大, 可能受数据集的影响较小）
hnswlib和faiss_hnsw都相当于用了第一种方法
启发式方法适合中维数据和多clustering的数据( mid-dimensional data and for the case of highly clustered data)

HNSW每个参数实际的影响

dim: 数据维度，dim 越大计算量越大
max_elements: 数据总量
M: 每个向量的最大链接数, M越大占用的内存越大, 构建和搜索也是越慢（精度更高）
efConstruction: 每个向量构建或者搜索时的候选集大小，efConstruction越大，构建和搜索速度越慢 (不会影响内存消耗)

M 应该如何选择

论文中指出A reasonable range of M is from 5 to 48

efConstruction

efConstruction 对speed/index quality有显著影响
论文中指出10M的sift dataset，用efConstruction=100就能达到不错的召回率(0.95), 适用多线程并发构建的话速度也不错
进一步提升efConstruction带来的收益并不明显，反而会较大影响构建速度

efSearch

搜索时候用到ef值，hnswlib和faiss_hnsw都可以单独设置
合适的efSearch能保证recall
efSearch也不是越高越好，边际效应越来越小，也会影响到搜索速度

总结

关于详细的性能数据可以参考论文
测试性能的时候关键指标是 Distance Computations, Query Time
可以使用PQ（乘积量化）的方式优化memory

skyitachi's blogHNSW算法原理及其实现 2024-07-22