知识库数据分块管理接口

上一篇讲完了文档的生命周期管理——删除、更新、启用/禁用，都是针对文档整体的操作。但文档粒度太粗了，自动分块之后，实际参与检索的是 Chunk。用户有时候需要针对某一段具体内容做调整，文档级操作够不到这个层面。

自动分块用一套固定参数处理整份文档，大多数内容效果还行，但总有些段落切出来不对劲。常见的就两类问题：

一是截断，切割位置踩到一段完整描述的中间，前半截在一个 Chunk 里，后半截在下一个，各自语义残缺
二是稀释，某条关键但简短的条款被合并进前后的大段文字里，检索时语义被淹没，很难精准命中。这两种情况靠调参数解决不了——参数是全局的，顾了大段顾不了短句，顾了整体顾不了边角

最直接的方式是针对有问题的 Chunk 做手动调整：把截断的两段合并编辑成完整的 Chunk，把混入噪音的段落拆出来单独处理，或者把质量明显差的 Chunk 直接禁用。这比重新调整分块策略、重跑整个文档的代价小得多。更重要的是，即便调整策略重跑，边界情况永远存在——手动干预是兜底能力，不是临时替代方案。

这就是 Chunk 管理接口存在的原因：不是让用户从头手写知识库，而是给自动分块的结果提供精准干预的入口。

KnowledgeChunkController 共有 6 个接口，覆盖了 Chunk 的完整 CRUD 加启用/禁用控制。

六个接口概览

先整体看一下这 6 个接口的关键维度：

接口	HTTP 方法	路径	向量操作	事务方式
分页查询	GET	`/knowledge-base/docs/{doc-id}/chunks`	无	无事务
新增 Chunk	POST	`/knowledge-base/docs/{doc-id}/chunks`	写入向量库	`@Transactional`
更新内容	PUT	`/knowledge-base/docs/{doc-id}/chunks/{chunk-id}`	删旧插新	`@Transactional`
删除 Chunk	DELETE	`/knowledge-base/docs/{doc-id}/chunks/{chunk-id}`	精准删除单条	`@Transactional`
启用/禁用单条	PATCH	`/knowledge-base/docs/{doc-id}/chunks/{chunk-id}/enable`	按需写入或删除	`@Transactional`
批量启用/禁用	PATCH	`/knowledge-base/docs/{doc-id}/chunks/batch-enable`	精准写入或精准删除	编程式事务

pageQuery 接口详解

1. 接口与实现

@GetMapping("/knowledge-base/docs/{doc-id}/chunks")
public Result<IPage<KnowledgeChunkVO>> pageQuery(@PathVariable("doc-id") String docId,
                                                 @Validated KnowledgeChunkPageRequest requestParam) {
    return Results.success(knowledgeChunkService.pageQuery(docId, requestParam));
}

KnowledgeChunkPageRequest 继承了 MyBatis Plus 的 Page（带 current、size），扩展了一个 enabled 字段用于状态过滤。

Service 层实现：

@Override
public IPage<KnowledgeChunkVO> pageQuery(String docId, KnowledgeChunkPageRequest requestParam) {
    KnowledgeDocumentDO documentDO = documentMapper.selectById(docId);
    Assert.notNull(documentDO, () -> new ClientException("文档不存在"));

    LambdaQueryWrapper<KnowledgeChunkDO> queryWrapper = new LambdaQueryWrapper<KnowledgeChunkDO>()
            .eq(KnowledgeChunkDO::getDocId, docId)
            .eq(requestParam.getEnabled() != null, KnowledgeChunkDO::getEnabled, requestParam.getEnabled())
            .orderByAsc(KnowledgeChunkDO::getChunkIndex);

    Page<KnowledgeChunkDO> page = new Page<>(requestParam.getCurrent(), requestParam.getSize());
    IPage<KnowledgeChunkDO> result = chunkMapper.selectPage(page, queryWrapper);
    return result.convert(each -> BeanUtil.toBean(each, KnowledgeChunkVO.class));
}

2. 设计要点

2.1 按 chunkIndex 排序

chunkIndex 记录的是 Chunk 在原文档中的顺序（从 0 开始）。分块处理时按文档内容顺序生成，用户在管理界面看 Chunk 列表时，需要按这个顺序展示，才能直观判断某段内容是文档的哪个部分、前后文是什么。

如果按 id 或 createTime 排序，手动新增的 Chunk 会插到末尾，即使它在逻辑上应该排在中间，视觉上会很混乱。

2.2 enabled 过滤的实际用途

enabled 传 null 时查全部，传 1 查已启用，传 0 查已禁用。

这个过滤在实际使用中有几个场景：看某文档里有哪些 Chunk 当前参与检索（传 1）；审查被禁用的内容、确认是否需要恢复（传 0）；查看全貌时不传，看完整的分块结构。

create 接口详解

1. 接口定义

@PostMapping("/knowledge-base/docs/{doc-id}/chunks")
public Result<KnowledgeChunkVO> create(@PathVariable("doc-id") String docId,
                                       @RequestBody KnowledgeChunkCreateRequest request) {
    return Results.success(knowledgeChunkService.create(docId, request));
}

KnowledgeChunkCreateRequest 有三个字段：content（必填）、index（可选，指定 Chunk 序号，不传则自动追加到末尾）、chunkId（前端不会传，文档操作接口里会用到）。

2. 核心实现流程

@Override
@Transactional(rollbackFor = Exception.class)
public KnowledgeChunkVO create(String docId, KnowledgeChunkCreateRequest requestParam) {
    KnowledgeDocumentDO documentDO = documentMapper.selectById(docId);
    Assert.notNull(documentDO, () -> new ClientException("文档不存在"));
    if (DocumentStatus.RUNNING.getCode().equals(documentDO.getStatus())) {
        throw new ClientException("文档正在分块处理中，暂不支持新增 Chunk");
    }
    if (!Integer.valueOf(1).equals(documentDO.getEnabled())) {
        throw new ClientException("文档未启用，暂不支持新增 Chunk");
    }

    String content = requestParam.getContent();
    Assert.notBlank(content, () -> new ClientException("Chunk 内容不能为空"));

    // 查当前最大 chunkIndex，用于自动追加
    KnowledgeChunkDO latest = chunkMapper.selectOne(
            Wrappers.lambdaQuery(KnowledgeChunkDO.class)
                    .eq(KnowledgeChunkDO::getDocId, docId)
                    .orderByDesc(KnowledgeChunkDO::getChunkIndex)
                    .last("LIMIT 1")
    );
    // 优先使用请求中指定的 index，未指定则自动追加到末尾
    int chunkIndex = requestParam.getIndex() != null
            ? requestParam.getIndex()
            : (latest != null ? latest.getChunkIndex() + 1 : 0);

    String contentHash = SecureUtil.sha256(content);
    int charCount = content.length();
    KnowledgeBaseDO kbDO = knowledgeBaseMapper.selectById(documentDO.getKbId());
    String embeddingModel = kbDO.getEmbeddingModel();
    String collectionName = kbDO.getCollectionName();
    Integer tokenCount = resolveTokenCount(content);

    KnowledgeChunkDO chunkDO = KnowledgeChunkDO.builder()
            .id(requestParam.getChunkId())
            .kbId(documentDO.getKbId())
            .docId(docId)
            .chunkIndex(chunkIndex)
            .content(content)
            .contentHash(contentHash)
            .charCount(charCount)
            .tokenCount(tokenCount)
            .enabled(1)
            .createdBy(UserContext.getUsername())
            .updatedBy(UserContext.getUsername())
            .build();

    chunkMapper.insert(chunkDO);

    // chunk_count 自增
    documentMapper.update(Wrappers.lambdaUpdate(KnowledgeDocumentDO.class)
            .eq(KnowledgeDocumentDO::getId, docId)
            .setSql("chunk_count = chunk_count + 1"));

    // 同步写入向量库
    syncChunkToVector(collectionName, docId, chunkDO, embeddingModel);

    return BeanUtil.toBean(chunkDO, KnowledgeChunkVO.class);
}

2.1 为什么要求文档启用才能新增 Chunk

前置校验有三个：文档存在 → status != RUNNING → enabled = 1。

前两个好理解，第三个值得说一下。禁用文档的语义是这份文档暂时下线，不参与检索。如果允许在禁用文档上新增 Chunk，新增的 Chunk 会立即写入向量库（create 方法末尾调用 syncChunkToVector），这个 Chunk 就进了向量库参与检索，但它的父文档已经被标记为禁用。数据库说不参与检索，向量库说可以被检索到，这就矛盾了。

更直接的说：禁用文档往往是因为这份文档的内容有问题，需要整体下线处理，这种时候往里面加内容没有意义。

2.2 chunkIndex：手动指定或自动追加

index 是可选参数。前端新建分块时可以指定序号，序号可以留空让系统自动追加到末尾。

// 优先使用请求中指定的 index，未指定则自动追加到末尾
int chunkIndex = requestParam.getIndex() != null
        ? requestParam.getIndex()
        : (latest != null ? latest.getChunkIndex() + 1 : 0);

自动追加时，查当前 docId 下 chunkIndex 最大值，加 1 作为新 Chunk 的序号。这里用查最大值 +1而不是 COUNT(*)，是因为 Chunk 可能被删除，COUNT 的结果会低于最大 chunkIndex，追加进来的新 Chunk 序号就重复了。

比如文档原来有 10 个 Chunk（0~9），删掉了 Chunk 5 和 Chunk 8，COUNT 返回 8，新增 Chunk 的 chunkIndex 就成了 8，和原来的 Chunk 8 冲突（虽然原来的已被逻辑删除，但从业务语义上序号还是被占用了，容易引起混乱）。

手动指定序号的场景：用户发现自动分块在某个位置漏了一段内容，想在 Chunk 3 和 Chunk 4 之间插入一个新的，可以直接指定 index 为 3（或任意值）。注意这里不会自动调整其他 Chunk 的序号，因为 chunkIndex 只是排序用的显示字段，不是唯一约束，允许重复值。分页查询按 chunkIndex 升序排列，多个相同序号的 Chunk 会相邻显示。

2.3 contentHash 的作用

哈希计算直接用 Hutool 的工具方法，一行搞定：

String contentHash = SecureUtil.sha256(content);

为什么要存一个 contentHash？它的核心用途是定时拉取场景下的增量更新判断。

项目支持 URL 类型的文档定时拉取（ScheduleRefreshProcessor），每次定时任务触发时需要判断远程文档内容是否发生了变化。拉取到新内容后，计算 SHA-256 和上次存储的 contentHash 对比，如果一样就跳过重新分块和向量化，避免每次定时触发都做一遍完整的分块 + 向量化流程。

在 Chunk 级别，contentHash 也为将来的增量分块预留了能力：重跑分块时，对比新旧 Chunk 的 hash，内容没变的 Chunk 可以跳过向量化，只处理有变更的部分。当前代码里 update 接口的幂等检查用的是 newContent.equals(chunkDO.getContent()) 直接比字符串，还没有走 hash 对比，但 hash 存着不亏——字符串比较在内容很长时性能不如 hash 对比（64 字符定长 vs 可能几千字符的 content）。

2.4 tokenCount：估算值，仅供展示

Integer tokenCount = resolveTokenCount(content);

这里的 tokenCount 是一个估算值，不是精确的 Token 数。实际实现是 HeuristicTokenCounterService，用的是启发式规则：中文字符按 1 字 ≈ 1 Token，英文按 4 字符 ≈ 1 Token，其他字符按 2 字符 ≈ 1 Token。没有调用 Tokenizer 做精确分词，因为不同模型的 Tokenizer 不同，精确计算意义不大。

这个字段目前的用途是管理后台的展示——让知识库管理员在 Chunk 列表里直观看到每个分块的大致 Token 规模，方便判断哪些 Chunk 太长或太短。它不参与 RAG 检索链路的任何逻辑计算。

2.5 写库与向量同步的顺序

数据库写入成功后，调用 syncChunkToVector：

private void syncChunkToVector(String collectionName, String docId, KnowledgeChunkDO chunkDO, String embeddingModel) {
    List<Float> embedding = embedContent(chunkDO.getContent(), embeddingModel);
    float[] vector = toArray(embedding);

    VectorChunk chunk = VectorChunk.builder()
            .index(chunkDO.getChunkIndex())
            .content(chunkDO.getContent())
            .chunkId(String.valueOf(chunkDO.getId()))
            .embedding(vector)
            .build();
    vectorStoreService.indexDocumentChunks(collectionName, docId, List.of(chunk));
}

整个 create 方法加了 @Transactional，但 syncChunkToVector（调用 Embedding API + 写向量库）是在事务内执行的。这里和 batchEnable 的做法不同：单条 Chunk 的向量化耗时短（一次 API 调用几百毫秒），放在事务内还可以接受；批量操作时 embed 多条，总耗时可能很长，那时候才需要把 embed 移到事务外。

知识库数据分块管理接口

六个接口概览

pageQuery 接口详解

1. 接口与实现

2. 设计要点

2.1 按 chunkIndex 排序

2.2 enabled 过滤的实际用途

create 接口详解

1. 接口定义

2. 核心实现流程

2.1 为什么要求文档启用才能新增 Chunk

2.2 chunkIndex：手动指定或自动追加

2.3 contentHash 的作用

2.4 tokenCount：估算值，仅供展示

2.5 写库与向量同步的顺序

解锁付费内容，👉 戳

Table of Contents

六个接口概览​

pageQuery 接口详解​

1. 接口与实现​

2. 设计要点​

2.1 按 chunkIndex 排序​

2.2 enabled 过滤的实际用途​

create 接口详解​

1. 接口定义​

2. 核心实现流程​

2.1 为什么要求文档启用才能新增 Chunk​

2.2 chunkIndex：手动指定或自动追加​

2.3 contentHash 的作用​

2.4 tokenCount：估算值，仅供展示​

2.5 写库与向量同步的顺序​

解锁付费内容，👉 戳​

Table of Contents

六个接口概览

pageQuery 接口详解

1. 接口与实现

2. 设计要点

2.1 按 chunkIndex 排序

2.2 enabled 过滤的实际用途

create 接口详解

1. 接口定义

2. 核心实现流程

2.1 为什么要求文档启用才能新增 Chunk

2.2 chunkIndex：手动指定或自动追加

2.3 contentHash 的作用

2.4 tokenCount：估算值，仅供展示

2.5 写库与向量同步的顺序

解锁付费内容，👉 戳