|
1 | | -# Create Knowledge Base |
| 1 | +# Create a Knowledge Base |
2 | 2 |
|
3 | | -This section describes how to create and configure a knowledge base. |
| 3 | +This page describes how to create and configure a knowledge base and the main options available. |
| 4 | + |
| 5 | +## Define the Knowledge Base and Retrieval Method |
| 6 | + |
| 7 | +Go to **My Knowledge Base** → **Create New Knowledge Base**. |
| 8 | + |
| 9 | +In this step you define: |
| 10 | + |
| 11 | +- **Name** and **Description**: To identify and describe the knowledge base. |
| 12 | +- **Retrieval settings**: How the system finds and extracts relevant content from your documents when it receives a user query, so the LLM can use it. |
| 13 | + |
| 14 | +Three retrieval strategies are supported. See the table below and the sections that follow. |
| 15 | + |
| 16 | +| Strategy | Summary | |
| 17 | +|----------|---------| |
| 18 | +| **Vector retrieval** | Turns the question into a vector and compares it to document vectors to return the most similar chunks. | |
| 19 | +| **Full-text retrieval** | Builds a full-text index over documents and returns chunks that match the user’s keywords. | |
| 20 | +| **Hybrid retrieval** | Runs both full-text and vector retrieval, then merges and reranks the results. | |
| 21 | + |
| 22 | +### Vector retrieval |
| 23 | + |
| 24 | +**What it does**: Converts the user’s question into a query vector, compares it to document vectors by similarity, and returns the closest chunks. |
| 25 | + |
| 26 | +**Options**: |
| 27 | + |
| 28 | +| Option | Description | |
| 29 | +|--------|-------------| |
| 30 | +| **Rerank model** | Off by default. When on, reranks vector-retrieval results to improve accuracy of the chunks sent to the LLM. | |
| 31 | +| **TopK** | Number of most similar chunks to return. The system may adjust this based on the model’s context window. Default is 3; higher values return more chunks. | |
| 32 | +| **Score threshold** | Minimum similarity score; only chunks above this value are returned. Off by default; higher values return fewer chunks. | |
| 33 | + |
| 34 | +### Full-text retrieval |
| 35 | + |
| 36 | +**What it does**: Builds a full-text index so users can query by any word and get back chunks that contain those words. |
| 37 | + |
| 38 | +**Options**: |
| 39 | + |
| 40 | +| Option | Description | |
| 41 | +|--------|-------------| |
| 42 | +| **Rerank model** | Off by default. When on, reranks full-text results to improve chunk quality. | |
| 43 | +| **TopK** | Number of chunks to return. The system may adjust this based on context. Default is 3. | |
| 44 | +| **Score threshold** | Only chunks with scores above this value are returned. Off by default; higher values return fewer chunks. | |
| 45 | + |
| 46 | +### Hybrid retrieval |
| 47 | + |
| 48 | +**What it does**: Runs both full-text and vector retrieval, then merges and reranks the results into a single set of chunks. |
| 49 | + |
| 50 | +**Options**: |
| 51 | + |
| 52 | +| Option | Description | |
| 53 | +|--------|-------------| |
| 54 | +| **Weight** | Balance between semantic (vector) and keyword (full-text) retrieval. Weight 1 for semantic = vector-only, which helps with paraphrasing and cross-language matching. Weight 1 for keyword = full-text only, which suits exact terms and lower compute. You can also set a custom mix for your use case. | |
| 55 | +| **Rerank model** | Off by default. When on, reranks hybrid results. | |
| 56 | +| **TopK** | Number of chunks to return. Default is 3. | |
| 57 | +| **Score threshold** | Only chunks above this score are returned. Off by default. | |
| 58 | + |
| 59 | +After name, description, and retrieval settings, you can either **Create empty knowledge base** to finish, or click **Create** to go to the file upload step. |
| 60 | + |
| 61 | +## Upload Files |
| 62 | + |
| 63 | +- **Count and size**: Up to 5 files per upload; each file up to 10 MB. |
| 64 | +- **Formats**: DOCX, PPTX, HTML, PDF, MD, CSV, XLSX, VTT, JPG, PNG, TXT. |
| 65 | +- **Source**: Only local upload is supported. Web or other external sources are not supported. |
| 66 | + |
| 67 | +## Text and Chunking Settings |
| 68 | + |
| 69 | +This step preprocesses your content and splits it into **chunks**, which are the units used for retrieval. Chunk quality directly affects recall and answer quality. |
| 70 | + |
| 71 | +- **Chunk length**: For all formats except TXT, you can set the approximate length (in characters) per chunk; the system will split as close to that as possible. |
| 72 | +- **TXT-only options**: |
| 73 | + - **Chunk delimiter**: Characters (or string) that trigger a new chunk. Default is `\n\n` (paragraph breaks). |
| 74 | + - **Chunk overlap**: Number of overlapping characters between adjacent chunks. Some overlap helps keep context and can improve recall; a typical value is 10–25% of the chunk length in tokens. |
| 75 | + |
| 76 | +## Finish Creation |
| 77 | + |
| 78 | +Confirm your upload and chunking settings, then wait for processing to complete. Once done, the knowledge base can be used in your apps for retrieval. |
0 commit comments