|
| 1 | +--- |
| 2 | +title: "Vector Storage" |
| 3 | +weight: 7 |
| 4 | +type: docs |
| 5 | +aliases: |
| 6 | +- /append-table/vector-storage.html |
| 7 | +--- |
| 8 | +<!-- |
| 9 | +Licensed to the Apache Software Foundation (ASF) under one |
| 10 | +or more contributor license agreements. See the NOTICE file |
| 11 | +distributed with this work for additional information |
| 12 | +regarding copyright ownership. The ASF licenses this file |
| 13 | +to you under the Apache License, Version 2.0 (the |
| 14 | +"License"); you may not use this file except in compliance |
| 15 | +with the License. You may obtain a copy of the License at |
| 16 | +
|
| 17 | + http://www.apache.org/licenses/LICENSE-2.0 |
| 18 | +
|
| 19 | +Unless required by applicable law or agreed to in writing, |
| 20 | +software distributed under the License is distributed on an |
| 21 | +"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY |
| 22 | +KIND, either express or implied. See the License for the |
| 23 | +specific language governing permissions and limitations |
| 24 | +under the License. |
| 25 | +--> |
| 26 | + |
| 27 | +# Vector Storage |
| 28 | + |
| 29 | +## Overview |
| 30 | + |
| 31 | +With the explosive growth of AI scenarios, vector storage has become increasingly important. |
| 32 | + |
| 33 | +Paimon provides optimized storage solutions specifically designed for vector data to meet the needs of various scenarios. |
| 34 | + |
| 35 | +## Vector Data Type |
| 36 | + |
| 37 | +Vector data comes in many types, among which dense vectors are the most commonly used. They are typically expressed as fixed-length, densely packed arrays, generally without `null` elements. |
| 38 | + |
| 39 | +Paimon supports defining columns of type `VECTOR<t, n>`, which represents a fixed-length, dense vector column, where: |
| 40 | + - **`t`**: The element type of the vector. Currently supports seven primitive types: `BOOLEAN`, `TINYINT`, `SMALLINT`, `INT`, `BIGINT`, `FLOAT`, `DOUBLE`; |
| 41 | + - **`n`**: The vector dimension, must be a positive integer not exceeding `2,147,483,647`; |
| 42 | + - **`null constraint`**: `VECTOR` type supports defining `NOT NULL` or the default nullable. However, if a specific `VECTOR` value itself is not `null`, its elements are not allowed to be `null`. |
| 43 | + |
| 44 | +Compared to variable-length arrays, these features make dense vectors more concise in storage and memory representation, with benefits including: |
| 45 | + - More natural semantic constraints, preventing mismatched lengths, `null` elements, and other anomalies at the data storage layer; |
| 46 | + - Better point-lookup performance, eliminating offset array storage and access; |
| 47 | + - Closer alignment with type representations in specialized vector engines, often avoiding memory copies and type conversions during queries. |
| 48 | + |
| 49 | +Example: Define a table with a `VECTOR` column using Java API and write one row of data. |
| 50 | +```java |
| 51 | +public class CreateTableWithVector { |
| 52 | + |
| 53 | + public static void main(String[] args) throws Exception { |
| 54 | + // Schema |
| 55 | + Schema.Builder schemaBuilder = Schema.newBuilder(); |
| 56 | + schemaBuilder.column("id", DataTypes.BIGINT()); |
| 57 | + schemaBuilder.column("embed", DataTypes.VECTOR(3, DataTypes.FLOAT())); |
| 58 | + schemaBuilder.option(CoreOptions.FILE_FORMAT.key(), "lance"); |
| 59 | + schemaBuilder.option(CoreOptions.FILE_COMPRESSION.key(), "none"); |
| 60 | + Schema schema = schemaBuilder.build(); |
| 61 | + |
| 62 | + // Create catalog |
| 63 | + String database = "default"; |
| 64 | + String tempPath = System.getProperty("java.io.tmpdir") + UUID.randomUUID(); |
| 65 | + Path warehouse = new Path(TraceableFileIO.SCHEME + "://" + tempPath); |
| 66 | + Identifier identifier = Identifier.create("default", "my_table"); |
| 67 | + try (Catalog catalog = CatalogFactory.createCatalog(CatalogContext.create(warehouse))) { |
| 68 | + |
| 69 | + // Create table |
| 70 | + catalog.createDatabase(database, true); |
| 71 | + catalog.createTable(identifier, schema, true); |
| 72 | + FileStoreTable table = (FileStoreTable) catalog.getTable(identifier); |
| 73 | + |
| 74 | + // Write data |
| 75 | + BatchWriteBuilder builder = table.newBatchWriteBuilder(); |
| 76 | + InternalVector vector = BinaryVector.fromPrimitiveArray(new float[] {1.0f, 2.0f, 3.0f}); |
| 77 | + try (BatchTableWrite batchTableWrite = builder.newWrite()) { |
| 78 | + try (BatchTableCommit commit = builder.newCommit()) { |
| 79 | + batchTableWrite.write(GenericRow.of(1L, vector)); |
| 80 | + commit.commit(batchTableWrite.prepareCommit()); |
| 81 | + } |
| 82 | + } |
| 83 | + |
| 84 | + // Read data |
| 85 | + ReadBuilder readBuilder = table.newReadBuilder(); |
| 86 | + TableScan.Plan plan = readBuilder.newScan().plan(); |
| 87 | + try (RecordReader<InternalRow> reader = readBuilder.newRead().createReader(plan)) { |
| 88 | + reader.forEachRemaining(row -> { |
| 89 | + float[] readVector = row.getVector(1).toFloatArray(); |
| 90 | + System.out.println(Arrays.toString(readVector)); |
| 91 | + }); |
| 92 | + } |
| 93 | + } |
| 94 | + } |
| 95 | +} |
| 96 | +``` |
| 97 | + |
| 98 | +**Notes**: |
| 99 | + - Columns of `VECTOR` type cannot be used as primary key columns, partition columns, or for sorting. |
| 100 | + |
| 101 | +## Engine-Level Representation |
| 102 | + |
| 103 | +Since engine layers typically don't have dedicated vector types, to support `VECTOR` type in engine SQL, Paimon provides a separate configuration to convert the engine's `ARRAY` type to Paimon's `VECTOR` type. |
| 104 | + |
| 105 | +Usage: |
| 106 | + - **`'vector-field'`**: Declare columns as `VECTOR` type, multiple columns separated by commas (`,`); |
| 107 | + - **`'field.{field-name}.vector-dim'`**: Declare the dimension of the vector column. |
| 108 | + |
| 109 | +Example: Define a table with a `VECTOR` column using Flink SQL. |
| 110 | +```sql |
| 111 | +CREATE TABLE IF NOT EXISTS ts_table ( |
| 112 | + id BIGINT, |
| 113 | + embed1 ARRAY<FLOAT>, |
| 114 | + embed2 ARRAY<FLOAT> |
| 115 | +) WITH ( |
| 116 | + 'file.format' = 'lance', |
| 117 | + 'vector-field' = 'embed1,embed2', |
| 118 | + 'field.embed1.vector-dim' = '128', |
| 119 | + 'field.embed2.vector-dim' = '768' |
| 120 | +); |
| 121 | +``` |
| 122 | + |
| 123 | +**Notes**: |
| 124 | + - When defining `vector-field` columns, you must provide the vector dimension; otherwise, the CREATE TABLE statement will fail; |
| 125 | + - Currently, only Flink SQL supports this configuration; other engines have not been implemented yet. |
| 126 | + |
| 127 | +## Specify File Format for Vector |
| 128 | + |
| 129 | +When mapping `VECTOR` type to the file format layer, the ideal storage format is `FixedSizeList`. Currently, this is only supported for certain file formats (such as `lance`) through the `paimon-arrow` integration. This means that to use `VECTOR` type, you must specify a particular format via `file.format`, which has a global impact. In particular, this may be unfavorable for scalars and multimodal (Blob) data. |
| 130 | + |
| 131 | +Therefore, Paimon provides a solution to store vector columns separately within Data Evolution tables. |
| 132 | + |
| 133 | +Layout: |
| 134 | +``` |
| 135 | +table/ |
| 136 | +├── bucket-0/ |
| 137 | +│ ├── data-uuid-0.parquet # Contains id, name columns |
| 138 | +│ ├── data-uuid-1.blob # Contains blob data |
| 139 | +│ ├── data-uuid-2.vector.lance # Contains vector data using lance format |
| 140 | +│ └── ... |
| 141 | +├── manifest/ |
| 142 | +├── schema/ |
| 143 | +└── snapshot/ |
| 144 | +``` |
| 145 | + |
| 146 | +Usage: |
| 147 | + - **`vector.file.format`**: Store `VECTOR` type columns separately in the specified file format; |
| 148 | + - **`vector.target-file-size`**: If stored separately, specifies the target file size for vector data, defaulting to `10 * 'target-file-size'`. |
| 149 | + |
| 150 | +Example: Store `VECTOR` columns separately using Flink SQL. |
| 151 | +```sql |
| 152 | +CREATE TABLE IF NOT EXISTS ts_table ( |
| 153 | + id BIGINT, |
| 154 | + embed ARRAY<FLOAT> |
| 155 | +) WITH ( |
| 156 | + 'file.format' = 'parquet', |
| 157 | + 'vector.file.format' = 'lance', |
| 158 | + 'vector-field' = 'embed', |
| 159 | + 'field.embed.vector-dim' = '128', |
| 160 | + 'row-tracking.enabled' = 'true', |
| 161 | + 'data-evolution.enabled' = 'true' |
| 162 | +); |
| 163 | +``` |
| 164 | + |
| 165 | +**Notes**: |
| 166 | + - If `vector.file.format` is the same as `file.format`, the data will be stored together and not separately; |
| 167 | + - Only supported for Append tables, not primary key tables, and requires `row-tracking.enabled` and `data-evolution.enabled` to be enabled. |
0 commit comments