|
| 1 | +--- |
| 2 | +title: "PyPaimon Release 0.2.0" |
| 3 | +type: release |
| 4 | +version: pypaimon-0.2.0 |
| 5 | +weight: 91 |
| 6 | +--- |
| 7 | + |
| 8 | +# PyPaimon 0.2.0 Available |
| 9 | + |
| 10 | +Dec 19, 2024 - Zelin Yu (yuzelin.yzl@gmail.com) |
| 11 | + |
| 12 | +The Apache Paimon PMC officially announces the release of PyPaimon 0.2.0. Because we didn't release 0.1.0, |
| 13 | +this is the first version. |
| 14 | + |
| 15 | +## What is PyPaimon? |
| 16 | + |
| 17 | +[PyPaimon](https://github.com/apache/paimon-python) is the Python SDK of Apache Paimon. It provides a way |
| 18 | +for users to get data from Paimon tables with Python for data analysis, and write data back to Paimon tables. |
| 19 | + |
| 20 | +## Version Overview |
| 21 | + |
| 22 | +The first version of PyPaimon supports following features: |
| 23 | + |
| 24 | +1. Connect to `Catalog`. |
| 25 | +2. Get or create table. |
| 26 | +3. Batch read: Filter and projection pushdown, and parallelly reading data as Apache Arrow, Pandas, DuckDB and Ray format. |
| 27 | +4. Batch write: Insert into or overwrite table with Apache Arrow and Pandas data. |
| 28 | + |
| 29 | +The detailed document can found at https://paimon.apache.org/docs/master/program-api/python-api/. |
| 30 | + |
| 31 | +### Connect to Catalog |
| 32 | + |
| 33 | +You can create a `Catalog` with options just like in SQL: |
| 34 | + |
| 35 | +```python |
| 36 | +from pypaimon.py4j import Catalog |
| 37 | + |
| 38 | +catalog_options = { |
| 39 | + 'warehouse': 'path/to/warehouse', |
| 40 | + 'metastore': 'filesystem' |
| 41 | + # other options |
| 42 | +} |
| 43 | + |
| 44 | +catalog = Catalog.create(catalog_options) |
| 45 | +``` |
| 46 | + |
| 47 | +You can connect to any `Catalog` supported by Java. PyPaimon has built-in support for `filesystem`, `Jdbc` and `hive` catalog. |
| 48 | +If you want to connect to your self-defined catalogs, you can add the dependency jars in following way: |
| 49 | + |
| 50 | +```python |
| 51 | +import os |
| 52 | +from pypaimon.py4j import constants |
| 53 | + |
| 54 | +os.environ[constants.PYPAIMON_JAVA_CLASSPATH] = '/path/to/jars/*' |
| 55 | +``` |
| 56 | + |
| 57 | +### Get or create table |
| 58 | + |
| 59 | +You can get a existed table from `Catalog` by its identifier: |
| 60 | + |
| 61 | +```python |
| 62 | +table = catalog.get_table('database_name.table_name') |
| 63 | +``` |
| 64 | + |
| 65 | +You can also create a new table. The table field definitions are described by `pyarrow.Schema`, and you can set primary keys, |
| 66 | +partition keys, table options and comment. |
| 67 | + |
| 68 | +```python |
| 69 | +import pyarrow as pa |
| 70 | +from pypaimon import Schema |
| 71 | + |
| 72 | +# field definitions |
| 73 | +pa_schema = pa.schema([ |
| 74 | + ('dt', pa.string()), |
| 75 | + ('hh', pa.string()), |
| 76 | + ('pk', pa.int64()), |
| 77 | + ('value', pa.string()) |
| 78 | +]) |
| 79 | +# table schema |
| 80 | +schema = Schema( |
| 81 | + pa_schema=pa_schema, |
| 82 | + partition_keys=['dt', 'hh'], |
| 83 | + primary_keys=['dt', 'hh', 'pk'], |
| 84 | + options={'bucket': '2'}, |
| 85 | + comment='my test table' |
| 86 | +) |
| 87 | + |
| 88 | +# create table |
| 89 | +catalog.create_table(identifier='default.test_table', schema=schema, ignore_if_exists=False) |
| 90 | +``` |
| 91 | + |
| 92 | +Then you can get table read and write interfaces from table. |
| 93 | + |
| 94 | +## Batch read |
| 95 | + |
| 96 | +Assume that you already hava the table `default.test_table` described in the previous section. Let's see how to read data from it. |
| 97 | + |
| 98 | +```python |
| 99 | +from pypaimon.py4j import Catalog |
| 100 | + |
| 101 | +# set 'max-workers' (thread numbers) for parallelly reading |
| 102 | +catalog_options = { |
| 103 | + 'warehouse': 'path/to/warehouse', |
| 104 | + 'metastore': 'filesystem', |
| 105 | + 'max-workers': '4' |
| 106 | +} |
| 107 | +catalog = Catalog.create(catalog_options) |
| 108 | +table = catalog.get_table('default.test_table') |
| 109 | + |
| 110 | +# use ReadBuilder to perform filter and projection pushdown |
| 111 | +read_builder = table.new_read_builder() |
| 112 | + |
| 113 | +# select partition: dt='2024-12-01',hh='12' |
| 114 | +predicate_builder = read_builder.new_predicate_builder() |
| 115 | +dt_predicate = predicate_builder.equal('dt', '2024-12-01') |
| 116 | +dt_hh = predicate_builder.equal('hh', '12') |
| 117 | +partition_predicate = predicate_builder.and_([dt_predicate, dt_hh]) |
| 118 | +read_builder = read_builder.with_filter(partition_predicate) |
| 119 | + |
| 120 | +# select pk and value |
| 121 | +read_builder = read_builder.with_projection(['pk', 'value']) |
| 122 | + |
| 123 | +# plan splits |
| 124 | +table_scan = read_builder.new_scan() |
| 125 | +splits = table_scan.splits() |
| 126 | + |
| 127 | +# read data to pandas.DataFrame |
| 128 | +df = table_read.to_pandas(splits) |
| 129 | +``` |
| 130 | + |
| 131 | +Then you can do some analysis on the dataframe with Python. |
| 132 | + |
| 133 | +## Batch Write |
| 134 | + |
| 135 | +Assume that you already hava the table `default.test_table` described in the previous section. Let's see how to write or overwrite it. |
| 136 | + |
| 137 | +First, assume that you have a dataframe data of 2024-12-02, 12 o'clock, and you want to write it into the table. |
| 138 | + |
| 139 | +```python |
| 140 | +write_builder = table.new_batch_write_builder() |
| 141 | +table_write = write_builder.new_write() |
| 142 | +table_commit = write_builder.new_commit() |
| 143 | + |
| 144 | +# you can write data many times before committing |
| 145 | +dataframe = ... |
| 146 | +table_write.write_pandas(dataframe) |
| 147 | + |
| 148 | +commit_messages = table_write.prepare_commit() |
| 149 | +table_commit.commit(commit_messages) |
| 150 | + |
| 151 | +table_write.close() |
| 152 | +table_commit.close() |
| 153 | +``` |
| 154 | + |
| 155 | +Let's see how to overwrite the partition 'dt=2024-12-02,hh=12' with new data. |
| 156 | +```python |
| 157 | +write_builder = table.new_batch_write_builder() |
| 158 | +# set partition to overwrite |
| 159 | +write_builder = write_builder.overwrite({'dt': '2024-01-01', 'hh': '12'}) |
| 160 | + |
| 161 | +table_write = write_builder.new_write() |
| 162 | +table_commit = write_builder.new_commit() |
| 163 | + |
| 164 | +# then write data |
| 165 | +dataframe = ... |
| 166 | +table_write.write_pandas(dataframe) |
| 167 | + |
| 168 | +commit_messages = table_write.prepare_commit() |
| 169 | +table_commit.commit(commit_messages) |
| 170 | + |
| 171 | +table_write.close() |
| 172 | +table_commit.close() |
| 173 | +``` |
| 174 | + |
| 175 | +### Various data formats |
| 176 | + |
| 177 | +PyPaimon supports reading data in following formats: Pandas, Apache Arrow and DuckDB, and writing data in following |
| 178 | +formats: Pandas, Apache Arrow. Please refer to the [document](https://paimon.apache.org/docs/master/program-api/python-api/) for details. |
0 commit comments