[Feature][Transform] Support single/batch mode vectorization using Amazon Titan & cohere embedding model by SEZ9 · Pull Request #9120 · apache/seatunnel

SEZ9 · 2025-04-07T14:26:45Z

Purpose of this pull request

Does this PR introduce any user-facing change?

Description
Add support for Amazon Titan model in the embedding model_provider configuration;
Implement batch inference support in the embedding process, and send data to the model API in batches at one time;
Support successful detection of batch sending and perform fault tolerance.
Usage Scenario
In large-scale text vectorization and storage in vector databases, users need to vectorize text data efficiently and at low cost and store it in vector databases. For example:

User's reviews analysis scenario, it is necessary to transfer millions or tens of millions of rows of data at one time for vectorization.
Image search scenario, users often have hundreds of thousands or millions of images vectorized into the database for subsequent vector approximation retrieval

How was this patch tested?

Check list

If any new Jar binary package adding in your PR, please add License Notice according
New License Guide
If necessary, please update the documentation to describe the new feature. https://github.com/apache/seatunnel/tree/dev/docs
If you are contributing the connector code, please check that the following files are updated:
1. Update plugin-mapping.properties and add new connector information in it
2. Update the pom file of seatunnel-dist
3. Add ci label in label-scope-conf
4. Add e2e testcase in seatunnel-e2e
5. Update connector plugin_config
Update the release-note.

init bedrock model files

init parameters and configuration

test complete

Copilot

Copilot reviewed 5 out of 6 changed files in this pull request and generated no comments.

Files not reviewed (1)

seatunnel-transforms-v2/pom.xml: Language not supported

hailin0 · 2025-04-07T14:30:34Z

https://github.com/apache/seatunnel/pull/9120/checks?check_run_id=40108495922

hailin0

Please update docs

https://github.com/apache/seatunnel/blob/dev/docs/en/transform-v2/embedding.md
https://github.com/apache/seatunnel/blob/dev/docs/zh/transform-v2/embedding.md

SEZ9 · 2025-04-07T15:42:40Z

updated doc both en and cn

corgy-w · 2025-04-08T01:30:59Z

Whether Amazon e2e tests are missing

corgy-w · 2025-04-08T01:34:42Z

Please update EmbeddingTransformFactory config

SEZ9 · 2025-04-08T03:59:52Z

updated EmbeddingTransformFactory ,add Amazon model config

SEZ9 · 2025-04-08T04:23:55Z

updated Amazon e2e tests in embedding_transform.conf

corgy-w · 2025-04-08T07:16:14Z

+                .conditional(
+                        EmbeddingTransformConfig.MODEL_PROVIDER,
+                        ModelProvider.AMAZON,
+                        EmbeddingTransformConfig.API_KEY,
+                        EmbeddingTransformConfig.SECRET_KEY,
+                        EmbeddingTransformConfig.AWS_REGION,
+                        EmbeddingTransformConfig.MODEL,
+                        EmbeddingTransformConfig.DIMENSION)


Is region not here

AWS region is a required parameter when calling the Amazon model.

SEZ9 · 2025-04-28T08:54:29Z

Hi @hailin0 @corgy-w @Hisoka-X . Transform's e2e test was passed. The reason is that aws-sdk in e2e test was not shutdown normally, resulting in timeout.
Please help me see if this PR can be merged ,thanks!

hailin0 · 2025-04-28T09:48:19Z

Waiting for ci passed

SEZ9 · 2025-06-02T13:11:27Z

Hi @hailin0 @corgy-w @Hisoka-X . All checks have passed now, please help me see if this PR can be merge, thanks!

hailin0

LGTM

…azon Titan & cohere embedding model (apache#9120)

SEZ9 added 8 commits March 30, 2025 16:42

init bedrock model files

02a62e9

init bedrock model files

Merge branch 'apache:dev' into dev

2d06f2e

init parameters

b0350d3

init parameters and configuration

Merge branch 'dev' of https://github.com/SEZ9/seatunnel into dev

c75cad1

test complete

11f476b

test complete

fix type

4ee4dd5

fix type

af08bbe

fix typo

590cf92

github-actions bot added the Transform-v2 label Apr 7, 2025

hailin0 requested a review from Copilot April 7, 2025 14:29

Copilot AI reviewed Apr 7, 2025

View reviewed changes

hailin0 reviewed Apr 7, 2025

View reviewed changes

SEZ9 added 4 commits April 7, 2025 23:11

Merge branch 'apache:dev' into dev

88e616c

change link

6b4b3ff

Merge branch 'dev' of https://github.com/SEZ9/seatunnel into dev

e531857

trigger build

5f43a40

SEZ9 changed the title ~~Feature][Transform] Support batch mode vectorization using Amazon Titan & cohere embedding mode~~ [Feature][Transform] Support single/batch mode vectorization using Amazon Titan & cohere embedding model Apr 7, 2025

update doc

72246c5

github-actions bot added the document label Apr 7, 2025

updated EmbeddingTransformFactory

6e1a424

add e2e transform amazon model

8d7b4e3

github-actions bot added the e2e label Apr 8, 2025

corgy-w reviewed Apr 8, 2025

View reviewed changes

Merge branch 'apache:dev' into dev

384db01

SEZ9 added 9 commits April 26, 2025 15:22

add support amazon endpoint , modified e2e mock test

aaf88da

update bedrock e2e URI

d76c740

change e2e only bedrock cohere

d847009

Update embedding_transform.conf

efbaa87

Update mock-embedding.json

0ab05c0

Update mock-embedding.json

c36114e

Update BedrockModel.java

ee121ac

Update TestEmbeddingIT.java

0298311

remove useIdleConnectionReaper

33aca70

SEZ9 added 6 commits May 7, 2025 12:12

Merge branch 'apache:dev' into dev

1e53c59

Merge branch 'apache:dev' into dev

7894179

Merge branch 'apache:dev' into dev

b9f8313

Merge branch 'apache:dev' into dev

6cd605d

Merge branch 'apache:dev' into dev

05cb544

Update Elasticsearch.md

fd6ae5e

github-actions bot added CI&CD and removed CI&CD labels May 30, 2025

hailin0 approved these changes Jun 3, 2025

View reviewed changes

github-actions bot added approved reviewed labels Jun 3, 2025

Hisoka-X approved these changes Jun 3, 2025

View reviewed changes

Hisoka-X merged commit 37d410c into apache:dev Jun 3, 2025
7 checks passed

dybyte pushed a commit to dybyte/seatunnel that referenced this pull request Jul 23, 2025

[Feature][Transform] Support single/batch mode vectorization using Am…

380f0ea

…azon Titan & cohere embedding model (apache#9120)

Conversation

SEZ9 commented Apr 7, 2025

Purpose of this pull request

Does this PR introduce any user-facing change?

How was this patch tested?

Check list

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Uh oh!

hailin0 commented Apr 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

hailin0 left a comment

Choose a reason for hiding this comment

Uh oh!

SEZ9 commented Apr 7, 2025

Uh oh!

corgy-w commented Apr 8, 2025

Uh oh!

corgy-w commented Apr 8, 2025

Uh oh!

SEZ9 commented Apr 8, 2025

Uh oh!

SEZ9 commented Apr 8, 2025

Uh oh!

corgy-w Apr 8, 2025

Choose a reason for hiding this comment

Uh oh!

SEZ9 Apr 8, 2025

Choose a reason for hiding this comment

Uh oh!

SEZ9 commented Apr 28, 2025

Uh oh!

hailin0 commented Apr 28, 2025

Uh oh!

SEZ9 commented Jun 2, 2025

Uh oh!

hailin0 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

hailin0 commented Apr 7, 2025 •

edited

Loading