Skip to content

Adding schema loader transform to cdc data generator#3483

Open
shreyakhajanchi wants to merge 5 commits intoGoogleCloudPlatform:mainfrom
shreyakhajanchi:dg-schema-loader
Open

Adding schema loader transform to cdc data generator#3483
shreyakhajanchi wants to merge 5 commits intoGoogleCloudPlatform:mainfrom
shreyakhajanchi:dg-schema-loader

Conversation

@shreyakhajanchi
Copy link
Contributor

@shreyakhajanchi shreyakhajanchi commented Mar 12, 2026

Add Schema Fetching Abstraction for CDC Data Generator

This PR introduces a flexible mechanism for fetching schemas from various sinks (Spanner, MySQL) to be used with the CDC Data Generator.

Key Features & Components:

  • SinkSchemaFetcher Interface: A common interface (com.google.cloud.teleport.v2.templates.sink.SinkSchemaFetcher) for abstracting the schema retrieval process from different database sinks.
  • Concrete Implementations:
    • SpannerSchemaFetcher: Fetches schema information from Cloud Spanner using the Spanner API and DDL information.
    • MySqlSchemaFetcher: Fetches schema information from MySQL using JDBC and MySqlInformationSchemaScanner.
  • TypeMapper Interface: An interface (com.google.cloud.teleport.v2.templates.sink.TypeMapper) to convert sink-specific data types into a common LogicalType used by the data generator.
    • Includes implementations for Spanner (SpannerTypeMapper) and MySQL (MySqlTypeMapper).
  • FetchSchemaFn: A Beam DoFn (com.google.cloud.teleport.v2.templates.dofn.FetchSchemaFn) responsible for orchestrating the schema fetching process using the appropriate SinkSchemaFetcher based on the configured SinkType.
  • SchemaLoader Transform: The SchemaLoader utilizes FetchSchemaFn to load the sink schema as a side input for the pipeline. It's designed to be extensible for other sink types.
  • Configuration: Sink connection details are expected to be provided via configuration files (e.g., JSON), paths to which are passed to the fetchers.

Notes:

  • The current data type mappings in MySqlTypeMapper and SpannerTypeMapper provide foundational coverage and will be expanded in future PRs to handle more types.

@gemini-code-assist
Copy link

Summary of Changes

Hello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances the CDC data generator by introducing a flexible and robust mechanism for loading database schemas directly from target sinks. This new framework, which currently supports Spanner and MySQL, allows the data generator to dynamically adapt to the schema of the destination database, ensuring more accurate and relevant synthetic data generation. The changes involve defining generic interfaces for schema fetching and type mapping, providing concrete implementations for supported databases, and integrating this functionality as a Beam transform for pipeline execution.

Highlights

  • Schema Loading Framework: Introduced a new, extensible framework for dynamically loading database schemas from various sinks, including SinkSchemaFetcher and TypeMapper interfaces.
  • MySQL Schema Integration: Added specific implementations for MySQL, including MySqlSchemaFetcher to retrieve schema details and MySqlTypeMapper to translate MySQL data types.
  • Spanner Schema Integration: Provided specific implementations for Spanner, including SpannerSchemaFetcher to retrieve schema details and SpannerTypeMapper to translate Spanner data types for both GoogleSQL and PostgreSQL dialects.
  • Data Model Relocation: Relocated core data model classes (e.g., DataGeneratorColumn, DataGeneratorSchema) from the test source set to the main source set, indicating their promotion to core application components.
  • Beam Transform for Schema Loading: Implemented a SchemaLoader Beam PTransform to integrate the schema fetching process directly into the data generation pipeline, allowing the schema to be loaded as a side input.
Changelog
  • v2/cdc-data-generator/src/main/java/com/google/cloud/teleport/v2/templates/CdcDataGeneratorOptions.java
    • Added CdcDataGeneratorOptions interface with SinkType enum to define available sink types.
  • v2/cdc-data-generator/src/main/java/com/google/cloud/teleport/v2/templates/model/DataGeneratorColumn.java
    • Renamed and moved DataGeneratorColumn from test to main source.
  • v2/cdc-data-generator/src/main/java/com/google/cloud/teleport/v2/templates/model/DataGeneratorForeignKey.java
    • Renamed and moved DataGeneratorForeignKey from test to main source.
  • v2/cdc-data-generator/src/main/java/com/google/cloud/teleport/v2/templates/model/DataGeneratorSchema.java
    • Renamed and moved DataGeneratorSchema from test to main source.
  • v2/cdc-data-generator/src/main/java/com/google/cloud/teleport/v2/templates/model/DataGeneratorTable.java
    • Renamed and moved DataGeneratorTable from test to main source.
  • v2/cdc-data-generator/src/main/java/com/google/cloud/teleport/v2/templates/model/DataGeneratorUniqueKey.java
    • Renamed and moved DataGeneratorUniqueKey from test to main source.
  • v2/cdc-data-generator/src/main/java/com/google/cloud/teleport/v2/templates/model/LogicalType.java
    • Renamed and moved LogicalType from test to main source.
  • v2/cdc-data-generator/src/main/java/com/google/cloud/teleport/v2/templates/model/package-info.java
    • Renamed and moved package-info.java for model classes from test to main source.
  • v2/cdc-data-generator/src/main/java/com/google/cloud/teleport/v2/templates/mysql/MySqlSchemaFetcher.java
    • Added MySqlSchemaFetcher to retrieve schema from MySQL databases.
  • v2/cdc-data-generator/src/main/java/com/google/cloud/teleport/v2/templates/mysql/MySqlTypeMapper.java
    • Added MySqlTypeMapper to map MySQL data types to logical types.
  • v2/cdc-data-generator/src/main/java/com/google/cloud/teleport/v2/templates/mysql/package-info.java
    • Added package-info for MySQL schema components.
  • v2/cdc-data-generator/src/main/java/com/google/cloud/teleport/v2/templates/sink/SinkSchemaFetcher.java
    • Added SinkSchemaFetcher interface for generic schema fetching.
  • v2/cdc-data-generator/src/main/java/com/google/cloud/teleport/v2/templates/sink/TypeMapper.java
    • Added TypeMapper interface for generic type mapping.
  • v2/cdc-data-generator/src/main/java/com/google/cloud/teleport/v2/templates/sink/package-info.java
    • Added package-info for sink interfaces.
  • v2/cdc-data-generator/src/main/java/com/google/cloud/teleport/v2/templates/spanner/SpannerSchemaFetcher.java
    • Added SpannerSchemaFetcher to retrieve schema from Spanner databases.
  • v2/cdc-data-generator/src/main/java/com/google/cloud/teleport/v2/templates/spanner/SpannerTypeMapper.java
    • Added SpannerTypeMapper to map Spanner data types to logical types.
  • v2/cdc-data-generator/src/main/java/com/google/cloud/teleport/v2/templates/spanner/package-info.java
    • Added package-info for Spanner schema components.
  • v2/cdc-data-generator/src/main/java/com/google/cloud/teleport/v2/templates/transforms/SchemaLoader.java
    • Added SchemaLoader Beam PTransform to load DataGeneratorSchema as a side input.
  • v2/cdc-data-generator/src/main/java/com/google/cloud/teleport/v2/templates/transforms/package-info.java
    • Added package-info for Beam transforms.
  • v2/cdc-data-generator/src/test/java/com/google/cloud/teleport/v2/templates/mysql/MySqlTypeMapperTest.java
    • Added unit tests for MySqlTypeMapper.
  • v2/cdc-data-generator/src/test/java/com/google/cloud/teleport/v2/templates/spanner/SpannerSchemaFetcherTest.java
    • Added unit tests for SpannerSchemaFetcher.
  • v2/cdc-data-generator/src/test/java/com/google/cloud/teleport/v2/templates/spanner/SpannerTypeMapperTest.java
    • Added unit tests for SpannerTypeMapper.
Activity
  • No specific activity or comments were provided in the pull request context.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

@shreyakhajanchi shreyakhajanchi added the addition New feature or request label Mar 12, 2026
@codecov
Copy link

codecov bot commented Mar 12, 2026

Codecov Report

❌ Patch coverage is 84.34164% with 44 lines in your changes missing coverage. Please review.
✅ Project coverage is 52.38%. Comparing base (3c5c4b5) to head (cc36319).
⚠️ Report is 15 commits behind head on main.

Files with missing lines Patch % Lines
...eleport/v2/templates/mysql/MySqlSchemaFetcher.java 81.31% 8 Missing and 9 partials ⚠️
...ort/v2/templates/spanner/SpannerSchemaFetcher.java 86.95% 6 Missing and 6 partials ⚠️
...loud/teleport/v2/templates/dofn/FetchSchemaFn.java 68.18% 6 Missing and 1 partial ⚠️
...leport/v2/templates/spanner/SpannerTypeMapper.java 89.47% 2 Missing and 2 partials ⚠️
...teleport/v2/templates/transforms/SchemaLoader.java 82.35% 2 Missing and 1 partial ⚠️
...d/teleport/v2/templates/mysql/MySqlTypeMapper.java 94.44% 0 Missing and 1 partial ⚠️
Additional details and impacted files
@@              Coverage Diff              @@
##               main    #3483       +/-   ##
=============================================
+ Coverage     32.94%   52.38%   +19.44%     
- Complexity      461     5729     +5268     
=============================================
  Files           214     1053      +839     
  Lines         12719    63252    +50533     
  Branches       1240     6925     +5685     
=============================================
+ Hits           4190    33136    +28946     
- Misses         8196    27875    +19679     
- Partials        333     2241     +1908     
Components Coverage Δ
spanner-templates 72.21% <ø> (∅)
spanner-import-export 68.93% <ø> (∅)
spanner-live-forward-migration 80.49% <ø> (∅)
spanner-live-reverse-replication 77.93% <ø> (∅)
spanner-bulk-migration 89.18% <ø> (∅)
gcs-spanner-dv 85.34% <ø> (∅)
Files with missing lines Coverage Δ
...teleport/v2/templates/CdcDataGeneratorOptions.java 100.00% <100.00%> (ø)
...leport/v2/templates/model/DataGeneratorColumn.java 100.00% <ø> (ø)
...rt/v2/templates/model/DataGeneratorForeignKey.java 100.00% <ø> (ø)
...leport/v2/templates/model/DataGeneratorSchema.java 100.00% <ø> (ø)
...eleport/v2/templates/model/DataGeneratorTable.java 100.00% <ø> (ø)
...ort/v2/templates/model/DataGeneratorUniqueKey.java 100.00% <ø> (ø)
...cloud/teleport/v2/templates/model/LogicalType.java 100.00% <ø> (ø)
...d/teleport/v2/templates/mysql/MySqlTypeMapper.java 94.44% <94.44%> (ø)
...teleport/v2/templates/transforms/SchemaLoader.java 82.35% <82.35%> (ø)
...leport/v2/templates/spanner/SpannerTypeMapper.java 89.47% <89.47%> (ø)
... and 3 more

... and 848 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@shreyakhajanchi shreyakhajanchi marked this pull request as ready for review March 13, 2026 07:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

addition New feature or request size/XXL

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant