Skip to content

TuGraph-family/Awesome-Text2GQL

Awesome-Text2GQL

Star Fork Contributor Commit License Release

Introduction

Awesome-Text2GQL is an AI-assisted framework for Text2GQL dataset construction. The framework supports translation, generalization and generation of Text2GQL corpus and corresponding database instance. With the assistance of this framework, users are able to construct a high quality Text2GQL dataset on any graph query language(ISO-GQL, Cypher, Gremlin, SQL/PGQ, etc.) with a significantly lower cost than pure human labor annotation.

image

Key Features

  • Translation

    • Question translation from graph query languages to questions in different natural languages to accelerate question annotation.

    • Query translation between different graph query languages to gather corpus from all existing graph query languages.

  • Generalization

    • Question generalization for getting natural language questions in different language style but with same semantic meaning to increase the diversity of corpus.

    • Query generalization for instantiating similar query patterns on different graph schemas to increase the diversity of corpus.

  • Generation

    • Schema generation for automatically generating complex graph database schemas from natural language domain descriptions.

    • Data generation for generating realistic simulation data that follows statistical distributions (power-law, long-tail, normal distribution, etc.).

    • Corpus generation for producing high-quality Question-Query pairs with complex queries (multi-hop, nested queries) through iterative enhancement strategies.

Demo: TuGraph-DB ChatBot

TuGraph-DB ChatBot is a demo demonstrates the effect of an agent trained by the corpus generated by Awesome-Text2GQL. It can intercract with the TuGraph-DB taking how you want to operate the db in Chinese as input, such as querying or creating data. And The ChatBot will also help you to excute the cypher too.

demo

Refer to demo for more information.

Quick Start

Environment Preparation

It is recommended to use poetry to manage your python environment while other tools may also work.

# install poetry and poetry shell
pip install poetry
pip install poetry-plugin-shell

# install environment
poetry install

# activate virtual environment
poetry shell

LLM Setup

Remote LLM Setup

To run Awesome-Text2GQL funtions based on remote LLMs,apply API-KEY before you start.

  1. Apply API-KEY

Awesome-Text2GQL's remote LLM client is based on the Qwen Inference Service served by Aliyun, you can refer to Aliyun to apply the API-KEY.

  1. Set API-KEY via environment variables (recommended)
# replace YOUR_DASHSCOPE_API_KEY with your API-KEY
echo "export DASHSCOPE_API_KEY='YOUR_DASHSCOPE_API_KEY'" >> ~/.bashrc
source ~/.bashrc
echo $DASHSCOPE_API_KEY

Local LLM Setup

Awesome-Text2GQL's local LLM client is based on transformers library, use model id from HuggingFace model hub if you can access HuggingFace or use the related local file path where the LLM model is. Add model_path when initializing llm client if you want to use local LLM instead of remote LLM.

Run Example

Generate Schema

python ./examples/generate_schema.py

This example shows how to use Awesome-Text2GQL Framework to generate a graph schema for use in corpus construction.

Generate Data

python ./examples/generate_data.py

This example shows how to use Awesome-Text2GQL Framework to generate data instances based on a given graph schema file.

Generate Corpus

This example shows how to use the Awesome-Text2GQL framework to generate a corpus. Before running it, ensure you have a running database instance and update the database connection and output configuration in examples/generate_corpus.py. Alternatively, you can import our provided test dataset into TuGraph. Download link: test dataset.

Example TuGraph import command:

# lgraph_import -u admin -p 73@TuGraph -c import_config.json --dir /var/lib/lgraph/data/ --overwrite true -v 3

After all, run:

python ./examples/generate_corpus.py

When the script finishes, the generated corpus will be saved to the output directory specified in the script.

Cypher2GQL

python ./examples/cypher2gql.py

This example shows how to use AwesomeText2GQL Framework to translate neo4j's Text2Cypher corpus into Text2GQL corpus with queries aligned to ISO-GQL grammar.

Generalize Cypher Corpus

python ./examples/generalize_corpus_cypher.py

This example shows how to use AwesomeText2GQL Framework to generalize from one cypher corpus pair to construct a Text2GQL corpus dataset.

Generalize GQL Corpus

python ./examples/generalize_corpus_gql.py

This example shows how to use AwesomeText2GQL Framework to generalize from one ISO-GQL corpus pair to construct a Text2GQL corpus dataset.

English to Chinese

python ./examples/english_to_chinese.py

This example shows how to use AwesomeText2GQL Framework to translate English question into Chinese question with the same semantic meaning.

AST Printing

python ./examples/print_ast.py

This example shows how to use AwesomeText2GQL Framework to print the ast of a query. Visualizing the AST is helpful for IR and other AST related development.

Modules

Awesome-Text2GQL use Translator, Generalizer and Generator to assit the entire process of Text2GQL dataset construction.

Translator

Translator supports multilingual tranlsation for question translation and multi-graph-query-language translation for query translation. Users can use translator to translate existing corpus in different natural language and graph query language into target natural language and graph query language.

Question Translator

Question translator currently has the ability to translate a query into a natural language question(English) with a similar query template and corresponding question template. In the future, we will support multilingual translation of natural language question.

from app.core.llm.llm_client import LlmClient
from app.core.translator.question_translator import QuestionTranslator

llm_client = LlmClient(model="qwen-plus-0723")
query_template="MATCH (n:Person)-[:HAS_CHILD*1]->(n) WHERE n.name = 'Vanessa Redgrave' RETURN n"
question_template="Who are Roy Redgrave's second generations?"
query_list = [
    "MATCH (n1:person)-[e1:acted_in]->{1,1}(n2:movie) WHERE n1.id = 'Neo' RETURN n2.`duration` AS `DURATION`",
    "MATCH (n1:person)-[e1:directed]->{1,1}(n2:movie) WHERE n1.name = 'MacQUeen' RETURN n2.id AS ID",
    "MATCH (n1:person)-[e1:produce]->{1,1}(n2:movie) WHERE n1.name = 'Hans' RETURN n2.rated AS RATED"
    ]

# translate query into question
question_translator = QuestionTranslator(llm_client=llm_client, chunk_size=5)
question_list = question_translator.translate(
    query_template=query_template,
    question_template=question_template,
    query_list = query_list
)

Query Translator

query_translator

Query translator has the ability to translate queries in one query language into another, like cypher to gql. To achieve this, Awesome-Text2GQL designed and implemented a set of intermediate representation for commonly used graph query languages(ISO-GQL, Cypher, Gremlin, SQL/PGQ, etc.) and their dialects. With ast vistitor's implementations, different graph query language can be translated into the intermediate representation. With the query translator's implementations, intermediate representation can be translated into different graph query language.

from app.impl.iso_gql.translator.iso_gql_query_translator import IsoGqlQueryTranslator as GQLTranslator
from app.impl.tugraph_cypher.ast_visitor.tugraph_cypher_query_visitor import TugraphCypherAstVisitor

query_visitor = TugraphCypherAstVisitor()
gql_translator = GQLTranslator()
cypher = "MATCH (n:Person)-[:HAS_CHILD*1]->(n) WHERE n.name = 'Vanessa Redgrave' RETURN n"

# translate cypher to gql
success, query_pattern = query_visitor.get_query_pattern(cypher)
if success:
    gql = gql_translator.translate(query_pattern)

Generalizer

Generalizer supports the corpus generalization based on the given query template and question template. Users can use generalizer to construct a large scale corpus dataset across multiple database instance from a limited number of existing corpus templates.

Question Generalizer

Question generalizer has the ability to generalize the given natural language question into similar questions with different language styles, and the symantic similarity is ensured with the given corresponding query. This generalization aims to increase the linguistic diversity of corpus to simulate the real world Text2GQL scenario.

from app.core.generalizer.question_generalizer import QuestionGeneralizer
from app.core.llm.llm_client import LlmClient

llm_client = LlmClient(model="qwen-plus-0723")
question_generalizer = QuestionGeneralizer(llm_client)
corpus_pair_list = [
    [
        "MATCH (n:Person)-[:HAS_CHILD*1]->(n) WHERE n.name = 'Vanessa Redgrave' RETURN n",
        "Who are Roy Redgrave's second generations?"
    ]
]

# generalize question
generalized_corpus_pair_list = []
for corpus_pair in corpus_pair_list:
    query = corpus_pair[0]
    question = corpus_pair[1]
    generalized_question_list = question_generalizer.generalize(
        query=query,
        question=question
    )
    for generalized_question in generalized_question_list:
        generalized_corpus_pair_list.append((query, generalized_question))
    generalized_corpus_pair_list.append((query, question))

Query Generalizer

query_generalizer

Query generalizer has the ability to generalize the given query into queries with similar query pattern on the given schema. With the intermediate representation for graph query languages, Awesome-Text2GQL can translate a query into intermediate query pattern, and the similar query pattern can be constructed with different variables on different schema. This generalization aims to migrate existing query patterns onto new database instance efficiently.

from app.core.generalizer.query_generalizer import QueryGeneralizer
from app.impl.tugraph_cypher.ast_visitor.tugraph_cypher_query_visitor import TugraphCypherAstVisitor

db_id = "movie"
instance_path = "../app/impl/tugraph_cypher/generalizer/base/db_instance/movie"
query_visitor = TugraphCypherAstVisitor()
query_generalizer = QueryGeneralizer(db_id, instance_path)
query_template="MATCH (n {name: 'Carrie-Anne Moss'}) RETURN n.born AS born"

# generalize cypher query
query_list = query_generalizer.generalize_from_cypher(query_template=query_template)
from app.core.generalizer.query_generalizer import QueryGeneralizer
from app.impl.iso_gql.translator.iso_gql_query_translator import IsoGqlQueryTranslator as GQLTranslator
from app.impl.tugraph_cypher.ast_visitor.tugraph_cypher_query_visitor import TugraphCypherAstVisitor

db_id = "movie"
instance_path = "../app/impl/tugraph_cypher/generalizer/base/db_instance/movie"
query_generalizer = QueryGeneralizer(db_id, instance_path)
query_visitor = TugraphCypherAstVisitor()
gql_translator = GQLTranslator()
query_template="MATCH (n:Person)-[:HAS_CHILD*1]->(n) WHERE n.name = 'Vanessa Redgrave' RETURN n"

# generalize gql query
query_list = []
success, query_pattern = query_visitor.get_query_pattern(query_template)
if success:
    query_pattern_list = query_generalizer.generalize(query_pattern=query_pattern)
    for query_pattern in query_pattern_list:
        query = gql_translator.translate(query_pattern)
        query_list.append(query)

Generator

The framework implements a full-chain automated generation pipeline: Schema Generation → Graph Database Instance Construction → Complex Corpus Generation. This transforms the traditional "manual template design & semi-automatic corpus generation" model into a new paradigm of fully automated generation.

Architecture Diagram

The system consists of four highly cohesive, loosely coupled core modules:

Module Function Description
Schema Generator Understands natural-language domain and subdomain descriptions and generates corresponding graph schemas (nodes, edges, properties) for the Data Generator
Data Generator Generates simulated node and edge data based on Schema, supporting large-scale complex relationship network construction
Corpus Generator Uses LLM to generate high-quality Question-Query pairs containing multi-hop queries, nested queries
Validator Checks the correctness of generated Schema, and the grammatical/semantic correctness of Query

Schema Generator

The Schema Generator module automatically creates complex graph database schemas from natural language domain descriptions. It introduces Domain and Subdomain concepts to enhance semantic hierarchy and implements a quantifiable graph structure generation strategy based on a 5-level complexity model.

Key Features:

  • Generates SchemaGraph format schemas that can be converted into TuGraph modeling files or adapted for other database engines
  • Supports quantitative control over schema complexity through predefined node and relationship ranges
  • Uses LLM to generate Schema Descriptions and corresponding Schema JSON
  • Ensures polymorphism through SchemaGraph class for future database adapters

Schema Generator Architecture

Data Generator

The Data Generator module creates realistic simulation data based on generated schemas, following real-world statistical distributions like power-law, long-tail, and normal distributions.

Key Features:

  • Generates node and edge CSV files with property constraints
  • Creates TuGraph-compatible import_config.json for batch importing via lgraph_import
  • Handles common import errors (type parsing failures, missing delimiters, null value handling)
  • Generated 488 CSV files containing ~3,716,332 rows of data during development

Data Generator Architecture

Corpus Generator

The Corpus Generator produces high-quality Question-Query pairs through a hierarchical generation strategy that balances complexity and diversity.

Key Features:

  • Layered generation strategy: first generates simple seed corpus, then complex corpus based on seeds
  • Real validation: all queries are executed and verified on actual graph databases
  • Context-aware generation: uses query execution results as context for LLM enhancement
  • Iterative enhancement: controllable iteration rounds for gradually increasing complexity
  • Generated 800+ high-quality corpus pairs across multiple domains

Corpus Generator Architecture

Experimental Results

We used the framework’s automated generation pipeline to construct multi-dimensional test datasets Geography_World, Movie_Movielens, Healthcare_Donor and training datasets Banking_Financial, Game Olympics, Retail_RetailWorld. They can be downloaded here: test set, training set.

Then, we conducted zero-shot experiments using three test database instances and their generated corpora. We evaluated Qwen-Plus, Qwen-8B Base, and Qwen-8B fine-tuned with LoRA on different datasets using four metrics: Grammar, Similarity, Google BLEU, and EA.

Test Set Schema Complexity Grammer Similarity Google BLEU EA
geography 5 83.7 85.8 64.3 16.28
geography_seeds 5 100 84.1 58.7 14.29
healthcare_donor 3 84.8 87.8 63.2 27.85
healthcare_donor_seeds 3 95.2 86.7 53.6 42.86
movie 3 86.5 88.3 63.4 14.86
movie_seeds 3 96.4 86.8 50.8 50.00

Table 1. Qwen-Plus Experiment Results on LLM-Synthesis Dataset

Test Set Schema Complexity Grammar Similarity Google BLUE EA
geography 5 0.884 0.856 0.564 26.74%
geography_seeds 5 0.929 0.861 0.551 21.43%
healthcare_donor 3 0.949 0.885 0.652 29.11%
healthcare_donor_seeds 3 1.000 0.861 0.510 42.86%
movie 3 0.946 0.877 0.598 6.76%
movie_seeds 3 0.893 0.864 0.512 39.29%

Table 2. Qwen-8B Base Experiment Results on LLM-Synthesis Dataset

Test Set Schema Complexity Grammar Similarity Google BLUE EA
geography 5 0.942 0.878 0.683 20.93%
geography_seeds 5 1.000 0.899 0.747 71.43%
healthcare_donor 3 0.987 0.870 0.578 27.85%
healthcare_donor_seeds 3 1.000 0.901 0.641 71.43%
movie 3 1.000 0.893 0.600 28.38%
movie_seeds 3 1.000 0.927 0.633 64.29%

Table 3. Qwen-8B Base fine-tuned with LoRA Experiment Results on LLM-Synthesis Dataset

Testing on LLM-synthesized datasets shows that the framework successfully generates complex corpora that challenge current LLMs, with execution accuracy below 30% on complex iterated corpora. Fine-tuning experiments demonstrate that models trained on framework-generated data show improved performance.

Development Guide

To make Awesome-Text2GQL supports the corpus construction of more types of graph query languages, we welcome contribution to the implementation of ast visitor, query translator, and schema parser of new graph query languages. If you find the compatibility of current intermediate representation is not enough for the new graph query language, we also welcome contribution to the intermediate representation.

Introduction to Intermediate Representation

The clause class is the core of Awesome-Text2GQL's intermediate representation for graph query languages. Currently clause class has match clause, return clause, where clause and with clause as subclasses. The design of subclasses might be updated in the future for the compatibility of more graph query languages

  • Match Clause: the intermediate representation for pattern match

  • Return Clause: the intermediate representation for item return.

  • Where Clause: the intermediate representation for condition expression.

  • With Clause: the intermediate representation for variable control.

Graph Query Language Implementation Guideline

Implement AST Visitor

The ast visitor class is a virtual class and should be implemented for different graph query language, like cypher ast visitor or gql ast visitor. Implementation on each graph query language should be able to parse the given query, visit the abstarct syntax tree, then return the graph pattern(a list of clauses) of the given query as the intermediate representation for further translation or generalization. See app/impl/tugraph_cypher/ast_visitor/tugraph_cypher_ast_visitor.py as an example.

Implement Query Translator

The query translator class is a virtual class and should be implemented for different graph query language, like cypher query translator or gql query translator. Implementation on each graph query language should implement the translate function to turn a list of clauses(the intermediate representation) into an actual query aligned to the grammar of corresponding language, and implement the grammar check function to check if a query is grammatically correct. See app/impl/iso_gql/translator/iso_gql_query_translator.py as an example.

Implement Schema Parser

The schema parser class is a virtual class and should be implemented for different DBMS, like neo4j schema parser or tugraph schema parser. Implementation on each DBMS should be able to parse the correspongding schema file, whether it's a set of queries or a json file, then return a in memory schema graph for query generalization. See app/impl/tugraph_cypher/schema/schema_parser.py as an example.

Future Plan

Awesome-Text2GQL will continue to enhance the quality and diversity of generated corpora and improve the framework's usability and performance.

Contribution

Do the following steps before submitting your code.

Code Formatting

poetry run ruff format .

Code Checking

poetry run ruff check ./app ./examples --fix

if all check passed, you can submit your code.

Submitting Code

create a pull request, link it to a related issue, then wait for the project maintainer to review your changes and provide feedback. If your pull request is finally approved by our maintainer, we will merge it. Other details can reference to our contributing document.

Contributors Wall

Attention

This project is still under development, suggestions, issues or pull requests are welcome.

About

Fine-Tuning Dataset Auto-Generation for Graph Query Languages.

Topics

Resources

License

Code of conduct

Contributing

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published