Awesome-Text2GQL is an AI-assisted framework for Text2GQL dataset construction. The framework supports translation, generalization and generation of Text2GQL corpus and corresponding database instance. With the assistance of this framework, users are able to construct a high quality Text2GQL dataset on any graph query language(ISO-GQL, Cypher, Gremlin, SQL/PGQ, etc.) with a significantly lower cost than pure human labor annotation.
-
Translation
-
Question translation from graph query languages to questions in different natural languages to accelerate question annotation.
-
Query translation between different graph query languages to gather corpus from all existing graph query languages.
-
-
Generalization
-
Question generalization for getting natural language questions in different language style but with same semantic meaning to increase the diversity of corpus.
-
Query generalization for instantiating similar query patterns on different graph schemas to increase the diversity of corpus.
-
-
Generation
-
Schema generation for automatically generating complex graph database schemas from natural language domain descriptions.
-
Data generation for generating realistic simulation data that follows statistical distributions (power-law, long-tail, normal distribution, etc.).
-
Corpus generation for producing high-quality Question-Query pairs with complex queries (multi-hop, nested queries) through iterative enhancement strategies.
-
TuGraph-DB ChatBot is a demo demonstrates the effect of an agent trained by the corpus generated by Awesome-Text2GQL. It can intercract with the TuGraph-DB taking how you want to operate the db in Chinese as input, such as querying or creating data. And The ChatBot will also help you to excute the cypher too.
Refer to demo for more information.
It is recommended to use poetry to manage your python environment while other tools may also work.
# install poetry and poetry shell
pip install poetry
pip install poetry-plugin-shell
# install environment
poetry install
# activate virtual environment
poetry shell
To run Awesome-Text2GQL funtions based on remote LLMs,apply API-KEY before you start.
- Apply API-KEY
Awesome-Text2GQL's remote LLM client is based on the Qwen Inference Service served by Aliyun, you can refer to Aliyun to apply the API-KEY.
- Set API-KEY via environment variables (recommended)
# replace YOUR_DASHSCOPE_API_KEY with your API-KEY
echo "export DASHSCOPE_API_KEY='YOUR_DASHSCOPE_API_KEY'" >> ~/.bashrc
source ~/.bashrc
echo $DASHSCOPE_API_KEY
Awesome-Text2GQL's local LLM client is based on transformers library, use model id from HuggingFace model hub if you can access HuggingFace or use the related local file path where the LLM model is. Add model_path when initializing llm client if you want to use local LLM instead of remote LLM.
python ./examples/generate_schema.py
This example shows how to use Awesome-Text2GQL Framework to generate a graph schema for use in corpus construction.
python ./examples/generate_data.py
This example shows how to use Awesome-Text2GQL Framework to generate data instances based on a given graph schema file.
This example shows how to use the Awesome-Text2GQL framework to generate a corpus. Before running it, ensure you have a running database instance and update the database connection and output configuration in examples/generate_corpus.py. Alternatively, you can import our provided test dataset into TuGraph. Download link: test dataset.
Example TuGraph import command:
# lgraph_import -u admin -p 73@TuGraph -c import_config.json --dir /var/lib/lgraph/data/ --overwrite true -v 3After all, run:
python ./examples/generate_corpus.py
When the script finishes, the generated corpus will be saved to the output directory specified in the script.
python ./examples/cypher2gql.py
This example shows how to use AwesomeText2GQL Framework to translate neo4j's Text2Cypher corpus into Text2GQL corpus with queries aligned to ISO-GQL grammar.
python ./examples/generalize_corpus_cypher.py
This example shows how to use AwesomeText2GQL Framework to generalize from one cypher corpus pair to construct a Text2GQL corpus dataset.
python ./examples/generalize_corpus_gql.py
This example shows how to use AwesomeText2GQL Framework to generalize from one ISO-GQL corpus pair to construct a Text2GQL corpus dataset.
python ./examples/english_to_chinese.py
This example shows how to use AwesomeText2GQL Framework to translate English question into Chinese question with the same semantic meaning.
python ./examples/print_ast.py
This example shows how to use AwesomeText2GQL Framework to print the ast of a query. Visualizing the AST is helpful for IR and other AST related development.
Awesome-Text2GQL use Translator, Generalizer and Generator to assit the entire process of Text2GQL dataset construction.
Translator supports multilingual tranlsation for question translation and multi-graph-query-language translation for query translation. Users can use translator to translate existing corpus in different natural language and graph query language into target natural language and graph query language.
Question translator currently has the ability to translate a query into a natural language question(English) with a similar query template and corresponding question template. In the future, we will support multilingual translation of natural language question.
from app.core.llm.llm_client import LlmClient
from app.core.translator.question_translator import QuestionTranslator
llm_client = LlmClient(model="qwen-plus-0723")
query_template="MATCH (n:Person)-[:HAS_CHILD*1]->(n) WHERE n.name = 'Vanessa Redgrave' RETURN n"
question_template="Who are Roy Redgrave's second generations?"
query_list = [
"MATCH (n1:person)-[e1:acted_in]->{1,1}(n2:movie) WHERE n1.id = 'Neo' RETURN n2.`duration` AS `DURATION`",
"MATCH (n1:person)-[e1:directed]->{1,1}(n2:movie) WHERE n1.name = 'MacQUeen' RETURN n2.id AS ID",
"MATCH (n1:person)-[e1:produce]->{1,1}(n2:movie) WHERE n1.name = 'Hans' RETURN n2.rated AS RATED"
]
# translate query into question
question_translator = QuestionTranslator(llm_client=llm_client, chunk_size=5)
question_list = question_translator.translate(
query_template=query_template,
question_template=question_template,
query_list = query_list
)Query translator has the ability to translate queries in one query language into another, like cypher to gql. To achieve this, Awesome-Text2GQL designed and implemented a set of intermediate representation for commonly used graph query languages(ISO-GQL, Cypher, Gremlin, SQL/PGQ, etc.) and their dialects. With ast vistitor's implementations, different graph query language can be translated into the intermediate representation. With the query translator's implementations, intermediate representation can be translated into different graph query language.
from app.impl.iso_gql.translator.iso_gql_query_translator import IsoGqlQueryTranslator as GQLTranslator
from app.impl.tugraph_cypher.ast_visitor.tugraph_cypher_query_visitor import TugraphCypherAstVisitor
query_visitor = TugraphCypherAstVisitor()
gql_translator = GQLTranslator()
cypher = "MATCH (n:Person)-[:HAS_CHILD*1]->(n) WHERE n.name = 'Vanessa Redgrave' RETURN n"
# translate cypher to gql
success, query_pattern = query_visitor.get_query_pattern(cypher)
if success:
gql = gql_translator.translate(query_pattern)Generalizer supports the corpus generalization based on the given query template and question template. Users can use generalizer to construct a large scale corpus dataset across multiple database instance from a limited number of existing corpus templates.
Question generalizer has the ability to generalize the given natural language question into similar questions with different language styles, and the symantic similarity is ensured with the given corresponding query. This generalization aims to increase the linguistic diversity of corpus to simulate the real world Text2GQL scenario.
from app.core.generalizer.question_generalizer import QuestionGeneralizer
from app.core.llm.llm_client import LlmClient
llm_client = LlmClient(model="qwen-plus-0723")
question_generalizer = QuestionGeneralizer(llm_client)
corpus_pair_list = [
[
"MATCH (n:Person)-[:HAS_CHILD*1]->(n) WHERE n.name = 'Vanessa Redgrave' RETURN n",
"Who are Roy Redgrave's second generations?"
]
]
# generalize question
generalized_corpus_pair_list = []
for corpus_pair in corpus_pair_list:
query = corpus_pair[0]
question = corpus_pair[1]
generalized_question_list = question_generalizer.generalize(
query=query,
question=question
)
for generalized_question in generalized_question_list:
generalized_corpus_pair_list.append((query, generalized_question))
generalized_corpus_pair_list.append((query, question))Query generalizer has the ability to generalize the given query into queries with similar query pattern on the given schema. With the intermediate representation for graph query languages, Awesome-Text2GQL can translate a query into intermediate query pattern, and the similar query pattern can be constructed with different variables on different schema. This generalization aims to migrate existing query patterns onto new database instance efficiently.
from app.core.generalizer.query_generalizer import QueryGeneralizer
from app.impl.tugraph_cypher.ast_visitor.tugraph_cypher_query_visitor import TugraphCypherAstVisitor
db_id = "movie"
instance_path = "../app/impl/tugraph_cypher/generalizer/base/db_instance/movie"
query_visitor = TugraphCypherAstVisitor()
query_generalizer = QueryGeneralizer(db_id, instance_path)
query_template="MATCH (n {name: 'Carrie-Anne Moss'}) RETURN n.born AS born"
# generalize cypher query
query_list = query_generalizer.generalize_from_cypher(query_template=query_template)from app.core.generalizer.query_generalizer import QueryGeneralizer
from app.impl.iso_gql.translator.iso_gql_query_translator import IsoGqlQueryTranslator as GQLTranslator
from app.impl.tugraph_cypher.ast_visitor.tugraph_cypher_query_visitor import TugraphCypherAstVisitor
db_id = "movie"
instance_path = "../app/impl/tugraph_cypher/generalizer/base/db_instance/movie"
query_generalizer = QueryGeneralizer(db_id, instance_path)
query_visitor = TugraphCypherAstVisitor()
gql_translator = GQLTranslator()
query_template="MATCH (n:Person)-[:HAS_CHILD*1]->(n) WHERE n.name = 'Vanessa Redgrave' RETURN n"
# generalize gql query
query_list = []
success, query_pattern = query_visitor.get_query_pattern(query_template)
if success:
query_pattern_list = query_generalizer.generalize(query_pattern=query_pattern)
for query_pattern in query_pattern_list:
query = gql_translator.translate(query_pattern)
query_list.append(query)The framework implements a full-chain automated generation pipeline: Schema Generation → Graph Database Instance Construction → Complex Corpus Generation. This transforms the traditional "manual template design & semi-automatic corpus generation" model into a new paradigm of fully automated generation.
The system consists of four highly cohesive, loosely coupled core modules:
| Module | Function Description |
|---|---|
| Schema Generator | Understands natural-language domain and subdomain descriptions and generates corresponding graph schemas (nodes, edges, properties) for the Data Generator |
| Data Generator | Generates simulated node and edge data based on Schema, supporting large-scale complex relationship network construction |
| Corpus Generator | Uses LLM to generate high-quality Question-Query pairs containing multi-hop queries, nested queries |
| Validator | Checks the correctness of generated Schema, and the grammatical/semantic correctness of Query |
The Schema Generator module automatically creates complex graph database schemas from natural language domain descriptions. It introduces Domain and Subdomain concepts to enhance semantic hierarchy and implements a quantifiable graph structure generation strategy based on a 5-level complexity model.
Key Features:
- Generates SchemaGraph format schemas that can be converted into TuGraph modeling files or adapted for other database engines
- Supports quantitative control over schema complexity through predefined node and relationship ranges
- Uses LLM to generate Schema Descriptions and corresponding Schema JSON
- Ensures polymorphism through SchemaGraph class for future database adapters
The Data Generator module creates realistic simulation data based on generated schemas, following real-world statistical distributions like power-law, long-tail, and normal distributions.
Key Features:
- Generates node and edge CSV files with property constraints
- Creates TuGraph-compatible import_config.json for batch importing via lgraph_import
- Handles common import errors (type parsing failures, missing delimiters, null value handling)
- Generated 488 CSV files containing ~3,716,332 rows of data during development
The Corpus Generator produces high-quality Question-Query pairs through a hierarchical generation strategy that balances complexity and diversity.
Key Features:
- Layered generation strategy: first generates simple seed corpus, then complex corpus based on seeds
- Real validation: all queries are executed and verified on actual graph databases
- Context-aware generation: uses query execution results as context for LLM enhancement
- Iterative enhancement: controllable iteration rounds for gradually increasing complexity
- Generated 800+ high-quality corpus pairs across multiple domains
We used the framework’s automated generation pipeline to construct multi-dimensional test datasets Geography_World, Movie_Movielens, Healthcare_Donor and training datasets Banking_Financial, Game Olympics, Retail_RetailWorld. They can be downloaded here: test set, training set.
Then, we conducted zero-shot experiments using three test database instances and their generated corpora. We evaluated Qwen-Plus, Qwen-8B Base, and Qwen-8B fine-tuned with LoRA on different datasets using four metrics: Grammar, Similarity, Google BLEU, and EA.
| Test Set | Schema Complexity | Grammer | Similarity | Google BLEU | EA |
|---|---|---|---|---|---|
| geography | 5 | 83.7 | 85.8 | 64.3 | 16.28 |
| geography_seeds | 5 | 100 | 84.1 | 58.7 | 14.29 |
| healthcare_donor | 3 | 84.8 | 87.8 | 63.2 | 27.85 |
| healthcare_donor_seeds | 3 | 95.2 | 86.7 | 53.6 | 42.86 |
| movie | 3 | 86.5 | 88.3 | 63.4 | 14.86 |
| movie_seeds | 3 | 96.4 | 86.8 | 50.8 | 50.00 |
Table 1. Qwen-Plus Experiment Results on LLM-Synthesis Dataset
| Test Set | Schema Complexity | Grammar | Similarity | Google BLUE | EA |
|---|---|---|---|---|---|
| geography | 5 | 0.884 | 0.856 | 0.564 | 26.74% |
| geography_seeds | 5 | 0.929 | 0.861 | 0.551 | 21.43% |
| healthcare_donor | 3 | 0.949 | 0.885 | 0.652 | 29.11% |
| healthcare_donor_seeds | 3 | 1.000 | 0.861 | 0.510 | 42.86% |
| movie | 3 | 0.946 | 0.877 | 0.598 | 6.76% |
| movie_seeds | 3 | 0.893 | 0.864 | 0.512 | 39.29% |
Table 2. Qwen-8B Base Experiment Results on LLM-Synthesis Dataset
| Test Set | Schema Complexity | Grammar | Similarity | Google BLUE | EA |
|---|---|---|---|---|---|
| geography | 5 | 0.942 | 0.878 | 0.683 | 20.93% |
| geography_seeds | 5 | 1.000 | 0.899 | 0.747 | 71.43% |
| healthcare_donor | 3 | 0.987 | 0.870 | 0.578 | 27.85% |
| healthcare_donor_seeds | 3 | 1.000 | 0.901 | 0.641 | 71.43% |
| movie | 3 | 1.000 | 0.893 | 0.600 | 28.38% |
| movie_seeds | 3 | 1.000 | 0.927 | 0.633 | 64.29% |
Table 3. Qwen-8B Base fine-tuned with LoRA Experiment Results on LLM-Synthesis Dataset
Testing on LLM-synthesized datasets shows that the framework successfully generates complex corpora that challenge current LLMs, with execution accuracy below 30% on complex iterated corpora. Fine-tuning experiments demonstrate that models trained on framework-generated data show improved performance.
To make Awesome-Text2GQL supports the corpus construction of more types of graph query languages, we welcome contribution to the implementation of ast visitor, query translator, and schema parser of new graph query languages. If you find the compatibility of current intermediate representation is not enough for the new graph query language, we also welcome contribution to the intermediate representation.
The clause class is the core of Awesome-Text2GQL's intermediate representation for graph query languages. Currently clause class has match clause, return clause, where clause and with clause as subclasses. The design of subclasses might be updated in the future for the compatibility of more graph query languages
-
Match Clause: the intermediate representation for pattern match
-
Return Clause: the intermediate representation for item return.
-
Where Clause: the intermediate representation for condition expression.
-
With Clause: the intermediate representation for variable control.
The ast visitor class is a virtual class and should be implemented for different graph query language, like cypher ast visitor or gql ast visitor. Implementation on each graph query language should be able to parse the given query, visit the abstarct syntax tree, then return the graph pattern(a list of clauses) of the given query as the intermediate representation for further translation or generalization. See app/impl/tugraph_cypher/ast_visitor/tugraph_cypher_ast_visitor.py as an example.
The query translator class is a virtual class and should be implemented for different graph query language, like cypher query translator or gql query translator. Implementation on each graph query language should implement the translate function to turn a list of clauses(the intermediate representation) into an actual query aligned to the grammar of corresponding language, and implement the grammar check function to check if a query is grammatically correct. See app/impl/iso_gql/translator/iso_gql_query_translator.py as an example.
The schema parser class is a virtual class and should be implemented for different DBMS, like neo4j schema parser or tugraph schema parser. Implementation on each DBMS should be able to parse the correspongding schema file, whether it's a set of queries or a json file, then return a in memory schema graph for query generalization. See app/impl/tugraph_cypher/schema/schema_parser.py as an example.
Awesome-Text2GQL will continue to enhance the quality and diversity of generated corpora and improve the framework's usability and performance.
Do the following steps before submitting your code.
poetry run ruff format .
poetry run ruff check ./app ./examples --fix
if all check passed, you can submit your code.
create a pull request, link it to a related issue, then wait for the project maintainer to review your changes and provide feedback. If your pull request is finally approved by our maintainer, we will merge it. Other details can reference to our contributing document.
This project is still under development, suggestions, issues or pull requests are welcome.



