For each database, it is assigned a separate sub-folder under data/database, which contains:
${database_name}.json: the database schema file, see the schema format;${database_name}.sql: SQL CREATE statement to build the database (automatically generated from.jsonfile);${database_name}.duckdb: The DuckDB type database which stores the cell content.
💡 Note: please use lowercased and underscore splitted (Pythonic) convention to name your database.
👇🏻 Click to view the database folder structure
data/database
├── ai_research
│ ├── ai_research.duckdb
│ ├── ai_research.json
│ └── ai_research.sql
├── emnlp_papers
│ ├── emnlp_papers.duckdb
│ ├── emnlp_papers.json
│ └── emnlp_papers.sql
└── openreview_papers
├── openreview_papers.duckdb
├── openreview_papers.json
└── openreview_papers.sqlGiven the input PDF(s) as well as the rules of how to parse (detailed in Configuration for PDF Parsing below), we can use the script utils.database_utils to parse the PDF and write multi-view content into the database.
Take the dataset airqa and the corresponding DB ai_research as an example:
- For single PDF input, the input argument
${pdf_to_parse}below can be:- PDF UUID like
0001a3be-2c07-51c1-81d3-f3a390874e92, in which case the metadata is already fetched from scholar APIs and the raw PDF file is also downloaded based on thepdf_pathfield; - Local PDF file path, e.g.,
~/Downloads/2210.03629.pdf;- 👁️ Attention: if the basename of the local PDF file is a valid UUID string, it degenerates to the case PDF UUID.
- Web URL of the PDF, e.g.,
https://aclanthology.org/2024.findings-emnlp.258.pdf; - Paper title, e.g.,
Attention is all you need.
- PDF UUID like
$ python utils/database_utils.py --database ai_research --config_path configs/ai_research_config.json \
--pdf_path ${pdf_to_parse} --on_conflict ignore- For a list of PDFs, we can pass the
.txtor.jsonfile name as the input:- Each line (
.txt) or each element (.json) in the${pdf_file}is a string of any 4 types defined above. - The argument
--on_conflict [raise|ignore|replace]can take three values to handle primary key conflicts when inserting new rows into the database.
- Each line (
$ python utils/database_utils.py --database ai_research --config_path configs/ai_research_config.json \
--pdf_path ${pdf_file} --on_conflict ignore🤗 Note: For all input types except PDF UUID, we will resort to real-time scholar APIs during PDF parsing to obtain the metadata of the paper (e.g., published conference and year, see Scholar APIs for available tools). Sadly, the scholar API may be unstable and fail to fetch the desired information. Therefore, it is strongly recommended that we pre-fetch the metadata of each paper and use PDF UUID as input when processing abundant papers from an entire venue.
- The database schema file
${database_name}.jsonis structured into:
{
"database_name": "which should also be the basename of the schema file",
"description": "A natural language description about this database",
"database_schema": [ // a List of table-columns dicts
{
"table": {
"table_name": "readable_name_for_this_table",
"description": "A natural language description about this table, e.g., what it contains and its functionality."
},
"columns": [
{
"column_name": "readable_name_for_this_column",
// refer to official doc: https://duckdb.org/docs/sql/data_types/overview, e.g., FLOAT, INTEGER[], MAP(INTEGER, VARCHAR)
"column_type": "upper_cased_data_type_string_of_DuckDB",
"description": "A natural language description about this column, e.g., what is it about.",
"encodable": "text" // optional, default to None, which means the column is not encodable. Can take the value from ['text', 'image']
},
{
... // other columns
}
],
"primary_keys": [
"column_name",
"composite_primary_key_column_name" // composite primary keys
],
"foreign_keys": [
// List of triplets, allow composite foreign keys, e.g., ["stuname", "student", "student_name"], [["stuname", "stuclass"], "student", ["student_name", "class_name"]]
[
"column_name_or_list_in_the_current_table",
"referenced_table_name",
"referenced_column_name_or_list"
],
... // other foreign keys
]
},
{
... // other tables
}
]
}- For available data types, please refer to DuckDB Data Types. Here are some data types you can use in the json field
column_type:- Basic Types:
BOOLEAN: boolean value, true/false;INTEGER: int4;FLOAT: float4;DOUBLE: float8, please useFLOATwith priority;DATE: date type, containing year, month, and day, usually in the formatYYYY-MM-DD, e.g.,2024-08-08;TIME: time type, containing hour, minute, and second, usually in the formatHH:MM:SS, e.g.,22:00:00;DATETIME: including bothDATEandTIME(alias ofTIMESTAMP, either type is ok), usually in the formatYYYY-MM-DD HH:MM:SS, e.g.,2024-08-08 22:00:00;TIMESTAMPTZ: timestamp with time zone information, usually in the formatYYYY-MM-DD HH:MM:SS±HH:MM, e.g.,2024-08-11 14:30:00+02:00represents August 11, 2024, at 14:30 in a time zone that is 2 hours ahead of UTC;VARCHAR: actually, this is an alias ofSTRING,CHARandTEXT. Please useVARCHARfor consistency;UUID: only used as primary keys, can be converted or interpreted asVARCHAR.
- Advanced Types:
- structured data types such as
ARRAY,LIST,MAP,STRUCT, andUNION; - when specifying these advanced column types, pay attention to the format when filling the
column_typesfield, e.g.,INTEGER[3]forARRAY,INTEGER[]forLIST, andMAP(INTEGER, VARCHAR)forMAP.
- structured data types such as
- Basic Types:
💡 Note: we only support DuckDB currently. Other database types are left as future work.
When we run the PDF parsing script above for the first time (--from_scratch), we will automatically get the .sql DDL file under the corresponding database folder. We can use free online tools (e.g., DrawSQL with database type PostgreSQL) to import this DDL file and obtain the visualization graph. The illustration of the ai_research database schema is:
An Illustration of the Database Schema for AI Research Papers
To populate the database content given an input PDF, we may utilize various database- or domain-specific functions to extract certain cell values and then aggregate them into the database. We propose a DataPopulation framework to formalize the workflow. The entrance for this class is the function populate:
def populate(self,
input_pdf: str,
config: Dict[str, Any],
write_to_db: bool = True,
write_to_vs: bool = True,
on_conflict: str = 'replace',
verbose: bool = False
) -> None:
""" Given a raw input about the PDF (`input_pdf`), try to parse it according to the rules defined in `config` and insert values into the corresponding database and vectorstore based on the conflicting policy `on_conflict`.
If `write_to_db` is True, execute the INSERT SQL against the database;
If `write_to_vs` is True, also encode the new cell values into the vectorstore.
"""
passWe take a small testing database test_domain as an example to demonstrate how to formalize the essential config JSON dict:
- The database schema file is
data/database/test_domain/test_domain.json; - The configuration (
config) isconfigs/test_domain_config.json.
This config dict contains three JSON keys, uuid, pipeline, and aggregation, where:
- Field
pipelinedefines how to get cell values for each column in a function pipeline; - Field
aggregationindicates how to aggregate the output ofpipelinefunctions into row entries for each table; - Field
uuidtells how to get the unique UUID of the input PDF. This UUID will be passed to the vectorstore encoding module.
{
"pipeline": [
{
"function": "get_pdf_page_text",
"args": {
"deps": [
"input_pdf"
],
"kwargs": {
"generate_uuid": true,
"normalize_blank": true
}
}
},
{
"function": "get_text_summary",
"args": {
"deps": [
"get_pdf_page_text"
],
"kwargs": {
"key": "page_contents",
"max_length": 50,
"model": "gpt-4o-mini",
"temperature": 0.7
}
}
}
],
"aggregation": [
{
"function": "aggregate_test_domain_table_pdf_meta",
"table": "pdf_meta",
"columns": [
"pdf_id",
"pdf_name",
"pdf_path"
],
"args": {
"deps": [
"get_pdf_page_text"
],
"kwargs": {}
}
},
{
"function": "aggregate_test_domain_table_pdf_pages",
"table": "pdf_pages",
"args": {
"deps": [
"get_pdf_page_text",
"get_text_summary"
],
"kwargs": {}
}
}
],
"uuid": {
"function": "get_pdf_page_text",
"field": "pdf_id"
}
}- Parse and Extract Cell Values: In the first function dict of the
pipelinefield above,
{
"function": "get_pdf_page_text",
"args": {
"deps": [
"input_pdf"
],
"kwargs": {
"generate_uuid": true,
"normalize_blank": true
}
}
}Each pipeline dict contains three fields:
- Field
function: str, required. It denotes the pipeline function name in moduleutils.functions; - Field
args -> deps: List[str], optional. It denotes the input positional parameters of the pipeline function. For example,deps = ["input_pdf"]means we use exactly the input parameterinput_pdfof functionpopulateas the first positional argument for functionget_pdf_page_text. As for the second pipeline functionget_text_summary,deps = ["get_pdf_page_text"]means it takes the output of the first functionget_pdf_page_textas the first positional input argument; - Field
args -> kwargs: Dict[str, Any], optional. It stores other keyword arguments for the current function.
- Aggregate and Insert Cell Values: Values of different columns may be processed in distinct pipeline functions. Thus, we need some instruction to put them together into a single table. This is exactly what the
aggregationdict list does. For example,
{
"function": "aggregate_test_domain_table_pdf_meta",
"table": "pdf_meta",
"columns": [
"pdf_id",
"pdf_name",
"pdf_path"
],
"args": {
"deps": [
"get_pdf_page_text"
],
"kwargs": {}
}
}Each aggregation dict follows almost the same format as pipeline functions:
- Field
function: str, required. It denotes the aggregation function name in moduleutils.functions; - Field
args -> deps: List[str], optional, for input-output dependencies or positional arguments; - Field
args -> kwargs: Dict[str, Any], optional, for keyword arguments of the current aggregation function; - Field
table: str, required. It denotes the table name to insert row values; - Field
columns: Optional[List[str]], optional. It represents the list of column names in the fieldtableto insert entries. If omitted, we insert values for all columns following the default column order in the specifiedtablebased on the database schema.
❗️ NOTE:
depscan only search forpipelinefunction names, rather thanaggregationfunctions.
- UUID: This special field returns how to get the unique UUID of the input PDF. In the demonstration case above, we use the
pdf_idfield of the JSON output from pipeline functionget_pdf_page_text. The extracted UUID will be passed to the vectorstore encoding part to search encodable cell values and use various encoding models to vectorize them.
To test the config above, you can run this simple demo script:
python utils/database_utils.py --database test_domain --config_path configs/test_domain_config.json \
--pdf_path data/dataset/test_pdf.pdf --on_conflict replace🌟 Best Practice: Here are some tips on customizing your personal pipeline and aggregation functions:
- Both
pipelineandaggregationfunctions should be defined in the Python moduleutils.functions. Remember to import them in the__init__.pyfile;- The first pipeline function should take the raw input
input_pdfas one of its position argument;- 🥀 Suggestion: Consider using a JSON Dict[str, Any] as the output type for each
pipelinefunction, such that it will be easier to chain the functions.- 🧠 Obligation: The output of aggregation functions MUST BE of type List[List[Any]], with each element being an entry row (List[Any]) to insert. Such that the follow-up population can automatically create the INSERT SQL statements and execute them via
duckdb.DuckDBPyConnection.executemany().
During multi-view PDF parsing, we rely on:
- the third party tool MinerU to perform the major PDF parsing (e.g., table recognition and formula detection), and
- LLMs/VLMs for PDF content refinement (both text and image modalities).
To speed up the database population, we can improve the efficiency at two steps.
-
Firstly, we can enable the GPU acceleration of MinerU and set the
device-modein MinerU configurationmagic-pdf.jsontocuda. -
Besides, we can pre-process the PDFs with official MinerU command
magic-pdf -p pdf_filepath -o output_folder -m autoand cache the output results in the output folder${dataset_dir}/${dataset}/processed_data/.- Each processed PDF will be cached in
processed_data/as a separate folder with the same name as the base filename.
$ cat uuids.txt data/dataset/airqa/papers/acl2023/a02e3a4b-1dfb-5f4b-b654-4855a1a7f7bf.pdf data/dataset/airqa/papers/acl2023/a04766c4-db6f-58b8-867f-07385a5890e3.pdf ... $ bash mineru.sh airqa uuids.txt
Then, during the PDF parsing in function
populate, it will automatically detect and utilize the cached results to accelerate the pipeline. For large quantities of papers, we can further launch multiple processes tomagic-pdfdifferent PDFs partitions. - Each processed PDF will be cached in
In the population pipeline, we constantly send http requests to LLM for summarization, which is slow and time-consuming. We also provide two extra configs to summarize with batch APIs, especially when you have a large number of papers to populate. Take the dataset airqa as an example:
⚡️ NOTE: Please pre-fetch all metadata into folder
metadata/and pre-parse all PDFs with MinerU into folderprocessed_data/(see Parallel MinerU Parsing via GPU) before running the following scripts.
-
Parallel Extraction: This script generates two files,
text_batch.jsonlandimage_batch.jsonl, following the standard format (see OpenAI Batch API). The default output pathdata/dataset/airqa/parallel/can be changed inai_research_pe_config.json.# batch_uuids.json contains a list of PDF UUIDs $ python utils/database_utils.py --database ai_research --pdf_path data/dataset/airqa/batch_uuids.json \ --config_path configs/ai_research_pe_config.json -
Batch API Calls: Now, we can send the two files (
text_batch.jsonlandimage_batch.jsonl) to LLMs/VLMs that support batch inference. Suppose we have obtained the result filestext_results.jsonlandimage_results.jsonlrespectively. -
Parallel Filling: Next, we can fill the missing summaries into
processed_data/with the following command.# batch_uuids.json should be exactly the same file in step 1 python utils/database_utils.py --database ai_research --pdf_path data/dataset/airqa/batch_uuids.json \ --config_path configs/ai_research_pf_config.jsonFinally, we can now populate the database and complete the whole database population:
python utils/database_utils.py --database ai_research --pdf_path data/dataset/airqa/batch_uuids.json \ --config_path configs/ai_research_config.json --on_conflict ignore
❗️ Attention: the Parallel Extraction and Parallel Filling should be conducted on the same server or laptop, because the hash value of the same LLM message may be different across OS platforms. Moreover, we found that the LLM batch API may fail to generate part of the results. Thus, it is suggested to maintain the HTTP connection to LLMs/VLMs even when performing the final population using
ai_researh_config.json.