Skip to content

Tool to scrap pdf of Environmental Product Declarations, OCR to transform in raw text and extract informations with LLM on json format.

Notifications You must be signed in to change notification settings

lan-ensad/Environmental_Product_Declarations

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

10 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

This repo contains an easy tool to scrap edp file from three sources.

Dependencies

pip install -r requirements.txt

Be sure to playwright install before launch scraping script.

Downloding PDF

Download files about Holcim but can be modified with any.

OCR sources

ocrisation.py will check the folder docs/ and generate raw text in ocr_output. You can adjust all parameters try to have the best raw file.

Resume with LLM

Use txt_to_json.py with an API key (Open AI in this exemple). It will generate a resume for each file in ocr_output folder. If the resume is not accurate, you can juste delete it, change some parameters (temperature or prompt for exemple) and relaunch the script. It will first check is the file in ocr_output already exist, if not it will send the API request.

Cleaning

It could be some unicode caracter artefact. Launch clean_unicode.py it will all json files in the folder_path to clean it.

Warning

Be sure to evaluate the cost of the API request before to proceed.

Work with json

ask_json.py you see a little exemple how you can use the repo with the github api.

About

Tool to scrap pdf of Environmental Product Declarations, OCR to transform in raw text and extract informations with LLM on json format.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages