This repo contains an easy tool to scrap edp file from three sources.
pip install -r requirements.txt
Be sure to playwright install before launch scraping script.
-
environdec/https://environdec.com/library- first
scrap_urls.pyto refresh theurls.json - then you can run
download_pdf.pywich referes to theurls.json. It will refresh the pdf files indocs/.
- first
-
holciumus/https://www.holcim.us/technical-specificationsjust run
download_pdf.pyto refresh the pdf files indocs/ -
labelingsustainability/https://www.labelingsustainability.com/holcim-epdsjust run
download_pdf.pyto refresh the pdf files indocs/
Download files about Holcim but can be modified with any.
ocrisation.py will check the folder docs/ and generate raw text in ocr_output. You can adjust all parameters try to have the best raw file.
Use txt_to_json.py with an API key (Open AI in this exemple). It will generate a resume for each file in ocr_output folder. If the resume is not accurate, you can juste delete it, change some parameters (temperature or prompt for exemple) and relaunch the script. It will first check is the file in ocr_output already exist, if not it will send the API request.
It could be some unicode caracter artefact. Launch clean_unicode.py it will all json files in the folder_path to clean it.
Be sure to evaluate the cost of the API request before to proceed.
ask_json.py you see a little exemple how you can use the repo with the github api.