Installation

Go to your project directory (where this repository is cloned)
Go to checkify package (where ocr and data packages located).
Download roberta-base.zip from here and extract it to models package.
Download Tesseract from [here](Download Tesseract from https://github.com/UB-Mannheim/tesseract/wiki)
Substitute tess.pytesseract.tesseract_cmd = r"C:\\Program Files\\Tesseract-OCR\\tesseract.exe" in .\checkify\ocr\ocr.py with path\to\your\tesseract.exe.
Download all the neccesary libraries from toml file.
Run preparation.sh (make sure that you have nltk downloaded).

Test

python .\main.py check-contract --path=test_file.pdf

Description

This program adds OCR layer upon robera-base model by TheAtticusProject. The model was fine-tuned using contract documents, manually annotated by Law students. Detailed description of CUAD dataset and annotation process can be found here.

For further fine-tuning, new data can be annotated using SQuAD format. Code for training can be found in the original repository.

eBrevia can be used for data annotation as stated here, under Annotations section.

Code for prediction was taken from this repository.

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
checkify		checkify
tests		tests
.gitignore		.gitignore
README.md		README.md
poetry.lock		poetry.lock
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Installation

Test

Description

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Installation

Test

Description

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages