- Go to your project directory (where this repository is cloned)
- Go to
checkifypackage (whereocranddatapackages located). - Download
roberta-base.zipfrom here and extract it tomodelspackage. - Download
Tesseractfrom [here](Download Tesseract from https://github.com/UB-Mannheim/tesseract/wiki) - Substitute
tess.pytesseract.tesseract_cmd = r"C:\\Program Files\\Tesseract-OCR\\tesseract.exe"in.\checkify\ocr\ocr.pywithpath\to\your\tesseract.exe. - Download all the neccesary libraries from
tomlfile. - Run
preparation.sh(make sure that you havenltkdownloaded).
python .\main.py check-contract --path=test_file.pdf
This program adds OCR layer upon robera-base model by TheAtticusProject. The model was fine-tuned using contract documents, manually annotated by Law students. Detailed description of CUAD dataset and annotation process can be found here.
For further fine-tuning, new data can be annotated using SQuAD format. Code for training can be found in the original repository.
eBrevia can be used for data annotation as stated here, under Annotations section.
Code for prediction was taken from this repository.
