|
1 | | -# [Parser](https://parser.excoffierleonard.com) |
| 1 | +# Parser |
2 | 2 |
|
3 | | -REST API service in Rust that takes in any file and returns its parsed content. |
| 3 | +A Rust-based document parsing system that extracts text content from various file formats. |
4 | 4 |
|
5 | | -Multithreading was used to improve the performance of the service. The service is able to handle multiple requests concurrently. |
| 5 | +[Live Demo](https://parser.excoffierleonard.com) | [API Endpoint](https://parser.excoffierleonard.com/parse) |
6 | 6 |
|
7 | | -Demonstration URL: [https://parser.excoffierleonard.com](https://parser.excoffierleonard.com) |
| 7 | + |
8 | 8 |
|
9 | | -Demonstration Endpoint: [https://parser.excoffierleonard.com/parse](https://parser.excoffierleonard.com/parse) |
| 9 | +## 📚 Overview |
10 | 10 |
|
11 | | - |
| 11 | +Parser is a modular Rust project that provides comprehensive document parsing capabilities through multiple interfaces: |
| 12 | + |
| 13 | +- **Core library**: The foundation providing parsing functionality for various file formats |
| 14 | +- **CLI tool**: Command-line interface for quick file parsing |
| 15 | +- **Web API**: REST service for parsing files via HTTP requests |
| 16 | +- **Web UI**: Simple interface for testing the parser functionality |
| 17 | + |
| 18 | +## 📦 Project Structure |
| 19 | + |
| 20 | +The project is organized as a Rust workspace with multiple crates: |
12 | 21 |
|
13 | | -## 📚 Table of Contents |
14 | | - |
15 | | -- [Supported File Types](#-supported-file-types) |
16 | | -- [Prerequisites](#-prerequisites) |
17 | | -- [Configuration](#-configuration) |
18 | | -- [Deployment](#-deployment) |
19 | | -- [API Documentation](#-api-documentation) |
20 | | -- [Development](#-development) |
21 | | -- [License](#-license) |
22 | | - |
23 | | -## 📦 Supported File Types |
24 | | - |
25 | | -The API supports the following file formats: |
26 | | - |
27 | | -- PDF (`.pdf`) |
28 | | -- Word Documents (`.docx`) |
29 | | -- Excel Spreadsheets (`.xlsx`) |
30 | | -- PowerPoint Presentations (`.pptx`) |
31 | | -- All text-based files including but not limited to: |
32 | | - - Plain text (`.txt`) |
33 | | - - Source code files (`.rs`, `.py`, `.js`, `etc.`) |
34 | | - - Configuration files (`.json`, `.yaml`, `.toml`, `etc.`) |
35 | | - - Markup files (`.html`, `.md`, `.xml`) |
36 | | - - Data files (`.csv`, `.tsv`) |
37 | | - - Log files (`.log`) |
38 | | -- All image-based files (OCR) including but not limited to: |
39 | | - - Raster images (`.png`, `.jpg`, `.jpeg`, `.gif`, `.bmp`, `.webp`, `etc.`) |
40 | | - - Icon files (`.ico`) |
41 | | - - Animated images (`.gif`) |
| 22 | +- **parser-core**: The core parsing engine |
| 23 | +- **parser-cli**: Command-line interface |
| 24 | +- **parser-web**: Web API and frontend |
| 25 | +- **test-utils**: Shared testing utilities |
| 26 | + |
| 27 | +## 📄 Supported File Types |
| 28 | + |
| 29 | +- **Documents**: PDF (`.pdf`), Word (`.docx`), PowerPoint (`.pptx`), Excel (`.xlsx`) |
| 30 | +- **Text**: Plain text (`.txt`), CSV, JSON, YAML, source code, and other text-based formats |
| 31 | +- **Images**: PNG, JPEG, WebP, and other image formats with OCR (Optical Character Recognition) |
42 | 32 |
|
43 | 33 | The OCR functionality supports English and French languages. |
44 | 34 |
|
45 | | -## 🛠 Prerequisites |
| 35 | +## 🛠️ Getting Started |
46 | 36 |
|
47 | | -For local build: |
| 37 | +### Prerequisites |
48 | 38 |
|
49 | | -- [Rust](https://www.rust-lang.org/learn/get-started) |
50 | | -- Libraries (For Tessaract OCR): |
| 39 | +- [Rust](https://www.rust-lang.org/learn/get-started) (latest stable) |
| 40 | +- OCR Dependencies: |
51 | 41 | - Tesseract development libraries |
52 | 42 | - Leptonica development libraries |
53 | 43 | - Clang development libraries |
54 | | - - English Language Data |
55 | | - - French Language Data |
56 | 44 |
|
57 | | -### Installing Dependencies |
| 45 | +#### Installing OCR Dependencies |
58 | 46 |
|
59 | | -#### Debian/Ubuntu |
| 47 | +**Debian/Ubuntu:** |
60 | 48 |
|
61 | 49 | ```bash |
62 | 50 | sudo apt install libtesseract-dev libleptonica-dev libclang-dev |
63 | 51 | ``` |
64 | 52 |
|
65 | | -#### macOS |
| 53 | +**macOS:** |
66 | 54 |
|
67 | 55 | ```bash |
68 | 56 | brew install tesseract |
69 | 57 | ``` |
70 | 58 |
|
71 | | -#### Windows |
72 | | - |
| 59 | +**Windows:** |
73 | 60 | Follow the instructions at [Tesseract GitHub repository](https://github.com/tesseract-ocr/tesseract). |
74 | 61 |
|
75 | | -For deployment: |
| 62 | +### Building from Source |
76 | 63 |
|
77 | | -- [Docker](https://docs.docker.com/get-docker/) |
78 | | -- [Docker Compose](https://docs.docker.com/compose/install/) |
| 64 | +```bash |
| 65 | +# Build all crates |
| 66 | +cargo build |
79 | 67 |
|
80 | | -## ⚙ Configuration |
| 68 | +# Build in release mode |
| 69 | +cargo build --release |
| 70 | +``` |
| 71 | + |
| 72 | +### Using the CLI |
| 73 | + |
| 74 | +```bash |
| 75 | +# Run directly with cargo |
| 76 | +cargo run -p parser-cli -- path/to/file1.pdf path/to/file2.docx |
81 | 77 |
|
82 | | -The service can be configured using the following environment variables. |
| 78 | +# Or use the built binary |
| 79 | +./target/release/parser-cli path/to/file1.pdf path/to/file2.docx |
| 80 | +``` |
| 81 | + |
| 82 | +### Running the Web Server |
| 83 | + |
| 84 | +```bash |
| 85 | +# Run the web server |
| 86 | +cargo run -p parser-web |
83 | 87 |
|
84 | | -- `PARSER_APP_PORT`: _INT_, The port on which the program listens on. (default: 8080) |
85 | | -- `ENABLE_FILE_SERVING`: _BOOL_, Enable serving files for the frontend. (default: false, just the API is enabled) |
| 88 | +# With custom port |
| 89 | +PARSER_APP_PORT=9000 cargo run -p parser-web |
| 90 | + |
| 91 | +# With file serving enabled (for frontend) |
| 92 | +ENABLE_FILE_SERVING=true cargo run -p parser-web |
| 93 | +``` |
86 | 94 |
|
87 | 95 | ## 🚀 Deployment |
88 | 96 |
|
| 97 | +The easiest way to deploy the service is using Docker: |
| 98 | + |
89 | 99 | ```bash |
90 | 100 | curl -o compose.yaml https://raw.githubusercontent.com/excoffierleonard/parser/refs/heads/main/compose.yaml && \ |
91 | 101 | docker compose up -d |
92 | 102 | ``` |
93 | 103 |
|
94 | | -## 📖 API Documentation |
| 104 | +### Environment Variables |
95 | 105 |
|
96 | | -API documentation and examples are available in [docs/api.md](docs/api.md). |
| 106 | +- `PARSER_APP_PORT`: The port on which the web service listens (default: 8080) |
| 107 | +- `ENABLE_FILE_SERVING`: Enable serving frontend files (default: false) |
97 | 108 |
|
98 | 109 | ## 🧪 Development |
99 | 110 |
|
100 | | -Useful commands for development: |
| 111 | +### Testing |
| 112 | + |
| 113 | +```bash |
| 114 | +# Run all tests |
| 115 | +cargo test --workspace |
| 116 | + |
| 117 | +# Run specific test |
| 118 | +cargo test test_name |
| 119 | +``` |
101 | 120 |
|
102 | | -- Full build: |
| 121 | +### Benchmarking |
103 | 122 |
|
104 | 123 | ```bash |
105 | | -chmod +x ./scripts/build.sh && \ |
106 | | -./scripts/build.sh |
| 124 | +# Run benchmarks |
| 125 | +cargo bench --workspace |
| 126 | + |
| 127 | +# Run benchmark script |
| 128 | +./scripts/benchmark.sh |
| 129 | +``` |
| 130 | + |
| 131 | +### Code Quality |
| 132 | + |
| 133 | +```bash |
| 134 | +# Run linter |
| 135 | +cargo clippy --workspace -- -D warnings |
| 136 | + |
| 137 | +# Format code |
| 138 | +cargo fmt --all |
107 | 139 | ``` |
108 | 140 |
|
109 | | -- Deployment tests: |
| 141 | +### Building with Scripts |
110 | 142 |
|
111 | 143 | ```bash |
112 | | -chmod +x ./scripts/deploy-tests.sh && \ |
| 144 | +# Full build script |
| 145 | +./scripts/build.sh |
| 146 | + |
| 147 | +# Deployment tests |
113 | 148 | ./scripts/deploy-tests.sh |
114 | 149 | ``` |
115 | 150 |
|
|
0 commit comments