Skip to content

Commit def9ed5

Browse files
Reworked main readme
1 parent edc779f commit def9ed5

1 file changed

Lines changed: 96 additions & 61 deletions

File tree

README.md

Lines changed: 96 additions & 61 deletions
Original file line numberDiff line numberDiff line change
@@ -1,115 +1,150 @@
1-
# [Parser](https://parser.excoffierleonard.com)
1+
# Parser
22

3-
REST API service in Rust that takes in any file and returns its parsed content.
3+
A Rust-based document parsing system that extracts text content from various file formats.
44

5-
Multithreading was used to improve the performance of the service. The service is able to handle multiple requests concurrently.
5+
[Live Demo](https://parser.excoffierleonard.com) | [API Endpoint](https://parser.excoffierleonard.com/parse)
66

7-
Demonstration URL: [https://parser.excoffierleonard.com](https://parser.excoffierleonard.com)
7+
![Website Preview](website_preview.png)
88

9-
Demonstration Endpoint: [https://parser.excoffierleonard.com/parse](https://parser.excoffierleonard.com/parse)
9+
## 📚 Overview
1010

11-
![Website Preview](website_preview.png)
11+
Parser is a modular Rust project that provides comprehensive document parsing capabilities through multiple interfaces:
12+
13+
- **Core library**: The foundation providing parsing functionality for various file formats
14+
- **CLI tool**: Command-line interface for quick file parsing
15+
- **Web API**: REST service for parsing files via HTTP requests
16+
- **Web UI**: Simple interface for testing the parser functionality
17+
18+
## 📦 Project Structure
19+
20+
The project is organized as a Rust workspace with multiple crates:
1221

13-
## 📚 Table of Contents
14-
15-
- [Supported File Types](#-supported-file-types)
16-
- [Prerequisites](#-prerequisites)
17-
- [Configuration](#-configuration)
18-
- [Deployment](#-deployment)
19-
- [API Documentation](#-api-documentation)
20-
- [Development](#-development)
21-
- [License](#-license)
22-
23-
## 📦 Supported File Types
24-
25-
The API supports the following file formats:
26-
27-
- PDF (`.pdf`)
28-
- Word Documents (`.docx`)
29-
- Excel Spreadsheets (`.xlsx`)
30-
- PowerPoint Presentations (`.pptx`)
31-
- All text-based files including but not limited to:
32-
- Plain text (`.txt`)
33-
- Source code files (`.rs`, `.py`, `.js`, `etc.`)
34-
- Configuration files (`.json`, `.yaml`, `.toml`, `etc.`)
35-
- Markup files (`.html`, `.md`, `.xml`)
36-
- Data files (`.csv`, `.tsv`)
37-
- Log files (`.log`)
38-
- All image-based files (OCR) including but not limited to:
39-
- Raster images (`.png`, `.jpg`, `.jpeg`, `.gif`, `.bmp`, `.webp`, `etc.`)
40-
- Icon files (`.ico`)
41-
- Animated images (`.gif`)
22+
- **parser-core**: The core parsing engine
23+
- **parser-cli**: Command-line interface
24+
- **parser-web**: Web API and frontend
25+
- **test-utils**: Shared testing utilities
26+
27+
## 📄 Supported File Types
28+
29+
- **Documents**: PDF (`.pdf`), Word (`.docx`), PowerPoint (`.pptx`), Excel (`.xlsx`)
30+
- **Text**: Plain text (`.txt`), CSV, JSON, YAML, source code, and other text-based formats
31+
- **Images**: PNG, JPEG, WebP, and other image formats with OCR (Optical Character Recognition)
4232

4333
The OCR functionality supports English and French languages.
4434

45-
## 🛠 Prerequisites
35+
## 🛠️ Getting Started
4636

47-
For local build:
37+
### Prerequisites
4838

49-
- [Rust](https://www.rust-lang.org/learn/get-started)
50-
- Libraries (For Tessaract OCR):
39+
- [Rust](https://www.rust-lang.org/learn/get-started) (latest stable)
40+
- OCR Dependencies:
5141
- Tesseract development libraries
5242
- Leptonica development libraries
5343
- Clang development libraries
54-
- English Language Data
55-
- French Language Data
5644

57-
### Installing Dependencies
45+
#### Installing OCR Dependencies
5846

59-
#### Debian/Ubuntu
47+
**Debian/Ubuntu:**
6048

6149
```bash
6250
sudo apt install libtesseract-dev libleptonica-dev libclang-dev
6351
```
6452

65-
#### macOS
53+
**macOS:**
6654

6755
```bash
6856
brew install tesseract
6957
```
7058

71-
#### Windows
72-
59+
**Windows:**
7360
Follow the instructions at [Tesseract GitHub repository](https://github.com/tesseract-ocr/tesseract).
7461

75-
For deployment:
62+
### Building from Source
7663

77-
- [Docker](https://docs.docker.com/get-docker/)
78-
- [Docker Compose](https://docs.docker.com/compose/install/)
64+
```bash
65+
# Build all crates
66+
cargo build
7967

80-
## ⚙ Configuration
68+
# Build in release mode
69+
cargo build --release
70+
```
71+
72+
### Using the CLI
73+
74+
```bash
75+
# Run directly with cargo
76+
cargo run -p parser-cli -- path/to/file1.pdf path/to/file2.docx
8177

82-
The service can be configured using the following environment variables.
78+
# Or use the built binary
79+
./target/release/parser-cli path/to/file1.pdf path/to/file2.docx
80+
```
81+
82+
### Running the Web Server
83+
84+
```bash
85+
# Run the web server
86+
cargo run -p parser-web
8387

84-
- `PARSER_APP_PORT`: _INT_, The port on which the program listens on. (default: 8080)
85-
- `ENABLE_FILE_SERVING`: _BOOL_, Enable serving files for the frontend. (default: false, just the API is enabled)
88+
# With custom port
89+
PARSER_APP_PORT=9000 cargo run -p parser-web
90+
91+
# With file serving enabled (for frontend)
92+
ENABLE_FILE_SERVING=true cargo run -p parser-web
93+
```
8694

8795
## 🚀 Deployment
8896

97+
The easiest way to deploy the service is using Docker:
98+
8999
```bash
90100
curl -o compose.yaml https://raw.githubusercontent.com/excoffierleonard/parser/refs/heads/main/compose.yaml && \
91101
docker compose up -d
92102
```
93103

94-
## 📖 API Documentation
104+
### Environment Variables
95105

96-
API documentation and examples are available in [docs/api.md](docs/api.md).
106+
- `PARSER_APP_PORT`: The port on which the web service listens (default: 8080)
107+
- `ENABLE_FILE_SERVING`: Enable serving frontend files (default: false)
97108

98109
## 🧪 Development
99110

100-
Useful commands for development:
111+
### Testing
112+
113+
```bash
114+
# Run all tests
115+
cargo test --workspace
116+
117+
# Run specific test
118+
cargo test test_name
119+
```
101120

102-
- Full build:
121+
### Benchmarking
103122

104123
```bash
105-
chmod +x ./scripts/build.sh && \
106-
./scripts/build.sh
124+
# Run benchmarks
125+
cargo bench --workspace
126+
127+
# Run benchmark script
128+
./scripts/benchmark.sh
129+
```
130+
131+
### Code Quality
132+
133+
```bash
134+
# Run linter
135+
cargo clippy --workspace -- -D warnings
136+
137+
# Format code
138+
cargo fmt --all
107139
```
108140

109-
- Deployment tests:
141+
### Building with Scripts
110142

111143
```bash
112-
chmod +x ./scripts/deploy-tests.sh && \
144+
# Full build script
145+
./scripts/build.sh
146+
147+
# Deployment tests
113148
./scripts/deploy-tests.sh
114149
```
115150

0 commit comments

Comments
 (0)