A high-performance, enterprise-grade HTTP/HTTPS Data Loss Prevention (DLP) proxy designed to sanitize sensitive information before it reaches external LLM endpoints.
📘 Documentation
Full documentation is available at https://fabriziosalmi.github.io/aidlp/ (or locally via
npm run docs:dev).
- Overview
- Features
- Architecture
- Prerequisites
- Installation
- Configuration
- Usage
- Observability
- Troubleshooting
- Contributing
- License
The AI DLP Proxy acts as a secure gateway, intercepting traffic to LLM providers (like OpenAI, Anthropic) and redacting sensitive data in real-time using a hybrid approach of static rules and Machine Learning models.
- Hybrid Redaction Engine: Combines the speed of static keyword matching (FlashText) with the intelligence of NLP models (Presidio/SpaCy) to detect PII, secrets, and custom terms.
- SSL/TLS Interception: Full support for HTTPS traffic inspection via
mitmproxycore. - High Performance: Asynchronous ML processing with configurable models (
en_core_web_smfor speed) ensures minimal latency impact. - Enterprise Observability: Native Prometheus metrics (
/metrics) and structured JSON logging for integration with Grafana/Loki. - Fail Closed Security: Requests are strictly blocked (500 Error) if the DLP engine encounters any failure, ensuring no data leakage.
- Scalable: Dockerized (Multi-stage build, Python 3.12) and load-tested to handle 1000+ concurrent connections.
The system is built on top of mitmproxy's robust core, extended with a custom Python addon (DLPAddon).
- Interception: The proxy intercepts HTTP/HTTPS
POSTrequests. - Analysis: The request body is passed to the
DLPEngine.- Static Analysis: Checks against
terms.txtfor known secrets. - ML Analysis: Runs Named Entity Recognition (NER) to find PII (Names, Phones, etc.).
- Static Analysis: Checks against
- Redaction: Sensitive tokens are replaced with
[REDACTED]. - Forwarding: The sanitized request is sent to the upstream server.
- Telemetry: Metrics and logs are emitted asynchronously.
- Python: 3.12 or higher.
- Docker: 20.10+ (for containerized deployment).
- Memory: Minimum 2GB RAM recommended for ML models.
-
Clone the repository:
git clone https://github.com/fabriziosalmi/aidlp.git cd aidlp -
Create a virtual environment:
python3 -m venv venv source venv/bin/activate -
Install dependencies:
pip install poetry poetry install poetry run python -m spacy download en_core_web_lg
-
Start the proxy:
poetry run python src/cli.py start --port 8080
-
Build and Run:
docker-compose up --build -d
-
Verify:
curl -x http://localhost:8080 http://httpbin.org/ip
The proxy is configured via config.yaml and terms.txt.
proxy:
port: 8080
metrics_port: 9090
ssl_bump: true
dlp:
static_terms_file: "terms.txt"
ml_enabled: true
ml_threshold: 0.5
nlp_model: "en_core_web_lg" # or "en_core_web_sm" for speed
entities: ["PERSON", "PHONE_NUMBER"] # Optional: filter specific entities
secrets_provider:
type: "file" # or "vault"
vault:
url: "http://localhost:8200"
path: "aidlp/terms"Add one sensitive term per line. The proxy reloads this file automatically on restart (dynamic reload planned).
super_secret_token
internal_db_password
Configure your HTTP client or environment to use the proxy.
Example (cURL):
curl -x http://localhost:8080 \
-X POST http://httpbin.org/post \
-d "My password is super_secret_token"Output:
{
"data": "My password is [REDACTED]"
}Prometheus metrics are available at http://localhost:9090/metrics.
dlp_requests_total: Total requests processed.dlp_redacted_total: Requests containing sensitive data.dlp_pii_detected_total: Count of PII entities by type (e.g.,EMAIL,PHONE_NUMBER).dlp_token_usage_total: Estimated token usage (input/output).dlp_latency_seconds: Histogram of processing time.dlp_active_connections: Current active connections.
Logs are output in structured JSON format to stdout, suitable for ingestion by Fluentd/Logstash.
- Cause: Port 8080 or 9090 is occupied.
- Fix: Change
portormetrics_portinconfig.yaml.
- Cause: The client does not trust the
mitmproxyCA. - Fix: Install
~/.mitmproxy/mitmproxy-ca-cert.peminto your system or browser trust store. Forcurl, use-k(insecure) for testing.
- Cause: ML model processing on CPU.
- Fix: Ensure you are running on a machine with AVX support. For production, consider GPU acceleration (future support).
We welcome contributions! Please see CONTRIBUTING.md for guidelines on how to get started.
This project is licensed under the MIT License - see the LICENSE file for details.