This system analyzes PDF files for malware using a hybrid approach that combines:
- Machine Learning Model (Python) → predicts malicious / benign
- PDF structure feature extraction using PDFBox (Java)
- AI LLM-based analysis generated via a local Ollama model
- Streamlit Frontend → uploads files and displays the HTML report
The system produces:
- ML prediction
- Confidence score
- Extracted PDF features
- AI-generated malware analysis report (HTML)
- Beautiful report rendered inside the Streamlit UI
- 📄 PDF feature extraction (metadata, objects, scripts, encryption, page data)
- 🧠 ML-based malware prediction (
malicious,benign) - 🤖 LLM-generated HTML analysis via Ollama
- 🔐 Base64-encoded HTML response for security
- 🎨 Streamlit UI with fixed light background for dark mode users
- ⚡ Optimized for < 15 second response time using lightweight local models
- 🧱 Modular, scalable, extendable architecture
[Streamlit UI]
│ Upload PDF
▼
[Spring Boot Backend]
├── PdfFeatureExtractor (PDFBox)
├── PythonPredictClient (WebClient → Python ML API)
├── AiAnalysisBuilderService (merges ML + metadata)
├── AiService (WebClient → Ollama LLM)
▼
[Local Ollama Model]
│ Returns HTML (Base64)
▼
[Streamlit → Base64 Decode → Display HTML Report]
backend/
├── controller/
├── service/
│ ├── PdfFeatureExtractor.java
│ ├── PythonPredictClient.java
│ ├── AiAnalysisBuilderService.java
│ ├── AiService.java
├── dtos/
│ ├── AIAnalysisRequest.java
│ ├── PredictionResponse.java
├── config/
│ ├── WebFluxConfig.java
├── application.yml
frontend/
└── app.py (Streamlit UI)
| Component | Technology |
|---|---|
| ML Model | Python, Scikit-Learn, Pickle |
| LLM | Ollama (Llama, Mistral, Qwen, Phi3, Gemma) |
| Backend | Spring Boot (Java 21) |
| PDF Parsing | Apache PDFBox |
| Networking | Spring WebClient |
| Frontend | Streamlit |
| Encoding | Base64 |
| Output Format | HTML |
Download: https://ollama.com/download
Recommended fast models:
ollama pull phi3:3b
ollama pull qwen2.5:3b
ollama pull gemma3:latestRun a model:
ollama run gemma3git clone https://github.com/shrihari7396/pdf-malware-analysis.git
cd pdf-malware-analysisMake sure Java 21 is installed.
server:
port: 8081
python:
predict:
url: http://localhost:5000/predict
ai:
analysis:
url: http://localhost:8082/api/v1/ai/analyze./mvnw spring-boot:runBackend runs at:
http://localhost:8081
Run:
python app.pyML API endpoint:
POST /predict
cd frontend
streamlit run app.pyFrontend opens at:
http://localhost:8501
POST http://localhost:8081/api/v1/scan
Content-Type: multipart/form-data
Upload a PDF file.
{
"prediction": "malicious",
"confidence": 0.998,
"features": { ... },
"explanation": { ... },
"htmlAnalysis": "BASE64_STRING"
}POST http://localhost:8082/api/v1/ai/analyze
{
"prediction": "...",
"confidence": 0.58,
"features": { ... },
"extractedText": "...",
"metadata": { ... },
"fileName": "sample.pdf",
"fileSize": 12345
}Base64-encoded HTML string.
Generate a clean HTML malware report inside a single <div>.
Return ONLY <div>...</div>.
Use inline CSS only.
Highlight malicious indicators in red and safe indicators in green.
Keep the report concise.
DATA:
FileName=%s
FileSize=%d
Prediction=%s
Confidence=%.2f
Features=%s
TextSummary=%s
Metadata=%s
Fix for dark mode + display issue:
wrapped_html = f"""
<div style="background:white; color:black; padding:20px;">
{decoded_html}
</div>
"""
st.components.v1.html(wrapped_html, height=1200, scrolling=True)phi3:3bqwen2.5:3bmistral:7b
set OLLAMA_USE_CUDA=1- ❌ Before: ~2 minutes
- ✅ After: 6–15 seconds
➡ Switch to smaller models + trim prompt ➡ Enable CUDA
➡ Wrap HTML with white background container
➡ Use correct import:
import org.springframework.beans.factory.annotation.Value;➡ Add Bean:
@Bean
public WebClient.Builder webClientBuilder() {
return WebClient.builder();
}This system is:
✔ Fully modular ✔ Fast and optimized ✔ Local and privacy-preserving ✔ Produces professional HTML reports ✔ Scalable for enterprise cybersecurity workflows