🛡️ PDF Malware Detection System

Machine Learning + AI-Powered PDF Risk Analyzer

Spring Boot Backend | Local Ollama LLM | Streamlit Frontend

📌 Overview

This system analyzes PDF files for malware using a hybrid approach that combines:

Machine Learning Model (Python) → predicts malicious / benign
PDF structure feature extraction using PDFBox (Java)
AI LLM-based analysis generated via a local Ollama model
Streamlit Frontend → uploads files and displays the HTML report

The system produces:

ML prediction
Confidence score
Extracted PDF features
AI-generated malware analysis report (HTML)
Beautiful report rendered inside the Streamlit UI

⭐ Features

📄 PDF feature extraction (metadata, objects, scripts, encryption, page data)
🧠 ML-based malware prediction (malicious, benign)
🤖 LLM-generated HTML analysis via Ollama
🔐 Base64-encoded HTML response for security
🎨 Streamlit UI with fixed light background for dark mode users
⚡ Optimized for < 15 second response time using lightweight local models
🧱 Modular, scalable, extendable architecture

🏗️ System Architecture


[Streamlit UI]
│  Upload PDF
▼
[Spring Boot Backend]
├── PdfFeatureExtractor (PDFBox)
├── PythonPredictClient (WebClient → Python ML API)
├── AiAnalysisBuilderService (merges ML + metadata)
├── AiService (WebClient → Ollama LLM)
▼
[Local Ollama Model]
│  Returns HTML (Base64)
▼
[Streamlit → Base64 Decode → Display HTML Report]

📂 Project Structure


backend/
├── controller/
├── service/
│     ├── PdfFeatureExtractor.java
│     ├── PythonPredictClient.java
│     ├── AiAnalysisBuilderService.java
│     ├── AiService.java
├── dtos/
│     ├── AIAnalysisRequest.java
│     ├── PredictionResponse.java
├── config/
│     ├── WebFluxConfig.java
├── application.yml

frontend/
└── app.py (Streamlit UI)

⚙️ Technologies Used

Component	Technology
ML Model	Python, Scikit-Learn, Pickle
LLM	Ollama (Llama, Mistral, Qwen, Phi3, Gemma)
Backend	Spring Boot (Java 21)
PDF Parsing	Apache PDFBox
Networking	Spring WebClient
Frontend	Streamlit
Encoding	Base64
Output Format	HTML

🚀 Setup & Installation

1️⃣ Install Ollama

Download: https://ollama.com/download

Recommended fast models:

ollama pull phi3:3b
ollama pull qwen2.5:3b
ollama pull gemma3:latest

Run a model:

ollama run gemma3

2️⃣ Clone Repository

git clone https://github.com/shrihari7396/pdf-malware-analysis.git
cd pdf-malware-analysis

3️⃣ Backend (Spring Boot)

Install Java 21

Make sure Java 21 is installed.

Configure `application.yml`

server:
  port: 8081

python:
  predict:
    url: http://localhost:5000/predict

ai:
  analysis:
    url: http://localhost:8082/api/v1/ai/analyze

Run Backend

./mvnw spring-boot:run

Backend runs at:

http://localhost:8081

4️⃣ ML Prediction Server (Python)

Run:

python app.py

ML API endpoint:

POST /predict

5️⃣ Streamlit Frontend

cd frontend
streamlit run app.py

Frontend opens at:

http://localhost:8501

🔍 API Documentation

📌 1. `/scan` (Main Endpoint)

POST http://localhost:8081/api/v1/scan
Content-Type: multipart/form-data

Request:

Upload a PDF file.

Response (`PredictionResponse`):

{
  "prediction": "malicious",
  "confidence": 0.998,
  "features": { ... },
  "explanation": { ... },
  "htmlAnalysis": "BASE64_STRING"
}

📌 2. `/ai/analyze`

POST http://localhost:8082/api/v1/ai/analyze

Body (`AIAnalysisRequest`):

{
  "prediction": "...",
  "confidence": 0.58,
  "features": { ... },
  "extractedText": "...",
  "metadata": { ... },
  "fileName": "sample.pdf",
  "fileSize": 12345
}

Response:

Base64-encoded HTML string.

🧠 Optimized LLM Prompt

Generate a clean HTML malware report inside a single <div>.
Return ONLY <div>...</div>.
Use inline CSS only.
Highlight malicious indicators in red and safe indicators in green.
Keep the report concise.

DATA:
FileName=%s
FileSize=%d
Prediction=%s
Confidence=%.2f
Features=%s
TextSummary=%s
Metadata=%s

🎨 Streamlit UI (Final Version)

Fix for dark mode + display issue:

wrapped_html = f"""
<div style="background:white; color:black; padding:20px;">
    {decoded_html}
</div>
"""
st.components.v1.html(wrapped_html, height=1200, scrolling=True)

⚡ Performance Optimization

✔ Recommended fast local models:

phi3:3b
qwen2.5:3b
mistral:7b

✔ Reduce extracted text to 300–400 characters

✔ Use shorter prompts

✔ Enable GPU:

set OLLAMA_USE_CUDA=1

⚡ Result:

❌ Before: ~2 minutes
✅ After: 6–15 seconds

🛠 Troubleshooting

❌ Slow inference (> 60s)

➡ Switch to smaller models + trim prompt ➡ Enable CUDA

❌ Dark-mode unreadable text in Streamlit

➡ Wrap HTML with white background container

❌ `@Value` injection issue

➡ Use correct import:

import org.springframework.beans.factory.annotation.Value;

❌ WebClient is null

➡ Add Bean:

@Bean
public WebClient.Builder webClientBuilder() {
    return WebClient.builder();
}

🏁 Conclusion

This system is:

✔ Fully modular ✔ Fast and optimized ✔ Local and privacy-preserving ✔ Produces professional HTML reports ✔ Scalable for enterprise cybersecurity workflows

Name		Name	Last commit message	Last commit date
Latest commit History 20 Commits
.github		.github
.vscode		.vscode
AIService		AIService
Backend		Backend
Frontend		Frontend
Model		Model
ModelApi		ModelApi
TestingWithMalciousFile		TestingWithMalciousFile
LICENSE		LICENSE
README.md		README.md
SECURITY.md		SECURITY.md
docker-compose.yml		docker-compose.yml
requirements.txt		requirements.txt
temp.java		temp.java

Folders and files

Latest commit

History

Repository files navigation

🛡️ PDF Malware Detection System

Machine Learning + AI-Powered PDF Risk Analyzer

Spring Boot Backend | Local Ollama LLM | Streamlit Frontend

📌 Overview

⭐ Features

🏗️ System Architecture

📂 Project Structure

⚙️ Technologies Used

🚀 Setup & Installation

1️⃣ Install Ollama

2️⃣ Clone Repository

3️⃣ Backend (Spring Boot)

Install Java 21

Configure application.yml

Run Backend

4️⃣ ML Prediction Server (Python)

5️⃣ Streamlit Frontend

🔍 API Documentation

📌 1. /scan (Main Endpoint)

Request:

Response (PredictionResponse):

📌 2. /ai/analyze

Body (AIAnalysisRequest):

Response:

🧠 Optimized LLM Prompt

🎨 Streamlit UI (Final Version)

⚡ Performance Optimization

✔ Recommended fast local models:

✔ Reduce extracted text to 300–400 characters

✔ Use shorter prompts

✔ Enable GPU:

⚡ Result:

🛠 Troubleshooting

❌ Slow inference (> 60s)

❌ Dark-mode unreadable text in Streamlit

❌ @Value injection issue

❌ WebClient is null

🏁 Conclusion

About

Resources

License

Security policy

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Configure `application.yml`

📌 1. `/scan` (Main Endpoint)

Response (`PredictionResponse`):

📌 2. `/ai/analyze`

Body (`AIAnalysisRequest`):

❌ `@Value` injection issue

Packages