Skip to content

shrihari7396/Pdf-Malware-Detection-System

Repository files navigation

🛡️ PDF Malware Detection System

Machine Learning + AI-Powered PDF Risk Analyzer

Spring Boot Backend | Local Ollama LLM | Streamlit Frontend


📌 Overview

This system analyzes PDF files for malware using a hybrid approach that combines:

  1. Machine Learning Model (Python) → predicts malicious / benign
  2. PDF structure feature extraction using PDFBox (Java)
  3. AI LLM-based analysis generated via a local Ollama model
  4. Streamlit Frontend → uploads files and displays the HTML report

The system produces:

  • ML prediction
  • Confidence score
  • Extracted PDF features
  • AI-generated malware analysis report (HTML)
  • Beautiful report rendered inside the Streamlit UI

⭐ Features

  • 📄 PDF feature extraction (metadata, objects, scripts, encryption, page data)
  • 🧠 ML-based malware prediction (malicious, benign)
  • 🤖 LLM-generated HTML analysis via Ollama
  • 🔐 Base64-encoded HTML response for security
  • 🎨 Streamlit UI with fixed light background for dark mode users
  • ⚡ Optimized for < 15 second response time using lightweight local models
  • 🧱 Modular, scalable, extendable architecture

🏗️ System Architecture


[Streamlit UI]
│  Upload PDF
▼
[Spring Boot Backend]
├── PdfFeatureExtractor (PDFBox)
├── PythonPredictClient (WebClient → Python ML API)
├── AiAnalysisBuilderService (merges ML + metadata)
├── AiService (WebClient → Ollama LLM)
▼
[Local Ollama Model]
│  Returns HTML (Base64)
▼
[Streamlit → Base64 Decode → Display HTML Report]


📂 Project Structure


backend/
├── controller/
├── service/
│     ├── PdfFeatureExtractor.java
│     ├── PythonPredictClient.java
│     ├── AiAnalysisBuilderService.java
│     ├── AiService.java
├── dtos/
│     ├── AIAnalysisRequest.java
│     ├── PredictionResponse.java
├── config/
│     ├── WebFluxConfig.java
├── application.yml

frontend/
└── app.py (Streamlit UI)


⚙️ Technologies Used

Component Technology
ML Model Python, Scikit-Learn, Pickle
LLM Ollama (Llama, Mistral, Qwen, Phi3, Gemma)
Backend Spring Boot (Java 21)
PDF Parsing Apache PDFBox
Networking Spring WebClient
Frontend Streamlit
Encoding Base64
Output Format HTML

🚀 Setup & Installation

1️⃣ Install Ollama

Download: https://ollama.com/download

Recommended fast models:

ollama pull phi3:3b
ollama pull qwen2.5:3b
ollama pull gemma3:latest

Run a model:

ollama run gemma3

2️⃣ Clone Repository

git clone https://github.com/shrihari7396/pdf-malware-analysis.git
cd pdf-malware-analysis

3️⃣ Backend (Spring Boot)

Install Java 21

Make sure Java 21 is installed.

Configure application.yml

server:
  port: 8081

python:
  predict:
    url: http://localhost:5000/predict

ai:
  analysis:
    url: http://localhost:8082/api/v1/ai/analyze

Run Backend

./mvnw spring-boot:run

Backend runs at:

http://localhost:8081

4️⃣ ML Prediction Server (Python)

Run:

python app.py

ML API endpoint:

POST /predict

5️⃣ Streamlit Frontend

cd frontend
streamlit run app.py

Frontend opens at:

http://localhost:8501

🔍 API Documentation

📌 1. /scan (Main Endpoint)

POST http://localhost:8081/api/v1/scan
Content-Type: multipart/form-data

Request:

Upload a PDF file.

Response (PredictionResponse):

{
  "prediction": "malicious",
  "confidence": 0.998,
  "features": { ... },
  "explanation": { ... },
  "htmlAnalysis": "BASE64_STRING"
}

📌 2. /ai/analyze

POST http://localhost:8082/api/v1/ai/analyze

Body (AIAnalysisRequest):

{
  "prediction": "...",
  "confidence": 0.58,
  "features": { ... },
  "extractedText": "...",
  "metadata": { ... },
  "fileName": "sample.pdf",
  "fileSize": 12345
}

Response:

Base64-encoded HTML string.


🧠 Optimized LLM Prompt

Generate a clean HTML malware report inside a single <div>.
Return ONLY <div>...</div>.
Use inline CSS only.
Highlight malicious indicators in red and safe indicators in green.
Keep the report concise.

DATA:
FileName=%s
FileSize=%d
Prediction=%s
Confidence=%.2f
Features=%s
TextSummary=%s
Metadata=%s

🎨 Streamlit UI (Final Version)

Fix for dark mode + display issue:

wrapped_html = f"""
<div style="background:white; color:black; padding:20px;">
    {decoded_html}
</div>
"""
st.components.v1.html(wrapped_html, height=1200, scrolling=True)

⚡ Performance Optimization

✔ Recommended fast local models:

  • phi3:3b
  • qwen2.5:3b
  • mistral:7b

✔ Reduce extracted text to 300–400 characters

✔ Use shorter prompts

✔ Enable GPU:

set OLLAMA_USE_CUDA=1

⚡ Result:

  • ❌ Before: ~2 minutes
  • ✅ After: 6–15 seconds

🛠 Troubleshooting

❌ Slow inference (> 60s)

➡ Switch to smaller models + trim prompt ➡ Enable CUDA

❌ Dark-mode unreadable text in Streamlit

➡ Wrap HTML with white background container

@Value injection issue

➡ Use correct import:

import org.springframework.beans.factory.annotation.Value;

❌ WebClient is null

➡ Add Bean:

@Bean
public WebClient.Builder webClientBuilder() {
    return WebClient.builder();
}

🏁 Conclusion

This system is:

✔ Fully modular ✔ Fast and optimized ✔ Local and privacy-preserving ✔ Produces professional HTML reports ✔ Scalable for enterprise cybersecurity workflows


About

AI-powered PDF malware detection system combining Machine Learning, PDF structural analysis, and local LLM-generated security reports using Spring Boot, Python, and Ollama.

Resources

License

Security policy

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors