A scalable Semantic Search Engine built with FastAPI that allows users to upload PDF documents, automatically extract and embed their contents, and perform semantic + metadata-aware search across stored documents.
The system follows a clean CSR (ControllerβServiceβRepository) architecture, supports tag-based filtering, and is designed to be extensible for multilingual embeddings.
-
PDF Upload
- Upload PDF files via API
- Automatic text extraction per page
- Intelligent chunking for semantic indexing
-
Semantic Search
- Vector-based similarity search using embeddings
- Natural language queries (not keyword-only)
-
Tag Support
- Assign multiple tags to PDFs (e.g.
AI, ML, transformers) - Filter search results by tag
- Assign multiple tags to PDFs (e.g.
-
Multi-Language Ready
- Supports multilingual embedding models
- Language stored as metadata per document
-
Clean Architecture (CSR)
- Controller layer (FastAPI routes)
- Service layer (business logic)
- Repository layer (data + vector DB)
- Client layer (embedding models)
text_embedding_system/
βββ app
β βββ main.py
β βββ config.py
β βββ models.py
β βββ controllers
β β βββ entries.py
β β βββ search.py
β βββ services
β β βββ entry_service.py
β β βββ search_service.py
β βββ repository
β β βββ dataset_repo.py
β βββ clients
β βββ embedder_client.py
β βββ faiss_client.py
βββ requirements.txt
- Backend: FastAPI
- Language: Python 3.10+
- PDF Parsing: pypdf
- Vector Database: ChromaDB
- Embeddings: Sentence Transformers
- Validation: Pydantic
- Architecture: CSR Pattern
Mostafa Abdelhamed