Description
This repository contains a prototype service for recognizing product “infomodels” and generating descriptions and summaries based on them.
The prototype was developed during the AI Product Hack hackathon by the ЭЯЙ team.
Сitilink aims to provide its customers with as much product information (content) as possible.
A product page must contain complete information about the item, including technical specifications and a marketing description.
Manual completion takes a long time. Generative technologies can quickly and efficiently fill in product pages.
Filling the infomodel:
Based on a sample infomodel and product name, you need to create a filled infomodel with all possible information.
Generating a description:
Based on the infomodel, compose a compelling, traffic-driving product description.
- Searching and ranking information sources
- Parsing HTML pages and PDF files found within them
- Extracting structured data in the form of an infomodel
- Exporting the infomodel to JSON
- Generating a description
- Generating a summary
- Docker installed on your machine. Docker installation guide
- Git installed. Git installation guide
- You must obtain an API key and a catalog number in Yandex Cloud to use the YandexGPT API.
Instructions for obtaining an API key - You also need an API key and a catalog number in Yandex Cloud for the Yandex Search API.
Getting started guide;
Instructions for obtaining an API key
-
Clone the repository:
git clone https://github.com/PE51K/ai-product-hack
-
Configure environment variables:
In the file env/env.api_key, you need to specify:
- YANDEX_GPT_MODEL_TYPE – the model type: yandexgpt
- YANDEX_GPT_CATALOG_ID – Yandex Cloud catalog ID
- YANDEX_GPT_API_KEY – API key
In the file env/env.yandex_search, you need to specify:
- YANDEX_SEARCH_BASE_LINK – Yandex Search API address
- YANDEX_SEARCH_FOLDER_ID – Yandex Cloud catalog ID
- YANDEX_SEARCH_API_KEY – API key
-
Open a terminal and go to the
ai-product-hackproject directory:cd path/to/ai-product-hack -
Build the Docker image using the following command:
docker build -t my-streamlit-app .
- Run the container using this command:
docker run -p 8501:8501 my-streamlit-app
Your Streamlit application is now available at http://localhost:8501.
FROM python:3.12.2: Uses the official Python 3.12.2 image as the base.WORKDIR /app: Sets the working directory inside the container.COPY requirements.txt .: Copiesrequirements.txtinto the container.RUN pip install --no-cache-dir -r requirements.txt: Installs dependencies.COPY . .: Copies all project files into the container.EXPOSE 8501: Opens port 8501 for access to the application.CMD ["streamlit", "run", "src/main.py"]: Launches the Streamlit application.
- Clone the repository:
git clone https://github.com/PE51K/ai-product-hack
- Navigate to the project directory:
cd path/to/ai-product-hack - Install dependencies:
pip install -r requirements.txt
The prototype is available at http://158.160.168.3:8501.
All necessary user information is available in the Streamlit prototype interface.
Test data to check the Streamlit application is located in the test_data directory.
This task is divided into three subtasks:
- Searching and ranking relevant information sources
- Parsing text from each identified source
- Processing the extracted text and identifying product specifications
Stage 1: Acquiring resource links. Possible approaches:
- Querying a search engine API
- Searching through the main resource table and appending a specific resource URL
Stage 2: Ranking
- Via a table
- Classification via LLM
Input data format (TypedDict usage):
from typing import TypedDict, Optional
class ProductInfo(TypedDict):
brand_name: str
model_name: str
part_number: Optional[int]Output data format:
from typing import TypedDict
class SourceLink(TypedDict):
link: str
confidence_rate: float # from 0 to 1
def get_source_links(product_info: ProductInfo) -> list[SourceLink]:
...
return [source_link_1, source_link_2, ...]Stage 1: Parsing. Possible outputs:
- Text from HTML
- Text from PDF found on the site
- Text from images on the site?
- Text from videos on the site?
Input data format (TypedDict usage):
from typing import TypedDict
class SourceLink(TypedDict):
link: str
confidence_rate: float # from 0 to 1
current_product_source_links: list[SourceLink] = get_source_links(...)Output data format:
from typing import TypedDict, Optional
class SourceLink(TypedDict):
link: str
confidence_rate: float # from 0 to 1
class TextInfoFromSource(TypedDict):
html_text: str
pdf_texts: Optional[list[str]] # There may be multiple PDFs on the site (if implemented)
source: SourceLink
def get_product_texts_from_sources(product_source_links: list[SourceLink]) -> list[TextInfoFromSource]:
...
return [text_info_from_source_1, text_info_from_source_2, ...]Stage 1: Extract specific characteristics from the text. Possible approaches:
- Language model
- Split the text into batches
- Split the infomodel into batches
- Data preprocessing?
- Postprocessing the output?
- NER
Stage 2: Combine results from different sources. Possible algorithms:
- Maximum by confidence rating
- Maximum by the sum of confidence ratings for groups with identical values
Input data format for Stage 1 (TypedDict usage):
from typing import TypedDict, Optional
class SourceLink(TypedDict):
link: str
confidence_rate: float # from 0 to 1
class TextInfoFromSource(TypedDict):
html_text: str
pdf_texts: Optional[list[str]] # There may be multiple PDFs on the site (if implemented)
source: SourceLink
current_product_texts_from_sources: list[TextInfoFromSource] = get_product_texts_from_sources(...)Output data format for Stage 1:
from typing import TypedDict
class TextInfoFromSource(TypedDict):
html_text: str
pdf_texts: Optional[list[str]] # There may be multiple PDFs on the site (if implemented)
source: SourceLink
class NotebookCharacteristics(TypedDict):
diagonal_size: float
...
source: SourceLink
def get_product_characteristics_from_sources(product_texts_from_sources: list[TextInfoFromSource]) -> list[NotebookCharacteristics]:
...
return [notebook_characteristics_from_source_1, notebook_characteristics_from_source_2, ...]Output data format for Stage 2:
from typing import TypedDict, Union
class SourceLink(TypedDict):
link: str
confidence_rate: float # from 0 to 1
class FinalNotebookCharacteristics(TypedDict):
diagonal_size: float
...
def get_final_product_characteristics(product_characteristics_from_sources: List[Union[NotebookCharacteristics, TVCharacteristics, ...]]) -> Union(FinalNotebookCharacteristics, FinalTVCharacteristics, ...):Possible approaches:
- Using GPT API
- A local LLM (if confidentiality is a priority)
Input data format:
from typing import TypedDict, Union
class SourceLink(TypedDict):
link: str
confidence_rate: float # from 0 to 1
class FinalNotebookCharacteristics(TypedDict):
diagonal_size: float
...
class ProductInfo(TypedDict):
brand_name: str
model_name: str
part_number: Optional[int]
input: (List[SourceLink], FinalNotebookCharacteristics, ProductInfo) = ...