GitHub - PE51K/ai-product-hack-2024: AI Product Hack 2024 repo: service for recognizing product “infomodels” and generating descriptions and summaries based on them

Description

This repository contains a prototype service for recognizing product “infomodels” and generating descriptions and summaries based on them.

The prototype was developed during the AI Product Hack hackathon by the ЭЯЙ team.

Problem

Сitilink aims to provide its customers with as much product information (content) as possible.

A product page must contain complete information about the item, including technical specifications and a marketing description.

Manual completion takes a long time. Generative technologies can quickly and efficiently fill in product pages.

Hackathon Tasks

Filling the infomodel:
Based on a sample infomodel and product name, you need to create a filled infomodel with all possible information.

Generating a description:
Based on the infomodel, compose a compelling, traffic-driving product description.

Prototype Features

Searching and ranking information sources
Parsing HTML pages and PDF files found within them
Extracting structured data in the form of an infomodel
Exporting the infomodel to JSON
Generating a description
Generating a summary

Installation and Setup

Requirements

Docker installed on your machine. Docker installation guide
Git installed. Git installation guide
You must obtain an API key and a catalog number in Yandex Cloud to use the YandexGPT API.
Instructions for obtaining an API key
You also need an API key and a catalog number in Yandex Cloud for the Yandex Search API.
Getting started guide;
Instructions for obtaining an API key

Environment Configuration

Clone the repository:

git clone https://github.com/PE51K/ai-product-hack

Configure environment variables:

In the file env/env.api_key, you need to specify:
- YANDEX_GPT_MODEL_TYPE – the model type: yandexgpt
- YANDEX_GPT_CATALOG_ID – Yandex Cloud catalog ID
- YANDEX_GPT_API_KEY – API key
In the file env/env.yandex_search, you need to specify:
- YANDEX_SEARCH_BASE_LINK – Yandex Search API address
- YANDEX_SEARCH_FOLDER_ID – Yandex Cloud catalog ID
- YANDEX_SEARCH_API_KEY – API key

Building the Docker Image

Open a terminal and go to the ai-product-hack project directory:
```
cd path/to/ai-product-hack
```
Build the Docker image using the following command:
```
docker build -t my-streamlit-app .
```

Running the Docker Container

Run the container using this command:

docker run -p 8501:8501 my-streamlit-app

Your Streamlit application is now available at http://localhost:8501.

Key Dockerfile Commands

FROM python:3.12.2: Uses the official Python 3.12.2 image as the base.
WORKDIR /app: Sets the working directory inside the container.
COPY requirements.txt .: Copies requirements.txt into the container.
RUN pip install --no-cache-dir -r requirements.txt: Installs dependencies.
COPY . .: Copies all project files into the container.
EXPOSE 8501: Opens port 8501 for access to the application.
CMD ["streamlit", "run", "src/main.py"]: Launches the Streamlit application.

Local Usage

Clone the repository:

git clone https://github.com/PE51K/ai-product-hack

Navigate to the project directory:
```
cd path/to/ai-product-hack
```
Install dependencies:
```
pip install -r requirements.txt
```

Usage

The prototype is available at http://158.160.168.3:8501.

All necessary user information is available in the Streamlit prototype interface.

Test data to check the Streamlit application is located in the test_data directory.

Task Analytics

Task 1: “Extracting Product Specifications”

This task is divided into three subtasks:

Searching and ranking relevant information sources
Parsing text from each identified source
Processing the extracted text and identifying product specifications

More About Subtask 1

Stage 1: Acquiring resource links. Possible approaches:

Querying a search engine API
Searching through the main resource table and appending a specific resource URL

Stage 2: Ranking

Via a table
Classification via LLM

Input data format (TypedDict usage):

from typing import TypedDict, Optional

class ProductInfo(TypedDict):
    brand_name: str
    model_name: str
    part_number: Optional[int]

Output data format:

from typing import TypedDict

class SourceLink(TypedDict):
    link: str
    confidence_rate: float  # from 0 to 1

def get_source_links(product_info: ProductInfo) -> list[SourceLink]:
    ...
    return [source_link_1, source_link_2, ...]

More About Subtask 2

Stage 1: Parsing. Possible outputs:

Text from HTML
Text from PDF found on the site
Text from images on the site?
Text from videos on the site?

Input data format (TypedDict usage):

from typing import TypedDict

class SourceLink(TypedDict):
    link: str
    confidence_rate: float  # from 0 to 1

current_product_source_links: list[SourceLink] = get_source_links(...)

Output data format:

from typing import TypedDict, Optional

class SourceLink(TypedDict):
    link: str
    confidence_rate: float  # from 0 to 1

class TextInfoFromSource(TypedDict):
    html_text: str
    pdf_texts: Optional[list[str]]  # There may be multiple PDFs on the site (if implemented)
    source: SourceLink

def get_product_texts_from_sources(product_source_links: list[SourceLink]) -> list[TextInfoFromSource]:
    ...
    return [text_info_from_source_1, text_info_from_source_2, ...]

More About Subtask 3

Stage 1: Extract specific characteristics from the text. Possible approaches:

Language model
- Split the text into batches
- Split the infomodel into batches
- Data preprocessing?
- Postprocessing the output?
NER

Stage 2: Combine results from different sources. Possible algorithms:

Maximum by confidence rating
Maximum by the sum of confidence ratings for groups with identical values

Input data format for Stage 1 (TypedDict usage):

from typing import TypedDict, Optional

class SourceLink(TypedDict):
    link: str
    confidence_rate: float  # from 0 to 1

class TextInfoFromSource(TypedDict):
    html_text: str
    pdf_texts: Optional[list[str]]  # There may be multiple PDFs on the site (if implemented)
    source: SourceLink

current_product_texts_from_sources: list[TextInfoFromSource] = get_product_texts_from_sources(...)

Output data format for Stage 1:

from typing import TypedDict

class TextInfoFromSource(TypedDict):
    html_text: str
    pdf_texts: Optional[list[str]]  # There may be multiple PDFs on the site (if implemented)
    source: SourceLink

class NotebookCharacteristics(TypedDict):
    diagonal_size: float
    ...
    source: SourceLink

def get_product_characteristics_from_sources(product_texts_from_sources: list[TextInfoFromSource]) -> list[NotebookCharacteristics]:
    ...
    return [notebook_characteristics_from_source_1, notebook_characteristics_from_source_2, ...]

Output data format for Stage 2:

from typing import TypedDict, Union

class SourceLink(TypedDict):
    link: str
    confidence_rate: float  # from 0 to 1

class FinalNotebookCharacteristics(TypedDict):
    diagonal_size: float
    ...

def get_final_product_characteristics(product_characteristics_from_sources: List[Union[NotebookCharacteristics, TVCharacteristics, ...]]) -> Union(FinalNotebookCharacteristics, FinalTVCharacteristics, ...):

Task 2: “Composing the Description and Summary”

Possible approaches:

Using GPT API
A local LLM (if confidentiality is a priority)

Input data format:

from typing import TypedDict, Union

class SourceLink(TypedDict):
    link: str
    confidence_rate: float  # from 0 to 1

class FinalNotebookCharacteristics(TypedDict):
    diagonal_size: float
    ...

class ProductInfo(TypedDict):
    brand_name: str
    model_name: str
    part_number: Optional[int]

input: (List[SourceLink], FinalNotebookCharacteristics, ProductInfo) = ...

Name		Name	Last commit message	Last commit date
Latest commit History 72 Commits
env		env
notebooks		notebooks
src		src
test_data		test_data
.gitignore		.gitignore
Dockerfile		Dockerfile
LICENSE.md		LICENSE.md
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Table of Contents

Problem

Hackathon Tasks

Prototype Features

Installation and Setup

Requirements

Environment Configuration

Building the Docker Image

Running the Docker Container

Key Dockerfile Commands

Local Usage

Usage

Task Analytics

Task 1: “Extracting Product Specifications”

More About Subtask 1

More About Subtask 2

More About Subtask 3

Task 2: “Composing the Description and Summary”

About

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Table of Contents

Problem

Hackathon Tasks

Prototype Features

Installation and Setup

Requirements

Environment Configuration

Building the Docker Image

Running the Docker Container

Key Dockerfile Commands

Local Usage

Usage

Task Analytics

Task 1: “Extracting Product Specifications”

More About Subtask 1

More About Subtask 2

More About Subtask 3

Task 2: “Composing the Description and Summary”

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Uh oh!

Contributors

Uh oh!

Languages