AutoRia.com web-scraper
- Author: satojkee
There are 2 main scrapers:
-
autoria_scraper.core.scrapers.catalog.CatalogScraper- this scraper is responsible for extractingdirectlinks from the catalog.- Yields a collection of urls per stage.
- Stores every collected url in pool to avoid duplicates.
- Stage size is similar to
SCRAPER__BATCH_SIZEvalue.
-
autoria_scraper.core.scrapers.direct.DirectScraper- this one receives a collection ofdirecturls and extracts all necessary information from them.- Processes each link given on init and yields a collection of parsed entities (collection may include
Nonevalues)
- Processes each link given on init and yields a collection of parsed entities (collection may include
Links example:
- direct - https://auto.ria.com/uk/auto_mercedes_benz_sprinter_38472224.html
- catalog - https://auto.ria.com/uk/car/used/?page=30
Cron jobs are managed by the cron - linux package. Define cron jobs in /scripts/start.sh script-file.
printenvis used to apply project environment variables for cron jobs, otherwise those variables won't be accessible by cron.- Define
CRON__SCRAPERandCRON__PG_DUMPin the same.envfile used by docker-compose.
#!/bin/bash
{
printenv
echo "$CRON__SCRAPER root /usr/local/bin/python3 /app/main.py >> /var/log/scraper.log 2>&1"
echo "$CRON__PG_DUMP root /app/scripts/dump.sh >> /var/log/pg_dump.log 2>&1"
} >> /etc/crontab
- Python 3.11
- Packages:
- SQLAlchemy[async]==2.0.41
- pydantic==2.11.7
- pydantic-settings==2.10.1
- asyncpg==0.30.0
- aiohttp==3.12.13
- beautifulsoup4==4.13.4
- fake-useragent==2.2.0
- lxml==5.4.0
Configuration is loaded from
.envfile.
# PostgresDsn = "dialect+driver://user:password@host:port/dbname"
DATABASE__URL="postgresql+asyncpg://postgres:postgres@localhost:5432/autoria"
# Use this one for testing purposes, ! remove in production !
SCRAPER__PAGES_LIMIT="100"
# Required for `autoria_scraper.core.scrapers.catalog.CatalogScraper` as root url to obtain listings
SCRAPER__ROOT_URL="https://auto.ria.com/uk/car/used/"
# Required for `autoria_scraper.core.scrapers.direct.DirectScraper` as root url to collect sellers' phone numbers
SCRAPER__PHONE_URL="https://auto.ria.com/bff/final-page/public/auto/popUp/"
# Required for both scrapers, defines an amount of concurrent tasks (the higher this values is, the more network/RAM application consumes)
SCRAPER__BATCH_SIZE="200"
# Replaces `aiohttp` default request timeout value (300 -> 60), throws `TimeoutError` if exceeded
AIOHTTP__TIMEOUT="60"
# Retries amount for each `aiohttp` request (om failure)
AIOHTTP__ATTEMPTS_LIMIT="3"
# Delay between each `aiohttp` request reattempt (on failure, in seconds)
AIOHTTP__ATTEMPT_DELAY="2"
# Required for `autoria-postgres` and `pg_dump` cron task
PG_USER="postgres"
PG_PASSWORD="postgres"
PG_DB="autoria"
PG_HOST="autoria-postgres" # postgres container name or remote host
# Cron schedule settings
# https://crontab.guru/
CRON__PG_DUMP="0 12 * * *"
CRON__SCRAPER="0 12 * * *"Project uses
pydantic_settingspackage to manage configuration.
__is used as nesting delimiter.
| Variable | Recommended value | Description |
|---|---|---|
SCRAPER__PAGES_LIMIT |
100 - 500 | Use this option to test web-scraper. Limits total amount of pages with a specified value for CatalogScraper (approx. 18k+ pages for unlimited). remove for production |
SCRAPER__ROOT_URL |
https://auto.ria.com/uk/car/used/ | Constant! Base url (crucial to obtain direct links to the listed cars) |
SCRAPER__PHONE_URL |
https://auto.ria.com/bff/final-page/public/auto/popUp/ | Constant! This one is used to dynamically obtain phone numbers |
SCRAPER__BATCH_SIZE |
200 | Amount of concurrent tasks (the higher this value is, the more network/RAM is consumed). |
AIOHTTP__ATTEMPTS_LIMIT |
3 | Number of reattempts for aiohttp requests |
AIOHTTP__TIMEOUT |
60 | Timeout for aiohttp requests (in seconds), default value provided by aiohttp = 60 * 5 = 300 |
AIOHTTP__ATTEMPT_DELAY |
2 | Delay between each reattempt (in seconds) |
DATABASE__URL |
postgresql+asyncpg://postgres:postgres@autoria-postgres:5432/autoria | Database url, postgresql+asyncpg://{PG_USER}:{PG_PASSWORD}@{host/container_name)}:5432/{PG_DB} |
PG_USER |
postgres | Database username (used by autoria-postgres and pg_dump util) |
PG_PASSWORD |
postgres | Database password (used by autoria-postgres and pg_dump util) |
PG_DB |
autoria | Database name (used by autoria-postgres and pg_dump util) |
PG_HOST |
autoria-postgres | Database host (used by autoria-postgres and pg_dump util, use db container name: autoria-postgres) |
Go to:
/examples/output_data_example.csvfor 100-row example
Don't forget to configure necessary variables in
.envfile.
docker-compose up -d