AutoRia.com web-scraper

Author: satojkee

Short structure description

There are 2 main scrapers:

autoria_scraper.core.scrapers.catalog.CatalogScraper - this scraper is responsible for extracting direct links from the catalog.
- Yields a collection of urls per stage.
- Stores every collected url in pool to avoid duplicates.
- Stage size is similar to SCRAPER__BATCH_SIZE value.
autoria_scraper.core.scrapers.direct.DirectScraper - this one receives a collection of direct urls and extracts all necessary information from them.
- Processes each link given on init and yields a collection of parsed entities (collection may include None values)

Links example:

direct - https://auto.ria.com/uk/auto_mercedes_benz_sprinter_38472224.html
catalog - https://auto.ria.com/uk/car/used/?page=30

Cron jobs

Cron jobs are managed by the cron - linux package. Define cron jobs in /scripts/start.sh script-file.

printenv is used to apply project environment variables for cron jobs, otherwise those variables won't be accessible by cron.
Define CRON__SCRAPER and CRON__PG_DUMP in the same .env file used by docker-compose.

#!/bin/bash

{
  printenv
  echo "$CRON__SCRAPER root /usr/local/bin/python3 /app/main.py >> /var/log/scraper.log 2>&1"
  echo "$CRON__PG_DUMP root /app/scripts/dump.sh >> /var/log/pg_dump.log 2>&1"
} >> /etc/crontab

Project requirements

Python 3.11
Packages:
- SQLAlchemy[async]==2.0.41
- pydantic==2.11.7
- pydantic-settings==2.10.1
- asyncpg==0.30.0
- aiohttp==3.12.13
- beautifulsoup4==4.13.4
- fake-useragent==2.2.0
- lxml==5.4.0

Settings

Configuration is loaded from .env file.

Default settings

# PostgresDsn = "dialect+driver://user:password@host:port/dbname"
DATABASE__URL="postgresql+asyncpg://postgres:postgres@localhost:5432/autoria"
# Use this one for testing purposes, ! remove in production !
SCRAPER__PAGES_LIMIT="100"
# Required for `autoria_scraper.core.scrapers.catalog.CatalogScraper` as root url to obtain listings
SCRAPER__ROOT_URL="https://auto.ria.com/uk/car/used/"
# Required for `autoria_scraper.core.scrapers.direct.DirectScraper` as root url to collect sellers' phone numbers
SCRAPER__PHONE_URL="https://auto.ria.com/bff/final-page/public/auto/popUp/"
# Required for both scrapers, defines an amount of concurrent tasks (the higher this values is, the more network/RAM application consumes)
SCRAPER__BATCH_SIZE="200"
# Replaces `aiohttp` default request timeout value (300 -> 60), throws `TimeoutError` if exceeded
AIOHTTP__TIMEOUT="60"
# Retries amount for each `aiohttp` request (om failure)
AIOHTTP__ATTEMPTS_LIMIT="3"
# Delay between each `aiohttp` request reattempt (on failure, in seconds)
AIOHTTP__ATTEMPT_DELAY="2"

# Required for `autoria-postgres` and `pg_dump` cron task
PG_USER="postgres"
PG_PASSWORD="postgres"
PG_DB="autoria"
PG_HOST="autoria-postgres"  # postgres container name or remote host

# Cron schedule settings
# https://crontab.guru/
CRON__PG_DUMP="0 12 * * *"
CRON__SCRAPER="0 12 * * *"

Description

Project uses pydantic_settings package to manage configuration.
__ is used as nesting delimiter.

Variable	Recommended value	Description
`SCRAPER__PAGES_LIMIT`	100 - 500	Use this option to test web-scraper. Limits total amount of pages with a specified value for `CatalogScraper` (approx. 18k+ pages for unlimited). `remove for production`
`SCRAPER__ROOT_URL`	https://auto.ria.com/uk/car/used/	`Constant!` Base url (crucial to obtain direct links to the listed cars)
`SCRAPER__PHONE_URL`	https://auto.ria.com/bff/final-page/public/auto/popUp/	`Constant!` This one is used to dynamically obtain phone numbers
`SCRAPER__BATCH_SIZE`	200	Amount of concurrent tasks (the higher this value is, the more network/RAM is consumed).
`AIOHTTP__ATTEMPTS_LIMIT`	3	Number of reattempts for `aiohttp` requests
`AIOHTTP__TIMEOUT`	60	Timeout for `aiohttp` requests (in seconds), default value provided by `aiohttp` = 60 * 5 = 300
`AIOHTTP__ATTEMPT_DELAY`	2	Delay between each reattempt (in seconds)
`DATABASE__URL`	postgresql+asyncpg://postgres:postgres@autoria-postgres:5432/autoria	Database url, `postgresql+asyncpg://{PG_USER}:{PG_PASSWORD}@{host/container_name)}:5432/{PG_DB}`
`PG_USER`	postgres	Database username (used by `autoria-postgres` and `pg_dump` util)
`PG_PASSWORD`	postgres	Database password (used by `autoria-postgres` and `pg_dump` util)
`PG_DB`	autoria	Database name (used by `autoria-postgres` and `pg_dump` util)
`PG_HOST`	autoria-postgres	Database host (used by `autoria-postgres` and `pg_dump` util, use db container name: `autoria-postgres`)

Output data example (10 rows)

Go to: /examples/output_data_example.csv for 100-row example

id	url	title	price_usd	odometer	username	phone_number	image_url	images_count	car_number	car_vin	datetime_found
1	https://auto.ria.com/uk/auto_infiniti_qx80_38398716.html	Infiniti QX80 2019	42999	110000	Антон	380751132773	https://cdn4.riastatic.com/photosnew/auto/photo/infiniti_qx80__601801019f.webp	19	KA 6330 MO	JN1JANZ62U0101530	2025-06-27 14:22:01.84235
2	https://auto.ria.com/uk/auto_bmw_5_series_38397166.html	BMW 5 Series 2006	9000	220000	Ігор	380974974041	https://cdn4.riastatic.com/photosnew/auto/photo/bmw_5-series__601603434f.webp	18	KE 9677 AK	WBANL51040CN08802	2025-06-27 14:22:01.842354
3	https://auto.ria.com/uk/auto_fiat_freemont_38504589.html	Fiat Freemont 2016	15990	201000	Роман	380505202504	https://cdn2.riastatic.com/photosnew/auto/photo/fiat_freemont__604717302f.webp	180	NULL	3C4PFBCY5FT656460	2025-06-27 14:22:01.842354
4	https://auto.ria.com/uk/auto_audi_a6_38145898.html	Audi A6 2019	39400	184000	Віталій	380967062649	https://cdn4.riastatic.com/photosnew/auto/photo/audi_a6__594640804f.webp	74	BO 0504 EP	WAUZZZF26LN009042	2025-06-27 14:22:01.842355
5	https://auto.ria.com/uk/auto_lexus_lx_38432047.html	Lexus LX 2022	122900	25000	Роман	380981524869	https://cdn2.riastatic.com/photosnew/auto/photo/lexus_lx__602956452f.webp	77	NULL	JTJPABCXx04xxxx81	2025-06-27 14:22:01.842357
6	https://auto.ria.com/uk/auto_volkswagen_passat_38490280.html	Volkswagen Passat 2021	18900	195000	IZI AUTO LUTSK	380970102233	https://cdn2.riastatic.com/photosnew/auto/photo/volkswagen_passat__604316732f.webp	91	NULL	WVWZZZ3CZME011380	2025-06-27 14:22:01.842358
7	https://auto.ria.com/uk/auto_peugeot_3008_38475871.html	Peugeot 3008 2009	7499	189000	Олег	380687327801	https://cdn0.riastatic.com/photosnew/auto/photo/peugeot_3008__603907420f.webp	51	NULL	VF30U9HZH9S052141	2025-06-27 14:22:01.842358
8	https://auto.ria.com/uk/auto_renault_grand_scenic_38486303.html	Renault Grand Scenic 2015	10499	237000	Сергій	380964133388	https://cdn0.riastatic.com/photosnew/auto/photo/renault_grand-scenic__604203510f.webp	199	NULL	VF1JZ03Bx53xxxx44	2025-06-27 14:22:01.842359
9	https://auto.ria.com/uk/auto_nissan_leaf_38487156.html	Nissan Leaf 2019	13200	120000	Ім’я не вказане	380976708990	https://cdn1.riastatic.com/photosnew/auto/photo/nissan_leaf__604227721f.webp	24	NULL	SJNFAAZE1U0050013	2025-06-27 14:22:01.842359
10	https://auto.ria.com/uk/auto_mercedes_benz_s_class_36816934.html	Mercedes-Benz S-Class 2013	25000	221000	Андрей	380500554866	https://cdn1.riastatic.com/photosnew/auto/photo/mercedes-benz_s-class__557748121f.webp	32	BC 1777 EP	WDDNG9EB0DA533729	2025-06-27 14:22:01.84236

How to build?

Don't forget to configure necessary variables in .env file.

Build and run using `docker-compose`

docker-compose up -d

Name		Name	Last commit message	Last commit date
Latest commit History 28 Commits
autoria_scraper		autoria_scraper
examples		examples
scripts		scripts
.dockerignore		.dockerignore
.gitignore		.gitignore
Dockerfile		Dockerfile
README.md		README.md
docker-compose.yaml		docker-compose.yaml
main.py		main.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

AutoRia.com web-scraper

Short structure description

Cron jobs

Project requirements

Settings

Default settings

Description

Output data example (10 rows)

How to build?

Build and run using `docker-compose`

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

AutoRia.com web-scraper

Short structure description

Cron jobs

Project requirements

Settings

Default settings

Description

Output data example (10 rows)

How to build?

Build and run using docker-compose

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Build and run using `docker-compose`

Packages