This repository runs the Impresso content-item ad classifier on rebuilt
newspaper JSONL files and writes a reduced JSONL output for downstream use. The
shared build machinery lives in the cookbook/ submodule; the task-specific
logic for this repository lives in the root Makefile, config, and
lib/cli_content_item_classification.py.
.
├── Makefile
├── configs/
│ └── config-content-item-classification-multilingual_v1-0-0.mk
├── lib/
│ └── cli_content_item_classification.py
├── cookbook/ # shared Impresso make cookbook submodule
├── dotenv.sample
├── Pipfile
└── requirements.txt
Input is expected to be rebuilt jsonl or jsonl.bz2 newspaper content. The
pipeline writes one output row per input content item:
- non-article rows keep only
idandtp - article rows keep
id,tp, andad_classification
The current default run ID is derived from:
- process label:
content-item-classification - task:
base - model:
multilingual - run version:
v1-0-0
GNU Make 4 or newer is required. On macOS, that usually means using Homebrew
gmake instead of the system make.
git clone --recursive <repo-url>
cd impresso-content-item-classification-cookbook
cp dotenv.sample .env
pipenv install
gmake setupIf you do not use pipenv, install from requirements.txt instead.
Show targets:
gmake helpRun one newspaper with the committed config:
gmake \
CFG=configs/config-content-item-classification-multilingual_v1-0-0.mk \
NEWSPAPER=BNL/actionfem \
newspaperRun a collection:
gmake \
CFG=configs/config-content-item-classification-multilingual_v1-0-0.mk \
COLLECTION_JOBS=4 \
collectionThe build first syncs rebuilt input data, then runs
lib/cli_content_item_classification.py, and finally uploads the output and log
back to S3.
Required environment variables go in .env:
SE_ACCESS_KEY=<YOUR VALUE>
SE_SECRET_KEY=<YOUR VALUE>
SE_HOST_URL=https://os.zhdk.cloud.switch.ch/Useful runtime overrides:
CFG: select a config fileNEWSPAPER: process a single newspaper, defaultBNL/actionfemCOLLECTION_JOBS: number of newspapers processed in parallelNEWSPAPER_JOBS: parallelism within one newspaperLOGGING_LEVEL: make and CLI logging verbosity
Task-specific defaults live in:
configs/config-content-item-classification-multilingual_v1-0-0.mkcookbook/paths_content_item_classification.mkcookbook/processing_content_item_classification.mk