Automate the process of extracting documentation URLs and adding them as sources to NotebookLM notebooks using browser automation.
The scrape_add_links_nblm_script.py script enhances URL extraction and Notebook management. The details of the script include:
- Based on This code is adapted from https://github.com/sshnaidm/notebooklm/blob/master/automation/add_links_script.py
- Extract URLs from documentation sites with version support
- Combined workflows - run extraction, authentication, and notebook loading in one command
- Smart resource combination - automatically combines scraped URLs with static CQA resources
- Add URLs to NotebookLM with authentication management
- All-in-one solution for extraction and notebook loading
-
URL Extraction: Scrape documentation hierarchies with smart version detection and support
-
Combined Workflows: Run extract → login → add in single command
-
Static Resource Integration: Automatically includes
CQA_res.txtstatic links (optional with--skip-cqa) -
Consistent File Handling: Always uses
urls.txtfor predictable behavior -
Version Support: Auto-detects versions in URLs or specify versions like 2.19, 2.20 (defaults to "latest")
-
Authentication Management: Persistent Google login sessions
-
Bulk URL Loading: Add multiple URLs to NotebookLM automatically
-
Error Handling: Comprehensive error messages and recovery options
-
YouTube & Website Support: Handles both content types
-
Smart Version Detection The script automatically detects version numbers in URLs and handles them intelligently:
URL Format --versionsFlagBehavior https://docs.example.com/product/3.2Not specified Uses detected version 3.2https://docs.example.com/product/3.2--versions 2.21,latestIgnores detected 3.2, uses specified versionshttps://docs.example.com/productNot specified Uses default latesthttps://docs.example.com/product--versions 2.21,latestUses specified versions -
Supported Version Patterns: Includes
/latest,/3.2,/v3.2,/2.21.1 -
Consistent File Handling: The script uses predictable file handling for easier workflows:
- Extraction: Always saves to
urls.txt(unless--toc-outputspecified) - Notebook Mode: Always reads from
urls.txt(unless--links-filespecified) - Resource Combination: Always includes
CQA_res.txtstatic resources (unless--skip-cqais used) - No Guessing: Clear, consistent behavior every time
- Extraction: Always saves to
- Python 3.7+
- A NotebookLM account
- OPTIONAL: If the project root is different from the script directory
add_links_notebook, navigate to the location where the.venvdirectory is located:
# Navigate to project root (where .venv is located)
cd ../add_links_notebook- Create a virtual environment if it is not available.
# If using this for the first time, create a new virtual environment
python3 -m venv .venv- Activate the virtual environment:
# Activate the existing virtual environment
source .venv/bin/activate- Install dependencies.
python3 -m pip install -r requirements.txt- Install browser binaries for Playwright.
python3 -m playwright install- Install the Playwright browser.
# Check Playwright is installed
playwright --version- Install Chromium browser to automate authentication.
playwright install chromium- OPTIONAL: If your project root is different from the script directory, navigate back to the script directory:
# Navigate back to script directory
cd ../add_scrapped_links_notebooklmNote: Always ensure the virtual environment is active (you should see (.venv) in your terminal prompt) before running the script.
Full Combined Workflow (One Command)
# Extract URLs, authenticate, and add to notebook in one command
python3 scrape_add_links_nblm_script.py --extract-toc "https://docs.redhat.com/en/documentation/red_hat_openshift_ai_self-managed" --login --notebook "https://notebooklm.google.com/notebook/YOUR_NOTEBOOK_ID"Add links other than docs.redhat.com links (One Command)
python3 scrape_add_links_nblm_script.py --login --notebook "https://notebooklm.google.com/notebook/YOUR_NOTEBOOK_ID" --links https://www.redhat.com/en/blog/red-hat-ai-inference-server-technical-deep-dive https://www.youtube.com/watch?v=b9BWbr_7xs8If you've already logged in, extract URLs and add to notebook in one command
python3 scrape_add_links_nblm_script.py --login --notebook "https://notebooklm.google.com/notebook/YOUR_NOTEBOOK_ID" --links https://www.redhat.com/en/blog/red-hat-ai-inference-server-technical-deep-dive https://www.youtube.com/watch?v=b9BWbr_7xs8- Extract URLs from the latest version or specific versions from your documentation:
# Extract URLs from the latest version (saves to urls.txt)
python3 scrape_add_links_nblm_script.py --extract-toc "https://docs.redhat.com/en/documentation/red_hat_openshift_ai_self-managed"# Extract URLs from specific versions
python3 scrape_add_links_nblm_script.py --extract-toc "https://docs.redhat.com/en/documentation/red_hat_openshift_ai_self-managed" --versions "latest,2.21,2.20"- Authenticate with Google (first time only)
python3 scrape_add_links_nblm_script.py --login- Opens the browser window.
- Log in to Google manually. IMPORTANT: Close the browser window after logging in (saves session).
- Create a NotebookLM Notebook.
- Go to NotebookLM
- Create a new notebook.
- Copy the notebook URL.
- Use the copied URL in the next step
- Add URLs to your Notebook.
python3 scrape_add_links_nblm_script.py --notebook "https://notebooklm.google.com/notebook/YOUR_NOTEBOOK_ID"Note: Automatically combines urls.txt (scraped URLs) + CQA_res.txt (static resources). Use --skip-cqa to exclude CQA_res.txt.
urls.txt: Primary file for scraped URLs (gets overwritten with each extraction)CQA_res.txt: Static CQA resources (always included automatically)- Combined: Script automatically merges both files when adding to notebook
The command line options include:
Help file
--help: Lists all available options
Extraction Mode:
--extract-toc URL: Documentation URL to scrape (with or without version)--toc-output FILE: Output file for extracted links (default: urls.txt)--versions LIST: Comma-separated versions (default: detected version or latest)
Notebook Mode:
--notebook URL: NotebookLM notebook URL--links-file FILE: Links file (default: urls.txt, always includes CQA_res.txt)--links URL [URL...]: Individual URLs to add--skip-cqa: Skip including CQA_res.txt when using file-based links
Authentication:
--login: Run authentication process--profile-path PATH: Browser profile directory (default: ~/.browser_automation)
Combined Workflows: You can combine any of the three main operations in a single command:
--extract-toc+--notebook: Extract then add--login+--notebook: Login then add--extract-toc+--login+--notebook: Full workflow
Run any of the following commands based on your use case:
# Full workflow (extract → login → add):
python3 scrape_add_links_nblm_script.py --extract-toc URL --login --notebook NOTEBOOK_URL
# Extract then add (uses urls.txt automatically):
python3 scrape_add_links_nblm_script.py --extract-toc URL --notebook NOTEBOOK_URL
# Login then add (uses existing urls.txt):
python3 scrape_add_links_nblm_script.py --login --notebook NOTEBOOK_URLExtract with Default Version (latest)
python3 scrape_add_links_nblm_script.py --extract-toc "BASE_URL"Note: Always saves to urls.txt unless --toc-output specified.
Extract with specific versions
python3 scrape_add_links_nblm_script.py --extract-toc "BASE_URL" --versions "2.21,2.22,latest"Smart version detection (URLs with versions)
# URL contains version - uses detected version (3.2)
python3 scrape_add_links_nblm_script.py --extract-toc "https://docs.redhat.com/en/documentation/red_hat_ai_inference_server/3.2"
# URL contains version but override with --versions flag
python3 scrape_add_links_nblm_script.py --extract-toc "https://docs.redhat.com/en/documentation/red_hat_ai_inference_server/3.2" --versions "latest,2.21"Note: Script automatically detects and strips version from URL, then uses detected version or specified versions.
Extract with custom output file
python3 scrape_add_links_nblm_script.py --extract-toc "BASE_URL" --toc-output custom_file.txtpython3 scrape_add_links_nblm_script.py --loginUse default files (Recommended)
python3 scrape_add_links_nblm_script.py --notebook "NOTEBOOK_URL"This command automatically combines:
urls.txt(scraped URLs)CQA_res.txt(static CQA resources)
Use custom links file
python3 scrape_add_links_nblm_script.py --notebook "NOTEBOOK_URL" --links-file custom_links.txtNote: Still includes CQA_res.txt automatically unless the --skip-cqa parameter is used.
Add individual URLs
python3 scrape_add_links_nblm_script.py --notebook "NOTEBOOK_URL" --links "https://example.com" "https://youtube.com/watch?v=xyz"Skip CQA resources (Use Only Extracted/Custom Links)
# Use only extracted URLs (skip CQA_res.txt)
python3 scrape_add_links_nblm_script.py --notebook "NOTEBOOK_URL" --skip-cqa
# Use only custom file (skip CQA_res.txt)
python3 scrape_add_links_nblm_script.py --notebook "NOTEBOOK_URL" --links-file custom.txt --skip-cqa
# Full workflow with skip CQA
python3 scrape_add_links_nblm_script.py --extract-toc "BASE_URL" --notebook "NOTEBOOK_URL" --skip-cqa# One command to do everything
python3 scrape_add_links_nblm_script.py \
--extract-toc "https://docs.redhat.com/en/documentation/red_hat_openshift_ai_self-managed" \
--login \
--notebook "https://notebooklm.google.com/notebook/abc123"# Extract AI Inference Server docs (overwrites urls.txt)
python3 scrape_add_links_nblm_script.py \
--extract-toc "https://docs.redhat.com/en/documentation/red_hat_ai_inference_server" \
--notebook "https://notebooklm.google.com/notebook/abc123"# Extract from multiple versions
python3 scrape_add_links_nblm_script.py \
--extract-toc "https://docs.example.com/product" \
--versions "v1.0,v2.0,latest" \
--notebook "https://notebooklm.google.com/notebook/xyz789"# URL with version - automatically uses 3.2 (no --versions needed)
python3 scrape_add_links_nblm_script.py \
--extract-toc "https://docs.redhat.com/en/documentation/red_hat_ai_inference_server/3.2" \
--notebook "https://notebooklm.google.com/notebook/abc123"
# URL with version but override to get multiple versions
python3 scrape_add_links_nblm_script.py \
--extract-toc "https://docs.redhat.com/en/documentation/red_hat_ai_inference_server/3.2" \
--versions "3.2,3.1,latest" \
--notebook "https://notebooklm.google.com/notebook/abc123"# Extract and add only Red Hat AI Inference Server docs (no CQA_res.txt)
python3 scrape_add_links_nblm_script.py \
--extract-toc "https://docs.redhat.com/en/documentation/red_hat_ai_inference_server" \
--notebook "https://notebooklm.google.com/notebook/abc123" \
--skip-cqa
# Use only custom links file (no CQA_res.txt)
python3 scrape_add_links_nblm_script.py \
--notebook "https://notebooklm.google.com/notebook/abc123" \
--links-file my_custom_links.txt \
--skip-cqaError: "Executable doesn't exist at .../Chromium.app/Contents/MacOS/Chromium" Solution: Install Playwright browsers:
# Navigate to project root
cd /Users/dobrenna/Documents/NLP_college/sandbox/add_links_notebook
# Activate virtual environment
source .venv/bin/activate
# Install Chromium browser
playwright install chromium
# Return to script directory
cd notebooklm/automation/add_scrapped_links_notebooklmError: "ModuleNotFoundError" or missing packages Solutions:
- Ensure virtual environment is activated:
source .venv/bin/activate - Check you're in the right directory: should see
(.venv)in prompt - Verify Playwright installation:
playwright --version
Error: "ProcessSingleton" errors Solution: Clear browser lock files:
rm -f ~/.browser_automation/SingletonLock ~/.browser_automation/SingletonCookie ~/.browser_automation/SingletonSocketError: "Main links file not found - urls.txt" Solutions:
- Run
--extract-tocfirst to createurls.txt - Specify
--links-filewith an existing file - Provide
--linkswith individual URLs
Error: "Could not find Add button" (all links fail) Solution: Re-run login process:
python3 scrape_add_links_nblm_script.py --loginError: 404 errors during extraction Solutions:
- Verify the base URL is correct
- Check if the versions exist (try "latest" first)
- Ensure the documentation site is accessible
Error: Script times out clicking buttons Solutions:
- Clear browser lock files (see above)
- Ensure you closed the browser window after logging in
- Try running the login step again
- Virtual Environment: Always activate your virtual environment (
source .venv/bin/activate) before running scripts - Browser Installation: One-time setup with
playwright install chromium - Authentication: Login session is saved in
~/.browser_automationdirectory - File Consistency: Always uses
urls.txtfor extracted URLs for predictable behavior - Resource Integration: Automatically includes static CQA resources from
CQA_res.txt(use--skip-cqato exclude) - Rate Limiting: Script waits 3 seconds between URLs to avoid overwhelming NotebookLM
- Browser: Uses Chromium in visible mode so you can see progress
- Content Types: Supports both website URLs and YouTube videos
- File Format: All URL files should have one URL per line
- Combined Workflows: Can run extraction, authentication, and notebook addition in single command
- All-in-one solution for extraction and notebook loading