Skip to content

commercetools/gdrive_cleanup

Repository files navigation

Google Drive Cleanup & AI Readiness

A Python application for IT Ops to archive stale Google Drive documents and structure Drive for AI agent access. Each phase is an independent service with its own test suite.

Prerequisites

  • Python 3.11+
  • A Google Cloud Platform project with billing enabled
  • A service account with domain-wide delegation configured
  • Google Workspace admin access

Required APIs (enable in GCP Console)

  • Google Drive API v3
  • Google Drive Labels API v2
  • Admin SDK Reports API
  • Google Sheets API v4
  • BigQuery API

Service Account Setup

  1. Create a service account in your GCP project
  2. Download the JSON key file and place it at config/service-account-key.json
  3. In Google Workspace Admin Console, go to Security > API Controls > Domain-wide Delegation
  4. Add the service account's client ID with these scopes:
    • https://www.googleapis.com/auth/drive
    • https://www.googleapis.com/auth/drive.labels
    • https://www.googleapis.com/auth/admin.reports.audit.readonly
    • https://www.googleapis.com/auth/spreadsheets
    • https://www.googleapis.com/auth/bigquery

Installation

python -m venv venv
source venv/bin/activate
pip install -r requirements.txt

Configuration

Copy the sample config and fill in your values:

cp config/config.yaml config/config.local.yaml

Edit config/config.local.yaml with your:

  • GCP project ID
  • Service account key path
  • Admin email for impersonation
  • Department names, codes, lead emails, and group emails
  • Archive cutoff date

Important: The application starts in dry_run: true mode by default. Set dry_run: false only when ready to execute real changes.

Usage

Each phase is a standalone CLI command. Run them in order:

# Phase 0: Validate connectivity, create BigQuery tables, create Drive Labels
python cli.py -c config/config.local.yaml phase0

# Phase 1: Crawl all drives, identify archive candidates, generate review sheets
python cli.py -c config/config.local.yaml phase1

# Phase 2: Process department review responses, apply dispositions
python cli.py -c config/config.local.yaml phase2 --sheet-ids '{"Finance": "spreadsheet_id_here"}'

# Phase 3: Create archive drives, move files, apply labels, lock drives
python cli.py -c config/config.local.yaml phase3

# Phase 4: Run AI analysis, generate per-owner action reports
python cli.py -c config/config.local.yaml phase4

# Phase 4 (analysis only, no reports):
python cli.py -c config/config.local.yaml phase4 --analyze-only

# Phase 4 (deploy folder structures only):
python cli.py -c config/config.local.yaml phase4 --deploy-folders

# Phase 5: Generate access manifest for AI agents
python cli.py -c config/config.local.yaml phase5

# Phase 6: Run governance checks (archive sweep, permission audit, AI readiness scores)
python cli.py -c config/config.local.yaml phase6

# Phase 6 (dashboard data only):
python cli.py -c config/config.local.yaml phase6 --dashboard

Add -v for verbose/debug logging:

python cli.py -v -c config/config.local.yaml phase1

Project Structure

gdrive_cleanup/
├── cli.py                          # CLI entrypoint (click)
├── config/
│   └── config.yaml                 # Sample configuration
├── src/
│   ├── common/                     # Shared clients and utilities
│   │   ├── auth.py                 # Service account authentication
│   │   ├── bigquery_client.py      # BigQuery wrapper + table schemas
│   │   ├── config.py               # YAML config loader
│   │   ├── drive_client.py         # Drive API v3 wrapper
│   │   ├── labels_client.py        # Drive Labels API v2 wrapper
│   │   └── sheets_client.py        # Google Sheets API wrapper
│   ├── phase0_infrastructure/      # Service account validation, BQ setup, label taxonomy
│   ├── phase1_discovery/           # Domain-wide file crawl, path reconstruction, archive ID
│   ├── phase2_review/              # Department review processing, disposition labeling
│   ├── phase3_archive/             # Archive drive creation, file migration, drive locking
│   ├── phase4_analysis/            # AI-driven file analysis, per-owner action reports
│   ├── phase5_manifest/            # Access manifest generation for AI agents
│   └── phase6_governance/          # Ongoing governance checks and AI readiness scoring
├── tests/
│   ├── conftest.py                 # Shared fixtures and sample data
│   ├── test_common.py              # Tests for config, auth, rate limiting
│   ├── test_phase0.py              # Tests for infrastructure setup
│   ├── test_phase1.py              # Tests for discovery and inventory
│   ├── test_phase2.py              # Tests for department review
│   ├── test_phase3.py              # Tests for archive execution
│   ├── test_phase4.py              # Tests for AI analysis and owner reports
│   ├── test_phase5.py              # Tests for access manifest
│   └── test_phase6.py              # Tests for ongoing governance
├── requirements.txt
├── pyproject.toml
├── execution_plan.md               # Detailed phase-by-phase plan
├── phase_flow_diagram.md           # Mermaid flowchart of all phases
└── overview.md                     # Original requirements and standards

Running Tests

# Run all tests
pytest

# Run tests for a specific phase
pytest tests/test_phase1.py

# Run with verbose output
pytest -v

# Run a specific test class
pytest tests/test_phase4.py::TestFindDuplicates

Phases Overview

Phase Service What It Does
0 InfrastructureService Validates API connectivity, provisions BigQuery dataset/tables, creates the Drive Label taxonomy
1 DiscoveryService Crawls all Shared Drives, reconstructs folder paths, flags protected documents, identifies archive candidates, generates review sheets for department leads
2 ReviewService Reads department lead responses from review sheets, applies lifecycle labels (Active/Archive-Candidate/Under-Review), generates executive summary
3 ArchiveService Creates archive Shared Drives, recreates folder structures, moves files individually (API limitation), applies archive labels, locks drives to read-only, validates migration
4 AnalysisService Runs AI-powered analysis (duplicates, naming issues, permission anomalies, depth violations, orphans, stale files, label inference), generates per-owner action reports. Owners take action, not IT.
5 ManifestService Builds a machine-readable access manifest (drive memberships, folder trees, file permissions, labels) for AI agents. Exports to BigQuery and JSON. Analyzes permission simplification opportunities.
6 GovernanceService Quarterly archive sweeps, monthly permission audits, folder depth enforcement, stale file alerts, AI readiness scoring per department, dashboard data generation

Dry Run Mode

All destructive operations (creating drives, moving files, applying labels, deleting permissions) check config.dry_run before executing. When dry_run: true:

  • File moves are logged but not executed
  • Drive creation is skipped
  • Labels are not applied
  • Folder structures are planned but not created
  • All analysis and reporting runs normally

Review the logs in dry run mode before switching to dry_run: false.

Key Design Decisions

  • Drive Labels over file renaming as the primary metadata strategy — labels are machine-queryable, survive file operations, and don't require retroactive renaming
  • modifiedTime as primary archive signal — the Admin Reports API only provides 180 days of view history, making modifiedTime the reliable cross-domain signal
  • File-by-file migration — the Drive API cannot move folders between Shared Drives, only individual files
  • BigQuery as canonical store — handles large file inventories, supports SQL analysis, connects to dashboards
  • Phase 4 is owner-led — AI identifies issues and generates reports, but file owners decide what to do. IT does not move, rename, or reorganize anyone's files
  • Every phase is independently runnable — phases can be re-run, run in isolation, or run on a schedule

IT Ops Implementation Notes

  1. Start with Phase 0 to validate everything is wired up correctly
  2. Run Phase 1 in dry run first to see the inventory size and review the data
  3. Allow 10 business days for department review in Phase 2
  4. Run Phase 3 during off-hours to minimize disruption (file moves are user-visible)
  5. Phase 4 can be re-run periodically as owners make progress
  6. Phase 5 manifest should be scheduled weekly via Cloud Scheduler + Cloud Functions
  7. Phase 6 governance checks should be scheduled quarterly/monthly as appropriate

About

One-off application that structures and implements an org-wide cleanup of G-drive with department level suggestions.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages