A Python application for IT Ops to archive stale Google Drive documents and structure Drive for AI agent access. Each phase is an independent service with its own test suite.
- Python 3.11+
- A Google Cloud Platform project with billing enabled
- A service account with domain-wide delegation configured
- Google Workspace admin access
- Google Drive API v3
- Google Drive Labels API v2
- Admin SDK Reports API
- Google Sheets API v4
- BigQuery API
- Create a service account in your GCP project
- Download the JSON key file and place it at
config/service-account-key.json - In Google Workspace Admin Console, go to Security > API Controls > Domain-wide Delegation
- Add the service account's client ID with these scopes:
https://www.googleapis.com/auth/drivehttps://www.googleapis.com/auth/drive.labelshttps://www.googleapis.com/auth/admin.reports.audit.readonlyhttps://www.googleapis.com/auth/spreadsheetshttps://www.googleapis.com/auth/bigquery
python -m venv venv
source venv/bin/activate
pip install -r requirements.txtCopy the sample config and fill in your values:
cp config/config.yaml config/config.local.yamlEdit config/config.local.yaml with your:
- GCP project ID
- Service account key path
- Admin email for impersonation
- Department names, codes, lead emails, and group emails
- Archive cutoff date
Important: The application starts in dry_run: true mode by default. Set dry_run: false only when ready to execute real changes.
Each phase is a standalone CLI command. Run them in order:
# Phase 0: Validate connectivity, create BigQuery tables, create Drive Labels
python cli.py -c config/config.local.yaml phase0
# Phase 1: Crawl all drives, identify archive candidates, generate review sheets
python cli.py -c config/config.local.yaml phase1
# Phase 2: Process department review responses, apply dispositions
python cli.py -c config/config.local.yaml phase2 --sheet-ids '{"Finance": "spreadsheet_id_here"}'
# Phase 3: Create archive drives, move files, apply labels, lock drives
python cli.py -c config/config.local.yaml phase3
# Phase 4: Run AI analysis, generate per-owner action reports
python cli.py -c config/config.local.yaml phase4
# Phase 4 (analysis only, no reports):
python cli.py -c config/config.local.yaml phase4 --analyze-only
# Phase 4 (deploy folder structures only):
python cli.py -c config/config.local.yaml phase4 --deploy-folders
# Phase 5: Generate access manifest for AI agents
python cli.py -c config/config.local.yaml phase5
# Phase 6: Run governance checks (archive sweep, permission audit, AI readiness scores)
python cli.py -c config/config.local.yaml phase6
# Phase 6 (dashboard data only):
python cli.py -c config/config.local.yaml phase6 --dashboardAdd -v for verbose/debug logging:
python cli.py -v -c config/config.local.yaml phase1gdrive_cleanup/
├── cli.py # CLI entrypoint (click)
├── config/
│ └── config.yaml # Sample configuration
├── src/
│ ├── common/ # Shared clients and utilities
│ │ ├── auth.py # Service account authentication
│ │ ├── bigquery_client.py # BigQuery wrapper + table schemas
│ │ ├── config.py # YAML config loader
│ │ ├── drive_client.py # Drive API v3 wrapper
│ │ ├── labels_client.py # Drive Labels API v2 wrapper
│ │ └── sheets_client.py # Google Sheets API wrapper
│ ├── phase0_infrastructure/ # Service account validation, BQ setup, label taxonomy
│ ├── phase1_discovery/ # Domain-wide file crawl, path reconstruction, archive ID
│ ├── phase2_review/ # Department review processing, disposition labeling
│ ├── phase3_archive/ # Archive drive creation, file migration, drive locking
│ ├── phase4_analysis/ # AI-driven file analysis, per-owner action reports
│ ├── phase5_manifest/ # Access manifest generation for AI agents
│ └── phase6_governance/ # Ongoing governance checks and AI readiness scoring
├── tests/
│ ├── conftest.py # Shared fixtures and sample data
│ ├── test_common.py # Tests for config, auth, rate limiting
│ ├── test_phase0.py # Tests for infrastructure setup
│ ├── test_phase1.py # Tests for discovery and inventory
│ ├── test_phase2.py # Tests for department review
│ ├── test_phase3.py # Tests for archive execution
│ ├── test_phase4.py # Tests for AI analysis and owner reports
│ ├── test_phase5.py # Tests for access manifest
│ └── test_phase6.py # Tests for ongoing governance
├── requirements.txt
├── pyproject.toml
├── execution_plan.md # Detailed phase-by-phase plan
├── phase_flow_diagram.md # Mermaid flowchart of all phases
└── overview.md # Original requirements and standards
# Run all tests
pytest
# Run tests for a specific phase
pytest tests/test_phase1.py
# Run with verbose output
pytest -v
# Run a specific test class
pytest tests/test_phase4.py::TestFindDuplicates| Phase | Service | What It Does |
|---|---|---|
| 0 | InfrastructureService |
Validates API connectivity, provisions BigQuery dataset/tables, creates the Drive Label taxonomy |
| 1 | DiscoveryService |
Crawls all Shared Drives, reconstructs folder paths, flags protected documents, identifies archive candidates, generates review sheets for department leads |
| 2 | ReviewService |
Reads department lead responses from review sheets, applies lifecycle labels (Active/Archive-Candidate/Under-Review), generates executive summary |
| 3 | ArchiveService |
Creates archive Shared Drives, recreates folder structures, moves files individually (API limitation), applies archive labels, locks drives to read-only, validates migration |
| 4 | AnalysisService |
Runs AI-powered analysis (duplicates, naming issues, permission anomalies, depth violations, orphans, stale files, label inference), generates per-owner action reports. Owners take action, not IT. |
| 5 | ManifestService |
Builds a machine-readable access manifest (drive memberships, folder trees, file permissions, labels) for AI agents. Exports to BigQuery and JSON. Analyzes permission simplification opportunities. |
| 6 | GovernanceService |
Quarterly archive sweeps, monthly permission audits, folder depth enforcement, stale file alerts, AI readiness scoring per department, dashboard data generation |
All destructive operations (creating drives, moving files, applying labels, deleting permissions) check config.dry_run before executing. When dry_run: true:
- File moves are logged but not executed
- Drive creation is skipped
- Labels are not applied
- Folder structures are planned but not created
- All analysis and reporting runs normally
Review the logs in dry run mode before switching to dry_run: false.
- Drive Labels over file renaming as the primary metadata strategy — labels are machine-queryable, survive file operations, and don't require retroactive renaming
modifiedTimeas primary archive signal — the Admin Reports API only provides 180 days of view history, makingmodifiedTimethe reliable cross-domain signal- File-by-file migration — the Drive API cannot move folders between Shared Drives, only individual files
- BigQuery as canonical store — handles large file inventories, supports SQL analysis, connects to dashboards
- Phase 4 is owner-led — AI identifies issues and generates reports, but file owners decide what to do. IT does not move, rename, or reorganize anyone's files
- Every phase is independently runnable — phases can be re-run, run in isolation, or run on a schedule
- Start with Phase 0 to validate everything is wired up correctly
- Run Phase 1 in dry run first to see the inventory size and review the data
- Allow 10 business days for department review in Phase 2
- Run Phase 3 during off-hours to minimize disruption (file moves are user-visible)
- Phase 4 can be re-run periodically as owners make progress
- Phase 5 manifest should be scheduled weekly via Cloud Scheduler + Cloud Functions
- Phase 6 governance checks should be scheduled quarterly/monthly as appropriate