Google Drive Cleanup & AI Readiness

A Python application for IT Ops to archive stale Google Drive documents and structure Drive for AI agent access. Each phase is an independent service with its own test suite.

Prerequisites

Python 3.11+
A Google Cloud Platform project with billing enabled
A service account with domain-wide delegation configured
Google Workspace admin access

Required APIs (enable in GCP Console)

Google Drive API v3
Google Drive Labels API v2
Admin SDK Reports API
Google Sheets API v4
BigQuery API

Service Account Setup

Create a service account in your GCP project
Download the JSON key file and place it at config/service-account-key.json
In Google Workspace Admin Console, go to Security > API Controls > Domain-wide Delegation
Add the service account's client ID with these scopes:
- https://www.googleapis.com/auth/drive
- https://www.googleapis.com/auth/drive.labels
- https://www.googleapis.com/auth/admin.reports.audit.readonly
- https://www.googleapis.com/auth/spreadsheets
- https://www.googleapis.com/auth/bigquery

Installation

python -m venv venv
source venv/bin/activate
pip install -r requirements.txt

Configuration

Copy the sample config and fill in your values:

cp config/config.yaml config/config.local.yaml

Edit config/config.local.yaml with your:

GCP project ID
Service account key path
Admin email for impersonation
Department names, codes, lead emails, and group emails
Archive cutoff date

Important: The application starts in dry_run: true mode by default. Set dry_run: false only when ready to execute real changes.

Usage

Each phase is a standalone CLI command. Run them in order:

# Phase 0: Validate connectivity, create BigQuery tables, create Drive Labels
python cli.py -c config/config.local.yaml phase0

# Phase 1: Crawl all drives, identify archive candidates, generate review sheets
python cli.py -c config/config.local.yaml phase1

# Phase 2: Process department review responses, apply dispositions
python cli.py -c config/config.local.yaml phase2 --sheet-ids '{"Finance": "spreadsheet_id_here"}'

# Phase 3: Create archive drives, move files, apply labels, lock drives
python cli.py -c config/config.local.yaml phase3

# Phase 4: Run AI analysis, generate per-owner action reports
python cli.py -c config/config.local.yaml phase4

# Phase 4 (analysis only, no reports):
python cli.py -c config/config.local.yaml phase4 --analyze-only

# Phase 4 (deploy folder structures only):
python cli.py -c config/config.local.yaml phase4 --deploy-folders

# Phase 5: Generate access manifest for AI agents
python cli.py -c config/config.local.yaml phase5

# Phase 6: Run governance checks (archive sweep, permission audit, AI readiness scores)
python cli.py -c config/config.local.yaml phase6

# Phase 6 (dashboard data only):
python cli.py -c config/config.local.yaml phase6 --dashboard

Add -v for verbose/debug logging:

python cli.py -v -c config/config.local.yaml phase1

Project Structure

gdrive_cleanup/
├── cli.py                          # CLI entrypoint (click)
├── config/
│   └── config.yaml                 # Sample configuration
├── src/
│   ├── common/                     # Shared clients and utilities
│   │   ├── auth.py                 # Service account authentication
│   │   ├── bigquery_client.py      # BigQuery wrapper + table schemas
│   │   ├── config.py               # YAML config loader
│   │   ├── drive_client.py         # Drive API v3 wrapper
│   │   ├── labels_client.py        # Drive Labels API v2 wrapper
│   │   └── sheets_client.py        # Google Sheets API wrapper
│   ├── phase0_infrastructure/      # Service account validation, BQ setup, label taxonomy
│   ├── phase1_discovery/           # Domain-wide file crawl, path reconstruction, archive ID
│   ├── phase2_review/              # Department review processing, disposition labeling
│   ├── phase3_archive/             # Archive drive creation, file migration, drive locking
│   ├── phase4_analysis/            # AI-driven file analysis, per-owner action reports
│   ├── phase5_manifest/            # Access manifest generation for AI agents
│   └── phase6_governance/          # Ongoing governance checks and AI readiness scoring
├── tests/
│   ├── conftest.py                 # Shared fixtures and sample data
│   ├── test_common.py              # Tests for config, auth, rate limiting
│   ├── test_phase0.py              # Tests for infrastructure setup
│   ├── test_phase1.py              # Tests for discovery and inventory
│   ├── test_phase2.py              # Tests for department review
│   ├── test_phase3.py              # Tests for archive execution
│   ├── test_phase4.py              # Tests for AI analysis and owner reports
│   ├── test_phase5.py              # Tests for access manifest
│   └── test_phase6.py              # Tests for ongoing governance
├── requirements.txt
├── pyproject.toml
├── execution_plan.md               # Detailed phase-by-phase plan
├── phase_flow_diagram.md           # Mermaid flowchart of all phases
└── overview.md                     # Original requirements and standards

Running Tests

# Run all tests
pytest

# Run tests for a specific phase
pytest tests/test_phase1.py

# Run with verbose output
pytest -v

# Run a specific test class
pytest tests/test_phase4.py::TestFindDuplicates

Phases Overview

Phase	Service	What It Does
0	`InfrastructureService`	Validates API connectivity, provisions BigQuery dataset/tables, creates the Drive Label taxonomy
1	`DiscoveryService`	Crawls all Shared Drives, reconstructs folder paths, flags protected documents, identifies archive candidates, generates review sheets for department leads
2	`ReviewService`	Reads department lead responses from review sheets, applies lifecycle labels (Active/Archive-Candidate/Under-Review), generates executive summary
3	`ArchiveService`	Creates archive Shared Drives, recreates folder structures, moves files individually (API limitation), applies archive labels, locks drives to read-only, validates migration
4	`AnalysisService`	Runs AI-powered analysis (duplicates, naming issues, permission anomalies, depth violations, orphans, stale files, label inference), generates per-owner action reports. Owners take action, not IT.
5	`ManifestService`	Builds a machine-readable access manifest (drive memberships, folder trees, file permissions, labels) for AI agents. Exports to BigQuery and JSON. Analyzes permission simplification opportunities.
6	`GovernanceService`	Quarterly archive sweeps, monthly permission audits, folder depth enforcement, stale file alerts, AI readiness scoring per department, dashboard data generation

Dry Run Mode

All destructive operations (creating drives, moving files, applying labels, deleting permissions) check config.dry_run before executing. When dry_run: true:

File moves are logged but not executed
Drive creation is skipped
Labels are not applied
Folder structures are planned but not created
All analysis and reporting runs normally

Review the logs in dry run mode before switching to dry_run: false.

Key Design Decisions

Drive Labels over file renaming as the primary metadata strategy — labels are machine-queryable, survive file operations, and don't require retroactive renaming
modifiedTime as primary archive signal — the Admin Reports API only provides 180 days of view history, making modifiedTime the reliable cross-domain signal
File-by-file migration — the Drive API cannot move folders between Shared Drives, only individual files
BigQuery as canonical store — handles large file inventories, supports SQL analysis, connects to dashboards
Phase 4 is owner-led — AI identifies issues and generates reports, but file owners decide what to do. IT does not move, rename, or reorganize anyone's files
Every phase is independently runnable — phases can be re-run, run in isolation, or run on a schedule

IT Ops Implementation Notes

Start with Phase 0 to validate everything is wired up correctly
Run Phase 1 in dry run first to see the inventory size and review the data
Allow 10 business days for department review in Phase 2
Run Phase 3 during off-hours to minimize disruption (file moves are user-visible)
Phase 4 can be re-run periodically as owners make progress
Phase 5 manifest should be scheduled weekly via Cloud Scheduler + Cloud Functions
Phase 6 governance checks should be scheduled quarterly/monthly as appropriate

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Google Drive Cleanup & AI Readiness

Prerequisites

Required APIs (enable in GCP Console)

Service Account Setup

Installation

Configuration

Usage

Project Structure

Running Tests

Phases Overview

Dry Run Mode

Key Design Decisions

IT Ops Implementation Notes

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
config		config
src		src
tests		tests
.gitignore		.gitignore
README.md		README.md
cli.py		cli.py
execution_plan.md		execution_plan.md
phase_flow_diagram.md		phase_flow_diagram.md
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

Google Drive Cleanup & AI Readiness

Prerequisites

Required APIs (enable in GCP Console)

Service Account Setup

Installation

Configuration

Usage

Project Structure

Running Tests

Phases Overview

Dry Run Mode

Key Design Decisions

IT Ops Implementation Notes

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages