Google Drive Cleanup & AI Readiness — Execution Plan

Prepared: March 2026 Deliverable: Python CLI application (cli.py) with per-phase services, test coverage, and dry-run safety Objectives:

Archive and clean up stale documents across the organization's Google Drive
Structure Drive so that calling AI agents (Claude, Gemini, etc.) can reliably understand access, permissions, and content

Deliverable Summary

The execution plan is implemented as a runnable Python application in this repository. Each phase is an independent service with matching test coverage, orchestrated by a CLI entrypoint.

Component	Location	Purpose
CLI entrypoint	`cli.py`	`python cli.py phase0`, `phase1`, ..., `phase6`
Configuration	`config/config.yaml`	All settings: GCP project, departments, cutoff dates, dry-run flag
Shared clients	`src/common/`	Drive API, BigQuery, Labels API, Sheets API wrappers and auth
Phase services	`src/phase0_infrastructure/` through `src/phase6_governance/`	One service class per phase
Test suite	`tests/test_phase0.py` through `tests/test_phase6.py` + `test_common.py`	73 tests covering all phases
README	`README.md`	Full setup, usage, and implementation guide for IT Ops
Mermaid diagram	`phase_flow_diagram.md`	Visual flow of all phases and their steps

Setup: pip install -r requirements.txt then configure config/config.yaml with your GCP project and service account details. All destructive operations are gated behind dry_run: true by default.

Critical API Findings That Shape This Plan

Before diving into phases, several Google API realities should inform every decision:

Capability	What It Means for Us
Drive API v3 `files.list` with `corpora=domain`	We can query every file in the domain programmatically — no need for per-user crawling for shared drives
`modifiedTime` is reliable; `viewedByMeTime` is per-user only	We can filter on `modifiedTime` globally, but "last viewed" requires either Admin Reports API (180-day max) or impersonation via a service account to check each user's `viewedByMeTime`
Admin Reports API: 180-day lookback limit	We cannot pull "last viewed" data older than 6 months from the Reports API. For files older than that, `modifiedTime` must be the primary signal
Folders cannot be moved between Shared Drives via API	Archive migration must be file-by-file. The script must recreate folder structures in archive drives, then move individual files
Drive Labels API (v2) is GA	We can programmatically create label taxonomies (department, document type, sensitivity, archive status) and apply them to files at scale. Labels are searchable and enforceable via DLP
`files.modifyLabels` supports atomic batch operations	Label application can be done in bulk — one API call can add/update multiple label fields on a file
Gemini can federate Drive data without copying	A well-labeled, well-structured Drive is directly queryable by Gemini Enterprise. Labels and consistent naming dramatically improve retrieval accuracy
Permissions API exposes full access graph	`permissions.list` on any file returns type, role, email, and inheritance — we can build a complete access map for AI agents

Architecture Overview

┌─────────────────────────────────────────────────────────────────┐
│                    Google Workspace Domain                      │
│                                                                 │
│  ┌──────────────┐   ┌──────────────────┐   ┌────────────────┐  │
│  │ Service Acct  │──▶│ Drive API v3     │──▶│ File Inventory │  │
│  │ (Domain-Wide  │   │ Labels API v2    │   │ (BigQuery or   │  │
│  │  Delegation)  │   │ Admin Reports    │   │  Google Sheet) │  │
│  └──────────────┘   │ Permissions API  │   └────────────────┘  │
│                     └──────────────────┘                        │
│                                                                 │
│  ┌──────────────────────────────────────────────────────────┐   │
│  │                  Drive Label Taxonomy                     │   │
│  │  ┌────────────┐ ┌────────────┐ ┌───────────────────────┐ │   │
│  │  │ Department │ │ Doc Type   │ │ Lifecycle Status      │ │   │
│  │  │ (Selection)│ │ (Selection)│ │ (Active/Archive/      │ │   │
│  │  │            │ │            │ │  Review/Protected)    │ │   │
│  │  └────────────┘ └────────────┘ └───────────────────────┘ │   │
│  └──────────────────────────────────────────────────────────┘   │
│                                                                 │
│  ┌──────────────────────────────────────────────────────────┐   │
│  │              Access Manifest (JSON/BigQuery)              │   │
│  │  Per-file: id, name, path, labels, permissions[], owner  │   │
│  │  → Queryable by AI agents via API or exported snapshot    │   │
│  └──────────────────────────────────────────────────────────┘   │
└─────────────────────────────────────────────────────────────────┘

Phase 0 — Infrastructure Setup

Goal: Stand up the service account, label taxonomy, and inventory pipeline before touching any files.

0.1 Service Account & Domain-Wide Delegation

Step	Detail
Create a GCP project	`gdrive-cleanup-prod` (or use existing IT project)
Create a service account	`[email protected]`
Enable APIs	Drive API v3, Drive Labels API v2, Admin SDK Reports API, Google Sheets API
Configure domain-wide delegation	In Admin Console → Security → API Controls → Domain-wide Delegation. Grant scopes: `https://www.googleapis.com/auth/drive`, `https://www.googleapis.com/auth/drive.labels`, `https://www.googleapis.com/auth/admin.reports.audit.readonly`
Test access	Impersonate an IT admin user, list files from a test shared drive

0.2 Create Drive Label Taxonomy

Using the Drive Labels API, create the following labels. These will be applied to files throughout the process and serve as the primary metadata layer for both human navigation and AI agent queries.

Label: Document-Classification

Field	Type	Values
`department`	Selection	CORP, FIN, LGL, DSBI, SLS, MKT, CST, PRD, ENG, HR, ITS, XFN
`doc-type`	Selection	RPT, POL, CTR, MTG, PLN, TPL, FIN, PRE, SPE, TRN, VND, CMP, CRE, DAT, REF
`lifecycle-status`	Selection	`Active`, `Archive-Candidate`, `Under-Review`, `Archived`, `Protected`
`sensitivity`	Selection	`Public`, `Internal`, `Confidential`, `Restricted`
`archive-date`	Date	(set when file is archived)
`reviewed-by`	User	(set when a department lead reviews)

Why labels instead of (or in addition to) file naming:

Labels are machine-queryable ('labels/Document-Classification' in labels in Drive search)
Labels survive file renames, copies, and moves
Labels can be enforced via Google Workspace DLP rules
AI agents can read labels via API without parsing file names
Labels don't require retroactive renaming of thousands of existing files

0.3 Choose Inventory Storage

Option	When to Use
Google Sheets	< 50,000 files. Simple, visible to department leads, easy to share
BigQuery	> 50,000 files. Supports SQL queries, scales to millions, integrates with Looker dashboards
Both	Use BigQuery as master, export filtered views to Sheets per department for review

Recommendation: Use BigQuery as the canonical store with per-department Sheet exports for the review workflow.

Phase 1 — Discovery & Inventory

Goal: Build a complete inventory of every file in the domain with metadata to support archiving decisions.

1.1 Full Domain File Crawl

Write a Python script (or Apps Script) using the service account to:

List all Shared Drives via drives.list()
For each Shared Drive, recursively list all files via files.list() with:
- corpora=drive, driveId=<id>, includeItemsFromAllDrives=true, supportsAllDrives=true
- fields: id, name, mimeType, modifiedTime, createdTime, owners, parents, size, shared, sharingUser, permissions(emailAddress, role, type, inherited), driveId, webViewLink
- q: no filter (we want everything)
For My Drive files (optional but recommended): Use domain-wide delegation to impersonate each user and list their My Drive files. This catches files not in Shared Drives.
Store results in BigQuery table file_inventory with schema:

file_id         STRING
name            STRING
mime_type       STRING
modified_time   TIMESTAMP
created_time    TIMESTAMP
owner_email     STRING
drive_id        STRING
drive_name      STRING
parent_path     STRING    -- reconstructed folder path
size_bytes      INT64
permissions     JSON      -- array of {email, role, type}
labels          JSON      -- any existing labels
web_link        STRING

1.2 Reconstruct Folder Paths

The Drive API returns parents (immediate parent ID only). To get full paths:

Build a parent-child lookup map from all folder entries
Walk up the tree from each file to the drive root
Store the reconstructed path as parent_path (e.g., Finance/01_Financial-Internal/General-Ledger)

This path reconstruction is essential for AI agents to understand where files live without traversing the API themselves.

1.3 Identify Archive Candidates

Run a query against the inventory:

SELECT *
FROM file_inventory
WHERE modified_time < '2024-09-01'
  AND lifecycle_label != 'Protected'
  AND mime_type != 'application/vnd.google-apps.folder'
ORDER BY drive_name, parent_path, name

1.4 Flag Protected Document Types

Use a combination of:

Keyword matching on file names and paths (e.g., contract, NDA, policy, SOP, bylaws, patent)
MIME type filtering (exclude Google Sites, Apps Script projects, etc. that shouldn't be moved)
Drive Labels — if any files already have classification labels, honor them
Manual list — department leads can pre-submit a list of file IDs that are protected

Apply the label lifecycle-status = Protected to all flagged files.

1.5 Generate Department Review Packets

For each department, export a Google Sheet with:

File name, path, last modified date, owner, current permissions, web link
A column for the department lead to mark: Keep / OK to Archive / Needs Review
Pre-filtered to only show that department's archive candidates
Share with the department lead with comment access

Phase 2 — Department Review & Approval

Goal: Get explicit sign-off from department leads before moving anything.

2.1 Kickoff Communication

Send each department lead:

Their review Sheet (from 1.5)
A brief explainer doc on what archiving means (read-only, still accessible, reversible)
Deadline: 10 business days
Escalation: files not reviewed default to "OK to Archive" (must be stated in the kickoff email and approved by executive sponsor)

2.2 Process Responses

After the review window:

Read back each department's Sheet
For files marked Keep: apply label lifecycle-status = Active
For files marked Needs Review: apply label lifecycle-status = Under-Review, schedule follow-up
For files marked OK to Archive (or unmarked): apply label lifecycle-status = Archive-Candidate
Update the BigQuery inventory with final dispositions

2.3 Executive Summary

Generate a summary report:

Total files scanned
Files approved for archive (count, total size)
Files retained
Files needing follow-up
Per-department breakdown

Phase 3 — Archive Execution

Goal: Move approved files into archive Shared Drives, preserving structure and access.

3.1 Create Archive Shared Drives

Using the Drive API:

for dept in departments:
    drive_metadata = {'name': f'{dept} — Archive (Pre-2025)'}
    request_id = str(uuid.uuid4())
    archive_drive = service.drives().create(
        requestId=request_id,
        body=drive_metadata
    ).execute()

Set permissions:

IT Ops: organizer role
Department members: reader role (view-only)
Department lead: fileOrganizer role (can organize but not delete)

3.2 Recreate Folder Structures

Since folders cannot be moved between Shared Drives via API:

For each archive candidate, look up its parent_path
Create the folder tree in the archive drive (if not already created)
Cache folder IDs to avoid duplicate creation

def ensure_folder_path(service, drive_id, path_parts, folder_cache):
    current_parent = drive_id
    for part in path_parts:
        cache_key = f"{current_parent}/{part}"
        if cache_key in folder_cache:
            current_parent = folder_cache[cache_key]
        else:
            folder = service.files().create(body={
                'name': part,
                'mimeType': 'application/vnd.google-apps.folder',
                'parents': [current_parent]
            }, supportsAllDrives=True, fields='id').execute()
            folder_cache[cache_key] = folder['id']
            current_parent = folder['id']
    return current_parent

3.3 Move Files

For each archive candidate:

service.files().update(
    fileId=file_id,
    addParents=archive_folder_id,
    removeParents=original_parent_id,
    supportsAllDrives=True
).execute()

Rate limiting: Drive API has a quota of ~12,000 queries per minute per user. Implement exponential backoff and batch into chunks of 1,000 files with pauses.

Logging: Record every move in a migration_log BigQuery table:

file_id, file_name, source_drive, source_path,
dest_drive, dest_path, moved_at, status, error

3.4 Apply Archive Labels

After moving, update each file's label:

service.files().modifyLabels(
    fileId=file_id,
    body={
        'labelModifications': [{
            'labelId': CLASSIFICATION_LABEL_ID,
            'fieldModifications': [
                {'fieldId': 'lifecycle-status', 'setSelectionValues': ['Archived']},
                {'fieldId': 'archive-date', 'setDateValues': [{'day': 3, 'month': 3, 'year': 2026}]}
            ]
        }]
    }
).execute()

3.5 Lock Archive Drives

After migration completes:

Remove write access for all non-IT users
Set archive drives to reader for department members
Confirm via permissions.list that no one has write access

3.6 Post-Archive Validation

Compare BigQuery migration_log against the archive candidate list — flag any mismatches
Spot-check 5% of files per department (verify correct location, accessible, labels applied)
Send department leads a confirmation Sheet with links to their archived files

Phase 4 — AI-Driven Analysis & Owner-Led Reorganization

Goal: Use AI to analyze the state of active files and generate per-owner action reports. File owners decide what to do and take the action themselves — IT does not move, rename, or reorganize anyone's files.

4.1 Folder Structure Deployment

For each department Shared Drive, create the standardized folder tree from the overview (e.g., 01_Governance, 02_Strategy-Planning, etc.) using the Drive API. This gives owners a clear target structure to move their files into — but the moves are theirs to make.

4.2 AI-Powered File Analysis

Using the BigQuery inventory from Phase 1, run automated analysis to generate per-owner findings. The AI identifies issues; it does not act on them.

Analysis categories:

Finding Type	What the AI Flags
Misplaced files	Files that appear to belong in a different folder based on name, content, and label vs. current location
Duplicate or near-duplicate files	Files with the same or very similar names, sizes, and content hashes in multiple locations
Naming issues	Files with ambiguous names, special characters, excessively long names, or names that don't match their content
Suggested labels	Inferred department, doc-type, and sensitivity labels based on file path, name, and content analysis
Permission anomalies	Files shared more broadly than their siblings, external shares on internal docs, stale individual shares for people no longer at the company
Folder depth violations	Files nested more than 4 levels deep
Orphaned files	Files sitting at the root of a drive or in an unstructured location
Stale active files	Files not archived but not modified in 6+ months — may need attention

4.3 Generate Per-Owner Action Reports

For each file owner, generate a personalized Google Sheet delivered via email:

Your Files to Review — every file they own, grouped by finding type
Each row includes: file name, current path, suggested action (move to X, rename to Y, apply label Z, review sharing), and a direct link to the file
A column for the owner to mark: Done / Skip / Disagree
Instructions are clear: "These are suggestions. You decide what to do. Only you will move or rename your files."

4.4 Self-Service Tools for Owners

Provide lightweight tools so owners can act on findings easily:

File Name Generator — a Google Apps Script sidebar or Google Form where owners paste a description and get back a compliant file name
Label Applier — a simple UI (Apps Script add-on) that lets owners apply Drive Labels to their files without needing to understand the label taxonomy
Suggested Folder Finder — given a file, suggests which folder in the new structure it belongs in

4.5 Track Completion (Not Enforce)

IT monitors progress through automated dashboards, not mandates:

Track how many flagged files have been addressed per department (via label changes, moves, or owner marking Done)
Surface completion rates to department leads — peer visibility drives action better than enforcement
Escalation path exists (executive sponsor) but is a last resort, not the default
Rerun the AI analysis periodically to show improvement over time

Phase 5 — Access Manifest for AI Agents

Goal: Create a machine-readable access and content map that AI agents can consume to understand what's in Drive and who can access it.

This is the key differentiator from a standard cleanup project. The access manifest enables any AI agent (Claude, Gemini, custom) to answer questions like:

"What files does the Finance team have access to?"
"Who can see the Q4 board deck?"
"Where are active contracts stored?"
"What's the latest version of the remote work policy?"

5.1 Access Manifest Schema

Generate a JSON manifest (stored in BigQuery and exported as needed):

{
  "generated_at": "2026-03-15T00:00:00Z",
  "drives": [
    {
      "drive_id": "0ABcd...",
      "drive_name": "Finance",
      "drive_type": "shared_drive",
      "members": [
        {"email": "[email protected]", "role": "organizer"},
        {"email": "[email protected]", "role": "fileOrganizer"}
      ],
      "folder_tree": {
        "name": "Finance",
        "children": [
          {
            "name": "01_Financial-Internal",
            "children": [
              {"name": "General-Ledger", "file_count": 42},
              {"name": "Accounts-Payable", "file_count": 118}
            ]
          }
        ]
      }
    }
  ],
  "files": [
    {
      "file_id": "1aBcD...",
      "name": "2025-03_FIN_RPT_Monthly-Revenue-Analysis_FINAL.pdf",
      "path": "Finance/01_Financial-Internal/Monthly-Reports/2025-03",
      "mime_type": "application/pdf",
      "labels": {
        "department": "FIN",
        "doc_type": "RPT",
        "lifecycle_status": "Active",
        "sensitivity": "Confidential"
      },
      "permissions": [
        {"email": "[email protected]", "role": "writer", "type": "user"},
        {"email": "[email protected]", "role": "reader", "type": "group"}
      ],
      "owner": "[email protected]",
      "modified_time": "2025-03-15T14:30:00Z",
      "web_link": "https://docs.google.com/..."
    }
  ]
}

5.2 Manifest Generation Pipeline

A scheduled script (Cloud Function or Cloud Run job) that runs weekly:

Crawl all Shared Drives via drives.list() + files.list()
Pull permissions via permissions.list() for each drive and files with non-inherited permissions
Pull labels via files.list() with includeLabels=<labelId> field
Reconstruct folder trees
Write to BigQuery access_manifest dataset
Export a snapshot JSON to a designated GCS bucket or Drive folder

5.3 AI Agent Integration Patterns

Pattern	How It Works
Direct API access	Give the AI agent's service account `reader` access to relevant drives. Agent uses Drive API to search, list, and read files on demand. Best for real-time queries.
Manifest-based	Agent reads the weekly JSON manifest to answer access and structure questions without needing Drive API credentials. Best for security-constrained environments.
BigQuery MCP	Agent queries BigQuery via MCP (Model Context Protocol) server to answer questions about the file inventory. Best for complex analytical queries.
Hybrid	Agent uses manifest for structure/access questions, Drive API for content retrieval. Recommended approach.

5.4 Permission Simplification (Improves AI Comprehension)

AI agents struggle with complex, overlapping permission models. Simplify:

Consolidate to group-based access: Replace individual user permissions with Google Group memberships where possible
Minimize "anyone with the link" shares: Audit and remove or convert to domain-restricted
Use Shared Drive membership as the primary access control: Avoid file-level permission overrides
Document exception permissions: Any file with permissions different from its parent should be flagged and justified

5.5 Content Index for AI Retrieval

For AI agents that need to understand file content (not just metadata):

Leverage Google Workspace's built-in Gemini integration for internal queries
For external AI agents (Claude): generate a content summary index
- Use Drive API to export/download files
- Extract text (Google Docs → export as plain text, PDFs → text extraction)
- Generate per-file summaries (title, 2-3 sentence description, key entities)
- Store summaries in the manifest alongside metadata
- This avoids giving the AI agent blanket read access to all files

Phase 6 — Ongoing Governance (Continuous)

Goal: Prevent Drive from drifting back into chaos.

6.1 Automated Policies

Policy	Implementation
New file labeling	Google Workspace DLP rule: warn users when saving files without a `Document-Classification` label
Quarterly archive sweep	Scheduled Cloud Function runs the archive candidate query every quarter, generates review packets
Permission audit	Monthly script checks for overly broad sharing, reports to IT
Folder depth enforcement	Weekly script flags any folder > 4 levels deep
Stale file alerts	Quarterly report of files not modified in 12+ months, sent to department leads

6.2 Dashboards

Build a Looker dashboard (or Google Sheets dashboard) showing:

Total files by department, type, and lifecycle status
Archive rate over time
Permission health score (% of files using group-based access)
Label coverage (% of files with complete labels)
Storage usage by department

6.3 AI Readiness Score

Track a composite score per department:

Metric	Weight	Measurement
Label coverage	30%	% of files with all label fields populated
Naming compliance	20%	% of files matching the naming convention
Folder depth compliance	15%	% of folders ≤ 4 levels
Permission simplicity	20%	% of files using only inherited group permissions
Content freshness	15%	% of active files modified within 12 months

Tooling (Implemented)

Tool	Implementation	Location
Python + Google Client Libraries	CLI app with per-phase services, rate limiting, dry-run safety	`cli.py`, `src/`
Drive API v3 wrapper	Shared drives, file listing, moves, permissions, folder creation	`src/common/drive_client.py`
Drive Labels API v2 wrapper	Label taxonomy creation, label application, field modifications	`src/common/labels_client.py`
BigQuery client	Dataset/table provisioning, inventory storage, SQL queries	`src/common/bigquery_client.py`
Google Sheets client	Review sheet generation, owner action reports, sharing	`src/common/sheets_client.py`
Google Cloud Functions	Phase 5 manifest and Phase 6 governance are designed to be deployed as scheduled Cloud Functions	`src/phase5_manifest/`, `src/phase6_governance/`
Looker / Looker Studio	Dashboard data is generated by `GovernanceService.generate_dashboard_data()` and stored in BigQuery	`src/phase6_governance/service.py`

Risk Register

Risk	Likelihood	Impact	Mitigation
Department leads don't respond within review window	Medium	High	Escalation path to executive sponsor; default-to-archive policy (with advance notice)
Rate limiting during large migrations	High	Medium	Exponential backoff, batch processing, run migrations during off-hours
Files with complex ownership (personal Drive, external shares)	Medium	Medium	Handle My Drive files separately; flag external shares for manual review
Label taxonomy needs revision after initial deployment	Medium	Low	Labels API supports schema updates; plan for v2 of the taxonomy
AI agents can't handle permission complexity	Medium	High	Permission simplification in Phase 5.4; group-based access model
Loss of file version history during moves	Low	Medium	Moves within Google Drive preserve version history; document this for stakeholders

Appendix A: Key API Endpoints Reference

# Shared Drives
GET  /drive/v3/drives                           — List all shared drives
POST /drive/v3/drives                           — Create a shared drive

# Files
GET  /drive/v3/files?corpora=domain             — List all domain files
GET  /drive/v3/files/{fileId}                   — Get file metadata
PATCH /drive/v3/files/{fileId}                  — Move file (addParents/removeParents)
POST /drive/v3/files/{fileId}/modifyLabels      — Apply/update labels

# Permissions
GET  /drive/v3/files/{fileId}/permissions       — List file permissions
POST /drive/v3/files/{fileId}/permissions       — Add permission

# Labels
GET  /drivelabels/v2/labels                     — List label definitions
POST /drivelabels/v2/labels                     — Create a label

# Admin Reports
GET  /admin/reports/v1/activity/users/all/applications/drive  — Drive activity audit

Appendix B: Estimated API Quota Usage

Operation	Estimated Call Count	Quota Impact
Initial file inventory (100K files)	~200 API calls (500 files/page)	Minimal
Permission retrieval (100K files)	~100K calls	~8 minutes at 12K/min quota
Label application (50K files)	~50K calls	~4 minutes at 12K/min quota
Archive file moves (30K files)	~30K calls	~2.5 minutes at 12K/min quota
Weekly manifest refresh	~200 calls	Minimal

Note: Actual quotas depend on your Google Workspace edition and any custom quota increases.

Appendix C: Differences from Original Overview

Original Overview	This Plan	Rationale
File naming as primary organization method	Drive Labels as primary, naming as secondary	Labels are machine-queryable, survive renames, and don't require retroactive renaming of existing files
No mention of AI access manifest	Dedicated Phase 5 for manifest	Core objective #2 requires a structured, machine-readable view of Drive
Google Apps Script or Admin Reports for inventory	Python + service account + BigQuery	More robust for large-scale operations; Apps Script has 6-minute execution limits
Manual checklist-driven process	Automated pipeline with scheduled jobs	Reduces IT ops burden for ongoing governance
Archive based on last modified AND last viewed	Archive primarily on `modifiedTime` with `viewedByMeTime` as supplementary signal	Admin Reports API only provides 180 days of view history; `modifiedTime` is the reliable cross-domain signal
Move folders to archive drives	Recreate folder structure, move files individually	Drive API cannot move folders between Shared Drives
No permission simplification	Explicit permission cleanup phase	Critical for AI agent comprehension of access model

FilesExpand file tree

execution_plan.md

Latest commit

History