Skip to content

Latest commit

 

History

History
610 lines (467 loc) · 29.6 KB

File metadata and controls

610 lines (467 loc) · 29.6 KB

Google Drive Cleanup & AI Readiness — Execution Plan

Prepared: March 2026 Deliverable: Python CLI application (cli.py) with per-phase services, test coverage, and dry-run safety Objectives:

  1. Archive and clean up stale documents across the organization's Google Drive
  2. Structure Drive so that calling AI agents (Claude, Gemini, etc.) can reliably understand access, permissions, and content

Deliverable Summary

The execution plan is implemented as a runnable Python application in this repository. Each phase is an independent service with matching test coverage, orchestrated by a CLI entrypoint.

Component Location Purpose
CLI entrypoint cli.py python cli.py phase0, phase1, ..., phase6
Configuration config/config.yaml All settings: GCP project, departments, cutoff dates, dry-run flag
Shared clients src/common/ Drive API, BigQuery, Labels API, Sheets API wrappers and auth
Phase services src/phase0_infrastructure/ through src/phase6_governance/ One service class per phase
Test suite tests/test_phase0.py through tests/test_phase6.py + test_common.py 73 tests covering all phases
README README.md Full setup, usage, and implementation guide for IT Ops
Mermaid diagram phase_flow_diagram.md Visual flow of all phases and their steps

Setup: pip install -r requirements.txt then configure config/config.yaml with your GCP project and service account details. All destructive operations are gated behind dry_run: true by default.


Critical API Findings That Shape This Plan

Before diving into phases, several Google API realities should inform every decision:

Capability What It Means for Us
Drive API v3 files.list with corpora=domain We can query every file in the domain programmatically — no need for per-user crawling for shared drives
modifiedTime is reliable; viewedByMeTime is per-user only We can filter on modifiedTime globally, but "last viewed" requires either Admin Reports API (180-day max) or impersonation via a service account to check each user's viewedByMeTime
Admin Reports API: 180-day lookback limit We cannot pull "last viewed" data older than 6 months from the Reports API. For files older than that, modifiedTime must be the primary signal
Folders cannot be moved between Shared Drives via API Archive migration must be file-by-file. The script must recreate folder structures in archive drives, then move individual files
Drive Labels API (v2) is GA We can programmatically create label taxonomies (department, document type, sensitivity, archive status) and apply them to files at scale. Labels are searchable and enforceable via DLP
files.modifyLabels supports atomic batch operations Label application can be done in bulk — one API call can add/update multiple label fields on a file
Gemini can federate Drive data without copying A well-labeled, well-structured Drive is directly queryable by Gemini Enterprise. Labels and consistent naming dramatically improve retrieval accuracy
Permissions API exposes full access graph permissions.list on any file returns type, role, email, and inheritance — we can build a complete access map for AI agents

Architecture Overview

┌─────────────────────────────────────────────────────────────────┐
│                    Google Workspace Domain                      │
│                                                                 │
│  ┌──────────────┐   ┌──────────────────┐   ┌────────────────┐  │
│  │ Service Acct  │──▶│ Drive API v3     │──▶│ File Inventory │  │
│  │ (Domain-Wide  │   │ Labels API v2    │   │ (BigQuery or   │  │
│  │  Delegation)  │   │ Admin Reports    │   │  Google Sheet) │  │
│  └──────────────┘   │ Permissions API  │   └────────────────┘  │
│                     └──────────────────┘                        │
│                                                                 │
│  ┌──────────────────────────────────────────────────────────┐   │
│  │                  Drive Label Taxonomy                     │   │
│  │  ┌────────────┐ ┌────────────┐ ┌───────────────────────┐ │   │
│  │  │ Department │ │ Doc Type   │ │ Lifecycle Status      │ │   │
│  │  │ (Selection)│ │ (Selection)│ │ (Active/Archive/      │ │   │
│  │  │            │ │            │ │  Review/Protected)    │ │   │
│  │  └────────────┘ └────────────┘ └───────────────────────┘ │   │
│  └──────────────────────────────────────────────────────────┘   │
│                                                                 │
│  ┌──────────────────────────────────────────────────────────┐   │
│  │              Access Manifest (JSON/BigQuery)              │   │
│  │  Per-file: id, name, path, labels, permissions[], owner  │   │
│  │  → Queryable by AI agents via API or exported snapshot    │   │
│  └──────────────────────────────────────────────────────────┘   │
└─────────────────────────────────────────────────────────────────┘

Phase 0 — Infrastructure Setup

Goal: Stand up the service account, label taxonomy, and inventory pipeline before touching any files.

0.1 Service Account & Domain-Wide Delegation

Step Detail
Create a GCP project gdrive-cleanup-prod (or use existing IT project)
Create a service account [email protected]
Enable APIs Drive API v3, Drive Labels API v2, Admin SDK Reports API, Google Sheets API
Configure domain-wide delegation In Admin Console → Security → API Controls → Domain-wide Delegation. Grant scopes: https://www.googleapis.com/auth/drive, https://www.googleapis.com/auth/drive.labels, https://www.googleapis.com/auth/admin.reports.audit.readonly
Test access Impersonate an IT admin user, list files from a test shared drive

0.2 Create Drive Label Taxonomy

Using the Drive Labels API, create the following labels. These will be applied to files throughout the process and serve as the primary metadata layer for both human navigation and AI agent queries.

Label: Document-Classification

Field Type Values
department Selection CORP, FIN, LGL, DSBI, SLS, MKT, CST, PRD, ENG, HR, ITS, XFN
doc-type Selection RPT, POL, CTR, MTG, PLN, TPL, FIN, PRE, SPE, TRN, VND, CMP, CRE, DAT, REF
lifecycle-status Selection Active, Archive-Candidate, Under-Review, Archived, Protected
sensitivity Selection Public, Internal, Confidential, Restricted
archive-date Date (set when file is archived)
reviewed-by User (set when a department lead reviews)

Why labels instead of (or in addition to) file naming:

  • Labels are machine-queryable ('labels/Document-Classification' in labels in Drive search)
  • Labels survive file renames, copies, and moves
  • Labels can be enforced via Google Workspace DLP rules
  • AI agents can read labels via API without parsing file names
  • Labels don't require retroactive renaming of thousands of existing files

0.3 Choose Inventory Storage

Option When to Use
Google Sheets < 50,000 files. Simple, visible to department leads, easy to share
BigQuery > 50,000 files. Supports SQL queries, scales to millions, integrates with Looker dashboards
Both Use BigQuery as master, export filtered views to Sheets per department for review

Recommendation: Use BigQuery as the canonical store with per-department Sheet exports for the review workflow.


Phase 1 — Discovery & Inventory

Goal: Build a complete inventory of every file in the domain with metadata to support archiving decisions.

1.1 Full Domain File Crawl

Write a Python script (or Apps Script) using the service account to:

  1. List all Shared Drives via drives.list()
  2. For each Shared Drive, recursively list all files via files.list() with:
    • corpora=drive, driveId=<id>, includeItemsFromAllDrives=true, supportsAllDrives=true
    • fields: id, name, mimeType, modifiedTime, createdTime, owners, parents, size, shared, sharingUser, permissions(emailAddress, role, type, inherited), driveId, webViewLink
    • q: no filter (we want everything)
  3. For My Drive files (optional but recommended): Use domain-wide delegation to impersonate each user and list their My Drive files. This catches files not in Shared Drives.
  4. Store results in BigQuery table file_inventory with schema:
file_id         STRING
name            STRING
mime_type       STRING
modified_time   TIMESTAMP
created_time    TIMESTAMP
owner_email     STRING
drive_id        STRING
drive_name      STRING
parent_path     STRING    -- reconstructed folder path
size_bytes      INT64
permissions     JSON      -- array of {email, role, type}
labels          JSON      -- any existing labels
web_link        STRING

1.2 Reconstruct Folder Paths

The Drive API returns parents (immediate parent ID only). To get full paths:

  • Build a parent-child lookup map from all folder entries
  • Walk up the tree from each file to the drive root
  • Store the reconstructed path as parent_path (e.g., Finance/01_Financial-Internal/General-Ledger)

This path reconstruction is essential for AI agents to understand where files live without traversing the API themselves.

1.3 Identify Archive Candidates

Run a query against the inventory:

SELECT *
FROM file_inventory
WHERE modified_time < '2024-09-01'
  AND lifecycle_label != 'Protected'
  AND mime_type != 'application/vnd.google-apps.folder'
ORDER BY drive_name, parent_path, name

1.4 Flag Protected Document Types

Use a combination of:

  • Keyword matching on file names and paths (e.g., contract, NDA, policy, SOP, bylaws, patent)
  • MIME type filtering (exclude Google Sites, Apps Script projects, etc. that shouldn't be moved)
  • Drive Labels — if any files already have classification labels, honor them
  • Manual list — department leads can pre-submit a list of file IDs that are protected

Apply the label lifecycle-status = Protected to all flagged files.

1.5 Generate Department Review Packets

For each department, export a Google Sheet with:

  • File name, path, last modified date, owner, current permissions, web link
  • A column for the department lead to mark: Keep / OK to Archive / Needs Review
  • Pre-filtered to only show that department's archive candidates
  • Share with the department lead with comment access

Phase 2 — Department Review & Approval

Goal: Get explicit sign-off from department leads before moving anything.

2.1 Kickoff Communication

Send each department lead:

  • Their review Sheet (from 1.5)
  • A brief explainer doc on what archiving means (read-only, still accessible, reversible)
  • Deadline: 10 business days
  • Escalation: files not reviewed default to "OK to Archive" (must be stated in the kickoff email and approved by executive sponsor)

2.2 Process Responses

After the review window:

  1. Read back each department's Sheet
  2. For files marked Keep: apply label lifecycle-status = Active
  3. For files marked Needs Review: apply label lifecycle-status = Under-Review, schedule follow-up
  4. For files marked OK to Archive (or unmarked): apply label lifecycle-status = Archive-Candidate
  5. Update the BigQuery inventory with final dispositions

2.3 Executive Summary

Generate a summary report:

  • Total files scanned
  • Files approved for archive (count, total size)
  • Files retained
  • Files needing follow-up
  • Per-department breakdown

Phase 3 — Archive Execution

Goal: Move approved files into archive Shared Drives, preserving structure and access.

3.1 Create Archive Shared Drives

Using the Drive API:

for dept in departments:
    drive_metadata = {'name': f'{dept} — Archive (Pre-2025)'}
    request_id = str(uuid.uuid4())
    archive_drive = service.drives().create(
        requestId=request_id,
        body=drive_metadata
    ).execute()

Set permissions:

  • IT Ops: organizer role
  • Department members: reader role (view-only)
  • Department lead: fileOrganizer role (can organize but not delete)

3.2 Recreate Folder Structures

Since folders cannot be moved between Shared Drives via API:

  1. For each archive candidate, look up its parent_path
  2. Create the folder tree in the archive drive (if not already created)
  3. Cache folder IDs to avoid duplicate creation
def ensure_folder_path(service, drive_id, path_parts, folder_cache):
    current_parent = drive_id
    for part in path_parts:
        cache_key = f"{current_parent}/{part}"
        if cache_key in folder_cache:
            current_parent = folder_cache[cache_key]
        else:
            folder = service.files().create(body={
                'name': part,
                'mimeType': 'application/vnd.google-apps.folder',
                'parents': [current_parent]
            }, supportsAllDrives=True, fields='id').execute()
            folder_cache[cache_key] = folder['id']
            current_parent = folder['id']
    return current_parent

3.3 Move Files

For each archive candidate:

service.files().update(
    fileId=file_id,
    addParents=archive_folder_id,
    removeParents=original_parent_id,
    supportsAllDrives=True
).execute()

Rate limiting: Drive API has a quota of ~12,000 queries per minute per user. Implement exponential backoff and batch into chunks of 1,000 files with pauses.

Logging: Record every move in a migration_log BigQuery table:

file_id, file_name, source_drive, source_path,
dest_drive, dest_path, moved_at, status, error

3.4 Apply Archive Labels

After moving, update each file's label:

service.files().modifyLabels(
    fileId=file_id,
    body={
        'labelModifications': [{
            'labelId': CLASSIFICATION_LABEL_ID,
            'fieldModifications': [
                {'fieldId': 'lifecycle-status', 'setSelectionValues': ['Archived']},
                {'fieldId': 'archive-date', 'setDateValues': [{'day': 3, 'month': 3, 'year': 2026}]}
            ]
        }]
    }
).execute()

3.5 Lock Archive Drives

After migration completes:

  • Remove write access for all non-IT users
  • Set archive drives to reader for department members
  • Confirm via permissions.list that no one has write access

3.6 Post-Archive Validation

  • Compare BigQuery migration_log against the archive candidate list — flag any mismatches
  • Spot-check 5% of files per department (verify correct location, accessible, labels applied)
  • Send department leads a confirmation Sheet with links to their archived files

Phase 4 — AI-Driven Analysis & Owner-Led Reorganization

Goal: Use AI to analyze the state of active files and generate per-owner action reports. File owners decide what to do and take the action themselves — IT does not move, rename, or reorganize anyone's files.

4.1 Folder Structure Deployment

For each department Shared Drive, create the standardized folder tree from the overview (e.g., 01_Governance, 02_Strategy-Planning, etc.) using the Drive API. This gives owners a clear target structure to move their files into — but the moves are theirs to make.

4.2 AI-Powered File Analysis

Using the BigQuery inventory from Phase 1, run automated analysis to generate per-owner findings. The AI identifies issues; it does not act on them.

Analysis categories:

Finding Type What the AI Flags
Misplaced files Files that appear to belong in a different folder based on name, content, and label vs. current location
Duplicate or near-duplicate files Files with the same or very similar names, sizes, and content hashes in multiple locations
Naming issues Files with ambiguous names, special characters, excessively long names, or names that don't match their content
Suggested labels Inferred department, doc-type, and sensitivity labels based on file path, name, and content analysis
Permission anomalies Files shared more broadly than their siblings, external shares on internal docs, stale individual shares for people no longer at the company
Folder depth violations Files nested more than 4 levels deep
Orphaned files Files sitting at the root of a drive or in an unstructured location
Stale active files Files not archived but not modified in 6+ months — may need attention

4.3 Generate Per-Owner Action Reports

For each file owner, generate a personalized Google Sheet delivered via email:

  • Your Files to Review — every file they own, grouped by finding type
  • Each row includes: file name, current path, suggested action (move to X, rename to Y, apply label Z, review sharing), and a direct link to the file
  • A column for the owner to mark: Done / Skip / Disagree
  • Instructions are clear: "These are suggestions. You decide what to do. Only you will move or rename your files."

4.4 Self-Service Tools for Owners

Provide lightweight tools so owners can act on findings easily:

  • File Name Generator — a Google Apps Script sidebar or Google Form where owners paste a description and get back a compliant file name
  • Label Applier — a simple UI (Apps Script add-on) that lets owners apply Drive Labels to their files without needing to understand the label taxonomy
  • Suggested Folder Finder — given a file, suggests which folder in the new structure it belongs in

4.5 Track Completion (Not Enforce)

IT monitors progress through automated dashboards, not mandates:

  • Track how many flagged files have been addressed per department (via label changes, moves, or owner marking Done)
  • Surface completion rates to department leads — peer visibility drives action better than enforcement
  • Escalation path exists (executive sponsor) but is a last resort, not the default
  • Rerun the AI analysis periodically to show improvement over time

Phase 5 — Access Manifest for AI Agents

Goal: Create a machine-readable access and content map that AI agents can consume to understand what's in Drive and who can access it.

This is the key differentiator from a standard cleanup project. The access manifest enables any AI agent (Claude, Gemini, custom) to answer questions like:

  • "What files does the Finance team have access to?"
  • "Who can see the Q4 board deck?"
  • "Where are active contracts stored?"
  • "What's the latest version of the remote work policy?"

5.1 Access Manifest Schema

Generate a JSON manifest (stored in BigQuery and exported as needed):

{
  "generated_at": "2026-03-15T00:00:00Z",
  "drives": [
    {
      "drive_id": "0ABcd...",
      "drive_name": "Finance",
      "drive_type": "shared_drive",
      "members": [
        {"email": "[email protected]", "role": "organizer"},
        {"email": "[email protected]", "role": "fileOrganizer"}
      ],
      "folder_tree": {
        "name": "Finance",
        "children": [
          {
            "name": "01_Financial-Internal",
            "children": [
              {"name": "General-Ledger", "file_count": 42},
              {"name": "Accounts-Payable", "file_count": 118}
            ]
          }
        ]
      }
    }
  ],
  "files": [
    {
      "file_id": "1aBcD...",
      "name": "2025-03_FIN_RPT_Monthly-Revenue-Analysis_FINAL.pdf",
      "path": "Finance/01_Financial-Internal/Monthly-Reports/2025-03",
      "mime_type": "application/pdf",
      "labels": {
        "department": "FIN",
        "doc_type": "RPT",
        "lifecycle_status": "Active",
        "sensitivity": "Confidential"
      },
      "permissions": [
        {"email": "[email protected]", "role": "writer", "type": "user"},
        {"email": "[email protected]", "role": "reader", "type": "group"}
      ],
      "owner": "[email protected]",
      "modified_time": "2025-03-15T14:30:00Z",
      "web_link": "https://docs.google.com/..."
    }
  ]
}

5.2 Manifest Generation Pipeline

A scheduled script (Cloud Function or Cloud Run job) that runs weekly:

  1. Crawl all Shared Drives via drives.list() + files.list()
  2. Pull permissions via permissions.list() for each drive and files with non-inherited permissions
  3. Pull labels via files.list() with includeLabels=<labelId> field
  4. Reconstruct folder trees
  5. Write to BigQuery access_manifest dataset
  6. Export a snapshot JSON to a designated GCS bucket or Drive folder

5.3 AI Agent Integration Patterns

Pattern How It Works
Direct API access Give the AI agent's service account reader access to relevant drives. Agent uses Drive API to search, list, and read files on demand. Best for real-time queries.
Manifest-based Agent reads the weekly JSON manifest to answer access and structure questions without needing Drive API credentials. Best for security-constrained environments.
BigQuery MCP Agent queries BigQuery via MCP (Model Context Protocol) server to answer questions about the file inventory. Best for complex analytical queries.
Hybrid Agent uses manifest for structure/access questions, Drive API for content retrieval. Recommended approach.

5.4 Permission Simplification (Improves AI Comprehension)

AI agents struggle with complex, overlapping permission models. Simplify:

  • Consolidate to group-based access: Replace individual user permissions with Google Group memberships where possible
  • Minimize "anyone with the link" shares: Audit and remove or convert to domain-restricted
  • Use Shared Drive membership as the primary access control: Avoid file-level permission overrides
  • Document exception permissions: Any file with permissions different from its parent should be flagged and justified

5.5 Content Index for AI Retrieval

For AI agents that need to understand file content (not just metadata):

  • Leverage Google Workspace's built-in Gemini integration for internal queries
  • For external AI agents (Claude): generate a content summary index
    • Use Drive API to export/download files
    • Extract text (Google Docs → export as plain text, PDFs → text extraction)
    • Generate per-file summaries (title, 2-3 sentence description, key entities)
    • Store summaries in the manifest alongside metadata
    • This avoids giving the AI agent blanket read access to all files

Phase 6 — Ongoing Governance (Continuous)

Goal: Prevent Drive from drifting back into chaos.

6.1 Automated Policies

Policy Implementation
New file labeling Google Workspace DLP rule: warn users when saving files without a Document-Classification label
Quarterly archive sweep Scheduled Cloud Function runs the archive candidate query every quarter, generates review packets
Permission audit Monthly script checks for overly broad sharing, reports to IT
Folder depth enforcement Weekly script flags any folder > 4 levels deep
Stale file alerts Quarterly report of files not modified in 12+ months, sent to department leads

6.2 Dashboards

Build a Looker dashboard (or Google Sheets dashboard) showing:

  • Total files by department, type, and lifecycle status
  • Archive rate over time
  • Permission health score (% of files using group-based access)
  • Label coverage (% of files with complete labels)
  • Storage usage by department

6.3 AI Readiness Score

Track a composite score per department:

Metric Weight Measurement
Label coverage 30% % of files with all label fields populated
Naming compliance 20% % of files matching the naming convention
Folder depth compliance 15% % of folders ≤ 4 levels
Permission simplicity 20% % of files using only inherited group permissions
Content freshness 15% % of active files modified within 12 months

Tooling (Implemented)

Tool Implementation Location
Python + Google Client Libraries CLI app with per-phase services, rate limiting, dry-run safety cli.py, src/
Drive API v3 wrapper Shared drives, file listing, moves, permissions, folder creation src/common/drive_client.py
Drive Labels API v2 wrapper Label taxonomy creation, label application, field modifications src/common/labels_client.py
BigQuery client Dataset/table provisioning, inventory storage, SQL queries src/common/bigquery_client.py
Google Sheets client Review sheet generation, owner action reports, sharing src/common/sheets_client.py
Google Cloud Functions Phase 5 manifest and Phase 6 governance are designed to be deployed as scheduled Cloud Functions src/phase5_manifest/, src/phase6_governance/
Looker / Looker Studio Dashboard data is generated by GovernanceService.generate_dashboard_data() and stored in BigQuery src/phase6_governance/service.py

Risk Register

Risk Likelihood Impact Mitigation
Department leads don't respond within review window Medium High Escalation path to executive sponsor; default-to-archive policy (with advance notice)
Rate limiting during large migrations High Medium Exponential backoff, batch processing, run migrations during off-hours
Files with complex ownership (personal Drive, external shares) Medium Medium Handle My Drive files separately; flag external shares for manual review
Label taxonomy needs revision after initial deployment Medium Low Labels API supports schema updates; plan for v2 of the taxonomy
AI agents can't handle permission complexity Medium High Permission simplification in Phase 5.4; group-based access model
Loss of file version history during moves Low Medium Moves within Google Drive preserve version history; document this for stakeholders

Appendix A: Key API Endpoints Reference

# Shared Drives
GET  /drive/v3/drives                           — List all shared drives
POST /drive/v3/drives                           — Create a shared drive

# Files
GET  /drive/v3/files?corpora=domain             — List all domain files
GET  /drive/v3/files/{fileId}                   — Get file metadata
PATCH /drive/v3/files/{fileId}                  — Move file (addParents/removeParents)
POST /drive/v3/files/{fileId}/modifyLabels      — Apply/update labels

# Permissions
GET  /drive/v3/files/{fileId}/permissions       — List file permissions
POST /drive/v3/files/{fileId}/permissions       — Add permission

# Labels
GET  /drivelabels/v2/labels                     — List label definitions
POST /drivelabels/v2/labels                     — Create a label

# Admin Reports
GET  /admin/reports/v1/activity/users/all/applications/drive  — Drive activity audit

Appendix B: Estimated API Quota Usage

Operation Estimated Call Count Quota Impact
Initial file inventory (100K files) ~200 API calls (500 files/page) Minimal
Permission retrieval (100K files) ~100K calls ~8 minutes at 12K/min quota
Label application (50K files) ~50K calls ~4 minutes at 12K/min quota
Archive file moves (30K files) ~30K calls ~2.5 minutes at 12K/min quota
Weekly manifest refresh ~200 calls Minimal

Note: Actual quotas depend on your Google Workspace edition and any custom quota increases.

Appendix C: Differences from Original Overview

Original Overview This Plan Rationale
File naming as primary organization method Drive Labels as primary, naming as secondary Labels are machine-queryable, survive renames, and don't require retroactive renaming of existing files
No mention of AI access manifest Dedicated Phase 5 for manifest Core objective #2 requires a structured, machine-readable view of Drive
Google Apps Script or Admin Reports for inventory Python + service account + BigQuery More robust for large-scale operations; Apps Script has 6-minute execution limits
Manual checklist-driven process Automated pipeline with scheduled jobs Reduces IT ops burden for ongoing governance
Archive based on last modified AND last viewed Archive primarily on modifiedTime with viewedByMeTime as supplementary signal Admin Reports API only provides 180 days of view history; modifiedTime is the reliable cross-domain signal
Move folders to archive drives Recreate folder structure, move files individually Drive API cannot move folders between Shared Drives
No permission simplification Explicit permission cleanup phase Critical for AI agent comprehension of access model