Prepared: March 2026
Deliverable: Python CLI application (cli.py) with per-phase services, test coverage, and dry-run safety
Objectives:
- Archive and clean up stale documents across the organization's Google Drive
- Structure Drive so that calling AI agents (Claude, Gemini, etc.) can reliably understand access, permissions, and content
The execution plan is implemented as a runnable Python application in this repository. Each phase is an independent service with matching test coverage, orchestrated by a CLI entrypoint.
| Component | Location | Purpose |
|---|---|---|
| CLI entrypoint | cli.py |
python cli.py phase0, phase1, ..., phase6 |
| Configuration | config/config.yaml |
All settings: GCP project, departments, cutoff dates, dry-run flag |
| Shared clients | src/common/ |
Drive API, BigQuery, Labels API, Sheets API wrappers and auth |
| Phase services | src/phase0_infrastructure/ through src/phase6_governance/ |
One service class per phase |
| Test suite | tests/test_phase0.py through tests/test_phase6.py + test_common.py |
73 tests covering all phases |
| README | README.md |
Full setup, usage, and implementation guide for IT Ops |
| Mermaid diagram | phase_flow_diagram.md |
Visual flow of all phases and their steps |
Setup: pip install -r requirements.txt then configure config/config.yaml with your GCP project and service account details. All destructive operations are gated behind dry_run: true by default.
Before diving into phases, several Google API realities should inform every decision:
| Capability | What It Means for Us |
|---|---|
Drive API v3 files.list with corpora=domain |
We can query every file in the domain programmatically — no need for per-user crawling for shared drives |
modifiedTime is reliable; viewedByMeTime is per-user only |
We can filter on modifiedTime globally, but "last viewed" requires either Admin Reports API (180-day max) or impersonation via a service account to check each user's viewedByMeTime |
| Admin Reports API: 180-day lookback limit | We cannot pull "last viewed" data older than 6 months from the Reports API. For files older than that, modifiedTime must be the primary signal |
| Folders cannot be moved between Shared Drives via API | Archive migration must be file-by-file. The script must recreate folder structures in archive drives, then move individual files |
| Drive Labels API (v2) is GA | We can programmatically create label taxonomies (department, document type, sensitivity, archive status) and apply them to files at scale. Labels are searchable and enforceable via DLP |
files.modifyLabels supports atomic batch operations |
Label application can be done in bulk — one API call can add/update multiple label fields on a file |
| Gemini can federate Drive data without copying | A well-labeled, well-structured Drive is directly queryable by Gemini Enterprise. Labels and consistent naming dramatically improve retrieval accuracy |
| Permissions API exposes full access graph | permissions.list on any file returns type, role, email, and inheritance — we can build a complete access map for AI agents |
┌─────────────────────────────────────────────────────────────────┐
│ Google Workspace Domain │
│ │
│ ┌──────────────┐ ┌──────────────────┐ ┌────────────────┐ │
│ │ Service Acct │──▶│ Drive API v3 │──▶│ File Inventory │ │
│ │ (Domain-Wide │ │ Labels API v2 │ │ (BigQuery or │ │
│ │ Delegation) │ │ Admin Reports │ │ Google Sheet) │ │
│ └──────────────┘ │ Permissions API │ └────────────────┘ │
│ └──────────────────┘ │
│ │
│ ┌──────────────────────────────────────────────────────────┐ │
│ │ Drive Label Taxonomy │ │
│ │ ┌────────────┐ ┌────────────┐ ┌───────────────────────┐ │ │
│ │ │ Department │ │ Doc Type │ │ Lifecycle Status │ │ │
│ │ │ (Selection)│ │ (Selection)│ │ (Active/Archive/ │ │ │
│ │ │ │ │ │ │ Review/Protected) │ │ │
│ │ └────────────┘ └────────────┘ └───────────────────────┘ │ │
│ └──────────────────────────────────────────────────────────┘ │
│ │
│ ┌──────────────────────────────────────────────────────────┐ │
│ │ Access Manifest (JSON/BigQuery) │ │
│ │ Per-file: id, name, path, labels, permissions[], owner │ │
│ │ → Queryable by AI agents via API or exported snapshot │ │
│ └──────────────────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────────────────┘
Goal: Stand up the service account, label taxonomy, and inventory pipeline before touching any files.
| Step | Detail |
|---|---|
| Create a GCP project | gdrive-cleanup-prod (or use existing IT project) |
| Create a service account | [email protected] |
| Enable APIs | Drive API v3, Drive Labels API v2, Admin SDK Reports API, Google Sheets API |
| Configure domain-wide delegation | In Admin Console → Security → API Controls → Domain-wide Delegation. Grant scopes: https://www.googleapis.com/auth/drive, https://www.googleapis.com/auth/drive.labels, https://www.googleapis.com/auth/admin.reports.audit.readonly |
| Test access | Impersonate an IT admin user, list files from a test shared drive |
Using the Drive Labels API, create the following labels. These will be applied to files throughout the process and serve as the primary metadata layer for both human navigation and AI agent queries.
Label: Document-Classification
| Field | Type | Values |
|---|---|---|
department |
Selection | CORP, FIN, LGL, DSBI, SLS, MKT, CST, PRD, ENG, HR, ITS, XFN |
doc-type |
Selection | RPT, POL, CTR, MTG, PLN, TPL, FIN, PRE, SPE, TRN, VND, CMP, CRE, DAT, REF |
lifecycle-status |
Selection | Active, Archive-Candidate, Under-Review, Archived, Protected |
sensitivity |
Selection | Public, Internal, Confidential, Restricted |
archive-date |
Date | (set when file is archived) |
reviewed-by |
User | (set when a department lead reviews) |
Why labels instead of (or in addition to) file naming:
- Labels are machine-queryable (
'labels/Document-Classification' in labelsin Drive search) - Labels survive file renames, copies, and moves
- Labels can be enforced via Google Workspace DLP rules
- AI agents can read labels via API without parsing file names
- Labels don't require retroactive renaming of thousands of existing files
| Option | When to Use |
|---|---|
| Google Sheets | < 50,000 files. Simple, visible to department leads, easy to share |
| BigQuery | > 50,000 files. Supports SQL queries, scales to millions, integrates with Looker dashboards |
| Both | Use BigQuery as master, export filtered views to Sheets per department for review |
Recommendation: Use BigQuery as the canonical store with per-department Sheet exports for the review workflow.
Goal: Build a complete inventory of every file in the domain with metadata to support archiving decisions.
Write a Python script (or Apps Script) using the service account to:
- List all Shared Drives via
drives.list() - For each Shared Drive, recursively list all files via
files.list()with:corpora=drive,driveId=<id>,includeItemsFromAllDrives=true,supportsAllDrives=truefields:id, name, mimeType, modifiedTime, createdTime, owners, parents, size, shared, sharingUser, permissions(emailAddress, role, type, inherited), driveId, webViewLinkq: no filter (we want everything)
- For My Drive files (optional but recommended): Use domain-wide delegation to impersonate each user and list their My Drive files. This catches files not in Shared Drives.
- Store results in BigQuery table
file_inventorywith schema:
file_id STRING
name STRING
mime_type STRING
modified_time TIMESTAMP
created_time TIMESTAMP
owner_email STRING
drive_id STRING
drive_name STRING
parent_path STRING -- reconstructed folder path
size_bytes INT64
permissions JSON -- array of {email, role, type}
labels JSON -- any existing labels
web_link STRING
The Drive API returns parents (immediate parent ID only). To get full paths:
- Build a parent-child lookup map from all folder entries
- Walk up the tree from each file to the drive root
- Store the reconstructed path as
parent_path(e.g.,Finance/01_Financial-Internal/General-Ledger)
This path reconstruction is essential for AI agents to understand where files live without traversing the API themselves.
Run a query against the inventory:
SELECT *
FROM file_inventory
WHERE modified_time < '2024-09-01'
AND lifecycle_label != 'Protected'
AND mime_type != 'application/vnd.google-apps.folder'
ORDER BY drive_name, parent_path, nameUse a combination of:
- Keyword matching on file names and paths (e.g.,
contract,NDA,policy,SOP,bylaws,patent) - MIME type filtering (exclude Google Sites, Apps Script projects, etc. that shouldn't be moved)
- Drive Labels — if any files already have classification labels, honor them
- Manual list — department leads can pre-submit a list of file IDs that are protected
Apply the label lifecycle-status = Protected to all flagged files.
For each department, export a Google Sheet with:
- File name, path, last modified date, owner, current permissions, web link
- A column for the department lead to mark:
Keep/OK to Archive/Needs Review - Pre-filtered to only show that department's archive candidates
- Share with the department lead with comment access
Goal: Get explicit sign-off from department leads before moving anything.
Send each department lead:
- Their review Sheet (from 1.5)
- A brief explainer doc on what archiving means (read-only, still accessible, reversible)
- Deadline: 10 business days
- Escalation: files not reviewed default to "OK to Archive" (must be stated in the kickoff email and approved by executive sponsor)
After the review window:
- Read back each department's Sheet
- For files marked
Keep: apply labellifecycle-status = Active - For files marked
Needs Review: apply labellifecycle-status = Under-Review, schedule follow-up - For files marked
OK to Archive(or unmarked): apply labellifecycle-status = Archive-Candidate - Update the BigQuery inventory with final dispositions
Generate a summary report:
- Total files scanned
- Files approved for archive (count, total size)
- Files retained
- Files needing follow-up
- Per-department breakdown
Goal: Move approved files into archive Shared Drives, preserving structure and access.
Using the Drive API:
for dept in departments:
drive_metadata = {'name': f'{dept} — Archive (Pre-2025)'}
request_id = str(uuid.uuid4())
archive_drive = service.drives().create(
requestId=request_id,
body=drive_metadata
).execute()Set permissions:
- IT Ops:
organizerrole - Department members:
readerrole (view-only) - Department lead:
fileOrganizerrole (can organize but not delete)
Since folders cannot be moved between Shared Drives via API:
- For each archive candidate, look up its
parent_path - Create the folder tree in the archive drive (if not already created)
- Cache folder IDs to avoid duplicate creation
def ensure_folder_path(service, drive_id, path_parts, folder_cache):
current_parent = drive_id
for part in path_parts:
cache_key = f"{current_parent}/{part}"
if cache_key in folder_cache:
current_parent = folder_cache[cache_key]
else:
folder = service.files().create(body={
'name': part,
'mimeType': 'application/vnd.google-apps.folder',
'parents': [current_parent]
}, supportsAllDrives=True, fields='id').execute()
folder_cache[cache_key] = folder['id']
current_parent = folder['id']
return current_parentFor each archive candidate:
service.files().update(
fileId=file_id,
addParents=archive_folder_id,
removeParents=original_parent_id,
supportsAllDrives=True
).execute()Rate limiting: Drive API has a quota of ~12,000 queries per minute per user. Implement exponential backoff and batch into chunks of 1,000 files with pauses.
Logging: Record every move in a migration_log BigQuery table:
file_id, file_name, source_drive, source_path,
dest_drive, dest_path, moved_at, status, error
After moving, update each file's label:
service.files().modifyLabels(
fileId=file_id,
body={
'labelModifications': [{
'labelId': CLASSIFICATION_LABEL_ID,
'fieldModifications': [
{'fieldId': 'lifecycle-status', 'setSelectionValues': ['Archived']},
{'fieldId': 'archive-date', 'setDateValues': [{'day': 3, 'month': 3, 'year': 2026}]}
]
}]
}
).execute()After migration completes:
- Remove write access for all non-IT users
- Set archive drives to
readerfor department members - Confirm via
permissions.listthat no one has write access
- Compare BigQuery
migration_logagainst the archive candidate list — flag any mismatches - Spot-check 5% of files per department (verify correct location, accessible, labels applied)
- Send department leads a confirmation Sheet with links to their archived files
Goal: Use AI to analyze the state of active files and generate per-owner action reports. File owners decide what to do and take the action themselves — IT does not move, rename, or reorganize anyone's files.
For each department Shared Drive, create the standardized folder tree from the overview (e.g., 01_Governance, 02_Strategy-Planning, etc.) using the Drive API. This gives owners a clear target structure to move their files into — but the moves are theirs to make.
Using the BigQuery inventory from Phase 1, run automated analysis to generate per-owner findings. The AI identifies issues; it does not act on them.
Analysis categories:
| Finding Type | What the AI Flags |
|---|---|
| Misplaced files | Files that appear to belong in a different folder based on name, content, and label vs. current location |
| Duplicate or near-duplicate files | Files with the same or very similar names, sizes, and content hashes in multiple locations |
| Naming issues | Files with ambiguous names, special characters, excessively long names, or names that don't match their content |
| Suggested labels | Inferred department, doc-type, and sensitivity labels based on file path, name, and content analysis |
| Permission anomalies | Files shared more broadly than their siblings, external shares on internal docs, stale individual shares for people no longer at the company |
| Folder depth violations | Files nested more than 4 levels deep |
| Orphaned files | Files sitting at the root of a drive or in an unstructured location |
| Stale active files | Files not archived but not modified in 6+ months — may need attention |
For each file owner, generate a personalized Google Sheet delivered via email:
- Your Files to Review — every file they own, grouped by finding type
- Each row includes: file name, current path, suggested action (move to X, rename to Y, apply label Z, review sharing), and a direct link to the file
- A column for the owner to mark:
Done/Skip/Disagree - Instructions are clear: "These are suggestions. You decide what to do. Only you will move or rename your files."
Provide lightweight tools so owners can act on findings easily:
- File Name Generator — a Google Apps Script sidebar or Google Form where owners paste a description and get back a compliant file name
- Label Applier — a simple UI (Apps Script add-on) that lets owners apply Drive Labels to their files without needing to understand the label taxonomy
- Suggested Folder Finder — given a file, suggests which folder in the new structure it belongs in
IT monitors progress through automated dashboards, not mandates:
- Track how many flagged files have been addressed per department (via label changes, moves, or owner marking
Done) - Surface completion rates to department leads — peer visibility drives action better than enforcement
- Escalation path exists (executive sponsor) but is a last resort, not the default
- Rerun the AI analysis periodically to show improvement over time
Goal: Create a machine-readable access and content map that AI agents can consume to understand what's in Drive and who can access it.
This is the key differentiator from a standard cleanup project. The access manifest enables any AI agent (Claude, Gemini, custom) to answer questions like:
- "What files does the Finance team have access to?"
- "Who can see the Q4 board deck?"
- "Where are active contracts stored?"
- "What's the latest version of the remote work policy?"
Generate a JSON manifest (stored in BigQuery and exported as needed):
{
"generated_at": "2026-03-15T00:00:00Z",
"drives": [
{
"drive_id": "0ABcd...",
"drive_name": "Finance",
"drive_type": "shared_drive",
"members": [
{"email": "[email protected]", "role": "organizer"},
{"email": "[email protected]", "role": "fileOrganizer"}
],
"folder_tree": {
"name": "Finance",
"children": [
{
"name": "01_Financial-Internal",
"children": [
{"name": "General-Ledger", "file_count": 42},
{"name": "Accounts-Payable", "file_count": 118}
]
}
]
}
}
],
"files": [
{
"file_id": "1aBcD...",
"name": "2025-03_FIN_RPT_Monthly-Revenue-Analysis_FINAL.pdf",
"path": "Finance/01_Financial-Internal/Monthly-Reports/2025-03",
"mime_type": "application/pdf",
"labels": {
"department": "FIN",
"doc_type": "RPT",
"lifecycle_status": "Active",
"sensitivity": "Confidential"
},
"permissions": [
{"email": "[email protected]", "role": "writer", "type": "user"},
{"email": "[email protected]", "role": "reader", "type": "group"}
],
"owner": "[email protected]",
"modified_time": "2025-03-15T14:30:00Z",
"web_link": "https://docs.google.com/..."
}
]
}A scheduled script (Cloud Function or Cloud Run job) that runs weekly:
- Crawl all Shared Drives via
drives.list()+files.list() - Pull permissions via
permissions.list()for each drive and files with non-inherited permissions - Pull labels via
files.list()withincludeLabels=<labelId>field - Reconstruct folder trees
- Write to BigQuery
access_manifestdataset - Export a snapshot JSON to a designated GCS bucket or Drive folder
| Pattern | How It Works |
|---|---|
| Direct API access | Give the AI agent's service account reader access to relevant drives. Agent uses Drive API to search, list, and read files on demand. Best for real-time queries. |
| Manifest-based | Agent reads the weekly JSON manifest to answer access and structure questions without needing Drive API credentials. Best for security-constrained environments. |
| BigQuery MCP | Agent queries BigQuery via MCP (Model Context Protocol) server to answer questions about the file inventory. Best for complex analytical queries. |
| Hybrid | Agent uses manifest for structure/access questions, Drive API for content retrieval. Recommended approach. |
AI agents struggle with complex, overlapping permission models. Simplify:
- Consolidate to group-based access: Replace individual user permissions with Google Group memberships where possible
- Minimize "anyone with the link" shares: Audit and remove or convert to domain-restricted
- Use Shared Drive membership as the primary access control: Avoid file-level permission overrides
- Document exception permissions: Any file with permissions different from its parent should be flagged and justified
For AI agents that need to understand file content (not just metadata):
- Leverage Google Workspace's built-in Gemini integration for internal queries
- For external AI agents (Claude): generate a content summary index
- Use Drive API to export/download files
- Extract text (Google Docs → export as plain text, PDFs → text extraction)
- Generate per-file summaries (title, 2-3 sentence description, key entities)
- Store summaries in the manifest alongside metadata
- This avoids giving the AI agent blanket read access to all files
Goal: Prevent Drive from drifting back into chaos.
| Policy | Implementation |
|---|---|
| New file labeling | Google Workspace DLP rule: warn users when saving files without a Document-Classification label |
| Quarterly archive sweep | Scheduled Cloud Function runs the archive candidate query every quarter, generates review packets |
| Permission audit | Monthly script checks for overly broad sharing, reports to IT |
| Folder depth enforcement | Weekly script flags any folder > 4 levels deep |
| Stale file alerts | Quarterly report of files not modified in 12+ months, sent to department leads |
Build a Looker dashboard (or Google Sheets dashboard) showing:
- Total files by department, type, and lifecycle status
- Archive rate over time
- Permission health score (% of files using group-based access)
- Label coverage (% of files with complete labels)
- Storage usage by department
Track a composite score per department:
| Metric | Weight | Measurement |
|---|---|---|
| Label coverage | 30% | % of files with all label fields populated |
| Naming compliance | 20% | % of files matching the naming convention |
| Folder depth compliance | 15% | % of folders ≤ 4 levels |
| Permission simplicity | 20% | % of files using only inherited group permissions |
| Content freshness | 15% | % of active files modified within 12 months |
| Tool | Implementation | Location |
|---|---|---|
| Python + Google Client Libraries | CLI app with per-phase services, rate limiting, dry-run safety | cli.py, src/ |
| Drive API v3 wrapper | Shared drives, file listing, moves, permissions, folder creation | src/common/drive_client.py |
| Drive Labels API v2 wrapper | Label taxonomy creation, label application, field modifications | src/common/labels_client.py |
| BigQuery client | Dataset/table provisioning, inventory storage, SQL queries | src/common/bigquery_client.py |
| Google Sheets client | Review sheet generation, owner action reports, sharing | src/common/sheets_client.py |
| Google Cloud Functions | Phase 5 manifest and Phase 6 governance are designed to be deployed as scheduled Cloud Functions | src/phase5_manifest/, src/phase6_governance/ |
| Looker / Looker Studio | Dashboard data is generated by GovernanceService.generate_dashboard_data() and stored in BigQuery |
src/phase6_governance/service.py |
| Risk | Likelihood | Impact | Mitigation |
|---|---|---|---|
| Department leads don't respond within review window | Medium | High | Escalation path to executive sponsor; default-to-archive policy (with advance notice) |
| Rate limiting during large migrations | High | Medium | Exponential backoff, batch processing, run migrations during off-hours |
| Files with complex ownership (personal Drive, external shares) | Medium | Medium | Handle My Drive files separately; flag external shares for manual review |
| Label taxonomy needs revision after initial deployment | Medium | Low | Labels API supports schema updates; plan for v2 of the taxonomy |
| AI agents can't handle permission complexity | Medium | High | Permission simplification in Phase 5.4; group-based access model |
| Loss of file version history during moves | Low | Medium | Moves within Google Drive preserve version history; document this for stakeholders |
# Shared Drives
GET /drive/v3/drives — List all shared drives
POST /drive/v3/drives — Create a shared drive
# Files
GET /drive/v3/files?corpora=domain — List all domain files
GET /drive/v3/files/{fileId} — Get file metadata
PATCH /drive/v3/files/{fileId} — Move file (addParents/removeParents)
POST /drive/v3/files/{fileId}/modifyLabels — Apply/update labels
# Permissions
GET /drive/v3/files/{fileId}/permissions — List file permissions
POST /drive/v3/files/{fileId}/permissions — Add permission
# Labels
GET /drivelabels/v2/labels — List label definitions
POST /drivelabels/v2/labels — Create a label
# Admin Reports
GET /admin/reports/v1/activity/users/all/applications/drive — Drive activity audit
| Operation | Estimated Call Count | Quota Impact |
|---|---|---|
| Initial file inventory (100K files) | ~200 API calls (500 files/page) | Minimal |
| Permission retrieval (100K files) | ~100K calls | ~8 minutes at 12K/min quota |
| Label application (50K files) | ~50K calls | ~4 minutes at 12K/min quota |
| Archive file moves (30K files) | ~30K calls | ~2.5 minutes at 12K/min quota |
| Weekly manifest refresh | ~200 calls | Minimal |
Note: Actual quotas depend on your Google Workspace edition and any custom quota increases.
| Original Overview | This Plan | Rationale |
|---|---|---|
| File naming as primary organization method | Drive Labels as primary, naming as secondary | Labels are machine-queryable, survive renames, and don't require retroactive renaming of existing files |
| No mention of AI access manifest | Dedicated Phase 5 for manifest | Core objective #2 requires a structured, machine-readable view of Drive |
| Google Apps Script or Admin Reports for inventory | Python + service account + BigQuery | More robust for large-scale operations; Apps Script has 6-minute execution limits |
| Manual checklist-driven process | Automated pipeline with scheduled jobs | Reduces IT ops burden for ongoing governance |
| Archive based on last modified AND last viewed | Archive primarily on modifiedTime with viewedByMeTime as supplementary signal |
Admin Reports API only provides 180 days of view history; modifiedTime is the reliable cross-domain signal |
| Move folders to archive drives | Recreate folder structure, move files individually | Drive API cannot move folders between Shared Drives |
| No permission simplification | Explicit permission cleanup phase | Critical for AI agent comprehension of access model |