- Executive Summary
- Content-Addressable Naming Scheme
- Structured Logging Format
- Blob Storage Organization
- Indexing and Metadata
- Implementation Roadmap
- Example Usage
This document describes the architecture for a comprehensive logging and artifact indexing system for Ascend. The system enables:
- Content-addressable naming: Jobs, logs, and artifacts are named using a deterministic scheme that enables filtering by user, project, date range, dependencies, and other metadata
- Structured logging: All logs follow a consistent format that is both human-readable and machine-parsable (JSON Lines format)
- Search and retrieval: Users can search and retrieve logs and artifacts from previous runs using intuitive CLI commands
- Bulk operations: Administrators can search, retrieve, and delete logs/jobs/artifacts in bulk using query filters
- Project-scoped storage: All logs and artifacts are stored in blob storage with per-project isolation
- Deterministic naming: Job IDs incorporate metadata for easy filtering without requiring a database
- Human-readable: Logs are viewable in terminal with rich formatting while remaining machine-parsable
- Efficient storage: Logs and artifacts use blob storage with lifecycle policies for automatic archival
- Fast retrieval: Metadata is encoded in path structure for quick filtering without full scanning
- Privacy-aware: User namespaces provide isolation while allowing admin oversight
Job IDs follow a structured format that encodes metadata for filtering and discovery:
{timestamp}-{user}-{project}-{dep_hash}-{run_hash}
Components:
timestamp: ISO 8601 formatYYYYMMDD-HHMMSS(UTC) for time-based filteringuser: Username (sanitized, lowercase, max 16 chars)project: Project name (sanitized, lowercase, max 16 chars, default: "default")dep_hash: First 8 characters of dependency set hash (fromDependencySet.calculate_hash())run_hash: First 8 characters of SHA256(function_name + args_hash + random_nonce)
Example:
20251204-143022-alice-frauddetect-a3f5d2c8-7b9e4f1a
This ID tells us:
- Created on 2025-12-04 at 14:30:22 UTC
- By user "alice"
- In project "frauddetect"
- With dependency hash a3f5d2c8 (can correlate with image tags)
- Unique run identifier 7b9e4f1a
Artifacts follow a similar pattern with type suffixes:
{job_id}.{artifact_type}.{extension}
Examples:
20251204-143022-alice-frauddetect-a3f5d2c8-7b9e4f1a.log.jsonl
20251204-143022-alice-frauddetect-a3f5d2c8-7b9e4f1a.result.pkl
20251204-143022-alice-frauddetect-a3f5d2c8-7b9e4f1a.model.pkl
20251204-143022-alice-frauddetect-a3f5d2c8-7b9e4f1a.plot.png
Image tags already use dependency hashing (format: user-{hash}). The dep_hash in job IDs matches the first 8 chars of the 12-char image hash, enabling correlation:
Image tag: alice-a3f5d2c8e9b1 (12 chars)
Job ID contains: a3f5d2c8 (first 8 chars match)
This allows queries like "show me all jobs using this Docker image".
All logs use JSON Lines format - one JSON object per line. This format is:
- Human-readable: Can be viewed with
cat,less, ortail -f - Machine-parsable: Each line is valid JSON for easy processing
- Streamable: Can process logs line-by-line without loading entire file
- Standard: Wide tool support (jq, LogStash, Fluentd, etc.)
{
"timestamp": "2025-12-04T14:30:22.123456Z",
"level": "INFO",
"logger": "ascend.runner",
"job_id": "20251204-143022-alice-frauddetect-a3f5d2c8-7b9e4f1a",
"user": "alice",
"project": "frauddetect",
"message": "Starting function execution",
"context": {
"function_name": "train_model",
"node_type": "gpu_small",
"pod_name": "ascend-20251204-143022-alice-frauddetect-a3f5d2c8-7b9e4f1a-xyz",
"namespace": "ascend-users-alice"
},
"extra": {}
}Required fields:
timestamp: ISO 8601 with microseconds (UTC)level: One of DEBUG, INFO, WARNING, ERROR, CRITICALlogger: Logger name (hierarchical, e.g., "ascend.runner.executor")job_id: Full job ID for correlationuser: Usernameproject: Project namemessage: Human-readable log message
Optional fields:
context: Structured context (function name, pod info, etc.)extra: Additional metadata specific to log entryexception: Exception details (type, message, stacktrace) for ERROR/CRITICALduration_ms: Duration in milliseconds for timing logsmetrics: Performance metrics (memory usage, GPU utilization, etc.)
| Level | Usage | Examples |
|---|---|---|
| DEBUG | Detailed debugging information | Function arguments, internal state |
| INFO | Normal operational events | "Job started", "Dependencies installed" |
| WARNING | Non-critical issues | "Package not found, using fallback" |
| ERROR | Error events that don't stop execution | "Failed to upload artifact, retrying" |
| CRITICAL | Fatal errors requiring job termination | "Out of memory", "Function crashed" |
{"timestamp":"2025-12-04T14:30:22.001Z","level":"INFO","logger":"ascend.runner","job_id":"20251204-143022-alice-frauddetect-a3f5d2c8-7b9e4f1a","user":"alice","project":"frauddetect","message":"Job started","context":{"pod_name":"ascend-...-xyz","namespace":"ascend-users-alice"}}
{"timestamp":"2025-12-04T14:30:23.456Z","level":"INFO","logger":"ascend.runner.deps","job_id":"20251204-143022-alice-frauddetect-a3f5d2c8-7b9e4f1a","user":"alice","project":"frauddetect","message":"Installing 5 dependencies","context":{"packages":["numpy","pandas","scikit-learn","torch","transformers"]}}All logs and artifacts are stored in Azure Blob Storage with the following hierarchy:
<storage-account>/
└── ascend-data/
├── projects/
│ └── {project-name}/ # Per-project isolation
│ ├── users/
│ │ └── {username}/ # Per-user within project
│ │ ├── jobs/
│ │ │ └── {job-id}/
│ │ │ ├── package.pkl # Serialized function + args
│ │ │ ├── metadata.json # Job metadata (see schema below)
│ │ │ ├── job.log.jsonl # Structured logs
│ │ │ ├── result.pkl # Execution result
│ │ │ └── artifacts/ # User-generated artifacts
│ │ │ ├── model.pkl
│ │ │ ├── plot.png
│ │ │ └── metrics.json
│ │ └── images/
│ │ └── {dep-hash}/
│ │ ├── requirements.txt
│ │ └── Dockerfile
│ └── logs/
│ └── by-date/
│ └── {YYYY}/{MM}/{DD}/
│ └── {job-id}.log.jsonl # Symlink or copy for date-based lookup
└── shared/
└── base-images/ # Shared base images (not user-specific)
Each job has a metadata.json file with comprehensive information:
{
"job_id": "20251204-143022-alice-frauddetect-a3f5d2c8-7b9e4f1a",
"created_at": "2025-12-04T14:30:22.000Z",
"updated_at": "2025-12-04T14:35:23.456Z",
"status": "completed",
"user": "alice",
"project": "frauddetect",
"function_name": "train_model",
"config": {
"cpu": "4",
"memory": "8Gi",
"timeout": 3600,
"node_type": "gpu_small"
},
"dependencies": {
"hash": "a3f5d2c8e9b1",
"python_version": "3.11",
"packages": [
"numpy==1.24.0",
"pandas==2.0.0",
"scikit-learn==1.3.0",
"torch==2.0.0",
"transformers==4.30.0"
],
"system_packages": [],
"use_gpu": true
},
"execution": {
"start_time": "2025-12-04T14:30:46.001Z",
"end_time": "2025-12-04T14:35:22.123Z",
"duration_seconds": 276.122,
"pod_name": "ascend-20251204-143022-alice-frauddetect-a3f5d2c8-7b9e4f1a-xyz",
"namespace": "ascend-users-alice",
"node_name": "aks-gpu-12345-vmss000001",
"exit_code": 0
},
"resources": {
"cpu_requested": "4",
"memory_requested": "8Gi",
"gpu_requested": "1",
"peak_memory_mb": 4096,
"avg_cpu_percent": 65,
"avg_gpu_utilization_percent": 85
},
"artifacts": [
{
"name": "model.pkl",
"type": "model",
"size_bytes": 104857600,
"created_at": "2025-12-04T14:35:21.000Z"
},
{
"name": "plot.png",
"type": "visualization",
"size_bytes": 524288,
"created_at": "2025-12-04T14:35:22.000Z"
}
],
"tags": {
"experiment": "baseline",
"model_type": "random_forest",
"dataset_version": "v2.1"
}
}By Job ID (primary):
projects/{project}/users/{user}/jobs/{job-id}/
By Date (secondary index):
projects/{project}/logs/by-date/{YYYY}/{MM}/{DD}/{job-id}.log.jsonl
By User (list all user jobs):
projects/{project}/users/{user}/jobs/
Blob storage lifecycle policies for cost optimization:
| Storage Tier | Age | Access Pattern | Cost/GB |
|---|---|---|---|
| Hot | 0-30d | Frequent | $0.018 |
| Cool | 30-90d | Infrequent | $0.010 |
| Archive | 90d+ | Rare | $0.002 |
| Delete | 365d+ | None | $0.000 |
Exceptions:
- Tagged jobs with
retain=trueare never deleted - Admin can configure retention per-project
- Critical artifacts (e.g., production models) can have custom retention
The content-addressable naming scheme enables queries without a database by using blob storage listing with prefix filters.
Query patterns:
# List all jobs by user alice in project frauddetect
prefix = "projects/frauddetect/users/alice/jobs/"
# List all jobs in date range (Dec 1-7, 2025)
prefix = "projects/{project}/logs/by-date/2025/12/"
start_date = "01"
end_date = "07"
# List all jobs using specific dependency hash
# Scan jobs, filter by dep_hash in job_id
pattern = "*-*-*-a3f5d2c8-*"For complex queries across projects, a lightweight search index can be maintained:
Azure Search Index:
{
"job_id": "string",
"created_at": "datetime",
"user": "string",
"project": "string",
"function_name": "string",
"status": "string",
"duration_seconds": "float",
"node_type": "string",
"dep_hash": "string",
"tags": ["string"],
"python_version": "string",
"packages": ["string"]
}Enables queries like:
- "Find all failed jobs in the last week"
- "Show jobs using torch>=2.0.0"
- "List all GPU jobs by cost"
The index is populated asynchronously by a background process that watches blob storage.
Users can manage and inspect their jobs via the ascend jobs CLI group:
# List recent jobs (default: last 20)
uv run ascend jobs list
uv run ascend jobs list --status running
uv run ascend jobs list --status failed --limit 5
uv run ascend jobs list --project myproject
# Show detailed status for a specific job
uv run ascend jobs status <job-id>
# Cancel a running job (deletes K8s job and marks metadata as cancelled)
uv run ascend jobs cancel <job-id>
# View stored logs for a completed job
uv run ascend jobs logs <job-id>
# Stream live logs from a running job
uv run ascend jobs logs <job-id> --followascend jobs list reads job metadata from blob storage, filters and
sorts by creation time. The --status filter supports: queued, running,
completed, failed, cancelled.
ascend jobs status displays full job details including resource config,
dependency info, and — for running jobs — live K8s pod status.
ascend jobs cancel deletes the K8s Job resource (with background
propagation so pods are cleaned up) and updates the metadata to cancelled.
ascend jobs logs reads stored log files (job.log.jsonl) from blob
storage. With --follow, it streams live logs from the pod via the K8s API.
- Namespace Isolation: Jobs, logs, and artifacts are isolated by user namespace
- RBAC: Users can only access their own data; admins can access all data within a project
- Audit Logging: All admin operations (especially deletions) are logged
- Lazy Loading: Don't download full logs unless requested
- Pagination: List commands return paginated results
- Streaming: Use streaming for large file downloads
- Caching: Cache frequently accessed metadata locally
- Parallel Operations: Bulk operations should use parallelism
- Retry Logic: Failed log uploads should be retried
- Partial Uploads: If artifact upload fails, job still succeeds
- Graceful Degradation: Missing metadata shouldn't break log viewing
- Data Integrity: Use checksums for artifact downloads
- Compression: Compress logs before archiving
- Lifecycle Policies: Automatically move old data to cheaper storage tiers
- Deduplication: Shared dependencies should not be duplicated
- Selective Retention: Allow users to mark important jobs for longer retention
This architecture provides a comprehensive logging and artifact management system that:
- ✅ Enables efficient search and retrieval without requiring a database
- ✅ Provides human-readable logs while maintaining machine parseability
- ✅ Supports both user self-service and admin bulk operations
- ✅ Optimizes storage costs through lifecycle policies
- ✅ Maintains security and privacy through namespace isolation
- ✅ Scales efficiently with growing data volumes
The content-addressable naming scheme and structured logging format form a solid foundation that can evolve with additional features while remaining backward compatible.