Date: 2026-01-31
Status: Implementation complete, testing in progress
Command to run: poetry run spinlock train-vqvae --config configs/vqvae/baseline_vqvae_variable_length.yaml
Implemented variable-length trajectory support for VQ-VAE training, enabling the model to train on trajectories of different lengths (T ∈ [16, 32, 64, 128, 256]) with adaptive pyramid level selection. This allows the model to learn scale-invariant representations for operator discovery and meta-learning.
The original VQ-VAE pipeline assumed fixed-length temporal trajectories (T=500). The initial variable-length implementation encountered a critical architectural mismatch:
-
Initial Error: Training failed with "Variable-length mode requires temporal features [N, T, D], got shape torch.Size([5688, 338])"
- Root cause: Pipeline pre-encoded temporal features to [N, 338] during loading
- Variable-length mode needs raw [N, T, D] sequences for length sampling during training
-
Secondary Error: After skipping pre-encoding, got "All features have zero variance!"
- Root cause: Category discovery (clustering) needs encoded features
- Kept raw features broke clustering
1. Feature Loading (_load_features() in train_vqvae.py)
When variable-length is enabled for temporal family:
# STEP 1: Encode at full length for category discovery
temporal_encoder = get_encoder(...)
encoded_for_clustering = temporal_encoder(raw_features) # [N, 320]
# STEP 2: Store raw features for variable-length training
raw_temporal_features = family_features # [N, T, D]
# STEP 3: Store encoder for runtime encoding
self._temporal_encoder = temporal_encoder
# Use encoded features for clustering
all_features.append(encoded_for_clustering)2. Dataloader Creation (_create_data_loaders() in train_vqvae.py)
# Pass raw temporal features to dataset (for length sampling)
dataset_features = raw_temporal_features # [N, T, D]
# Store pre-encoded initial features separately
self._encoded_initial_features = initial_features # [N, D_initial]
# Dataset creates masks and samples lengths3. Trainer (VQVAETrainer in trainer.py)
Added variable-length support parameters:
temporal_encoder: Encoder to apply at runtimetemporal_encoder_output_dim: Output dimension (320D)encoded_initial_features: Pre-encoded initial features to concatenate
Modified training loop:
# In train_epoch() and validate():
if self.temporal_encoder is not None:
# Extract mask and length from batch
mask = batch["mask"] # [B, T]
length = batch["length"] # [B]
# Encode temporal with variable lengths
encoded_temporal = self.temporal_encoder(features, mask=mask, lengths=length) # [B, 320]
# Concatenate with pre-encoded initial features
if self.encoded_initial_features_tensor is not None:
features = torch.cat([initial_batch, encoded_temporal], dim=1)
else:
features = encoded_temporal
# Pass to VQ-VAE
outputs = self.model(features)Raw temporal [N, T=256, D]
→ Encode at full length
→ [N, 320] encoded
→ Clustering/category discovery
→ 14 categories (4 initial + 3 P0 + 4 P1 + ... + 8 P3)
Raw temporal [N, T=256, D]
→ Dataset: Sample length T_i ∈ {16, 32, 64, 128, 256}
→ Create mask [N, T] (True×T_i, False×(T-T_i))
→ Training loop: Encode with mask
→ [N, 320] encoded (adaptive pyramid levels)
→ Concatenate with initial [N, 128]
→ [N, 448] total features
→ VQ-VAE forward pass
File: configs/vqvae/baseline_vqvae_variable_length.yaml
Key settings:
families:
temporal:
encoder: PyramidTemporalEncoder
encoder_params:
level_dims: [32, 64, 96, 128] # Per-level output (total 320D)
downsample_factors: [1, 2, 4, 8] # Pyramid factors
variable_length:
enabled: true # ENABLED for variable-length mode
min_timesteps: 16 # Minimum: 2^4
max_timesteps: 256 # Maximum: 2^8
sampling_strategy: "fixed_bins"
length_bins: [16, 32, 64, 128, 256] # Powers of 2
adaptive_pyramid: true # Auto-skip invalid levels
mask_downsample_method: "ceil" # Conservative masking
training:
batch_size: 1024
num_epochs: 100 # Variable-length: ~400-500 hours (17-21 days)
warmup_epochs: 20-
src/spinlock/cli/train_vqvae.py- Added variable-length detection in
_load_features() - Dual encoding: encode for clustering, keep raw for training
- Store temporal encoder and initial features
- Modified
_create_data_loaders()to use raw temporal features - Modified
_create_trainer()to pass temporal encoder
- Added variable-length detection in
-
src/spinlock/encoding/training/trainer.py- Added parameters:
temporal_encoder,temporal_encoder_output_dim,encoded_initial_features - Modified
train_epoch(): encode temporal at runtime, concatenate with initial - Modified
validate(): same runtime encoding logic
- Added parameters:
-
src/spinlock/encoding/trajectory_length_sampler.py(NEW)TrajectoryLengthSamplerwith strategies: uniform, geometric, fixed_binsLengthCurriculumSchedulerfor gradual min→max transitioncreate_mask()utility for boolean masks
-
src/spinlock/encoding/temporal_pyramid.py(MODIFIED)- Added
get_valid_levels()for adaptive pyramid level selection - Added
adaptiveparameter andmin_pyramid_length - Returns (levels, level_masks, valid_factors) tuple
- Added
-
src/spinlock/encoding/encoders/pyramid_temporal.py(MODIFIED)- Added
variable_length_configparameter - Modified
forward()to accept mask and lengths - Added
_pad_to_full_dim()for zero-padding missing levels - Returns (embeddings, mask_info) when mask provided
- Added
-
src/spinlock/encoding/training/losses.py(MODIFIED)- Added sample weighting by valid fraction
- Modified
reconstruction_loss()to handlemask_info
-
src/spinlock/encoding/variable_length_utils.py(NEW)parse_variable_length_config()- extract and validate configcreate_length_sampler()- instantiate sampleraugment_dataset_with_lengths()- wrap dataset with length samplingextract_mask_info_from_batch()- extract mask metadata
-
src/spinlock/encoding/training/data_utils.py(NEW - refactored from train_vqvae.py)FeatureDatasetclasscreate_train_val_dataloaders()with variable-length support
-
configs/vqvae/baseline_vqvae_variable_length.yaml(NEW)- Full configuration with variable_length enabled
- Powers of 2 length bins for clean pyramid alignment
-
docs/vqvae/temporal-pyramid.md(UPDATED)- Added "Variable-Length Trajectory Support" section (~200 lines)
- Documented length sampling, adaptive pyramid, masking
-
README.md(UPDATED)- Added variable-length callout in VQ-VAE section
-
✅ "Variable-length mode requires temporal features [N, T, D], got shape torch.Size([5688, 338])"
- Fixed by skipping pre-encoding for temporal features
-
✅ "All features have zero variance!"
- Fixed by dual encoding (encode for clustering, keep raw for training)
-
✅ "The size of tensor a (466) must match the size of tensor b (486) at non-singleton dimension 1"
- Root cause: Feature cleaning created 338D features during category discovery, but training passed 358D uncleaned features
- Fixed by:
- Storing feature_mask as instance variable
- Extracting initial portion of mask and applying to initial_features_only
- Extracting temporal portion of mask and applying during training
- Correctly concatenating cleaned initial + cleaned temporal → 338D
- ✅ Training working successfully!
- ✅ Feature cleaning applied correctly in variable-length mode
- ✅ Temporal feature mask: 300/320 features kept
- ✅ Training progressing normally (Epoch 1: loss=47.14 → Epoch 47: loss=10.17)
⚠️ torch.compile disabled for variable-length mode (runtime temporal encoding incompatible)
Loading features from datasets/100k_baseline_dev.h5...
initial: Hybrid mode - 38D manual + raw ICs (6320, 3, 64, 64)
temporal: Variable-length mode detected - dual encoding strategy
Encoding at full length (T=256) for category discovery...
Encoded for clustering: (6320, 320)
Keeping raw features for variable-length training: (6320, 256, 345)
Proceeding with encoded features for category discovery
DISCOVERED CATEGORY STRUCTURE
Number of categories: 14
...
Variable-length mode: Using raw temporal features for runtime encoding:
Raw temporal shape: (6320, 256, 345)
Pre-encoded initial features shape: (6320, 142)
These will be concatenated with encoded temporal features during training
Variable-length mode enabled:
TrajectoryLengthSampler(min_T=16, max_T=256, strategy='fixed_bins', bins=[16, 32, 64, 128, 256])
Epoch 1/100: [Training starts]
- Clustering needs encoded features: Category discovery requires fixed-size embeddings
- Training needs raw features: Variable-length sampling requires [N, T, D] sequences
- Solution: Encode once at full length for clustering, keep raw for training
- Pyramid factors [1, 2, 4, 8] divide cleanly into powers of 2
- T=16: levels [16, 8, 4, 2] - all valid
- T=32: levels [32, 16, 8, 4] - all valid
- T=64: levels [64, 32, 16, 8] - all valid
- No rounding, integer boundaries at every level
- For T=16 with factors [1, 2, 4, 8]: all 4 levels valid (16/8 = 2 ≥ 1)
- For T=4 with factors [1, 2, 4, 8]: only [1, 2, 4] valid (skip 8×)
- Output zero-padded to full dimension (320D) when levels skipped
- Prevents short trajectories from dominating gradients
- Weight = (num_valid / max_valid)
- T=64 batch: weight = 64/256 = 0.25
- T=256 batch: weight = 256/256 = 1.00
-
849514b - "fix: implement variable-length trajectory support for VQ-VAE training"
- Modified
_load_features()to skip encoding for temporal - Modified
_create_data_loaders()to use raw temporal - Modified
VQVAETrainerfor runtime encoding
- Modified
-
44620f1 - "fix: dual encoding strategy for variable-length"
- Encode at full length for clustering
- Keep raw for training
- Fixes "zero variance" error
-
c981790 - "fix: apply feature cleaning correctly in variable-length mode"
- Store feature_mask and extract initial/temporal portions
- Apply masks during training to match model's expected dimensions
- Fixes dimension mismatch (466D vs 486D) error
-
3750426 - "feat: implement chunked HDF5 loading to prevent OOM on large datasets"
- Chunked HDF5 loading with 400MB target per chunk
- Chunked encoding to avoid full tensor conversion
- Memory reduction: 15-20 GB → ~5-6 GB (66% reduction)
- Configuration: hdf5_chunk_size (auto/manual), encode_chunk_size (5000)
- Backward compatible: only activates for large + variable-length + 3D features
-
Monitor Training
- Watch for successful epoch completion
- Check loss convergence
- Verify codebook utilization (~18% expected)
-
Validation
- After a few epochs, check reconstruction quality
- Verify variable-length batches work correctly
- Check that different length bins are sampled
-
Optimization (if needed)
- If training is too slow, reduce batch_size or use fewer pyramid levels
- If memory issues, reduce max_timesteps or model capacity
- If convergence is poor, adjust commitment_cost or learning_rate
-
Early Stopping
- Training will stop early if validation loss plateaus
- Early stopping patience: 30 epochs (reduced from 100 proportionally)
- Min delta: 0.01
Error: Out of memory during feature loading
- ✅ Fixed: Chunked HDF5 loading now prevents this (commit 3750426)
- If still issues: Set
hdf5_chunk_size: 500for smaller chunks - If still issues: Set
encode_chunk_size: 2000for smaller encoding chunks
Error: Out of memory during training
- Reduce
batch_sizefrom 1024 to 512 - Reduce
max_timestepsfrom 256 to 128 - Reduce model capacity (group_embedding_dim, group_hidden_dim)
Error: Slow training
- Reduce
num_epochsfrom 100 to 50 - Check torch.compile is enabled and working
- Verify GPU utilization (should be >80%)
Error: Poor reconstruction quality
- Increase
num_epochsfor more training - Adjust
commitment_cost(try 0.15-0.30 range) - Check orthogonality target is being met
Error: Encoder-related errors
- Check that temporal_encoder is on correct device (GPU)
- Verify mask and length tensors have correct shapes
- Check that encoded_initial_features matches batch size
- Check that length sampling is working (inspect batch["length"])
- Verify adaptive pyramid is selecting correct levels
- Check that sample weighting is applied correctly
- Monitor codebook utilization per category
┌─────────────────────────────────────────────────────────────────────┐
│ FEATURE LOADING (Once) │
├─────────────────────────────────────────────────────────────────────┤
│ │
│ Raw Temporal [N, T=256, D=345] │
│ │ │
│ ├──→ Encode at full length ──→ [N, 320] ──→ Clustering │
│ │ │
│ └──→ Store raw [N, T, D] ──────────────────────────────────┼──┐
│ │ │
│ Initial [N, C, H, W] │ │
│ │ │ │
│ └──→ Extract manual ──→ [N, 38] ──→ Concat with CNN later │ │
│ │ │
└─────────────────────────────────────────────────────────────────────┘ │
│
┌─────────────────────────────────────────────────────────────────────┐ │
│ TRAINING LOOP (Every Batch) │ │
├─────────────────────────────────────────────────────────────────────┤ │
│ │ │
│ Dataset __getitem__: │◀─┘
│ 1. Sample length T_i ∈ {16, 32, 64, 128, 256} │
│ 2. Create mask [T]: [True×T_i, False×(T-T_i)] │
│ 3. Return: features=[T, D], mask=[T], length=T_i │
│ │
│ Training loop (trainer.py): │
│ 1. features = batch["features"] # [B, T, D] raw │
│ 2. mask = batch["mask"] # [B, T] boolean │
│ 3. length = batch["length"] # [B] actual lengths │
│ │
│ 4. # Encode with variable lengths │
│ encoded_temporal = temporal_encoder( │
│ features, │
│ mask=mask, │
│ lengths=length │
│ ) # [B, 320] with adaptive pyramid levels │
│ │
│ 5. # Concatenate with initial features │
│ features = concat([initial_batch, encoded_temporal], dim=1) │
│ # [B, 142+320] = [B, 462] │
│ │
│ 6. # Forward pass │
│ outputs = model(features) │
│ │
│ 7. # Compute loss (with sample weighting) │
│ loss = reconstruction_loss(outputs, targets, mask_info) │
│ │
└─────────────────────────────────────────────────────────────────────┘
- Plan document: Plan file in previous conversation
- Temporal pyramid docs:
docs/vqvae/temporal-pyramid.md - Example config:
configs/vqvae/baseline_vqvae_variable_length.yaml - Test files:
tests/encoding/test_variable_length_*.py - Original implementation: All core components in
src/spinlock/encoding/
- Encoder not found errors → Check
src/spinlock/encoding/encoders/__init__.pyregistry - Mask shape errors → Check
src/spinlock/encoding/temporal_pyramid.pymask downsampling - Concatenation errors → Check
src/spinlock/encoding/training/trainer.pyfeature concatenation - Loss computation errors → Check
src/spinlock/encoding/training/losses.pysample weighting - Dataset errors → Check
src/spinlock/encoding/training/data_utils.pyandvariable_length_utils.py
END OF RESUME DOCUMENT