This repository contains the code and resources used in the project to understand how transformer models solve recursive problems.
- I'm Jasper Lee; I worked on this project in Fall 2025. This repo was originally the code for the paper "Transformer-Based Models Are Not Yet Perfect At Learning to Emulate Structural Recursion". The original code was never publicly released, and it was quite messy, with lots of dead code and little documentation. (Zory Zhang, one of the authors of the paper, shared it with me. He also shared another codebase
Binary_successor_experiment-main/but I felt it was less relevant so I mostly just used this one.) A lot of that messiness is still in the repo, but I hope to at least clear some of the confusion with these notes. - This section is my only contribution to the README; the rest is from the original codebase.
- The project I worked on was follow-up work to the original paper. The goal of the work is to find methods that augment LLMs so that they are better at emulating structural recursion.
- Most of my changes to the original codebase are marked with a
# jleecomment, and most of theTODOs I wrote are markedTODO: jlee:. - I didn't delete much code, even code that looked like dead code, from the original codebase (since I wasn't sure what code would end up being useful).
- All the scripts for running the code are found in
scripts/. The only scripts I used werescripts/train_transformer_bin_succ_jlee.sh(copied fromscripts/train_transformer_bin_succ.sh) andscripts/train_transformer_traversal_jlee.shandscripts/train_transformer_traversal_jlee_trace.sh(copied fromscripts/train_transformer_traversal.sh). - I will focus on
scripts/train_transformer_traversal_jlee_trace.shnow, as that is the most recent script file I worked on, though the others are similar.- The variables that I usually adjust between runs are:
MIN_TRAINING_LENGTHMAX_TRAINING_LENGTHMIN_TESTING_LENGTHMAX_TESTING_LENGTHLRTRAIN_EPOCHSTASKEXP_NAME(this is just for organizing the logs)MODELOAD_CHECKPOINT
- I might temporarily set
EVAL_FREQUENCYto 1 for testing purposes. - It's not separated out as an environment variable, but
--padding_modewas changed from the usualrandom_pad_lefttodefaultbecause therandom_pad_leftcode didn't work with differing input and output lengths. - More info on
MODEandLOAD_CHECKPOINT:- These were added by me to add functionality for evaluations and for resuming training runs. The environment variables may not be listed yet in the rest of the script files.
MODE="train"is for starting a new training run;LOAD_CHECKPOINTis ignored.MODE="train_resume"is for resuming a training run from the checkpoint specified inLOAD_CHECKPOINT. Note that the specified checkpoint only loads the model weights; the training parameters are still decided by the script file. Also note thatTRAIN_EPOCHSspecifies the total number of epochs of training, including any previous training runs.MODE="test_only"runs an evaluation run on the checkpoint inLOAD_CHECKPOINT. This is similar to the mid-training evaluation runs (controlled byEVAL_FREQUENCY) but will likely evaluate on a larger sample of the test dataset (usually the entire dataset).MODE="test_only_full_steps"runs a special type of evaluation run on the checkpoint inLOAD_CHECKPOINT, meant specifically forTASKs with "unroll" in the name. For each test set example, this mode iterates the LLM, calling it repeatedly, first on the input, then on its previous (most recent) output, until it gives a final output that we check against the ground truth. This mode is not optimized and runs extremely slow.
- The variables that I usually adjust between runs are:
- I will now trace a run of
scripts/train_transformer_traversal_jlee_trace.shto give a sense of the codebase structure.- Assume that
MODE="train". scripts/train_transformer_traversal_jlee_trace.shcallstransformer_main.py, which parses the arguments, creates a newTransformerEncoderDecoderTrainerand callsrun_experiment()on it.- The constructor of
TransformerEncoderDecoderTrainer, intrainers/transformer_trainer.py, generates its training and testing data by calling a method intasks/task.py, which in turn callstasks/traversal.py, which in turn callstasks/bin_tree_gen/__init__.py. run_experiment()is also intrainers/transformer_trainer.py, along with the majority of the training (and evaluation) code.- During training, checkpoints are saved in
runs/and TensorBoard logging data is saved intensorboard_result/. (Note that the original codebase also hadruns_v3/andtensorboard_result_v3/folders but I'm not sure about their significance.)
- Assume that
- I will now briefly discuss my progress on the project.
- I spent a lot of time figuring out the code and replicating the results of the previous paper; I was able to get quite close.
- I created a new traced format (
traversal_preorder_traceortraversal_inorder_trace) where the input is the tree (as usual) and the output is the entire trace for doing the traversal. I initially tried doing a node-by-node trace but these were too long, so I switched to a layer-by-layer trace which was more tractable. However, it still couldn't fit depth 6 trees without running out of CUDA memory, only depth 5 trees. traversal_preorder_tracegets up to about 0.98 test accuracy in 500 epochs (though it's already really close at 300 epochs) whiletraversal_inorder_traceonly reaches about 0.66 test accuracy in 2000 epochs.- I then created a new unrolled format (
traversal_inorder_unroll) that unrolls the traces, including each single step of each trace as a separate training/testing pair. In theory, training an LLM on this and then looping it repeatedly on the input tree should eventually result in the correct traversal. However, it only ended with about 0.73 test accuracy for inorder traversal. (I didn't run unrolling with preorder traversal, as it is too easy.) I didn't try looping it in evaluation (MODE="test_only_full_steps") since accuracy was too low. - It's unclear what the next steps should be for this project. A good place to start would be getting better accuracy on any of the inorder traversal tasks. We also looked at several different architectures and frameworks for giving LLMs the ability to do recursion.
- If needed, I can provide more detailed numbers and graphs for any of the experiments discussed above (including sharing the checkpoint and log folders).
- Feel free to contact me at [email protected] if you have any questions!
The repository is structured as follows:
/tasks: Includes the tasks that we have been experimenting with. There is a basicTaskclass intask.py. The rest of the classes inherit this class. For binary successor task, we will be using BinarSuccessorGeneratorV2 in/tasks/sucessors.py./trainers: Includes trainers and utils of trainers. For the sake of facilitating experiments, there is one trainer for min_gpt and one for standard encoder-decoder transformer which are largely similar with some differences.transfomer_trainer.pyincludes the abstract trainer class and the trainer for encoder-decoder transformers.min_gpt_trainer.pyincludes the trainer for min-gpt./models: Includes the transformer model implementation and min_gpt hook. We followed The Annotated Transformer for transformer implementation./min_gpt: An adapted version of min_gpt for our project./analyze: Includes some notebooks (which are work-in-progress).
To train the model for binary-successor task, navigate to /scripts and and run train_transformer_bin_succ.sh with approporiate values for each argument. We use bin_basic_v2 for experiments in natural order, and bin_rev_v2 for experiments in reversed order.
The script will call transformer_main.py and configure the trainer to use. Note, in the current implementation of the binary successor task (v2), we have hard-coded the length ranges and number of samples for testing. Therefore, you may want to set MAXINT, TRAIN_MAX_NUM and TEST_MAX_NUM to be the same, being the maximum number for training. (We generate train samples from 2 to TRAIN_MAX_NUM).
This project is licensed under the MIT License. See the LICENSE file for details.