Skip to content

Illinois-Theorem-Provers/recursive-nn

Repository files navigation

Understanding model behaviors when learning to solve recursive problems.

This repository contains the code and resources used in the project to understand how transformer models solve recursive problems.

Table of Contents

Jasper Lee's Notes

  • I'm Jasper Lee; I worked on this project in Fall 2025. This repo was originally the code for the paper "Transformer-Based Models Are Not Yet Perfect At Learning to Emulate Structural Recursion". The original code was never publicly released, and it was quite messy, with lots of dead code and little documentation. (Zory Zhang, one of the authors of the paper, shared it with me. He also shared another codebase Binary_successor_experiment-main/ but I felt it was less relevant so I mostly just used this one.) A lot of that messiness is still in the repo, but I hope to at least clear some of the confusion with these notes.
  • This section is my only contribution to the README; the rest is from the original codebase.
  • The project I worked on was follow-up work to the original paper. The goal of the work is to find methods that augment LLMs so that they are better at emulating structural recursion.
  • Most of my changes to the original codebase are marked with a # jlee comment, and most of the TODOs I wrote are marked TODO: jlee:.
  • I didn't delete much code, even code that looked like dead code, from the original codebase (since I wasn't sure what code would end up being useful).
  • All the scripts for running the code are found in scripts/. The only scripts I used were scripts/train_transformer_bin_succ_jlee.sh (copied from scripts/train_transformer_bin_succ.sh) and scripts/train_transformer_traversal_jlee.sh and scripts/train_transformer_traversal_jlee_trace.sh (copied from scripts/train_transformer_traversal.sh).
  • I will focus on scripts/train_transformer_traversal_jlee_trace.sh now, as that is the most recent script file I worked on, though the others are similar.
    • The variables that I usually adjust between runs are:
      • MIN_TRAINING_LENGTH
      • MAX_TRAINING_LENGTH
      • MIN_TESTING_LENGTH
      • MAX_TESTING_LENGTH
      • LR
      • TRAIN_EPOCHS
      • TASK
      • EXP_NAME (this is just for organizing the logs)
      • MODE
      • LOAD_CHECKPOINT
    • I might temporarily set EVAL_FREQUENCY to 1 for testing purposes.
    • It's not separated out as an environment variable, but --padding_mode was changed from the usual random_pad_left to default because the random_pad_left code didn't work with differing input and output lengths.
    • More info on MODE and LOAD_CHECKPOINT:
      • These were added by me to add functionality for evaluations and for resuming training runs. The environment variables may not be listed yet in the rest of the script files.
      • MODE="train" is for starting a new training run; LOAD_CHECKPOINT is ignored.
      • MODE="train_resume" is for resuming a training run from the checkpoint specified in LOAD_CHECKPOINT. Note that the specified checkpoint only loads the model weights; the training parameters are still decided by the script file. Also note that TRAIN_EPOCHS specifies the total number of epochs of training, including any previous training runs.
      • MODE="test_only" runs an evaluation run on the checkpoint in LOAD_CHECKPOINT. This is similar to the mid-training evaluation runs (controlled by EVAL_FREQUENCY) but will likely evaluate on a larger sample of the test dataset (usually the entire dataset).
      • MODE="test_only_full_steps" runs a special type of evaluation run on the checkpoint in LOAD_CHECKPOINT, meant specifically for TASKs with "unroll" in the name. For each test set example, this mode iterates the LLM, calling it repeatedly, first on the input, then on its previous (most recent) output, until it gives a final output that we check against the ground truth. This mode is not optimized and runs extremely slow.
  • I will now trace a run of scripts/train_transformer_traversal_jlee_trace.sh to give a sense of the codebase structure.
    • Assume that MODE="train".
    • scripts/train_transformer_traversal_jlee_trace.sh calls transformer_main.py, which parses the arguments, creates a new TransformerEncoderDecoderTrainer and calls run_experiment() on it.
    • The constructor of TransformerEncoderDecoderTrainer, in trainers/transformer_trainer.py, generates its training and testing data by calling a method in tasks/task.py, which in turn calls tasks/traversal.py, which in turn calls tasks/bin_tree_gen/__init__.py.
    • run_experiment() is also in trainers/transformer_trainer.py, along with the majority of the training (and evaluation) code.
    • During training, checkpoints are saved in runs/ and TensorBoard logging data is saved in tensorboard_result/. (Note that the original codebase also had runs_v3/ and tensorboard_result_v3/ folders but I'm not sure about their significance.)
  • I will now briefly discuss my progress on the project.
    • I spent a lot of time figuring out the code and replicating the results of the previous paper; I was able to get quite close.
    • I created a new traced format (traversal_preorder_trace or traversal_inorder_trace) where the input is the tree (as usual) and the output is the entire trace for doing the traversal. I initially tried doing a node-by-node trace but these were too long, so I switched to a layer-by-layer trace which was more tractable. However, it still couldn't fit depth 6 trees without running out of CUDA memory, only depth 5 trees.
    • traversal_preorder_trace gets up to about 0.98 test accuracy in 500 epochs (though it's already really close at 300 epochs) while traversal_inorder_trace only reaches about 0.66 test accuracy in 2000 epochs.
    • I then created a new unrolled format (traversal_inorder_unroll) that unrolls the traces, including each single step of each trace as a separate training/testing pair. In theory, training an LLM on this and then looping it repeatedly on the input tree should eventually result in the correct traversal. However, it only ended with about 0.73 test accuracy for inorder traversal. (I didn't run unrolling with preorder traversal, as it is too easy.) I didn't try looping it in evaluation (MODE="test_only_full_steps") since accuracy was too low.
    • It's unclear what the next steps should be for this project. A good place to start would be getting better accuracy on any of the inorder traversal tasks. We also looked at several different architectures and frameworks for giving LLMs the ability to do recursion.
    • If needed, I can provide more detailed numbers and graphs for any of the experiments discussed above (including sharing the checkpoint and log folders).
  • Feel free to contact me at [email protected] if you have any questions!

Code Structure

The repository is structured as follows:

  • /tasks: Includes the tasks that we have been experimenting with. There is a basic Task class in task.py. The rest of the classes inherit this class. For binary successor task, we will be using BinarSuccessorGeneratorV2 in /tasks/sucessors.py.
  • /trainers: Includes trainers and utils of trainers. For the sake of facilitating experiments, there is one trainer for min_gpt and one for standard encoder-decoder transformer which are largely similar with some differences. transfomer_trainer.py includes the abstract trainer class and the trainer for encoder-decoder transformers. min_gpt_trainer.py includes the trainer for min-gpt.
  • /models: Includes the transformer model implementation and min_gpt hook. We followed The Annotated Transformer for transformer implementation.
  • /min_gpt: An adapted version of min_gpt for our project.
  • /analyze: Includes some notebooks (which are work-in-progress).

How to Run the Code

To train the model for binary-successor task, navigate to /scripts and and run train_transformer_bin_succ.sh with approporiate values for each argument. We use bin_basic_v2 for experiments in natural order, and bin_rev_v2 for experiments in reversed order.

The script will call transformer_main.py and configure the trainer to use. Note, in the current implementation of the binary successor task (v2), we have hard-coded the length ranges and number of samples for testing. Therefore, you may want to set MAXINT, TRAIN_MAX_NUM and TEST_MAX_NUM to be the same, being the maximum number for training. (We generate train samples from 2 to TRAIN_MAX_NUM).

License

This project is licensed under the MIT License. See the LICENSE file for details.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors