Figure inspired by ExpeL.
In this repository we release the code and data accompanying our paper "From Knowledge to Noise: CTIM-Rover and the Pitfalls of Episodic Memory in Software Engineering Agents".
With CTIM-Rover, an forked agent for Software Engineering built on top of AutoCodeRover, we investigate if the experiental learning approach for episodic memory introduced in ExpeL generalizes to the more challenging software engineering domain. We find that this implementation, even with modifications, does not currently generalize to software engineering and actually degrades performance compared to the baseline AutoCodeRover. As likely culprit we identify noisy CTIM items leading to suboptimal initial repository exploration decisions of the agent. For details on our findings, we refer the to our paper.
In the ctim-rover-results directory we release the cross-task-instance memory (CTIM) used for our evaluations as ruleset.json and repo_ruleset.json. These were created using o1, we also release all other CTIMs we experimented with during our project and logs and cost metadata on the CTIM construction process in this folder.
The runs_stored.zip archive contains all trajectories of all evaluation runs referred to in our project and preliminary evaluations with CTIMs that we constructed based on SWE-bench Lite trajectories for preliminary investigations. In run_output_train_full.zip we release the full results of our training run with self-reflection and all SWE-bench Verified trajectories we collected this way.
In the app the modified version of AutoCodeRover, CTIM-Rover, with options for using
- general level CTIM
- repository level CTIM
- exemplar-based in-context learning
In the ctim-rover-scripts folder we release our knowledge distillation scripts and Reflexion-based training/data collection pipeline as well as notebooks for data exploration, evaluation and dataset splitting.
For the dataset creation process we use the annotations released with SWE-bench Verified and the leaderboard standings on this benchmark. We do not bundle the leaderboard data we use with this repository due to GitHub file size limitations, but they can be downloaded from the SWE-bench website. The AutoCodeRover SWE-bench Lite trajectories we mention in the paper are located in and taken from acr-results/acr-val-only folder.
If you use this work, please cite our paper:
@inproceedings{lindenbauer-etal-2025-knowledge,
title = "From Knowledge to Noise: {CTIM}-Rover and the Pitfalls of Episodic Memory in Software Engineering Agents",
author = "Lindenbauer, Tobias and
Groh, Georg and
Schuetze, Hinrich",
editor = "Kamalloo, Ehsan and
Gontier, Nicolas and
Lu, Xing Han and
Dziri, Nouha and
Murty, Shikhar and
Lacoste, Alexandre",
booktitle = "Proceedings of the 1st Workshop for Research on Agent Language Models (REALM 2025)",
month = jul,
year = "2025",
address = "Vienna, Austria",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2025.realm-1.30/",
doi = "10.18653/v1/2025.realm-1.30",
pages = "411--427",
ISBN = "979-8-89176-264-0",
abstract = "We introduce CTIM-Rover, an AI agent for Software Engineering (SE) built on top of AutoCodeRover (Zhang et al., 2024) that extends agentic reasoning frameworks with an episodic memory, more specifically, a general and repository-level Cross-Task-Instance Memory (CTIM). While existing open-source SE agents mostly rely on ReAct (Yao et al., 2023b), Reflexion (Shinn et al., 2023), or Code-Act (Wang et al., 2024), all of these reasoning and planning frameworks inefficiently discard their long-term memory after a single task instance. As repository-level understanding is pivotal for identifying all locations requiring a patch for fixing a bug, we hypothesize that SE is particularly well positioned to benefit from CTIM. For this, we build on the Experiential Learning (EL) approach ExpeL (Zhao et al., 2024), proposing a Mixture-Of-Experts (MoEs) inspired approach to create both a general-purpose and repository-level CTIM . We find that CTIM-Rover does not outperform AutoCodeRover in any configuration and thus conclude that neither ExpeL nor DoT-Bank (Lingam et al., 2024) scale to real-world SE problems. Our analysis indicates noise introduced by distracting CTIM items or exemplar trajectories as the likely source of the performance degradation."
}