CTIM-Rover

Figure inspired by ExpeL.

In this repository we release the code and data accompanying our paper "From Knowledge to Noise: CTIM-Rover and the Pitfalls of Episodic Memory in Software Engineering Agents".

With CTIM-Rover, an forked agent for Software Engineering built on top of AutoCodeRover, we investigate if the experiental learning approach for episodic memory introduced in ExpeL generalizes to the more challenging software engineering domain. We find that this implementation, even with modifications, does not currently generalize to software engineering and actually degrades performance compared to the baseline AutoCodeRover. As likely culprit we identify noisy CTIM items leading to suboptimal initial repository exploration decisions of the agent. For details on our findings, we refer the to our paper.

Contents of This Repository

In the ctim-rover-results directory we release the cross-task-instance memory (CTIM) used for our evaluations as ruleset.json and repo_ruleset.json. These were created using o1, we also release all other CTIMs we experimented with during our project and logs and cost metadata on the CTIM construction process in this folder.

The runs_stored.zip archive contains all trajectories of all evaluation runs referred to in our project and preliminary evaluations with CTIMs that we constructed based on SWE-bench Lite trajectories for preliminary investigations. In run_output_train_full.zip we release the full results of our training run with self-reflection and all SWE-bench Verified trajectories we collected this way.

In the app the modified version of AutoCodeRover, CTIM-Rover, with options for using

general level CTIM
repository level CTIM
exemplar-based in-context learning

In the ctim-rover-scripts folder we release our knowledge distillation scripts and Reflexion-based training/data collection pipeline as well as notebooks for data exploration, evaluation and dataset splitting.

For the dataset creation process we use the annotations released with SWE-bench Verified and the leaderboard standings on this benchmark. We do not bundle the leaderboard data we use with this repository due to GitHub file size limitations, but they can be downloaded from the SWE-bench website. The AutoCodeRover SWE-bench Lite trajectories we mention in the paper are located in and taken from acr-results/acr-val-only folder.

Citation

If you use this work, please cite our paper:

@inproceedings{lindenbauer-etal-2025-knowledge,
    title = "From Knowledge to Noise: {CTIM}-Rover and the Pitfalls of Episodic Memory in Software Engineering Agents",
    author = "Lindenbauer, Tobias  and
      Groh, Georg  and
      Schuetze, Hinrich",
    editor = "Kamalloo, Ehsan  and
      Gontier, Nicolas  and
      Lu, Xing Han  and
      Dziri, Nouha  and
      Murty, Shikhar  and
      Lacoste, Alexandre",
    booktitle = "Proceedings of the 1st Workshop for Research on Agent Language Models (REALM 2025)",
    month = jul,
    year = "2025",
    address = "Vienna, Austria",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2025.realm-1.30/",
    doi = "10.18653/v1/2025.realm-1.30",
    pages = "411--427",
    ISBN = "979-8-89176-264-0",
    abstract = "We introduce CTIM-Rover, an AI agent for Software Engineering (SE) built on top of AutoCodeRover (Zhang et al., 2024) that extends agentic reasoning frameworks with an episodic memory, more specifically, a general and repository-level Cross-Task-Instance Memory (CTIM). While existing open-source SE agents mostly rely on ReAct (Yao et al., 2023b), Reflexion (Shinn et al., 2023), or Code-Act (Wang et al., 2024), all of these reasoning and planning frameworks inefficiently discard their long-term memory after a single task instance. As repository-level understanding is pivotal for identifying all locations requiring a patch for fixing a bug, we hypothesize that SE is particularly well positioned to benefit from CTIM. For this, we build on the Experiential Learning (EL) approach ExpeL (Zhao et al., 2024), proposing a Mixture-Of-Experts (MoEs) inspired approach to create both a general-purpose and repository-level CTIM . We find that CTIM-Rover does not outperform AutoCodeRover in any configuration and thus conclude that neither ExpeL nor DoT-Bank (Lingam et al., 2024) scale to real-world SE problems. Our analysis indicates noise introduced by distracting CTIM items or exemplar trajectories as the likely source of the performance degradation."
}

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
acr-results		acr-results
acr-scripts		acr-scripts
app		app
conf		conf
ctim-rover-results		ctim-rover-results
ctim-rover-scripts		ctim-rover-scripts
demo_vis		demo_vis
figures		figures
.gitignore		.gitignore
AutoCodeRover-EXPERIMENT.md		AutoCodeRover-EXPERIMENT.md
AutoCodeRover-README.md		AutoCodeRover-README.md
CITATION.cff		CITATION.cff
Dockerfile.scratch		Dockerfile.scratch
LICENSE		LICENSE
README.md		README.md
environment.windows.yml		environment.windows.yml
environment.yml		environment.yml
requirements-dev.txt		requirements-dev.txt
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

CTIM-Rover

Contents of This Repository

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

CTIM-Rover

Contents of This Repository

Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages