The Codebase Flattener is a Python script that flattens a directory structure by copying files from a source directory and its subdirectories into a target directory, grouping them into folders of a specified size. It provides control over which directories and files to include in the flattening process and automatically ignores files and directories specified in the .gitignore file.
- Recursively traverses a source directory and its subdirectories.
- Copies files to a target directory, grouping them into folders of a specified size.
- Flattens the directory structure by replacing path separators with double dashes (
--). - Allows specifying an include list of directories and files to flatten via a JSON configuration file.
- Automatically ignores files and directories specified in the
.gitignorefile. - Estimates token counts using NLTK and TikToken tokenizers.
- Provides user feedback and error handling.
- Python 3.6 or higher
- NLTK library
- TikToken library
- Clone the repository:
git clone https://github.com/Agentic-Insights/codebase-flattener - Install the required dependencies:
pip install -r requirements.txt
-
Create a
flatten-config.jsonfile in the source directory with the following format:{ "include": [ "path/to/include1", "path/to/include2" ] }Replace
path/to/include1andpath/to/include2with the paths of directories or files you want to include in the flattening process. -
Run the script:
python flatten.py path/to/source/directory --target_dir path/to/target/directory- Replace
path/to/source/directorywith the path to the directory you want to flatten. - Optionally, specify the target directory using the
--target_dirflag. If not provided, the script will create a directory named{source_directory}_flatin the current directory.
- Replace
-
The script will flatten the specified directory structure, copying files to the target directory in separate folders based on the specified number of files per folder (default is 100).
-
If the
flatten-config.jsonfile is not found or the include list is empty, the script will not proceed with flattening unless the--forceflag is provided.
The script allows for configuration through the flatten-config.json file placed in the source directory. The JSON file should have the following structure:
{
"include": [
"path/to/include1",
"path/to/include2",
"path/to/directory/*",
"**/*.txt"
]
}include: An array of file paths, directory paths, or patterns to include in the flattening process. If not specified or empty, the script will not proceed with flattening unless the--forceflag is provided.
The include array supports the following types of entries:
- File paths: Specify the relative path to a specific file to include it in the flattening process. For example,
"path/to/file.txt". - Directory paths: Specify the relative path to a directory to include all its files and subdirectories in the flattening process. For example,
"path/to/directory". - Glob patterns: Use wildcard characters to match multiple files or directories. For example,
"path/to/directory/*"will include all files in the specified directory, and"**/*.txt"will include all files with a.txtextension in any directory.
The script interprets the entries in the include array similarly to the patterns used in .gitignore files. It supports the following wildcard characters:
*: Matches any number of characters except path separators (/ or ).**: Matches any number of characters, including path separators, to match files in nested directories.?: Matches any single character.[abc]: Matches any single character within the specified set (in this case, a, b, or c).[!abc]: Matches any single character not within the specified set.
Note that the script automatically ignores files and directories specified in the .gitignore file, so there is no need to duplicate those entries in the flatten-config.json file.
Examples of valid include entries:
{
"include": [
"src/main.py",
"src/utils",
"tests/*.py",
"docs/**/*.md"
]
}"src/main.py": Includes the specific filemain.pyin thesrcdirectory."src/utils": Includes all files and subdirectories in thesrc/utilsdirectory."tests/*.py": Includes all files with a.pyextension in thetestsdirectory."docs/**/*.md": Includes all files with a.mdextension in thedocsdirectory and its subdirectories.
--tokenizer: Specify the tokenizer to use for estimating the token count. Available options:nltk,tiktoken. If not provided, the script will use an average of both tokenizers.--force: Force flattening without a configuration file or an empty include list.
This project is licensed under the MIT License.