Skip to content

Add Custom Text File Character-Level Training Support#679

Open
Adolfo-GM wants to merge 5 commits intokarpathy:masterfrom
Adolfo-GM:data-textfile
Open

Add Custom Text File Character-Level Training Support#679
Adolfo-GM wants to merge 5 commits intokarpathy:masterfrom
Adolfo-GM:data-textfile

Conversation

@Adolfo-GM
Copy link

Added support for training a character-level GPT model on custom text files (textfile_char dataset).

Training

python train.py config/train_textfile_char.py --device=cpu --compile=False --eval_iters=20 --log_interval=1 --block_size=256 --batch_size=12 --n_layer=6 --n_head=8 --n_embd=128 --max_iters=2000 --lr_decay_iters=2000 --dropout=0.0

Final Results:

Train Loss: 0.0281
Val Loss: 5.3117
Parameters: 1.19M

Sampling

python sample.py --out_dir=out-textfile-char --device=cpu --start="<u> hi" --num_samples=3

Sample Output:

<u> hi> teat's ikire tis as conony es
<a> thereeawives y
<a> lwou touldod ald I he to mar?
<a> gelony mes! dou nonilpih?
<a> he helll! t g alpeo hice? wout blprikire?
<u> it'tthe mesth?
<a> me I any they ane thaly ave blyoing hang memeng! nnin anerey atou cals t fut cer seru blo oto oomio? athind hee.

Added files:

config/
└── train_textfile_char.py       

data/textfile_char/
├── data.txt               
├── prepare.py                  
├── readme.md                   

The script prepares text files for character-level language modeling by reading .txt files, encoding the data, and saving the training and validation datasets along with metadata.
Added readme for character-level dataset preparation.
gkielian added a commit to gkielian/ReaLLMASIC_nanogpt that referenced this pull request Feb 21, 2026
…_scripts

Add phi format comparison scripts
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant