Modernize training on Windows: venv/pyenv, UTF-8, corpus dirs, optional WordNet vocab, text cleaning, checkpoint id_to_char by jordiaphane · Pull Request #6 · yxtay/char-rnn-text-generation

jordiaphane · 2026-04-22T01:36:13Z

I LOVE this repo.

This PR refreshes the project so it is easier to run on current machines without conda, keeps training Unicode-safe on Windows, and adds a few optional quality-of-life features for real book corpora.

Setup & dependencies
Document pyenv-win + Python 3.7 + venv and add a requirements.txt-based install path (TensorFlow 1.15 / Keras / PyTorch 1.1 / etc., aligned with what runs on Windows + py3.7).
Add .python-version (3.7.9) and ignore .venv/; optional checkpoints/.gitkeep (checkpoints remain gitignored).
Training data
--text-path may point at a directory: training cycles one .txt file per epoch (sorted order).
Optional --clean-book-text: strips common plaintext noise (page-number lines, Roman chapter lines, light OCR/normalization) via text_cleaning.py.
All corpus / sidecar JSON reads use encoding="utf-8" so UTF-8 texts work on Windows.
Vocabulary & generation
Optional --wordnet-char-vocab: smaller character alphabet derived from WordNet lemma strings (requires nltk; cached inventory under data/).
Checkpoints save id_to_char in model.ckpt.json; loaders apply it so generate decodes with the same id→char table as training (avoids garbage output when the alphabet is non-default).
Clear error when checkpoint vocab_size does not match the process vocabulary and no id_to_char is present.
TensorFlow
Ensure the graph is built with the same vocab_size as the training args (embedding/softmax consistent with saved hyperparameters).
Docs
README: Windows/pyenv/venv workflow, corrected CLI examples (--checkpoint-path / --text-path), and notes on restore/generate.
Notes for reviewers
Intended runtime remains Python 3.7 and TensorFlow 1.x APIs (tf.contrib, etc.); this is a compatibility/maintenance PR, not a TF2 migration.
cioran/ sample texts are included for local experiments; happy to drop or move them if you prefer not to ship corpus files in-tree.

the dependencies to build the env i also changed the torch command to the whl url as instructed on pytorch.org also changed conda from anaconda, updated all

… text cleaning, checkpoint id_to_char for generate, UTF-8 IO, and Windows/pyenv docs. Made-with: Cursor

find-fun · 2026-04-22T01:36:47Z

你好，已收到你的邮件，我会尽快处理，谢谢~

gemini-code-assist

Code Review

This pull request introduces support for multi-corpus training by allowing the --text-path argument to point to a directory of text files, which are then cycled through per epoch. It also adds a feature to use a character vocabulary derived from WordNet lemmas and ensures that character mappings are preserved in checkpoint metadata across all supported frameworks (TensorFlow, Chainer, MXNet, PyTorch, and Keras). Additionally, the PR provides a new Windows-specific setup guide using pyenv-win and requirements.txt to handle legacy dependency constraints. Review feedback focuses on ensuring consistent implementation of the vocabulary metadata feature across all frameworks and fixing a path resolution bug in the TensorFlow model restoration logic.

gemini-code-assist · 2026-04-22T01:38:43Z

+from utils import (batch_generator, corpus_for_training_epoch, encode_text,
+                   generate_seed, ID2CHAR, list_training_text_files, main,
+                   make_dirs, resolve_seed_text_file, sample_from_probs,
+                   VOCAB_SIZE)


To support the new checkpoint vocabulary metadata feature consistently across frameworks, please import pop_vocab_metadata_from_model_args and id_to_char_list_from_globals from utils.

Suggested change

from utils import (batch_generator, corpus_for_training_epoch, encode_text,

generate_seed, ID2CHAR, list_training_text_files, main,

make_dirs, resolve_seed_text_file, sample_from_probs,

VOCAB_SIZE)

from utils import (batch_generator, corpus_for_training_epoch, encode_text,

generate_seed, id_to_char_list_from_globals, ID2CHAR,

list_training_text_files, main, make_dirs,

pop_vocab_metadata_from_model_args, resolve_seed_text_file,

sample_from_probs, VOCAB_SIZE)

gemini-code-assist · 2026-04-22T01:38:43Z

+    with open("{}.json".format(checkpoint_path), encoding="utf-8") as f:
        model_args = json.load(f)


Call pop_vocab_metadata_from_model_args to ensure that any saved character mapping in the checkpoint is applied to the global state before building the network. This is necessary for the model to correctly decode text when a non-default vocabulary was used during training.

Suggested change

with open("{}.json".format(checkpoint_path), encoding="utf-8") as f:

model_args = json.load(f)

with open("{}.json".format(checkpoint_path), encoding="utf-8") as f:

model_args = json.load(f)

pop_vocab_metadata_from_model_args(model_args)

gemini-code-assist · 2026-04-22T01:38:43Z

+    with open("{}.json".format(args.checkpoint_path), "w", encoding="utf-8") as f:
        json.dump(model.predictor.args, f, indent=2)


Include the id_to_char mapping in the checkpoint JSON to ensure the model can be correctly loaded even if the default vocabulary changes or a custom alphabet was used.

Suggested change

with open("{}.json".format(args.checkpoint_path), "w", encoding="utf-8") as f:

json.dump(model.predictor.args, f, indent=2)

with open("{}.json".format(args.checkpoint_path), "w", encoding="utf-8") as f:

payload = dict(model.predictor.args)

payload["id_to_char"] = id_to_char_list_from_globals()

json.dump(payload, f, indent=2, ensure_ascii=False)

gemini-code-assist · 2026-04-22T01:38:43Z

+from utils import (batch_generator, corpus_for_training_epoch, encode_text,
+                   generate_seed, ID2CHAR, list_training_text_files, main,
+                   make_dirs, resolve_seed_text_file, sample_from_probs,
+                   VOCAB_SIZE)


Import the new vocabulary utility functions to support saving and loading character mappings in checkpoints, ensuring consistency with the TensorFlow implementation.

Suggested change

from utils import (batch_generator, corpus_for_training_epoch, encode_text,

generate_seed, ID2CHAR, list_training_text_files, main,

make_dirs, resolve_seed_text_file, sample_from_probs,

VOCAB_SIZE)

from utils import (batch_generator, corpus_for_training_epoch, encode_text,

generate_seed, id_to_char_list_from_globals, ID2CHAR,

list_training_text_files, main, make_dirs,

pop_vocab_metadata_from_model_args, resolve_seed_text_file,

sample_from_probs, VOCAB_SIZE)

gemini-code-assist · 2026-04-22T01:38:43Z

+        with open("{}.json".format(checkpoint_path), "w", encoding="utf-8") as f:
            json.dump(self.args, f, indent=2)


Save the character mapping in the checkpoint JSON sidecar to allow the model to be loaded correctly regardless of the environment's default vocabulary.

Suggested change

with open("{}.json".format(checkpoint_path), "w", encoding="utf-8") as f:

json.dump(self.args, f, indent=2)

with open("{}.json".format(checkpoint_path), "w", encoding="utf-8") as f:

payload = dict(self.args)

payload["id_to_char"] = id_to_char_list_from_globals()

json.dump(payload, f, indent=2, ensure_ascii=False)

gemini-code-assist · 2026-04-22T01:38:43Z

+        with open("{}.json".format(checkpoint_path), encoding="utf-8") as f:
            model_args = json.load(f)


Apply saved vocabulary metadata when loading the model to ensure the character mapping is synchronized with the checkpoint.

Suggested change

with open("{}.json".format(checkpoint_path), encoding="utf-8") as f:

model_args = json.load(f)

with open("{}.json".format(checkpoint_path), encoding="utf-8") as f:

model_args = json.load(f)

pop_vocab_metadata_from_model_args(model_args)

gemini-code-assist · 2026-04-22T01:38:44Z

+from utils import (batch_generator, corpus_for_training_epoch, encode_text,
+                   generate_seed, ID2CHAR, list_training_text_files, main,
+                   make_dirs, resolve_seed_text_file, VOCAB_SIZE)


Import vocabulary utilities for checkpoint compatibility and to handle custom character sets.

Suggested change

from utils import (batch_generator, corpus_for_training_epoch, encode_text,

generate_seed, ID2CHAR, list_training_text_files, main,

make_dirs, resolve_seed_text_file, VOCAB_SIZE)

from utils import (batch_generator, corpus_for_training_epoch, encode_text,

generate_seed, id_to_char_list_from_globals, ID2CHAR,

list_training_text_files, main, make_dirs,

pop_vocab_metadata_from_model_args, resolve_seed_text_file,

VOCAB_SIZE)

gemini-code-assist · 2026-04-22T01:38:44Z

    if args.restore:
        load_path = args.checkpoint_path if args.restore is True else args.restore
-        with open("{}.json".format(args.checkpoint_path)) as f:
+        with open("{}.json".format(args.checkpoint_path), encoding="utf-8") as f:


When restoring from a specific path (other than the default checkpoint path), the model configuration JSON should be loaded from that same path to ensure consistency between weights and hyperparameters.

Suggested change

with open("{}.json".format(args.checkpoint_path), encoding="utf-8") as f:

with open("{}.json".format(load_path), encoding="utf-8") as f:

jordiaphane and others added 3 commits May 17, 2019 03:23

With new conda in path I needed pip listed in

5fd4af5

the dependencies to build the env i also changed the torch command to the whl url as instructed on pytorch.org also changed conda from anaconda, updated all

Improve training setup: corpus directory cycling, WordNet char vocab,…

fda68af

… text cleaning, checkpoint id_to_char for generate, UTF-8 IO, and Windows/pyenv docs. Made-with: Cursor

Describe your follow-up changes

a9ed972

gemini-code-assist Bot reviewed Apr 22, 2026

View reviewed changes

Add pre-commit config for pre-commit.ci

d91fe84

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Modernize training on Windows: venv/pyenv, UTF-8, corpus dirs, optional WordNet vocab, text cleaning, checkpoint id_to_char#6

Modernize training on Windows: venv/pyenv, UTF-8, corpus dirs, optional WordNet vocab, text cleaning, checkpoint id_to_char#6
jordiaphane wants to merge 4 commits intoyxtay:masterfrom
jordiaphane:cioran-training

jordiaphane commented Apr 22, 2026

Uh oh!

find-fun commented Apr 22, 2026 via email

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot Apr 22, 2026

Uh oh!

gemini-code-assist Bot Apr 22, 2026

Uh oh!

gemini-code-assist Bot Apr 22, 2026

Uh oh!

gemini-code-assist Bot Apr 22, 2026

Uh oh!

gemini-code-assist Bot Apr 22, 2026

Uh oh!

gemini-code-assist Bot Apr 22, 2026

Uh oh!

gemini-code-assist Bot Apr 22, 2026

Uh oh!

gemini-code-assist Bot Apr 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

		with open("{}.json".format(checkpoint_path), encoding="utf-8") as f:
		model_args = json.load(f)

		with open("{}.json".format(args.checkpoint_path), "w", encoding="utf-8") as f:
		json.dump(model.predictor.args, f, indent=2)

		with open("{}.json".format(checkpoint_path), "w", encoding="utf-8") as f:
		json.dump(self.args, f, indent=2)

	with open("{}.json".format(args.checkpoint_path), encoding="utf-8") as f:
	with open("{}.json".format(load_path), encoding="utf-8") as f:

Conversation

jordiaphane commented Apr 22, 2026

Uh oh!

find-fun commented Apr 22, 2026 via email

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot Apr 22, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Apr 22, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Apr 22, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Apr 22, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Apr 22, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Apr 22, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Apr 22, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Apr 22, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants