Skip to content

Modernize training on Windows: venv/pyenv, UTF-8, corpus dirs, optional WordNet vocab, text cleaning, checkpoint id_to_char#6

Open
jordiaphane wants to merge 4 commits intoyxtay:masterfrom
jordiaphane:cioran-training
Open

Modernize training on Windows: venv/pyenv, UTF-8, corpus dirs, optional WordNet vocab, text cleaning, checkpoint id_to_char#6
jordiaphane wants to merge 4 commits intoyxtay:masterfrom
jordiaphane:cioran-training

Conversation

@jordiaphane
Copy link
Copy Markdown

I LOVE this repo.

This PR refreshes the project so it is easier to run on current machines without conda, keeps training Unicode-safe on Windows, and adds a few optional quality-of-life features for real book corpora.

Setup & dependencies
Document pyenv-win + Python 3.7 + venv and add a requirements.txt-based install path (TensorFlow 1.15 / Keras / PyTorch 1.1 / etc., aligned with what runs on Windows + py3.7).
Add .python-version (3.7.9) and ignore .venv/; optional checkpoints/.gitkeep (checkpoints remain gitignored).
Training data
--text-path may point at a directory: training cycles one .txt file per epoch (sorted order).
Optional --clean-book-text: strips common plaintext noise (page-number lines, Roman chapter lines, light OCR/normalization) via text_cleaning.py.
All corpus / sidecar JSON reads use encoding="utf-8" so UTF-8 texts work on Windows.
Vocabulary & generation
Optional --wordnet-char-vocab: smaller character alphabet derived from WordNet lemma strings (requires nltk; cached inventory under data/).
Checkpoints save id_to_char in model.ckpt.json; loaders apply it so generate decodes with the same id→char table as training (avoids garbage output when the alphabet is non-default).
Clear error when checkpoint vocab_size does not match the process vocabulary and no id_to_char is present.
TensorFlow
Ensure the graph is built with the same vocab_size as the training args (embedding/softmax consistent with saved hyperparameters).
Docs
README: Windows/pyenv/venv workflow, corrected CLI examples (--checkpoint-path / --text-path), and notes on restore/generate.
Notes for reviewers
Intended runtime remains Python 3.7 and TensorFlow 1.x APIs (tf.contrib, etc.); this is a compatibility/maintenance PR, not a TF2 migration.
cioran/ sample texts are included for local experiments; happy to drop or move them if you prefer not to ship corpus files in-tree.

jordiaphane and others added 3 commits May 17, 2019 03:23
the dependencies to build the env
i also changed the torch command
to the whl url as instructed on pytorch.org
also changed conda from anaconda, updated all
… text cleaning, checkpoint id_to_char for generate, UTF-8 IO, and Windows/pyenv docs.

Made-with: Cursor
@find-fun
Copy link
Copy Markdown

find-fun commented Apr 22, 2026 via email

Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces support for multi-corpus training by allowing the --text-path argument to point to a directory of text files, which are then cycled through per epoch. It also adds a feature to use a character vocabulary derived from WordNet lemmas and ensures that character mappings are preserved in checkpoint metadata across all supported frameworks (TensorFlow, Chainer, MXNet, PyTorch, and Keras). Additionally, the PR provides a new Windows-specific setup guide using pyenv-win and requirements.txt to handle legacy dependency constraints. Review feedback focuses on ensuring consistent implementation of the vocabulary metadata feature across all frameworks and fixing a path resolution bug in the TensorFlow model restoration logic.

Comment thread chainer_model.py
Comment on lines +14 to +17
from utils import (batch_generator, corpus_for_training_epoch, encode_text,
generate_seed, ID2CHAR, list_training_text_files, main,
make_dirs, resolve_seed_text_file, sample_from_probs,
VOCAB_SIZE)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

To support the new checkpoint vocabulary metadata feature consistently across frameworks, please import pop_vocab_metadata_from_model_args and id_to_char_list_from_globals from utils.

Suggested change
from utils import (batch_generator, corpus_for_training_epoch, encode_text,
generate_seed, ID2CHAR, list_training_text_files, main,
make_dirs, resolve_seed_text_file, sample_from_probs,
VOCAB_SIZE)
from utils import (batch_generator, corpus_for_training_epoch, encode_text,
generate_seed, id_to_char_list_from_globals, ID2CHAR,
list_training_text_files, main, make_dirs,
pop_vocab_metadata_from_model_args, resolve_seed_text_file,
sample_from_probs, VOCAB_SIZE)

Comment thread chainer_model.py
Comment on lines +80 to 81
with open("{}.json".format(checkpoint_path), encoding="utf-8") as f:
model_args = json.load(f)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Call pop_vocab_metadata_from_model_args to ensure that any saved character mapping in the checkpoint is applied to the global state before building the network. This is necessary for the model to correctly decode text when a non-default vocabulary was used during training.

Suggested change
with open("{}.json".format(checkpoint_path), encoding="utf-8") as f:
model_args = json.load(f)
with open("{}.json".format(checkpoint_path), encoding="utf-8") as f:
model_args = json.load(f)
pop_vocab_metadata_from_model_args(model_args)

Comment thread chainer_model.py
Comment on lines +232 to 233
with open("{}.json".format(args.checkpoint_path), "w", encoding="utf-8") as f:
json.dump(model.predictor.args, f, indent=2)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Include the id_to_char mapping in the checkpoint JSON to ensure the model can be correctly loaded even if the default vocabulary changes or a custom alphabet was used.

Suggested change
with open("{}.json".format(args.checkpoint_path), "w", encoding="utf-8") as f:
json.dump(model.predictor.args, f, indent=2)
with open("{}.json".format(args.checkpoint_path), "w", encoding="utf-8") as f:
payload = dict(model.predictor.args)
payload["id_to_char"] = id_to_char_list_from_globals()
json.dump(payload, f, indent=2, ensure_ascii=False)

Comment thread mxnet_model.py
Comment on lines +13 to +16
from utils import (batch_generator, corpus_for_training_epoch, encode_text,
generate_seed, ID2CHAR, list_training_text_files, main,
make_dirs, resolve_seed_text_file, sample_from_probs,
VOCAB_SIZE)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Import the new vocabulary utility functions to support saving and loading character mappings in checkpoints, ensuring consistency with the TensorFlow implementation.

Suggested change
from utils import (batch_generator, corpus_for_training_epoch, encode_text,
generate_seed, ID2CHAR, list_training_text_files, main,
make_dirs, resolve_seed_text_file, sample_from_probs,
VOCAB_SIZE)
from utils import (batch_generator, corpus_for_training_epoch, encode_text,
generate_seed, id_to_char_list_from_globals, ID2CHAR,
list_training_text_files, main, make_dirs,
pop_vocab_metadata_from_model_args, resolve_seed_text_file,
sample_from_probs, VOCAB_SIZE)

Comment thread mxnet_model.py
Comment on lines +63 to 64
with open("{}.json".format(checkpoint_path), "w", encoding="utf-8") as f:
json.dump(self.args, f, indent=2)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Save the character mapping in the checkpoint JSON sidecar to allow the model to be loaded correctly regardless of the environment's default vocabulary.

Suggested change
with open("{}.json".format(checkpoint_path), "w", encoding="utf-8") as f:
json.dump(self.args, f, indent=2)
with open("{}.json".format(checkpoint_path), "w", encoding="utf-8") as f:
payload = dict(self.args)
payload["id_to_char"] = id_to_char_list_from_globals()
json.dump(payload, f, indent=2, ensure_ascii=False)

Comment thread mxnet_model.py
Comment on lines +73 to 74
with open("{}.json".format(checkpoint_path), encoding="utf-8") as f:
model_args = json.load(f)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Apply saved vocabulary metadata when loading the model to ensure the character mapping is synchronized with the checkpoint.

Suggested change
with open("{}.json".format(checkpoint_path), encoding="utf-8") as f:
model_args = json.load(f)
with open("{}.json".format(checkpoint_path), encoding="utf-8") as f:
model_args = json.load(f)
pop_vocab_metadata_from_model_args(model_args)

Comment thread pytorch_model.py
Comment on lines +11 to +13
from utils import (batch_generator, corpus_for_training_epoch, encode_text,
generate_seed, ID2CHAR, list_training_text_files, main,
make_dirs, resolve_seed_text_file, VOCAB_SIZE)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Import vocabulary utilities for checkpoint compatibility and to handle custom character sets.

Suggested change
from utils import (batch_generator, corpus_for_training_epoch, encode_text,
generate_seed, ID2CHAR, list_training_text_files, main,
make_dirs, resolve_seed_text_file, VOCAB_SIZE)
from utils import (batch_generator, corpus_for_training_epoch, encode_text,
generate_seed, id_to_char_list_from_globals, ID2CHAR,
list_training_text_files, main, make_dirs,
pop_vocab_metadata_from_model_args, resolve_seed_text_file,
VOCAB_SIZE)

Comment thread tf_model.py
if args.restore:
load_path = args.checkpoint_path if args.restore is True else args.restore
with open("{}.json".format(args.checkpoint_path)) as f:
with open("{}.json".format(args.checkpoint_path), encoding="utf-8") as f:
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

When restoring from a specific path (other than the default checkpoint path), the model configuration JSON should be loaded from that same path to ensure consistency between weights and hyperparameters.

Suggested change
with open("{}.json".format(args.checkpoint_path), encoding="utf-8") as f:
with open("{}.json".format(load_path), encoding="utf-8") as f:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants