Modernize training on Windows: venv/pyenv, UTF-8, corpus dirs, optional WordNet vocab, text cleaning, checkpoint id_to_char#6
Conversation
the dependencies to build the env i also changed the torch command to the whl url as instructed on pytorch.org also changed conda from anaconda, updated all
… text cleaning, checkpoint id_to_char for generate, UTF-8 IO, and Windows/pyenv docs. Made-with: Cursor
|
你好,已收到你的邮件,我会尽快处理,谢谢~
|
There was a problem hiding this comment.
Code Review
This pull request introduces support for multi-corpus training by allowing the --text-path argument to point to a directory of text files, which are then cycled through per epoch. It also adds a feature to use a character vocabulary derived from WordNet lemmas and ensures that character mappings are preserved in checkpoint metadata across all supported frameworks (TensorFlow, Chainer, MXNet, PyTorch, and Keras). Additionally, the PR provides a new Windows-specific setup guide using pyenv-win and requirements.txt to handle legacy dependency constraints. Review feedback focuses on ensuring consistent implementation of the vocabulary metadata feature across all frameworks and fixing a path resolution bug in the TensorFlow model restoration logic.
| from utils import (batch_generator, corpus_for_training_epoch, encode_text, | ||
| generate_seed, ID2CHAR, list_training_text_files, main, | ||
| make_dirs, resolve_seed_text_file, sample_from_probs, | ||
| VOCAB_SIZE) |
There was a problem hiding this comment.
To support the new checkpoint vocabulary metadata feature consistently across frameworks, please import pop_vocab_metadata_from_model_args and id_to_char_list_from_globals from utils.
| from utils import (batch_generator, corpus_for_training_epoch, encode_text, | |
| generate_seed, ID2CHAR, list_training_text_files, main, | |
| make_dirs, resolve_seed_text_file, sample_from_probs, | |
| VOCAB_SIZE) | |
| from utils import (batch_generator, corpus_for_training_epoch, encode_text, | |
| generate_seed, id_to_char_list_from_globals, ID2CHAR, | |
| list_training_text_files, main, make_dirs, | |
| pop_vocab_metadata_from_model_args, resolve_seed_text_file, | |
| sample_from_probs, VOCAB_SIZE) |
| with open("{}.json".format(checkpoint_path), encoding="utf-8") as f: | ||
| model_args = json.load(f) |
There was a problem hiding this comment.
Call pop_vocab_metadata_from_model_args to ensure that any saved character mapping in the checkpoint is applied to the global state before building the network. This is necessary for the model to correctly decode text when a non-default vocabulary was used during training.
| with open("{}.json".format(checkpoint_path), encoding="utf-8") as f: | |
| model_args = json.load(f) | |
| with open("{}.json".format(checkpoint_path), encoding="utf-8") as f: | |
| model_args = json.load(f) | |
| pop_vocab_metadata_from_model_args(model_args) |
| with open("{}.json".format(args.checkpoint_path), "w", encoding="utf-8") as f: | ||
| json.dump(model.predictor.args, f, indent=2) |
There was a problem hiding this comment.
Include the id_to_char mapping in the checkpoint JSON to ensure the model can be correctly loaded even if the default vocabulary changes or a custom alphabet was used.
| with open("{}.json".format(args.checkpoint_path), "w", encoding="utf-8") as f: | |
| json.dump(model.predictor.args, f, indent=2) | |
| with open("{}.json".format(args.checkpoint_path), "w", encoding="utf-8") as f: | |
| payload = dict(model.predictor.args) | |
| payload["id_to_char"] = id_to_char_list_from_globals() | |
| json.dump(payload, f, indent=2, ensure_ascii=False) |
| from utils import (batch_generator, corpus_for_training_epoch, encode_text, | ||
| generate_seed, ID2CHAR, list_training_text_files, main, | ||
| make_dirs, resolve_seed_text_file, sample_from_probs, | ||
| VOCAB_SIZE) |
There was a problem hiding this comment.
Import the new vocabulary utility functions to support saving and loading character mappings in checkpoints, ensuring consistency with the TensorFlow implementation.
| from utils import (batch_generator, corpus_for_training_epoch, encode_text, | |
| generate_seed, ID2CHAR, list_training_text_files, main, | |
| make_dirs, resolve_seed_text_file, sample_from_probs, | |
| VOCAB_SIZE) | |
| from utils import (batch_generator, corpus_for_training_epoch, encode_text, | |
| generate_seed, id_to_char_list_from_globals, ID2CHAR, | |
| list_training_text_files, main, make_dirs, | |
| pop_vocab_metadata_from_model_args, resolve_seed_text_file, | |
| sample_from_probs, VOCAB_SIZE) |
| with open("{}.json".format(checkpoint_path), "w", encoding="utf-8") as f: | ||
| json.dump(self.args, f, indent=2) |
There was a problem hiding this comment.
Save the character mapping in the checkpoint JSON sidecar to allow the model to be loaded correctly regardless of the environment's default vocabulary.
| with open("{}.json".format(checkpoint_path), "w", encoding="utf-8") as f: | |
| json.dump(self.args, f, indent=2) | |
| with open("{}.json".format(checkpoint_path), "w", encoding="utf-8") as f: | |
| payload = dict(self.args) | |
| payload["id_to_char"] = id_to_char_list_from_globals() | |
| json.dump(payload, f, indent=2, ensure_ascii=False) |
| with open("{}.json".format(checkpoint_path), encoding="utf-8") as f: | ||
| model_args = json.load(f) |
There was a problem hiding this comment.
Apply saved vocabulary metadata when loading the model to ensure the character mapping is synchronized with the checkpoint.
| with open("{}.json".format(checkpoint_path), encoding="utf-8") as f: | |
| model_args = json.load(f) | |
| with open("{}.json".format(checkpoint_path), encoding="utf-8") as f: | |
| model_args = json.load(f) | |
| pop_vocab_metadata_from_model_args(model_args) |
| from utils import (batch_generator, corpus_for_training_epoch, encode_text, | ||
| generate_seed, ID2CHAR, list_training_text_files, main, | ||
| make_dirs, resolve_seed_text_file, VOCAB_SIZE) |
There was a problem hiding this comment.
Import vocabulary utilities for checkpoint compatibility and to handle custom character sets.
| from utils import (batch_generator, corpus_for_training_epoch, encode_text, | |
| generate_seed, ID2CHAR, list_training_text_files, main, | |
| make_dirs, resolve_seed_text_file, VOCAB_SIZE) | |
| from utils import (batch_generator, corpus_for_training_epoch, encode_text, | |
| generate_seed, id_to_char_list_from_globals, ID2CHAR, | |
| list_training_text_files, main, make_dirs, | |
| pop_vocab_metadata_from_model_args, resolve_seed_text_file, | |
| VOCAB_SIZE) |
| if args.restore: | ||
| load_path = args.checkpoint_path if args.restore is True else args.restore | ||
| with open("{}.json".format(args.checkpoint_path)) as f: | ||
| with open("{}.json".format(args.checkpoint_path), encoding="utf-8") as f: |
There was a problem hiding this comment.
When restoring from a specific path (other than the default checkpoint path), the model configuration JSON should be loaded from that same path to ensure consistency between weights and hyperparameters.
| with open("{}.json".format(args.checkpoint_path), encoding="utf-8") as f: | |
| with open("{}.json".format(load_path), encoding="utf-8") as f: |
I LOVE this repo.
This PR refreshes the project so it is easier to run on current machines without conda, keeps training Unicode-safe on Windows, and adds a few optional quality-of-life features for real book corpora.
Setup & dependencies
Document pyenv-win + Python 3.7 + venv and add a requirements.txt-based install path (TensorFlow 1.15 / Keras / PyTorch 1.1 / etc., aligned with what runs on Windows + py3.7).
Add .python-version (3.7.9) and ignore .venv/; optional checkpoints/.gitkeep (checkpoints remain gitignored).
Training data
--text-path may point at a directory: training cycles one .txt file per epoch (sorted order).
Optional --clean-book-text: strips common plaintext noise (page-number lines, Roman chapter lines, light OCR/normalization) via text_cleaning.py.
All corpus / sidecar JSON reads use encoding="utf-8" so UTF-8 texts work on Windows.
Vocabulary & generation
Optional --wordnet-char-vocab: smaller character alphabet derived from WordNet lemma strings (requires nltk; cached inventory under data/).
Checkpoints save id_to_char in model.ckpt.json; loaders apply it so generate decodes with the same id→char table as training (avoids garbage output when the alphabet is non-default).
Clear error when checkpoint vocab_size does not match the process vocabulary and no id_to_char is present.
TensorFlow
Ensure the graph is built with the same vocab_size as the training args (embedding/softmax consistent with saved hyperparameters).
Docs
README: Windows/pyenv/venv workflow, corrected CLI examples (--checkpoint-path / --text-path), and notes on restore/generate.
Notes for reviewers
Intended runtime remains Python 3.7 and TensorFlow 1.x APIs (tf.contrib, etc.); this is a compatibility/maintenance PR, not a TF2 migration.
cioran/ sample texts are included for local experiments; happy to drop or move them if you prefer not to ship corpus files in-tree.