merge upstream by francislabountyjr · Pull Request #2 · serp-ai/outlines

francislabountyjr · 2024-06-04T04:17:33Z

No description provided.

On top of the missing links highlighted in #1622, there were a bunch of broken links in the README. This PR makes sure all links lead to actually existing page. In some cases, the links do not lead to the exact section it should, but that's a problem to fix in the documentation in a separate PR. In the meantime, links at least lead to a related page from which the user should be able to find their way.

Per the XDG Base Directory Specification, $XDG_CACHE_HOME already defines the base cache directory (typically ~/.cache). The previous code joined it with ".cache" again, producing a redundant path like ~/.cache/.cache/outlines instead of ~/.cache/outlines. Signed-off-by: JiangNan <[email protected]>

Bug 1 – Attention mask incorrectly masks EOS tokens: LlamaCppTokenizer.encode() used pad_token_id (== eos_token_id) to generate the attention mask. Any EOS token appearing in the middle of a real prompt was masked as padding (attention=0). Since batch encoding is not supported, there is never any real padding, so all tokens should always be attended. Bug 2 – Fallback vocab extraction truncates long token pieces: The fallback path (no HF tokenizer) used a fixed 32-byte ctypes buffer. If llama_token_to_piece reports a piece longer than 32 bytes, the piece was silently truncated. Two tokens that share the first 32 bytes but differ afterwards would collide in the vocabulary dict, losing an entry. Now retry with a correctly-sized buffer when truncation is detected. Also hoists the llama_model_get_vocab call out of the per-token loop to avoid redundant work. Fixes #1819

Cover the retry branch when llama_token_to_piece returns n > buffer size (32 bytes) and verify attention mask is all-ones even when the prompt contains EOS tokens. Signed-off-by: Giulio Leone <[email protected]>

@kudos07

…piece Addresses review feedback from @kudos07: llama_token_to_piece can return negative values as error codes. These are now skipped during vocabulary construction instead of producing invalid token entries. Added test_negative_n_skips_invalid_token to cover this path.

The compound 'or all(...)' clause only differed from the first check when an empty-string key mapped to the error token id, which would itself be a bug. Replaced with the simpler direct check.

Add _ensure_json_quoted() helper that wraps bare String terms in double-quote delimiters when they appear inside container types (List, Tuple, Dict). This ensures Literal/Enum string values produce valid JSON regex patterns. Includes 16 unit tests covering: - String/Alternatives/passthrough quoting - List, Tuple (fixed and variadic), Dict containers - Non-string types unchanged (int, bool) - Nested alternatives, special characters, single-variant literals

Add 14 comprehensive tests that verify the full pipeline (python_types_to_terms → to_regex → re.fullmatch) for: - List/Tuple/Dict with Literal and Enum string values - Mixed string+int Literals (only strings quoted) - Empty string literals - Special characters (spaces, hyphens) - Standalone Literal unaffected (no quotes) - Negative assertions (bare words correctly rejected)

Apply reviewer suggestion to avoid creating a Sequence for a simple quoted string — String(f'"{{term.value}}"') is cleaner and produces the same regex output.

@RobinPicard

All tests now check for String('"...") instead of Sequence wrapping, matching the simplified implementation suggested by @RobinPicard.

`SPIECE_UNDERLINE` is imported from `transformers.file_utils`, which is an internal module that may be removed or relocated in future versions. Add try/except fallback that inlines the constant (U+2581 "▁") when the import fails, making outlines resilient to transformers refactors. Fixes #1829

Cover the except-ImportError branch in both llamacpp.py and transformers.py convert_token_to_string() by mocking the import to fail.

Co-authored-by: Copilot <[email protected]>

…tring In `create_outlines_core_vocabulary`, the assignment `formatted_vocab[token_as_str] = [token_id]` overwrites previous entries when multiple BPE tokens decode to the same string via `token_to_str`. For Mistral-7B, this silently drops 17.2% of token IDs. Replace with `formatted_vocab.setdefault(token_as_str, []).append(token_id)` to accumulate all token IDs for each decoded string. Fixes #1830 Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>

- Rename client import from Dottxt to DotTxt - Replace model_name + model_revision with single model identifier - Replace client.json() with client.generate(input=, response_format=) - Add AsyncDottxt class wrapping AsyncDotTxt client - Single from_dottxt dispatching on client type (DotTxt vs AsyncDotTxt) - Add explicit ValueError when model identifier is missing - Export AsyncDottxt, add to AsyncBlackBoxModel

- Update client import to DotTxt/AsyncDotTxt - Replace list_models() with models.list() and model.id - Replace model_name + model_revision fixtures with single model_name - Add async model fixtures and async test coverage - Add missing model validation tests - Fix test for wrong inference parameters to use model with identifier set

- Rewrite dottxt.md to reflect new SDK: single model param, async support, inference arguments, and API key request link - Add Dottxt to API Support row in README - Add API key request link at the top of the README - Add Dottxt API and schema audit call-to-action sections to docs homepage

Removed duplicate early access request link from README.

RobinPicard and others added 30 commits June 18, 2025 17:09

Modify Application's call method to include inference kwargs

7b6ca4d

Update the Gemini model to use their new sdk

1cda909

Standardize the use of the model_name argument

9c69621

Add script to autogenerate API reference with mkdocs

0be7d8b

Update the utility features section of the documentation

fee8a23

Update the advanced features documentation

962a847

Remove forgottenly committed print statement

884e983

Rename SgLang into SGLang

a993210

Improve type hints and docstrings for the processors module

5bf0f93

Improve docstrings and type hinting in the types module

333f705

Improve docstrings and type hinting in model implementations

d83c1bf

Improve docstrings and type hinting in model base and generator

891108e

Various docstrings and type hints improvements

7eb1f36

Add the section for output types in the features documentation

eb6efb2

Update the model documentation in the Features section

9fd653d

Update the features documentation section's index page

8ca7588

Align examples lists in documentation

39b20b7

Update the guides in the documentation

4a28911

Modify the documentation home page

b2aeda5

Use the examples from old_docs in docs and delete old_docs

96fbfa6

Restore deprecated regex DSL functions with a warning

fc7ea9b

Correct the docs logo aspect ratio

74c187c

Add guide for OpenAI-compatible models

f273c6e

Modify VLLMOffline to rely on vLLM's GuidedDecodingParams argument

f9abff4

Create optional dependencies for the models

58473e8

Fix failing tests transformers

4959875

Improve the documentation index

73962b8

Add missing plugins to the requirements-doc

2aeed06

Merge v1.0 into main (#1621)

d631de9

RobinPicard and others added 30 commits March 2, 2026 16:27

Add a link to the audit form in the README and the doc website

45d5857

Fix a bug in the documentation preview when opening a PR

f4e9c72

Fix linting

179dc56

test: add coverage for n > size vocab truncation path

5f0e495

Cover the retry branch when llama_token_to_piece returns n > buffer size (32 bytes) and verify attention mask is all-ones even when the prompt contains EOS tokens. Signed-off-by: Giulio Leone <[email protected]>

Simplify redundant assertion in test_negative_n_skips_invalid_token

429bd95

The compound 'or all(...)' clause only differed from the first check when an empty-string key mapped to the error token id, which would itself be a bug. Replaced with the simpler direct check.

test(dsl): satisfy ruff import order

ad91fcb

refactor: use String(f'"...") instead of Sequence for JSON quoting

656bc1e

Apply reviewer suggestion to avoid creating a Sequence for a simple quoted string — String(f'"{{term.value}}"') is cleaner and produces the same regex output.

test: update assertions to match refactored _ensure_json_quoted output

ee43a02

All tests now check for String('"...") instead of Sequence wrapping, matching the simplified implementation suggested by @RobinPicard.

fix: accumulate duplicate token IDs in vocabulary construction

e53cd37

style: fix ruff formatting

2ea5954

test: add test for duplicate token ID preservation in vocabulary

1afa84d

fix: add type annotation for mypy

692c8b5

fix: restore type: ignore comments displaced by formatter

54827e6

test: add coverage for SPIECE_UNDERLINE ImportError fallback

88c761c

Cover the except-ImportError branch in both llamacpp.py and transformers.py convert_token_to_string() by mocking the import to fail.

fix(tokenizers): inline spiece underline constant

5f70f74

Co-authored-by: Copilot <[email protected]>

Fix OutlinesCoreBackend and related tests

f08900f

Update the uv.lock

2fc5a1b

Fix the documentation preview GitHub action

e5715ac

Apply suggestion from @rlouf

f3b29cf

Fix duplicate early access request link in README

8dbd868

Removed duplicate early access request link from README.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

merge upstream#2

merge upstream#2
francislabountyjr wants to merge 711 commits intoserp-ai:mainfrom
dottxt-ai:main

francislabountyjr commented Jun 4, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

20 participants

Conversation

francislabountyjr commented Jun 4, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

20 participants