Skip to content

merge upstream#2

Open
francislabountyjr wants to merge 711 commits intoserp-ai:mainfrom
dottxt-ai:main
Open

merge upstream#2
francislabountyjr wants to merge 711 commits intoserp-ai:mainfrom
dottxt-ai:main

Conversation

@francislabountyjr
Copy link
Copy Markdown
Member

No description provided.

RobinPicard and others added 30 commits June 18, 2025 17:09
On top of the missing links highlighted in #1622, there were a bunch of
broken links in the README. This PR makes sure all links lead to
actually existing page. In some cases, the links do not lead to the
exact section it should, but that's a problem to fix in the
documentation in a separate PR. In the meantime, links at least lead to
a related page from which the user should be able to find their way.
RobinPicard and others added 30 commits March 2, 2026 16:27
Per the XDG Base Directory Specification, $XDG_CACHE_HOME already
defines the base cache directory (typically ~/.cache). The previous
code joined it with ".cache" again, producing a redundant path
like ~/.cache/.cache/outlines instead of ~/.cache/outlines.

Signed-off-by: JiangNan <[email protected]>
Bug 1 – Attention mask incorrectly masks EOS tokens:
LlamaCppTokenizer.encode() used pad_token_id (== eos_token_id) to
generate the attention mask. Any EOS token appearing in the middle of a
real prompt was masked as padding (attention=0). Since batch encoding is
not supported, there is never any real padding, so all tokens should
always be attended.

Bug 2 – Fallback vocab extraction truncates long token pieces:
The fallback path (no HF tokenizer) used a fixed 32-byte ctypes buffer.
If llama_token_to_piece reports a piece longer than 32 bytes, the piece
was silently truncated. Two tokens that share the first 32 bytes but
differ afterwards would collide in the vocabulary dict, losing an entry.
Now retry with a correctly-sized buffer when truncation is detected.

Also hoists the llama_model_get_vocab call out of the per-token loop
to avoid redundant work.

Fixes #1819
Cover the retry branch when llama_token_to_piece returns n > buffer
size (32 bytes) and verify attention mask is all-ones even when the
prompt contains EOS tokens.

Signed-off-by: Giulio Leone <[email protected]>
…piece

Addresses review feedback from @kudos07: llama_token_to_piece can return
negative values as error codes. These are now skipped during vocabulary
construction instead of producing invalid token entries.

Added test_negative_n_skips_invalid_token to cover this path.
The compound 'or all(...)' clause only differed from the first check
when an empty-string key mapped to the error token id, which would
itself be a bug. Replaced with the simpler direct check.
Add _ensure_json_quoted() helper that wraps bare String terms in
double-quote delimiters when they appear inside container types
(List, Tuple, Dict). This ensures Literal/Enum string values produce
valid JSON regex patterns.

Includes 16 unit tests covering:
- String/Alternatives/passthrough quoting
- List, Tuple (fixed and variadic), Dict containers
- Non-string types unchanged (int, bool)
- Nested alternatives, special characters, single-variant literals
Add 14 comprehensive tests that verify the full pipeline
(python_types_to_terms → to_regex → re.fullmatch) for:
- List/Tuple/Dict with Literal and Enum string values
- Mixed string+int Literals (only strings quoted)
- Empty string literals
- Special characters (spaces, hyphens)
- Standalone Literal unaffected (no quotes)
- Negative assertions (bare words correctly rejected)
Apply reviewer suggestion to avoid creating a Sequence for a simple
quoted string — String(f'"{{term.value}}"') is cleaner and produces
the same regex output.
All tests now check for String('"...") instead of Sequence wrapping,
matching the simplified implementation suggested by @RobinPicard.
`SPIECE_UNDERLINE` is imported from `transformers.file_utils`, which is
an internal module that may be removed or relocated in future versions.

Add try/except fallback that inlines the constant (U+2581 "▁") when the
import fails, making outlines resilient to transformers refactors.

Fixes #1829
Cover the except-ImportError branch in both llamacpp.py and
transformers.py convert_token_to_string() by mocking the import to fail.
…tring

In `create_outlines_core_vocabulary`, the assignment
`formatted_vocab[token_as_str] = [token_id]` overwrites previous entries
when multiple BPE tokens decode to the same string via `token_to_str`.
For Mistral-7B, this silently drops 17.2% of token IDs.

Replace with `formatted_vocab.setdefault(token_as_str, []).append(token_id)`
to accumulate all token IDs for each decoded string.

Fixes #1830

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
- Rename client import from Dottxt to DotTxt
- Replace model_name + model_revision with single model identifier
- Replace client.json() with client.generate(input=, response_format=)
- Add AsyncDottxt class wrapping AsyncDotTxt client
- Single from_dottxt dispatching on client type (DotTxt vs AsyncDotTxt)
- Add explicit ValueError when model identifier is missing
- Export AsyncDottxt, add to AsyncBlackBoxModel
- Update client import to DotTxt/AsyncDotTxt
- Replace list_models() with models.list() and model.id
- Replace model_name + model_revision fixtures with single model_name
- Add async model fixtures and async test coverage
- Add missing model validation tests
- Fix test for wrong inference parameters to use model with identifier set
- Rewrite dottxt.md to reflect new SDK: single model param, async support,
  inference arguments, and API key request link
- Add Dottxt to API Support row in README
- Add API key request link at the top of the README
- Add Dottxt API and schema audit call-to-action sections to docs homepage
Removed duplicate early access request link from README.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.