You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
When reading from connections (compressed files, URLs, etc.), comment lines that intersect with chunk boundaries cause parsing failures (indexing failures, to be more specific). The same file reads correctly when uncompressed (memory-mapped). Or, often, just with a different (usually larger) chunk size.
Connection-based reading processes data in fixed-size chunks. Two related issues arise when comments intersect chunk boundaries:
1. Comment detected, but terminating newline in next chunk
Chunk 1: [a,b\n1,2\n#] <- comment detected at "#"
Chunk 2: [comment\n3,4\n...] <- but newline is here
skip_rest_of_line() can't find \n, returns end-of-buffer. State incorrectly becomes RECORD_START, and "comment\n" in the next chunk is parsed as data.
2. Multi-char comment split across chunks
Chunk 1: [...data\n#] <- only first "#" visible
Chunk 2: [#rest...\n] <- second "#" here
With comment "##" and only 1 byte visible, we can't determine if it's a comment or data.
Why Memory-Mapped Files Work
For multi-threaded file reading, chunk boundaries are aligned to newlines via find_next_newline(). Comment prefixes appear at line starts, so they're never split.
Connection reading has arbitrary boundaries (wherever the buffer fills), causing the problem.
Prior Art in vroom: CRLF Boundary Fix
5fc54e6 fixed an analogous problem with \r\n spanning chunks (#331). From that commit message:
"The file would read fine until a line ending happened to fall on the connection buffer boundary, then the rest of the file would have a garbled index."
This was a much simpler problem, admitting a relatively straightforward solution. But the general description of the bad stuff that happens applies here too.
Proposed Solution: Pending Token Buffer
Leverage the existing double-buffer structure. When bytes at the end of a chunk can't be fully evaluated:
Detect "limbo" bytes - partial comment prefix, or comment detected but no newline found
Copy to next buffer - prepend limbo bytes to the start of the other buffer
Read after them - fill the rest of the buffer from the connection
Exclude from write - don't write limbo bytes to temp file (they'll be written with next chunk)
Chunk N in buf[0]: [...data...][#] <- can't decide, copy "#"
↓
Chunk N+1 in buf[1]: [#][...new data from connection...]
^
read starts here, after copied byte
This is the standard "pending token buffer" pattern used by streaming parsers (SAX, simdjson, etc.).
Advantages:
No new parser states needed
Bytes carry their own state (no "going backwards")
Summary
When reading from connections (compressed files, URLs, etc.), comment lines that intersect with chunk boundaries cause parsing failures (indexing failures, to be more specific). The same file reads correctly when uncompressed (memory-mapped). Or, often, just with a different (usually larger) chunk size.
This whole investigation was inspired by poking at tidyverse/readr#1523.
Minimal Reproduction
Created on 2026-01-22 with reprex v2.1.1
Root Cause
Connection-based reading processes data in fixed-size chunks. Two related issues arise when comments intersect chunk boundaries:
1. Comment detected, but terminating newline in next chunk
skip_rest_of_line()can't find\n, returns end-of-buffer. State incorrectly becomesRECORD_START, and "comment\n" in the next chunk is parsed as data.2. Multi-char comment split across chunks
With comment
"##"and only 1 byte visible, we can't determine if it's a comment or data.Why Memory-Mapped Files Work
For multi-threaded file reading, chunk boundaries are aligned to newlines via
find_next_newline(). Comment prefixes appear at line starts, so they're never split.Connection reading has arbitrary boundaries (wherever the buffer fills), causing the problem.
Prior Art in vroom: CRLF Boundary Fix
5fc54e6 fixed an analogous problem with
\r\nspanning chunks (#331). From that commit message:This was a much simpler problem, admitting a relatively straightforward solution. But the general description of the bad stuff that happens applies here too.
Proposed Solution: Pending Token Buffer
Leverage the existing double-buffer structure. When bytes at the end of a chunk can't be fully evaluated:
This is the standard "pending token buffer" pattern used by streaming parsers (SAX, simdjson, etc.).
Advantages:
Open Questions
Tests
I developed 2 tests during my exploration that will be useful (all fail now):