-
Notifications
You must be signed in to change notification settings - Fork 473
Description
[BUG] Regression: .md files parsed as empty in parse_page_without_llm mode — content destroyed since ~Feb 17-20, 2026
⚠️ Regression — Previously Working
This is a regression, not a new feature request. The same .md files, uploaded to the same pipeline with the same configuration, were parsed correctly on Feb 16, 2026 but return empty content as of Feb 21, 2026.
Summary
Markdown (.md) files uploaded to a LlamaCloud Index configured with parse_page_without_llm (Fast Mode) are parsed as empty. The "Parsed" view shows only --- separators with no content between them. The "Chunked" view produces text nodes with empty strings.
Renaming the identical file from .md to .txt immediately fixes the issue — the content is parsed correctly. This confirms the bug is in file-type routing, not in the content itself.
Impact
- All
.mdfiles uploaded after the regression window (~Feb 17-20) are affected - Indexes built on
.mdfiles contain zero retrievable content - RAG pipelines silently return empty results with no error or warning
- Users have no indication their index is broken unless they manually inspect parsed output
Steps to Reproduce
- Create a LlamaCloud Index with a pipeline configured as:
parse_mode:parse_page_without_llm- Any
page_separator(tested with===SECTION_BREAK===)
- Upload a
.mdfile containing YAML frontmatter + markdown body (example below) - Wait for parsing to complete
- Inspect the file in the LlamaCloud UI:
- Raw File tab → content is fully visible ✅
- Parsed File tab → only
---separators, no content ❌ - Chunked File tab → text nodes contain empty strings ❌
Minimal Reproduction File
test_file.md (any .md with YAML frontmatter):
--- company: TEST date: 2024-01-01 type: report ---Test Document
This is a test paragraph with actual content.
Section Two
More content here that should be indexed and retrievable.
Confirm It's File-Extension-Specific
- Rename the same file to
test_file.txt - Upload to the same pipeline
- Inspect → content is fully parsed and chunked correctly ✅
Expected Behavior
.md files should be treated as plain text (same as .txt) and passed through to chunking with their full content preserved. This was the behavior prior to ~Feb 17, 2026.
Actual Behavior
.md files are routed through the PDF/document parsing pipeline (parse_page_without_llm), which:
- Searches for "layered text" (a PDF concept — does not exist in plain
.md) - Interprets
---from YAML frontmatter as page boundaries - Returns only the
---separators with no content between them
Evidence of Regression
| Date | Action | Result |
|---|---|---|
| Feb 16, 2026 | Uploaded .md files to pipeline a9d2a289-... | ✅ Parsed correctly, full content in chunks |
| Feb 21, 2026 | Re-uploaded identical .md files to same pipeline | ❌ Only --- separators, empty chunks |
| Feb 21, 2026 | Renamed same .md → .txt, uploaded to same pipeline | ✅ Parsed correctly |
No configuration changes were made to the pipeline between these dates. The pipeline config was verified via API:
{
"pipeline_id": "a9d2a289-f0c7-46fd-8924-37a25fd2fabf",
"parse_mode": "parse_page_without_llm",
"page_separator": "===SECTION_BREAK===",
"tier": null,
"created_at": "2026-02-13T09:48:55.650014Z"
}
Root Cause Analysis
The parse_page_without_llm documentation states:
"Extracts only the layered text from the file — designed for PDFs. Does not return markdown. The default page separator is
\n---\n."
Before the regression: .md files were correctly identified as plain text and bypassed the PDF parsing pipeline (treated same as .txt).
After the regression: .md files are now being routed through the PDF parsing path, where the "layered text" extraction finds nothing and the --- YAML frontmatter delimiters are misinterpreted as page boundaries.
Environment
- LlamaCloud tier: Starter
- Pipeline schema: v1
- Pipeline created: 2026-02-13
- Regression window: Between Feb 16 and Feb 21, 2026 (likely a backend deployment)
- Affected file types:
.md(confirmed), potentially.markdownand other plain text variants - Not affected:
.txt,.pdf
Workaround
Rename .md files to .txt before uploading. Content and parsing behave correctly with the .txt extension.
# Batch rename workaround
for f in *.md; do mv "$f" "${f%.md}.txt"; done
Note: This workaround forces users to lose .md extension metadata and breaks any automation that relies on file type detection.
Suggested Fix
Add .md (and .markdown) to the list of file extensions that receive plain text passthrough treatment (same path as .txt), bypassing the parse_page_without_llm PDF extraction pipeline. This was the implicit behavior before the regression.
Alternatively, when parse_page_without_llm encounters a non-PDF file, it should fall back to plain text extraction rather than silently returning empty content.