Skip to content

[BUG] Regression: .md files parsed as empty in parse_page_without_llm mode — content destroyed since ~Feb 17-20, 2026 #1122

@dragosdehelean

Description

@dragosdehelean

[BUG] Regression: .md files parsed as empty in parse_page_without_llm mode — content destroyed since ~Feb 17-20, 2026

⚠️ Regression — Previously Working

This is a regression, not a new feature request. The same .md files, uploaded to the same pipeline with the same configuration, were parsed correctly on Feb 16, 2026 but return empty content as of Feb 21, 2026.

Summary

Markdown (.md) files uploaded to a LlamaCloud Index configured with parse_page_without_llm (Fast Mode) are parsed as empty. The "Parsed" view shows only --- separators with no content between them. The "Chunked" view produces text nodes with empty strings.

Renaming the identical file from .md to .txt immediately fixes the issue — the content is parsed correctly. This confirms the bug is in file-type routing, not in the content itself.

Impact

  • All .md files uploaded after the regression window (~Feb 17-20) are affected
  • Indexes built on .md files contain zero retrievable content
  • RAG pipelines silently return empty results with no error or warning
  • Users have no indication their index is broken unless they manually inspect parsed output

Steps to Reproduce

  1. Create a LlamaCloud Index with a pipeline configured as:
    • parse_mode: parse_page_without_llm
    • Any page_separator (tested with ===SECTION_BREAK===)
  2. Upload a .md file containing YAML frontmatter + markdown body (example below)
  3. Wait for parsing to complete
  4. Inspect the file in the LlamaCloud UI:
    • Raw File tab → content is fully visible ✅
    • Parsed File tab → only --- separators, no content ❌
    • Chunked File tab → text nodes contain empty strings ❌

Minimal Reproduction File

test_file.md (any .md with YAML frontmatter):

---
company: TEST
date: 2024-01-01
type: report
---

Test Document

This is a test paragraph with actual content.

Section Two

More content here that should be indexed and retrievable.

Confirm It's File-Extension-Specific

  1. Rename the same file to test_file.txt
  2. Upload to the same pipeline
  3. Inspect → content is fully parsed and chunked correctly

Expected Behavior

.md files should be treated as plain text (same as .txt) and passed through to chunking with their full content preserved. This was the behavior prior to ~Feb 17, 2026.

Actual Behavior

.md files are routed through the PDF/document parsing pipeline (parse_page_without_llm), which:

  1. Searches for "layered text" (a PDF concept — does not exist in plain .md)
  2. Interprets --- from YAML frontmatter as page boundaries
  3. Returns only the --- separators with no content between them

Evidence of Regression

Date Action Result
Feb 16, 2026 Uploaded .md files to pipeline a9d2a289-... ✅ Parsed correctly, full content in chunks
Feb 21, 2026 Re-uploaded identical .md files to same pipeline ❌ Only --- separators, empty chunks
Feb 21, 2026 Renamed same .md → .txt, uploaded to same pipeline ✅ Parsed correctly

No configuration changes were made to the pipeline between these dates. The pipeline config was verified via API:

{
  "pipeline_id": "a9d2a289-f0c7-46fd-8924-37a25fd2fabf",
  "parse_mode": "parse_page_without_llm",
  "page_separator": "===SECTION_BREAK===",
  "tier": null,
  "created_at": "2026-02-13T09:48:55.650014Z"
}

Root Cause Analysis

The parse_page_without_llm documentation states:

"Extracts only the layered text from the file — designed for PDFs. Does not return markdown. The default page separator is \n---\n."

Before the regression: .md files were correctly identified as plain text and bypassed the PDF parsing pipeline (treated same as .txt).

After the regression: .md files are now being routed through the PDF parsing path, where the "layered text" extraction finds nothing and the --- YAML frontmatter delimiters are misinterpreted as page boundaries.

Environment

  • LlamaCloud tier: Starter
  • Pipeline schema: v1
  • Pipeline created: 2026-02-13
  • Regression window: Between Feb 16 and Feb 21, 2026 (likely a backend deployment)
  • Affected file types: .md (confirmed), potentially .markdown and other plain text variants
  • Not affected: .txt, .pdf

Workaround

Rename .md files to .txt before uploading. Content and parsing behave correctly with the .txt extension.

# Batch rename workaround
for f in *.md; do mv "$f" "${f%.md}.txt"; done

Note: This workaround forces users to lose .md extension metadata and breaks any automation that relies on file type detection.

Suggested Fix

Add .md (and .markdown) to the list of file extensions that receive plain text passthrough treatment (same path as .txt), bypassing the parse_page_without_llm PDF extraction pipeline. This was the implicit behavior before the regression.

Alternatively, when parse_page_without_llm encounters a non-PDF file, it should fall back to plain text extraction rather than silently returning empty content.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions