[BUG] Regression: .md files parsed as empty in parse_page_without_llm mode — content destroyed since ~Feb 17-20, 2026

<html><body>
<html><head></head><body><h1>[BUG] Regression: <code>.md</code> files parsed as empty in <code>parse_page_without_llm</code> mode — content destroyed since ~Feb 17-20, 2026</h1>
<h2>⚠️ Regression — Previously Working</h2>
<p>This is a <strong>regression</strong>, not a new feature request. The same <code>.md</code> files, uploaded to the same pipeline with the same configuration, were parsed correctly on <strong>Feb 16, 2026</strong> but return <strong>empty content</strong> as of <strong>Feb 21, 2026</strong>.</p>
<h2>Summary</h2>
<p>Markdown (<code>.md</code>) files uploaded to a LlamaCloud Index configured with <code>parse_page_without_llm</code> (Fast Mode) are parsed as empty. The "Parsed" view shows only <code>---</code> separators with no content between them. The "Chunked" view produces text nodes with empty strings.</p>
<p><strong>Renaming the identical file from <code>.md</code> to <code>.txt</code> immediately fixes the issue</strong> — the content is parsed correctly. This confirms the bug is in file-type routing, not in the content itself.</p>
<h2>Impact</h2>
<ul>
<li><strong>All <code>.md</code> files</strong> uploaded after the regression window (~Feb 17-20) are affected</li>
<li>Indexes built on <code>.md</code> files contain <strong>zero retrievable content</strong></li>
<li>RAG pipelines silently return empty results with <strong>no error or warning</strong></li>
<li>Users have no indication their index is broken unless they manually inspect parsed output</li>
</ul>
<h2>Steps to Reproduce</h2>
<ol>
<li>Create a LlamaCloud Index with a pipeline configured as:
<ul>
<li><code>parse_mode</code>: <code>parse_page_without_llm</code></li>
<li>Any <code>page_separator</code> (tested with <code>===SECTION_BREAK===</code>)</li>
</ul>
</li>
<li>Upload a <code>.md</code> file containing YAML frontmatter + markdown body (example below)</li>
<li>Wait for parsing to complete</li>
<li>Inspect the file in the LlamaCloud UI:
<ul>
<li><strong>Raw File</strong> tab → content is fully visible ✅</li>
<li><strong>Parsed File</strong> tab → only <code>---</code> separators, no content ❌</li>
<li><strong>Chunked File</strong> tab → text nodes contain empty strings ❌</li>
</ul>
</li>
</ol>
<h3>Minimal Reproduction File</h3>
<p><code>test_file.md</code> (any <code>.md</code> with YAML frontmatter):</p>
<pre><code class="language-markdown">---
company: TEST
date: 2024-01-01
type: report
---

# Test Document

This is a test paragraph with actual content.

## Section Two

More content here that should be indexed and retrievable.
</code></pre>
<h3>Confirm It's File-Extension-Specific</h3>
<ol start="5">
<li>Rename the same file to <code>test_file.txt</code></li>
<li>Upload to the same pipeline</li>
<li>Inspect → <strong>content is fully parsed and chunked correctly</strong> ✅</li>
</ol>
<h2>Expected Behavior</h2>
<p><code>.md</code> files should be treated as plain text (same as <code>.txt</code>) and passed through to chunking with their full content preserved. This was the behavior prior to ~Feb 17, 2026.</p>
<h2>Actual Behavior</h2>
<p><code>.md</code> files are routed through the PDF/document parsing pipeline (<code>parse_page_without_llm</code>), which:</p>
<ol>
<li>Searches for "layered text" (a PDF concept — does not exist in plain <code>.md</code>)</li>
<li>Interprets <code>---</code> from YAML frontmatter as page boundaries</li>
<li>Returns only the <code>---</code> separators with no content between them</li>
</ol>
<h2>Evidence of Regression</h2>

Date | Action | Result
-- | -- | --
Feb 16, 2026 | Uploaded .md files to pipeline a9d2a289-... | ✅ Parsed correctly, full content in chunks
Feb 21, 2026 | Re-uploaded identical .md files to same pipeline | ❌ Only --- separators, empty chunks
Feb 21, 2026 | Renamed same .md → .txt, uploaded to same pipeline | ✅ Parsed correctly


<p><strong>No configuration changes</strong> were made to the pipeline between these dates. The pipeline config was verified via API:</p>
<pre><code class="language-json">{
  "pipeline_id": "a9d2a289-f0c7-46fd-8924-37a25fd2fabf",
  "parse_mode": "parse_page_without_llm",
  "page_separator": "===SECTION_BREAK===",
  "tier": null,
  "created_at": "2026-02-13T09:48:55.650014Z"
}
</code></pre>
<h2>Root Cause Analysis</h2>
<p>The <code>parse_page_without_llm</code> documentation states:</p>
<blockquote>
<p><em>"Extracts only the layered text from the file — designed for PDFs. Does not return markdown. The default page separator is <code>\n---\n</code>."</em></p>
</blockquote>
<p><strong>Before the regression:</strong> <code>.md</code> files were correctly identified as plain text and bypassed the PDF parsing pipeline (treated same as <code>.txt</code>).</p>
<p><strong>After the regression:</strong> <code>.md</code> files are now being routed through the PDF parsing path, where the "layered text" extraction finds nothing and the <code>---</code> YAML frontmatter delimiters are misinterpreted as page boundaries.</p>
<h2>Environment</h2>
<ul>
<li><strong>LlamaCloud tier:</strong> Starter</li>
<li><strong>Pipeline schema:</strong> v1</li>
<li><strong>Pipeline created:</strong> 2026-02-13</li>
<li><strong>Regression window:</strong> Between Feb 16 and Feb 21, 2026 (likely a backend deployment)</li>
<li><strong>Affected file types:</strong> <code>.md</code> (confirmed), potentially <code>.markdown</code> and other plain text variants</li>
<li><strong>Not affected:</strong> <code>.txt</code>, <code>.pdf</code></li>
</ul>
<h2>Workaround</h2>
<p>Rename <code>.md</code> files to <code>.txt</code> before uploading. Content and parsing behave correctly with the <code>.txt</code> extension.</p>
<pre><code class="language-bash"># Batch rename workaround
for f in *.md; do mv "$f" "${f%.md}.txt"; done
</code></pre>
<p><strong>Note:</strong> This workaround forces users to lose <code>.md</code> extension metadata and breaks any automation that relies on file type detection.</p>
<h2>Suggested Fix</h2>
<p>Add <code>.md</code> (and <code>.markdown</code>) to the list of file extensions that receive <strong>plain text passthrough</strong> treatment (same path as <code>.txt</code>), bypassing the <code>parse_page_without_llm</code> PDF extraction pipeline. This was the implicit behavior before the regression.</p>
<p>Alternatively, when <code>parse_page_without_llm</code> encounters a non-PDF file, it should fall back to plain text extraction rather than silently returning empty content.</p></body></html>
</body>
</html>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] Regression: .md files parsed as empty in parse_page_without_llm mode — content destroyed since ~Feb 17-20, 2026 #1122

[BUG] Regression: `.md` files parsed as empty in `parse_page_without_llm` mode — content destroyed since ~Feb 17-20, 2026

⚠️ Regression — Previously Working

Summary

Impact

Steps to Reproduce

Minimal Reproduction File

Test Document

Section Two

Confirm It's File-Extension-Specific

Expected Behavior

Actual Behavior

Evidence of Regression

Root Cause Analysis

Environment

Workaround

Suggested Fix

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Date	Action	Result
Feb 16, 2026	Uploaded .md files to pipeline a9d2a289-...	✅ Parsed correctly, full content in chunks
Feb 21, 2026	Re-uploaded identical .md files to same pipeline	❌ Only --- separators, empty chunks
Feb 21, 2026	Renamed same .md → .txt, uploaded to same pipeline	✅ Parsed correctly

[BUG] Regression: .md files parsed as empty in parse_page_without_llm mode — content destroyed since ~Feb 17-20, 2026 #1122

Description

[BUG] Regression: .md files parsed as empty in parse_page_without_llm mode — content destroyed since ~Feb 17-20, 2026

⚠️ Regression — Previously Working

Summary

Impact

Steps to Reproduce

Minimal Reproduction File

Test Document

Section Two

Confirm It's File-Extension-Specific

Expected Behavior

Actual Behavior

Evidence of Regression

Root Cause Analysis

Environment

Workaround

Suggested Fix

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

[BUG] Regression: `.md` files parsed as empty in `parse_page_without_llm` mode — content destroyed since ~Feb 17-20, 2026