Skip to content

feat: v2-only SWE-Lancer tier list on GitHub Pages#89

Merged
laiso merged 7 commits intomainfrom
devin/1775115148-gh-pages-results
Apr 2, 2026
Merged

feat: v2-only SWE-Lancer tier list on GitHub Pages#89
laiso merged 7 commits intomainfrom
devin/1775115148-gh-pages-results

Conversation

@devin-ai-integration
Copy link
Copy Markdown
Contributor

@devin-ai-integration devin-ai-integration bot commented Apr 2, 2026

Summary

Adds a static site under docs/ to display v2 SWE-Lancer benchmark results on GitHub Pages, with per-agent detail pages including OGP metadata. V1 (Exercism) results are completely excluded. Closes #88.

What's included:

  • docs/index.html — Client-side dashboard with three tabs: Tier List (tiermaker-style horizontal rows), Historical Runs (sorted by date), and Task Breakdown (per-task pass/fail for the 5 SWE-Lancer tasks). Only v2 entries are displayed.
  • scripts/build-results-pages.ts — Bun script that reads public/data/leaderboard.json, filters to v2 entries only, and generates one HTML page per agent/model under docs/results/ with OGP metadata (og:title, og:description, twitter:card).
  • .github/workflows/gh-pages.ymlUnified workflow (merged the former swelancer-pages.yml into this single file). Builds both benchmark result pages and SWE-Lancer task browser, then deploys docs/ via actions/deploy-pages@v4. Triggers on pushes to main that touch relevant paths, plus workflow_dispatch.
  • Adds build:results-pages npm script. Generated docs/results/ and docs/data/leaderboard.json are gitignored (CI rebuilds).

Updates since last revision

  • Expanded sample data (16 agents): Added 10 more dummy v2 agents to leaderboard.json so every tier (S/A/B/C/D/F) has 2–4 agents. Total: cursor, claude, devin (S); codex, gemini, opencode (A); cline, goose, kimi, qwen (B); aider, cursor (C); copilot, gemini (D); windsurf, opencode (F). All data is fabricated for demo purposes.
  • Agent icons: Downloaded 64×64 PNG icons for all agents via Google's favicon service, stored in docs/assets/icons/. Displayed at 32×32 in agent cards with graceful fallback (onerror hides broken images).
  • isV2Entry() consistency fix (scripts/build-results-pages.ts): Changed if (entry.tier)if (entry.tier?.tier) to match the frontend's check, preventing the build script from generating orphan result pages that the frontend never links to.
  • esc() quote escaping fix (docs/index.html): Replaced the textContentinnerHTML trick with explicit manual escaping of all 5 characters (&, <, >, ", '). The old implementation didn't escape quotes, which could allow attribute injection when esc() output is placed inside alt="..." or src="..." attributes.
  • tierClass() allowlist (scripts/build-results-pages.ts): Added an allowlist check so only known tier letters (S/A/B/C/D/F) produce a CSS class, preventing class-name injection.

Previous updates (still applied)

  • Tier list maker layout: Horizontal tier rows (S → F) with colored labels and agent cards, matching tiermaker.com style.
  • Vercel-style dark theme: Black background (#000), dark surfaces (#111, #1a1a1a), light text (#ededed), underline-style tabs. Tier row colors: S (#ff7f7f), A (#ffbf7f), B (#ffdf7f), C (#ffff7f), D (#bfff7f), F (#7fffff).
  • V2-only: All v1 Exercism data filtered out client-side (isV2Entry()) and at build time.
  • XSS hardening: tierBadge() validates tier values against an allowlist. Build script uses escapeHtml() + class allowlist.
  • Workflow consolidation: Deleted swelancer-pages.yml; merged into gh-pages.yml.

Review & Testing Checklist for Human

  • All 16 agents use fabricated data — The leaderboard.json entries are entirely made up for demo/layout purposes. Decide whether to keep them for the initial deploy or strip before merge. Real results will overwrite them when update-leaderboard.ts runs after actual v2 benchmarks.
  • Agent icons may not match actual logos — Icons were sourced from Google's favicon service using agent website domains. Verify the icons are reasonable representations (especially for agents whose websites may not have distinctive favicons).
  • Tier color inconsistency between pagesindex.html uses tiermaker-style colors (S = #ff7f7f red, F = #7fffff cyan) while individual result pages in build-results-pages.ts use a different palette (S = #c9a000 gold, F = #f85149 red). Verify this is acceptable or unify.
  • V2 detection logic is duplicated in docs/index.html (client JS), scripts/build-results-pages.ts, and src/config/constants.ts. The three isV2Entry / tier-threshold implementations must stay in sync.
  • Enable GitHub Pages — Settings → Pages → Source: "GitHub Actions". The renamed workflow (gh-pages.yml, formerly swelancer-pages.yml) may require re-selecting the source.

Suggested test plan: Serve docs/ locally (python3 -m http.server 8080 --directory docs), verify all 3 tabs render with 16 agents distributed across all 6 tiers, click agent cards to confirm detail pages load with correct OGP tags, and resize to check mobile responsiveness.

Notes

  • Typecheck (bun run typecheck) and unit tests (bun test ./src — 90 pass) both pass with no regressions.
  • Build script generates 16 v2 result pages from the sample data (skips 16 v1 entries).

Link to Devin session: https://app.devin.ai/sessions/6a49c671c925468887611e948335d9af
Requested by: @laiso


Open with Devin

- Add docs/index.html with leaderboard, historical runs, and task breakdown tabs
- Add scripts/build-results-pages.ts to generate per-agent result pages with OGP metadata
- Add .github/workflows/gh-pages.yml to deploy docs/ to GitHub Pages on leaderboard updates
- Add build:results-pages script to package.json

Closes #88
@devin-ai-integration
Copy link
Copy Markdown
Contributor Author

🤖 Devin AI Engineer

I'll be helping with this pull request! Here's what you should know:

✅ I will automatically:

  • Address comments on this PR. Add '(aside)' to your comment to have me ignore it.
  • Look at CI failures and help fix them

Note: I can only respond to comments from users who have write access to this repository.

⚙️ Control Options:

  • Disable automatic comment and CI monitoring

devin-ai-integration[bot]

This comment was marked as resolved.

Address Devin Review feedback: escape tier in body content and
og:description attribute, whitelist tier CSS class names.
devin-ai-integration[bot]

This comment was marked as resolved.

- Filter leaderboard to show only v2 SWE-Lancer results (5 tasks)
- Add tier legend (S/A/B/C/D/F) and sort by tier rank
- Fix XSS: add tier allowlist in tierBadge() function
- Merge swelancer-pages.yml into unified gh-pages.yml
- Delete redundant swelancer-pages.yml workflow
- Update build script to skip non-v2 entries
- Update page title/description to reference SWE-Lancer v2
@devin-ai-integration devin-ai-integration bot changed the title feat: add GitHub Pages benchmark results display feat: v2-only SWE-Lancer tier list on GitHub Pages Apr 2, 2026
@devin-ai-integration
Copy link
Copy Markdown
Contributor Author

Testing Results — v2-only SWE-Lancer Tier List

Tested locally at http://localhost:8080 serving docs/. All 5 tests passed.

v2-only Page Rendering (5/5 passed)
  • v2 branding — PASSED. Header shows "SWE-Lancer v2 benchmark". No "Exercism" or "TypeScript workloads" references.
  • Tier legend — PASSED. S 5/5, A 4/5, B 3/5, C 2/5, D 1/5, F 0/5 all visible with distinct colored badges.
  • Tier List tab empty state — PASSED. Shows "No v2 benchmark results yet." with --dataset v2 guidance. Zero v1 entries visible (all 23 filtered out).
  • Historical Runs tab — PASSED. Shows "No v2 results yet."
  • Task Breakdown tab — PASSED. Shows "No v2 task-level data available."
Build Script v2 Filtering
$ bun scripts/build-results-pages.ts
Generated 0 v2 result pages (skipped 23 non-v2 entries)
Screenshots
Tier List (empty state) Historical Runs Task Breakdown
Tier List Shows "No v2 results yet." Shows "No v2 task-level data available."

Devin session

…2 data

- Redesign leaderboard with horizontal tier rows (S/A/B/C/D/F) and agent cards
- Vercel-style dark theme: black background, dark surfaces, light text
- Tier colors matching tier list maker aesthetic (S: red to F: cyan)
- Add 6 sample v2 agents: cursor(S), codex(A), cline(B), aider(C), copilot(D), windsurf(F)
- Underline-style tab navigation
- Agent cards show name, model, solved count, avg time
- Download favicons for cursor, codex, cline, aider, copilot, windsurf
- Display 32x32 icons above agent name in tier list cards
- Graceful fallback (hide icon) if image fails to load
devin-ai-integration[bot]

This comment was marked as resolved.

…ote escaping

- Add 10 more dummy v2 agents to leaderboard.json (16 total) to populate
  all tier rows (S/A/B/C/D/F) with multiple agents each
- Download icons for new agents (claude, devin, gemini, opencode, goose,
  kimi, qwen) from Google favicon service
- Fix isV2Entry() in build script to check entry.tier?.tier (not just
  entry.tier) to match frontend logic — prevents generating orphan pages
- Fix esc() in frontend to escape quotes (" and ') in addition to <>&
  preventing attribute injection via alt/src attributes
Copy link
Copy Markdown
Contributor Author

@devin-ai-integration devin-ai-integration bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Devin Review found 2 new potential issues.

View 12 additional findings in Devin Review.

Open in Devin Review

docs/index.html Outdated
if (e.data.results[i].exercise === task) { found = e.data.results[i]; break; }
}
if (!found) {
html += '<td style="color:var(--text-muted)">-</td>';
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Undefined CSS variable --text-muted used in index.html breakdown table

The renderBreakdown function at docs/index.html:499 uses color:var(--text-muted) in an inline style, but the page's CSS (:root at docs/index.html:13-24) only defines --text-secondary, not --text-muted. When the browser encounters the undefined variable, var(--text-muted) resolves to the initial value, so the color property has no effect and the dash text inherits the parent's color (--text: #ededed, i.e., full white), instead of appearing in a muted/gray color as intended. The generated result pages in scripts/build-results-pages.ts:191 correctly define --text-muted: #8b949e, suggesting --text-secondary was the intended variable in index.html.

Suggested change
html += '<td style="color:var(--text-muted)">-</td>';
html += '<td style="color:var(--text-secondary)">-</td>';
Open in Devin Review

Was this helpful? React with 👍 or 👎 to provide feedback.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 68a1670 — changed var(--text-muted) to var(--text-secondary) which is the correct variable defined in this page's :root.

…text-secondary

- Escape agentName via esc() before building iconPath to prevent
  attribute injection in img src attribute
- Replace undefined --text-muted with --text-secondary in breakdown
  table for missing task cells
@laiso laiso merged commit 9df2912 into main Apr 2, 2026
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

GH Pages benchmark results display

1 participant