Skip to content

Telescopetest-io: add AI content filtering#144

Open
Judyzc wants to merge 10 commits intocloudflare:mainfrom
Judyzc:temp-ai
Open

Telescopetest-io: add AI content filtering#144
Judyzc wants to merge 10 commits intocloudflare:mainfrom
Judyzc:temp-ai

Conversation

@Judyzc
Copy link
Contributor

@Judyzc Judyzc commented Feb 20, 2026

Related to #143. This PR sets up AI content filtering for telescopetest.io, as described in the issue.

  • Added a new content_rating column in D1 the tests metadata table. Auto-generated a migration file (0002) for this with Prisma by following the README
  • Added workers AI binding for a text model (https://developers.cloudflare.com/workers-ai/models/llama-guard-3-8b/) and an image model (https://developers.cloudflare.com/workers-ai/models/llama-3.2-11b-vision-instruct/). These are used in lib/ai/ai-content-rater.ts, which adds the function rateUrlContent().
  • This function rateUrlContent() is called on in upload.ts POST endpoint with waitUntil() and always returns either SAFE or UNSAFE. If this AI content check gets interrupted by user refresh, it can be called again by the telescopetest.io/results/[testId] page, which now polls (with the GET endpoint tests/[testId]/rating) and blocks displaying results until a rating is given.
  • If a user wants to see unsafe content, they can do so locally (development env) with ENABLE_AI_RATING=false in a .dev.vars file as described in the README. Then, unsafe content will be displayed on the /results page with a flag.
  • Tested on staging.

QUESTION/REQUEST:

  • Any good way to test the quality/accuracy of this AI content filter? I've been using movies but I'm not sure if there's a better way.

misc:

  • fixed 'name' field on results list to not cut off letters

@Judyzc Judyzc changed the title Temp ai telescopetest-io: add AI content filtering Feb 20, 2026
@Judyzc Judyzc marked this pull request as ready for review February 20, 2026 19:10
@Judyzc Judyzc requested a review from a team February 20, 2026 19:10
@Judyzc Judyzc marked this pull request as draft February 20, 2026 19:11
@Judyzc Judyzc marked this pull request as ready for review February 20, 2026 19:32
Comment on lines +31 to +40
.replace(/<(script|style|noscript|head|template)[\s\S]*?<\/\1>/gi, '')
.replace(/<[^>]+>/g, ' ')
.replace(/&amp;/g, '&')
.replace(/&lt;/g, '<')
.replace(/&gt;/g, '>')
.replace(/&quot;/g, '"')
.replace(/&#39;/g, "'")
.replace(/&nbsp;/g, ' ')
.replace(/&[a-z]+;/gi, ' ')
.replace(/\s+/g, ' ')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, I think this sort of replacement won't work across newlines, and is omitting valid escaped text.

We should most likely be parsing the HTML and extracting the text nodes (most likely via https://developer.mozilla.org/en-US/docs/Web/API/DOMParser/parseFromString) from the parsed document.

Also, do we need to extract text at all? Like, assuming the content scanner is an LLM capable of sifting through structured documents, it probably could be passed the HTML document as-is and make a determination on the content?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the [\s\S] part of the regex allows it to work over newlines, shown here and through testing.

I don't think DOMParser works with Cloudflare workers, explained here, though I might be wrong. Cloudflare has its own HTMLRewriter tool I could use but that adds in streaming. There's also this third-party library linkedom I could try using, but what are your thoughts?

For needing to extract text, the LLM seems to be for conversation like strings: https://developers.cloudflare.com/workers-ai/models/llama-guard-3-8b/, so I haven't actually tested with just the HTML document. I can probably try this too though.

.join(' ')
.replace(/\s+/g, ' ')
.trim()
.slice(0, 4000);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We're intentionally only scanning the first ~4k characters?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can probably increase this to ~100,000 chars b/c the model can take 131,072 tokens, but yeah we can definitely increase this.

@Judyzc Judyzc changed the title telescopetest-io: add AI content filtering Telescopetest-io: add AI content filtering Feb 24, 2026
signal: AbortSignal.timeout(10_000),
});
const html = await response.text();
return html
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Judyzc did you have any success with sending HTML to the agent here?

@sergeychernyshev sergeychernyshev added the ticket This label indicates that internal ticket was created to track it. label Mar 2, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ticket This label indicates that internal ticket was created to track it.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants