-
-
Notifications
You must be signed in to change notification settings - Fork 11
Open
Description
Summary
Add an AI-powered document transcription integration that uses multimodal models (Gemini, Claude Vision) to extract text from uploaded PDF files. This is the smart extraction option for complex documents, scanned PDFs, and documents requiring OCR.
For simple text-based PDFs, see the companion issue for local CLI-based extraction.
Problem
Currently, ActionContext only provides:
data ActionContext = ActionContext
{ secretStore :: SecretStore
, providerRegistry :: Map Text ValidatedOAuth2ProviderConfig
}This means integrations cannot access uploaded file content. For document processing workflows, the integration needs to:
- Receive a
FileReffrom the triggering event - Retrieve the file bytes from the blob store
- Send to multimodal AI for intelligent extraction
- Emit a command with the extracted text
Proposed Solution
1. Extend ActionContext with File Access
data ActionContext = ActionContext
{ secretStore :: SecretStore
, providerRegistry :: Map Text ValidatedOAuth2ProviderConfig
, fileAccess :: Maybe FileAccessContext -- NEW
}
data FileAccessContext = FileAccessContext
{ retrieveFile :: FileRef -> Task FileAccessError Bytes
-- ^ Retrieve file content by FileRef (validates ownership internally)
, getFileMetadata :: FileRef -> Task FileAccessError FileMetadata
-- ^ Get file metadata (filename, content type, size) without retrieving bytes
}2. Create Integration.Ai.TranscribePdf
Following the OpenRouter pattern (Jess/Nick personas):
Jess's API (User):
import Integration qualified
import Integration.Ai.TranscribePdf qualified as AiTranscribe
proposalIntegrations :: ProposalEntity -> ProposalEvent -> Integration.Outbound
proposalIntegrations entity event = case event of
ProposalPdfUploaded e -> Integration.batch
[ Integration.outbound AiTranscribe.Request
{ fileRef = e.file
, model = "google/gemini-pro-1.5" -- multimodal model
, config = AiTranscribe.defaultConfig
{ extractionMode = AiTranscribe.FullText
, language = Just "en"
}
, onSuccess = \result -> RecordTranscription
{ proposalId = e.proposalId
, transcribedText = result.text
}
, onError = \err -> TranscriptionFailed
{ proposalId = e.proposalId
, error = err
}
}
]Nick's API (Integration Developer):
module Integration.Ai.TranscribePdf where
data ExtractionMode
= FullText -- Extract all text preserving structure
| Summary -- Generate a summary
| Structured Schema -- Extract into structured format (JSON)
deriving (Show, Eq, Generic)
data Config = Config
{ extractionMode :: ExtractionMode
, language :: Maybe Text -- Language hint for better accuracy
, maxPages :: Maybe Int -- Limit pages to process
, systemPrompt :: Maybe Text -- Custom extraction instructions
}
data Request command = Request
{ fileRef :: FileRef
, model :: Text -- "google/gemini-pro-1.5", "anthropic/claude-3.5-sonnet"
, config :: Config
, onSuccess :: TranscriptionResult -> command
, onError :: Text -> command
}
data TranscriptionResult = TranscriptionResult
{ text :: Text
, pageCount :: Int
, confidence :: Maybe Float
}3. Implementation
The integration:
- Retrieves PDF bytes via
ctx.fileAccess.retrieveFile - Converts to base64 for multimodal API
- Builds prompt based on
extractionMode - Calls OpenRouter with multimodal model
- Parses response and emits result command
instance Integration.ToAction (Request command) where
toAction request = Integration.action \ctx -> do
-- 1. Get file bytes
fileBytes <- getFileBytes ctx request.fileRef
-- 2. Build multimodal request
let base64Pdf = Bytes.toBase64 fileBytes
let prompt = buildExtractionPrompt request.config
-- 3. Call via OpenRouter (piggyback pattern)
let openRouterRequest = OpenRouter.Request
{ messages =
[ Message.system prompt
, Message.userWithAttachment "Extract text from this PDF" base64Pdf "application/pdf"
]
, model = request.model
, ...
}
-- 4. Transform response
...4. Supported Models
| Model | Best For | Notes |
|---|---|---|
google/gemini-pro-1.5 |
General documents | Good balance of speed/accuracy |
anthropic/claude-3.5-sonnet |
Complex layouts | Requires vision API in OpenRouter |
openai/gpt-4o |
Mixed content | Good for documents with images |
Use Cases
- Scanned documents - OCR required, AI handles it
- Complex layouts - Tables, multi-column, forms
- Handwritten notes - AI can interpret handwriting
- Mixed content - Documents with images, charts, diagrams
- Summarization - Extract key points instead of full text
Acceptance Criteria
-
ActionContextextended withFileAccessContext -
FileAccessContextpopulated inApplication.runwhen file uploads enabled - New
Integration.Ai.TranscribePdfmodule innhintegrations - Follows Jess/Nick persona pattern
- Multiple extraction modes (FullText, Summary, Structured)
- Configurable model selection
- Error handling (file not found, API failure, timeout)
- Documentation with examples
- Tests
Related
- Companion issue: Local PDF text extraction (CLI-based, no AI)
- May benefit from Add declarative file upload configuration (like PostgresEventStore) #342 (declarative file upload config)
Metadata
Metadata
Assignees
Labels
No labels