Skip to content

Add AI-powered PDF transcription integration (multimodal) #343

@NickSeagull

Description

@NickSeagull

Summary

Add an AI-powered document transcription integration that uses multimodal models (Gemini, Claude Vision) to extract text from uploaded PDF files. This is the smart extraction option for complex documents, scanned PDFs, and documents requiring OCR.

For simple text-based PDFs, see the companion issue for local CLI-based extraction.

Problem

Currently, ActionContext only provides:

data ActionContext = ActionContext
  { secretStore :: SecretStore
  , providerRegistry :: Map Text ValidatedOAuth2ProviderConfig
  }

This means integrations cannot access uploaded file content. For document processing workflows, the integration needs to:

  1. Receive a FileRef from the triggering event
  2. Retrieve the file bytes from the blob store
  3. Send to multimodal AI for intelligent extraction
  4. Emit a command with the extracted text

Proposed Solution

1. Extend ActionContext with File Access

data ActionContext = ActionContext
  { secretStore :: SecretStore
  , providerRegistry :: Map Text ValidatedOAuth2ProviderConfig
  , fileAccess :: Maybe FileAccessContext  -- NEW
  }

data FileAccessContext = FileAccessContext
  { retrieveFile :: FileRef -> Task FileAccessError Bytes
  -- ^ Retrieve file content by FileRef (validates ownership internally)
  , getFileMetadata :: FileRef -> Task FileAccessError FileMetadata
  -- ^ Get file metadata (filename, content type, size) without retrieving bytes
  }

2. Create Integration.Ai.TranscribePdf

Following the OpenRouter pattern (Jess/Nick personas):

Jess's API (User):

import Integration qualified
import Integration.Ai.TranscribePdf qualified as AiTranscribe

proposalIntegrations :: ProposalEntity -> ProposalEvent -> Integration.Outbound
proposalIntegrations entity event = case event of
  ProposalPdfUploaded e -> Integration.batch
    [ Integration.outbound AiTranscribe.Request
        { fileRef = e.file
        , model = "google/gemini-pro-1.5"  -- multimodal model
        , config = AiTranscribe.defaultConfig
            { extractionMode = AiTranscribe.FullText
            , language = Just "en"
            }
        , onSuccess = \result -> RecordTranscription
            { proposalId = e.proposalId
            , transcribedText = result.text
            }
        , onError = \err -> TranscriptionFailed
            { proposalId = e.proposalId
            , error = err
            }
        }
    ]

Nick's API (Integration Developer):

module Integration.Ai.TranscribePdf where

data ExtractionMode
  = FullText           -- Extract all text preserving structure
  | Summary            -- Generate a summary
  | Structured Schema  -- Extract into structured format (JSON)
  deriving (Show, Eq, Generic)

data Config = Config
  { extractionMode :: ExtractionMode
  , language :: Maybe Text        -- Language hint for better accuracy
  , maxPages :: Maybe Int         -- Limit pages to process
  , systemPrompt :: Maybe Text    -- Custom extraction instructions
  }

data Request command = Request
  { fileRef :: FileRef
  , model :: Text                 -- "google/gemini-pro-1.5", "anthropic/claude-3.5-sonnet"
  , config :: Config
  , onSuccess :: TranscriptionResult -> command
  , onError :: Text -> command
  }

data TranscriptionResult = TranscriptionResult
  { text :: Text
  , pageCount :: Int
  , confidence :: Maybe Float
  }

3. Implementation

The integration:

  1. Retrieves PDF bytes via ctx.fileAccess.retrieveFile
  2. Converts to base64 for multimodal API
  3. Builds prompt based on extractionMode
  4. Calls OpenRouter with multimodal model
  5. Parses response and emits result command
instance Integration.ToAction (Request command) where
  toAction request = Integration.action \ctx -> do
    -- 1. Get file bytes
    fileBytes <- getFileBytes ctx request.fileRef
    
    -- 2. Build multimodal request
    let base64Pdf = Bytes.toBase64 fileBytes
    let prompt = buildExtractionPrompt request.config
    
    -- 3. Call via OpenRouter (piggyback pattern)
    let openRouterRequest = OpenRouter.Request
          { messages = 
              [ Message.system prompt
              , Message.userWithAttachment "Extract text from this PDF" base64Pdf "application/pdf"
              ]
          , model = request.model
          , ...
          }
    
    -- 4. Transform response
    ...

4. Supported Models

Model Best For Notes
google/gemini-pro-1.5 General documents Good balance of speed/accuracy
anthropic/claude-3.5-sonnet Complex layouts Requires vision API in OpenRouter
openai/gpt-4o Mixed content Good for documents with images

Use Cases

  • Scanned documents - OCR required, AI handles it
  • Complex layouts - Tables, multi-column, forms
  • Handwritten notes - AI can interpret handwriting
  • Mixed content - Documents with images, charts, diagrams
  • Summarization - Extract key points instead of full text

Acceptance Criteria

  • ActionContext extended with FileAccessContext
  • FileAccessContext populated in Application.run when file uploads enabled
  • New Integration.Ai.TranscribePdf module in nhintegrations
  • Follows Jess/Nick persona pattern
  • Multiple extraction modes (FullText, Summary, Structured)
  • Configurable model selection
  • Error handling (file not found, API failure, timeout)
  • Documentation with examples
  • Tests

Related

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions