Skip to content

Latest commit

 

History

History
338 lines (257 loc) · 10.6 KB

File metadata and controls

338 lines (257 loc) · 10.6 KB

Vector Search Retrieval Script

This script provides a Python interface for retrieving and searching documents from a vector database using the Activeloop API.

Overview

retrieve_with_filters.py enables you to:

  • Perform semantic search using text queries
  • Apply filters to narrow down results
  • Handle paginated results automatically
  • Query vector embeddings with custom tensor names

Prerequisites

  • Python 3.7+
  • Required packages: requests, pydantic
  • Activeloop API token

Installation

pip install requests pydantic

Configuration

Set your Activeloop token as an environment variable:

export ACTIVELOOP_TOKEN="your-token-here"

By default, the script connects to http://0.0.0.0:8080/api. To use a different endpoint, modify the BASE_URL variable in the script.

Usage

Basic Search

from retrieve_with_filters import retrieve_docs, Search

# Define your search parameters
search_data = Search(
    text="Gordon Ramsay",
    embedding_tensor="title_emb",
    top_k=10
)

# Execute the search
org_id = "your-org-id"
ds_name = "your-dataset-name"
docs = retrieve_docs(org_id, ds_name, search_data)

# Access results
print(f"Found {docs.total_results} results")
print(docs.results)

Search with Filters

The script supports three types of filters:

1. Boolean Filters

Filter by boolean fields:

search_data = Search(
    text="cooking video",
    embedding_tensor="title_emb",
    top_k=10,
    filters={
        'made_for_kids': {'value': True}
    }
)

2. List Filters

Filter by specific values:

search_data = Search(
    text="recipes",
    embedding_tensor="title_emb",
    top_k=10,
    filters={
        'video_id': {'value': ['tqST9EcunHg', 'NzUgvM3BBQs']}
    }
)

3. JSON Content Filters

Filter by content within JSON fields:

search_data = Search(
    text="family recipes",
    embedding_tensor="title_emb",
    top_k=10,
    filters={
        "power_keywords": {
            "content": 'couple'
        }
    }
)

Pagination and Continuation Tokens

Important: top_k > 50 Behavior

When top_k is greater than 50, the API will return results with a continuation token. You must use this token to retrieve the remaining results in subsequent requests.

Handling Pagination

# Request more than 50 results
search_data = Search(
    text="Gordon Ramsay",
    embedding_tensor="title_emb",
    top_k=100  # This will trigger pagination
)

# Retrieve all pages of results
all_results = []
docs = retrieve_docs(org_id, ds_name, search_data)
all_results.extend(docs.results)
print(f"Retrieved {len(docs.results)} results in first batch")

# Continue fetching if continuation token is present
while docs.continuation_token:
    print(f"Fetching more results... (continuation token: {docs.continuation_token[:20]}...)")
    search_data.continuation_token = docs.continuation_token
    docs = retrieve_docs(org_id, ds_name, search_data)
    all_results.extend(docs.results)
    print(f"Retrieved {len(docs.results)} more results")

print(f"Total results retrieved: {len(all_results)}")

Pagination Best Practices

  1. Always check for continuation_token: Even if you request fewer than 50 results, filters might affect pagination behavior
  2. Preserve search parameters: When using continuation tokens, keep the original search parameters (text, filters, etc.) the same
  3. Handle network interruptions: Store continuation tokens if you need to resume interrupted searches
  4. Monitor result counts: Use has_more flag to determine if additional results are available

Example: Pagination with Progress Tracking

def retrieve_all_results(org_id: str, ds_name: str, search_data: Search):
    """Retrieve all results across multiple pages with progress tracking"""
    all_results = []
    page = 1

    docs = retrieve_docs(org_id, ds_name, search_data)
    all_results.extend(docs.results)
    print(f"Page {page}: Retrieved {len(docs.results)} results")

    while docs.continuation_token:
        page += 1
        search_data.continuation_token = docs.continuation_token
        docs = retrieve_docs(org_id, ds_name, search_data)
        all_results.extend(docs.results)
        print(f"Page {page}: Retrieved {len(docs.results)} results")

        if not docs.has_more:
            break

    print(f"\nTotal: {len(all_results)} results across {page} pages")
    return all_results

# Usage
search_data = Search(
    text="Gordon Ramsay",
    embedding_tensor="title_emb",
    top_k=150  # Will require multiple pages
)
results = retrieve_all_results(org_id, ds_name, search_data)

Search Parameters

Search Class

Parameter Type Default Description
text str "" Search query text
embedding_tensor Optional[str] "" Name of the embedding tensor to search
inv_text str "" Inverse text query (exclusion)
inv_tensor str "" Inverse tensor name
filters Optional[Dict] None Filtering criteria
top_k int 10 Number of results to return. Note: If > 50, results will be paginated
continuation_token Optional[str] None Token for retrieving next page of results

SearchResponse Class

Field Type Description
results Any Search results from the API
continuation_token Optional[str] Token to fetch next page (present when more results available)
total_results Optional[int] Total number of results matching the query
has_more bool Whether more results are available for retrieval

Error Handling

The script includes comprehensive error handling:

  • ConnectionError: Raised when the server is unreachable
  • Timeout: Raised when requests exceed 30 seconds (configurable)
  • RequestException: Raised for other HTTP errors

All errors include timing information and response details when available.

Example Error Handling

from requests.exceptions import RequestException

try:
    docs = retrieve_docs(org_id, ds_name, search_data)
except RequestException as e:
    print(f"Search failed: {e}")
    # Handle error appropriately (retry, log, alert, etc.)

Complete Example

from retrieve_with_filters import retrieve_docs, Search, SearchResponse

def main():
    # Configuration
    org_id = "your-org-id"
    ds_name = "your-dataset-name"

    # Create search query with filters
    search_data = Search(
        text="Gordon Ramsay cooking",
        embedding_tensor="title_emb",
        top_k=75,  # Will require pagination
        filters={
            'made_for_kids': {'value': False},
            'views': {'value': [100000, 1000000]}  # Views between 100k-1M
        }
    )

    # Retrieve all results
    all_results = []
    page_count = 0

    try:
        docs = retrieve_docs(org_id, ds_name, search_data)
        all_results.extend(docs.results)
        page_count += 1

        # Handle pagination
        while docs.continuation_token:
            search_data.continuation_token = docs.continuation_token
            docs = retrieve_docs(org_id, ds_name, search_data)
            all_results.extend(docs.results)
            page_count += 1

        print(f"Successfully retrieved {len(all_results)} results across {page_count} pages")

        # Process results
        for i, result in enumerate(all_results, 1):
            print(f"{i}. {result}")

    except Exception as e:
        print(f"Error during search: {e}")
        return None

    return all_results

if __name__ == "__main__":
    main()

API Endpoint

The script uses the following endpoint:

POST /api/vector-search/search/{org_id}/{ds_name}

Request body includes search parameters as JSON.

Troubleshooting

Connection Issues

  • Verify the BASE_URL is correct and the server is running
  • Check that your ACTIVELOOP_TOKEN is valid
  • Ensure network connectivity to the API endpoint

Authentication Errors

  • Confirm your token has proper permissions for the organization and dataset
  • Verify the token is correctly exported in your environment
  • Check token format (should be "Bearer {token}")

Pagination Issues

  • Getting incomplete results: Ensure you're checking for and using continuation tokens
  • top_k > 50 not returning all results: You must iterate through all pages using continuation tokens
  • Stale continuation token: Tokens may expire; restart the search if you receive an error

Empty Results

  • Check that the dataset name and organization ID are correct
  • Verify the embedding_tensor name matches your dataset schema
  • Review your filter criteria to ensure they're not too restrictive
  • Confirm the dataset contains data matching your query

Performance Issues

  • Consider reducing top_k if you don't need all results at once
  • Use filters to narrow down results before retrieving large result sets
  • Implement caching for repeated queries
  • Monitor timeout settings (default: 30 seconds)

Security Notes

  • Never commit your ACTIVELOOP_TOKEN to version control
  • Use environment variables or secrets management for credentials
  • Remove debug print statements that expose credentials before deploying to production
  • Consider using HTTPS endpoints for production deployments
  • Implement rate limiting if making frequent API calls

Performance Tips

  1. Optimize top_k: Request only the number of results you need
  2. Use filters effectively: Apply filters to reduce result set size before retrieval
  3. Batch processing: Process results as they come in each page rather than waiting for all pages
  4. Connection pooling: Reuse HTTP connections for multiple requests
  5. Async requests: Consider using async libraries for concurrent searches

Version 1.0

  • Initial release with basic search functionality
  • Filter support (boolean, list, JSON content)
  • Pagination with continuation tokens
  • Comprehensive error handling