Skip to content

Conversation

@kinglisky
Copy link
Contributor

@kinglisky kinglisky commented Jan 28, 2026

Description

Important

  1. Make sure you have read our contribution guidelines
  2. Ensure there is an associated issue and you have been assigned to it
  3. Use the correct syntax to link this PR: [Feature] Dataset-scoped API Keys for Fine-grained Access Control #31409

Summary

This PR introduces dataset-scoped API keys that are bound to specific datasets, providing finer-grained access control compared to existing tenant-level keys.

Resolve #31409

Problem

Currently, Dataset API Keys are tenant-level only (app_id=NULL), which means:

  • A single key can access all datasets within the tenant
  • No isolation between different datasets
  • Over-privileged access for scenarios requiring single-dataset access

Solution

Introduce dataset-scoped API keys that coexist with tenant-level keys:

Key Type app_id Value Access Scope
Tenant-level NULL All datasets in tenant
Dataset-scoped <dataset_id> Only the bound dataset

Changes

  1. Permission validation (controllers/service_api/wraps.py)

    • Added dataset-scoped key permission check in validate_dataset_token decorator
    • Dataset-scoped keys can only access their bound dataset (403 Forbidden otherwise)
    • Dataset-scoped keys cannot access tenant-level endpoints (e.g., list all datasets)
  2. API key management (controllers/console/apikey.py)

    • Changed DatasetApiKeyListResource.resource_id_field to use app_id for storing dataset_id
    • Changed DatasetApiKeyResource.resource_id_field to use app_id
  3. Tenant-level key listing (controllers/console/datasets/datasets.py)

    • Added app_id.is_(None) filter to ensure tenant-level endpoint only returns tenant-level keys
  4. Unit tests (tests/unit_tests/controllers/service_api/test_dataset_token_validation.py)

    • Added 5 test cases covering permission validation scenarios

Design Principles

  • No database migration: Reuse existing app_id field to store dataset_id
  • Backward compatible: Existing tenant-level keys continue to work unchanged
  • Minimal invasion: Only modify necessary code paths

Screenshots

N/A - This is a backend API change with no UI modifications.

Checklist

  • This change requires a documentation update, included: Dify Document
  • I understand that this PR may be closed in case there was no previous discussion or issues. (This doesn't apply to typos!)
  • I've added a test for each change that was introduced, and I tried as much as possible to make a single atomic change.
  • I've updated the documentation accordingly.
  • I ran make lint and make type-check (backend) and cd web && npx lint-staged (frontend) to appease the lint gods

   - Add permission check in validate_dataset_token decorator
     - Dataset-scoped keys can only access their bound dataset (403 otherwise)
     - Dataset-scoped keys cannot access tenant-level endpoints
   - Change DatasetApiKeyListResource to use app_id field for storing dataset_id
   - Add app_id.is_(None) filter to tenant-level key list endpoint
   - Add unit tests for permission validation logic
@dosubot dosubot bot added the size:L This PR changes 100-499 lines, ignoring generated files. label Jan 28, 2026
@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello @kinglisky, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly improves the security and flexibility of API access by introducing dataset-scoped API keys. Previously, all dataset API keys were tenant-level, granting broad access to all datasets within a tenant. The new implementation allows keys to be tied to a single dataset, preventing over-privileged access and enabling more precise control over data access. This change reuses existing database structures and maintains backward compatibility for existing tenant-level keys.

Highlights

  • Dataset-Scoped API Keys: Introduced a new capability to create API keys that are specifically bound to individual datasets, enhancing granular access control.
  • Permission Validation Logic: Implemented new validation rules in validate_dataset_token to enforce that dataset-scoped keys can only access their designated dataset and cannot access tenant-level endpoints.
  • API Key Management Update: Modified apikey.py to utilize the existing app_id field for storing dataset_id for dataset-scoped keys, ensuring backward compatibility without database migrations.
  • Tenant-Level Key Filtering: Updated the API key listing endpoint in datasets.py to explicitly filter for app_id IS NULL, ensuring that only tenant-level keys are displayed for tenant-level requests.
  • Comprehensive Unit Tests: Added a new test file with 5 distinct test cases to thoroughly verify the correct behavior of the new dataset-scoped API key validation and permission checks.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces dataset-scoped API keys, a great feature for finer-grained access control. While the core permission validation logic in the validate_dataset_token decorator is sound, significant security issues have been identified. Repurposing the app_id column without a database migration leads to existing keys becoming unmanageable ('ghost keys'), and inconsistent key counting logic in the tenant-level key creation endpoint allows for a denial-of-service scenario where scoped keys can block the creation of tenant-level keys. Additionally, there are minor areas for improvement in the new test code to enhance clarity.

resource_model = Dataset
resource_id_field = "dataset_id"
token_prefix = "ds-"
resource_id_field = "app_id"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

security-medium medium

The PR changes the resource_id_field for dataset API keys from dataset_id to app_id to reuse the existing column and avoid a database migration. However, as noted in the PR description, no database migration is performed. This creates a security issue where any existing dataset API keys (which likely have their associated dataset ID stored in the dataset_id column) will no longer be returned by the list endpoint or be deletable by the delete endpoint, as these endpoints now filter by the app_id column. These 'ghost' keys remain active and valid for authentication but cannot be managed or revoked by administrators through the API or console, leading to a broken access control and revocation mechanism.

select(ApiToken).where(
ApiToken.type == self.resource_type,
ApiToken.tenant_id == current_tenant_id,
ApiToken.app_id.is_(None),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

security-medium medium

While the get method has been correctly updated to filter for tenant-level keys using ApiToken.app_id.is_(None), the post method in the same DatasetApiKeyApi class (lines 770-774) still counts all keys of type 'dataset' for the tenant when enforcing the max_keys limit. Since dataset-scoped keys also use the 'dataset' type, they are included in this count. This allows a user with edit permissions (who can create scoped keys) to reach the limit and effectively block administrators from creating any tenant-level dataset API keys, resulting in a denial of service on a critical management function and a privilege inconsistency.

# Setup mock database queries
mock_db.session.query.return_value.where.return_value.first.return_value = mock_dataset
query_chain = mock_db.session.query.return_value.where.return_value
query_chain.where.return_value.where.return_value.where.return_value.one_or_none.return_value = (
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This line is redundant because the side_effect set on line 82 for the same mock object (...first) will override this return_value. You can safely remove this line to improve test clarity.

# Setup mock database queries
mock_db.session.query.return_value.where.return_value.first.return_value = mock_dataset
query_chain = mock_db.session.query.return_value.where.return_value
query_chain.where.return_value.where.return_value.where.return_value.one_or_none.return_value = (
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This line is redundant because the side_effect set on line 112 for the same mock object (...first) will override this return_value. You can safely remove this line to improve test clarity.

@crazywoola crazywoola self-assigned this Jan 29, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

size:L This PR changes 100-499 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Feature] Dataset-scoped API Keys for Fine-grained Access Control

2 participants