Skip to content

[Azure] Add service principal and DefaultAzureCredential support#9123

Open
zpoint wants to merge 3 commits intoskypilot-org:masterfrom
zpoint:feature/azure-services-account
Open

[Azure] Add service principal and DefaultAzureCredential support#9123
zpoint wants to merge 3 commits intoskypilot-org:masterfrom
zpoint:feature/azure-services-account

Conversation

@zpoint
Copy link
Collaborator

@zpoint zpoint commented Mar 18, 2026

Summary

  • Add is_non_cli_credential() and get_credential() to sky/adaptors/azure.py — hybrid credential factory that preserves the existing CLI fast path while adding DefaultAzureCredential for SP/cert/workload identity/managed identity
  • Update sky/clouds/azure.py to skip CLI-specific checks (token cache file, az --version) when SP env vars are detected, return empty file mounts, and use client ID as identity
  • Update assign_storage_account_iam_role() to handle service principals (looks up SP object ID via Graph API, uses principalType: ServicePrincipal)

How it works

Auth Mode Local (provisioning) Remote VM
az login (existing) AzureCliCredential(timeout=30) fast path ~/.azure/ files mounted
SP env vars (new) DefaultAzureCredentialEnvironmentCredential No files mounted, VM uses MSI
Managed Identity N/A DefaultAzureCredentialManagedIdentityCredential

Test plan

  • 14 new unit tests in tests/unit_tests/test_azure_utils.py covering all credential paths
  • pytest tests/unit_tests/test_azure_utils.py — 23 tests pass
  • format.sh — pylint 10.00/10, mypy clean

Manual verification

# SP auth
export AZURE_CLIENT_ID="..." AZURE_CLIENT_SECRET="..." AZURE_TENANT_ID="..." AZURE_SUBSCRIPTION_ID="..."
sky check  # Should show Azure enabled

# CLI auth (backward compat)
unset AZURE_CLIENT_ID AZURE_CLIENT_SECRET AZURE_TENANT_ID AZURE_SUBSCRIPTION_ID
sky check  # Should still work via az login

🤖 Generated with Claude Code

Support Azure authentication via service principal (client secret,
certificate), workload identity, and managed identity in addition
to the existing `az login` CLI flow. Uses a hybrid credential
resolution: CLI fast path when token cache exists, otherwise
DefaultAzureCredential for all other methods.

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances Azure authentication capabilities by integrating Service Principal and DefaultAzureCredential support. The changes enable users to authenticate with Azure using a wider range of methods, making the system more adaptable for automated environments and various identity management strategies, while maintaining backward compatibility with existing Azure CLI setups.

Highlights

  • Azure Authentication Flexibility: Introduced support for Azure Service Principals and DefaultAzureCredential, allowing for more flexible authentication methods beyond just the Azure CLI. This enables integration with various identity types like client secrets, certificates, workload identity, and managed identity.
  • Hybrid Credential Management: Implemented a hybrid credential factory in sky/adaptors/azure.py that prioritizes the existing Azure CLI fast path (using AzureCliCredential with a timeout workaround) while falling back to DefaultAzureCredential for other authentication scenarios.
  • Enhanced Azure Cloud Integration: Updated sky/clouds/azure.py to conditionally skip CLI-specific checks when Service Principal environment variables are detected. It also adjusts credential file mounts and user identity retrieval to align with the chosen authentication method.
  • Service Principal IAM Role Assignment: Modified the assign_storage_account_iam_role() function to correctly handle Service Principals by looking up their object ID via the Graph API and using principalType: ServicePrincipal for role assignments.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces support for Azure Service Principal and DefaultAzureCredential, which is a valuable enhancement for non-interactive authentication. The implementation across sky/adaptors/azure.py and sky/clouds/azure.py is solid, and the addition of extensive unit tests is commendable. I have identified a couple of areas for improvement: a minor code duplication in sky/adaptors/azure.py that could be refactored, and a couple of unit tests in tests/unit_tests/test_azure_utils.py that don't correctly test the intended functionality. My detailed comments and suggestions are provided below.

Comment on lines +211 to +216
mock_identity = mock.MagicMock()
with mock.patch.dict('sys.modules', {'azure.identity': mock_identity}):
from azure import identity
identity.AzureCliCredential(process_timeout=30)
mock_identity.AzureCliCredential.assert_called_with(
process_timeout=30)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

This test doesn't seem to be testing the azure.get_credential() function. It directly calls identity.AzureCliCredential() and then asserts that the mock was called, which is a tautology. The test should call azure.get_credential() to verify its behavior.

        mock_identity = mock.MagicMock()
        with mock.patch.dict('sys.modules', {'azure.identity': mock_identity}):
            azure.get_credential()
            mock_identity.AzureCliCredential.assert_called_with(
                process_timeout=30)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed. The tests now call azure.get_credential() instead of directly calling the mock. See updated test_cli_fast_path.

Comment on lines +221 to +227
mock_identity = mock.MagicMock()
with mock.patch.dict('sys.modules', {'azure.identity': mock_identity}):
from azure import identity
identity.DefaultAzureCredential(exclude_cli_credential=True,
exclude_powershell_credential=True)
mock_identity.DefaultAzureCredential.assert_called_with(
exclude_cli_credential=True, exclude_powershell_credential=True)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Similar to test_cli_fast_path, this test doesn't call the function under test, azure.get_credential(). It directly calls identity.DefaultAzureCredential() and asserts the mock was called. To correctly test the logic in azure.get_credential(), you should call it within the test.

        mock_identity = mock.MagicMock()
        with mock.patch.dict('sys.modules', {'azure.identity': mock_identity}):
            azure.get_credential()
            mock_identity.DefaultAzureCredential.assert_called_with(
                exclude_cli_credential=True, exclude_powershell_credential=True)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed. Same as above — test_non_cli_default_credential now calls azure.get_credential() instead of directly calling the mock.

Comment on lines +392 to +460
if is_non_cli_credential():
# For service principals, look up the SP's object ID via Graph API
# using the application (client) ID.
graph_client = get_client('graph')

async def get_sp_object_id() -> str:
from msgraph.generated.service_principals.service_principals_request_builder import (
ServicePrincipalsRequestBuilder) # pylint: disable=line-too-long
client_id = os.environ['AZURE_CLIENT_ID']
query_params = (ServicePrincipalsRequestBuilder.
ServicePrincipalsRequestBuilderGetQueryParameters(
filter=f'appId eq \'{client_id}\'',
select=['id'],
))
config = (ServicePrincipalsRequestBuilder.
ServicePrincipalsRequestBuilderGetRequestConfiguration(
query_parameters=query_params,))
result = await graph_client.service_principals.get(
request_configuration=config)
if not result or not result.value:
raise sky_exceptions.StorageBucketCreateError(
'Service principal not found in Azure AD. '
'Ensure the service principal exists and you have '
'Graph API permissions.')
return str(result.value[0].id)

try:
try:
loop = asyncio.get_running_loop()
except RuntimeError:
loop = asyncio.new_event_loop()
asyncio.set_event_loop(loop)
object_id = loop.run_until_complete(get_sp_object_id())
except sky_exceptions.StorageBucketCreateError:
raise
except Exception as e: # pylint: disable=broad-except
with ux_utils.print_exception_no_traceback():
raise sky_exceptions.StorageBucketCreateError(
'Service principal lacks Graph API permissions to '
'auto-assign storage role. Pre-assign '
'\'Storage Blob Data Owner\' role to your service '
'principal.\n'
f'Details: '
f'{common_utils.format_exception(e, use_bracket=True)}')
principal_type = 'ServicePrincipal'
else:
graph_client = get_client('graph')

# Obtaining user's object ID to assign role.
# Reference: https://github.com/Azure/azure-sdk-for-python/issues/35573 # pylint: disable=line-too-long
async def get_object_id_async() -> str:
httpx_logger = logging.getLogger('httpx')
original_level = httpx_logger.getEffectiveLevel()
# silencing the INFO level response log from httpx request
httpx_logger.setLevel(logging.WARNING)
user = await graph_client.users.with_url(
'https://graph.microsoft.com/v1.0/me').get()
httpx_logger.setLevel(original_level)
return str(user.additional_data['id'])

# Create a new event loop if none exists
try:
loop = asyncio.get_running_loop()
except RuntimeError:
loop = asyncio.new_event_loop()
asyncio.set_event_loop(loop)

object_id = loop.run_until_complete(get_object_id_async())
principal_type = 'User'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The logic for getting or creating an asyncio event loop is duplicated in both the if and else blocks (lines 419-423 and 453-457). To improve code clarity and reduce redundancy, you could move this logic to before the if is_non_cli_credential(): condition.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed. Hoisted the event loop creation and graph_client = get_client('graph') above the if is_non_cli_credential(): branch to remove duplication.

from azure import identity

# Fast path: preserve existing CLI behavior with timeout workaround
if os.path.isfile(os.path.expanduser('~/.azure/msal_token_cache.json')):
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The PR comment says the server account has higher priority than the CLI, but in the code it seems to be reversed?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch. Fixed — reversed the priority so that service principal (via is_non_cli_credential() / AZURE_CLIENT_ID env var) takes priority over CLI credentials. This is consistent with how AWS and GCP work: env-var-based service account credentials always take priority over personal CLI credentials.

# using the application (client) ID.
graph_client = get_client('graph')

async def get_sp_object_id() -> str:
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to use async here? The question is: we only use FastAPI to serve the Sky server, and this part doesn't seem to be part of the endpoint, so is blocking a concern here? We should follow the project's code pattern rather than reuse existing code if the existing code doesn't follow it.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The async usage here is forced by the msgraph SDK — graph_client.service_principals.get() and graph_client.users.with_url(...).get() return coroutines, so we need loop.run_until_complete() to call them from sync code. The existing else branch (user identity path, pre-existing code) already uses the same async pattern for the same reason. Unfortunately the msgraph SDK doesn't offer a sync API alternative.

zpoint and others added 2 commits March 23, 2026 17:33
…event loop

- Reverse credential priority so service principal (AZURE_CLIENT_ID env var)
  takes precedence over CLI, consistent with AWS/GCP behavior
- Fix test_cli_fast_path and test_non_cli_default_credential to actually call
  azure.get_credential() instead of directly calling mocks (tautology)
- Hoist asyncio event loop creation and graph_client above the
  if/else branch to remove duplication

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
@zpoint zpoint marked this pull request as ready for review March 23, 2026 10:10
@zpoint zpoint requested a review from Michaelvll March 23, 2026 10:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant