[Azure] Add service principal and DefaultAzureCredential support#9123
[Azure] Add service principal and DefaultAzureCredential support#9123zpoint wants to merge 3 commits intoskypilot-org:masterfrom
Conversation
Support Azure authentication via service principal (client secret, certificate), workload identity, and managed identity in addition to the existing `az login` CLI flow. Uses a hybrid credential resolution: CLI fast path when token cache exists, otherwise DefaultAzureCredential for all other methods. Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
Summary of ChangesHello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request significantly enhances Azure authentication capabilities by integrating Service Principal and Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request introduces support for Azure Service Principal and DefaultAzureCredential, which is a valuable enhancement for non-interactive authentication. The implementation across sky/adaptors/azure.py and sky/clouds/azure.py is solid, and the addition of extensive unit tests is commendable. I have identified a couple of areas for improvement: a minor code duplication in sky/adaptors/azure.py that could be refactored, and a couple of unit tests in tests/unit_tests/test_azure_utils.py that don't correctly test the intended functionality. My detailed comments and suggestions are provided below.
| mock_identity = mock.MagicMock() | ||
| with mock.patch.dict('sys.modules', {'azure.identity': mock_identity}): | ||
| from azure import identity | ||
| identity.AzureCliCredential(process_timeout=30) | ||
| mock_identity.AzureCliCredential.assert_called_with( | ||
| process_timeout=30) |
There was a problem hiding this comment.
This test doesn't seem to be testing the azure.get_credential() function. It directly calls identity.AzureCliCredential() and then asserts that the mock was called, which is a tautology. The test should call azure.get_credential() to verify its behavior.
mock_identity = mock.MagicMock()
with mock.patch.dict('sys.modules', {'azure.identity': mock_identity}):
azure.get_credential()
mock_identity.AzureCliCredential.assert_called_with(
process_timeout=30)There was a problem hiding this comment.
Fixed. The tests now call azure.get_credential() instead of directly calling the mock. See updated test_cli_fast_path.
| mock_identity = mock.MagicMock() | ||
| with mock.patch.dict('sys.modules', {'azure.identity': mock_identity}): | ||
| from azure import identity | ||
| identity.DefaultAzureCredential(exclude_cli_credential=True, | ||
| exclude_powershell_credential=True) | ||
| mock_identity.DefaultAzureCredential.assert_called_with( | ||
| exclude_cli_credential=True, exclude_powershell_credential=True) |
There was a problem hiding this comment.
Similar to test_cli_fast_path, this test doesn't call the function under test, azure.get_credential(). It directly calls identity.DefaultAzureCredential() and asserts the mock was called. To correctly test the logic in azure.get_credential(), you should call it within the test.
mock_identity = mock.MagicMock()
with mock.patch.dict('sys.modules', {'azure.identity': mock_identity}):
azure.get_credential()
mock_identity.DefaultAzureCredential.assert_called_with(
exclude_cli_credential=True, exclude_powershell_credential=True)There was a problem hiding this comment.
Fixed. Same as above — test_non_cli_default_credential now calls azure.get_credential() instead of directly calling the mock.
| if is_non_cli_credential(): | ||
| # For service principals, look up the SP's object ID via Graph API | ||
| # using the application (client) ID. | ||
| graph_client = get_client('graph') | ||
|
|
||
| async def get_sp_object_id() -> str: | ||
| from msgraph.generated.service_principals.service_principals_request_builder import ( | ||
| ServicePrincipalsRequestBuilder) # pylint: disable=line-too-long | ||
| client_id = os.environ['AZURE_CLIENT_ID'] | ||
| query_params = (ServicePrincipalsRequestBuilder. | ||
| ServicePrincipalsRequestBuilderGetQueryParameters( | ||
| filter=f'appId eq \'{client_id}\'', | ||
| select=['id'], | ||
| )) | ||
| config = (ServicePrincipalsRequestBuilder. | ||
| ServicePrincipalsRequestBuilderGetRequestConfiguration( | ||
| query_parameters=query_params,)) | ||
| result = await graph_client.service_principals.get( | ||
| request_configuration=config) | ||
| if not result or not result.value: | ||
| raise sky_exceptions.StorageBucketCreateError( | ||
| 'Service principal not found in Azure AD. ' | ||
| 'Ensure the service principal exists and you have ' | ||
| 'Graph API permissions.') | ||
| return str(result.value[0].id) | ||
|
|
||
| try: | ||
| try: | ||
| loop = asyncio.get_running_loop() | ||
| except RuntimeError: | ||
| loop = asyncio.new_event_loop() | ||
| asyncio.set_event_loop(loop) | ||
| object_id = loop.run_until_complete(get_sp_object_id()) | ||
| except sky_exceptions.StorageBucketCreateError: | ||
| raise | ||
| except Exception as e: # pylint: disable=broad-except | ||
| with ux_utils.print_exception_no_traceback(): | ||
| raise sky_exceptions.StorageBucketCreateError( | ||
| 'Service principal lacks Graph API permissions to ' | ||
| 'auto-assign storage role. Pre-assign ' | ||
| '\'Storage Blob Data Owner\' role to your service ' | ||
| 'principal.\n' | ||
| f'Details: ' | ||
| f'{common_utils.format_exception(e, use_bracket=True)}') | ||
| principal_type = 'ServicePrincipal' | ||
| else: | ||
| graph_client = get_client('graph') | ||
|
|
||
| # Obtaining user's object ID to assign role. | ||
| # Reference: https://github.com/Azure/azure-sdk-for-python/issues/35573 # pylint: disable=line-too-long | ||
| async def get_object_id_async() -> str: | ||
| httpx_logger = logging.getLogger('httpx') | ||
| original_level = httpx_logger.getEffectiveLevel() | ||
| # silencing the INFO level response log from httpx request | ||
| httpx_logger.setLevel(logging.WARNING) | ||
| user = await graph_client.users.with_url( | ||
| 'https://graph.microsoft.com/v1.0/me').get() | ||
| httpx_logger.setLevel(original_level) | ||
| return str(user.additional_data['id']) | ||
|
|
||
| # Create a new event loop if none exists | ||
| try: | ||
| loop = asyncio.get_running_loop() | ||
| except RuntimeError: | ||
| loop = asyncio.new_event_loop() | ||
| asyncio.set_event_loop(loop) | ||
|
|
||
| object_id = loop.run_until_complete(get_object_id_async()) | ||
| principal_type = 'User' |
There was a problem hiding this comment.
Fixed. Hoisted the event loop creation and graph_client = get_client('graph') above the if is_non_cli_credential(): branch to remove duplication.
sky/adaptors/azure.py
Outdated
| from azure import identity | ||
|
|
||
| # Fast path: preserve existing CLI behavior with timeout workaround | ||
| if os.path.isfile(os.path.expanduser('~/.azure/msal_token_cache.json')): |
There was a problem hiding this comment.
The PR comment says the server account has higher priority than the CLI, but in the code it seems to be reversed?
There was a problem hiding this comment.
Good catch. Fixed — reversed the priority so that service principal (via is_non_cli_credential() / AZURE_CLIENT_ID env var) takes priority over CLI credentials. This is consistent with how AWS and GCP work: env-var-based service account credentials always take priority over personal CLI credentials.
| # using the application (client) ID. | ||
| graph_client = get_client('graph') | ||
|
|
||
| async def get_sp_object_id() -> str: |
There was a problem hiding this comment.
Do we need to use async here? The question is: we only use FastAPI to serve the Sky server, and this part doesn't seem to be part of the endpoint, so is blocking a concern here? We should follow the project's code pattern rather than reuse existing code if the existing code doesn't follow it.
There was a problem hiding this comment.
The async usage here is forced by the msgraph SDK — graph_client.service_principals.get() and graph_client.users.with_url(...).get() return coroutines, so we need loop.run_until_complete() to call them from sync code. The existing else branch (user identity path, pre-existing code) already uses the same async pattern for the same reason. Unfortunately the msgraph SDK doesn't offer a sync API alternative.
…event loop - Reverse credential priority so service principal (AZURE_CLIENT_ID env var) takes precedence over CLI, consistent with AWS/GCP behavior - Fix test_cli_fast_path and test_non_cli_default_credential to actually call azure.get_credential() instead of directly calling mocks (tautology) - Hoist asyncio event loop creation and graph_client above the if/else branch to remove duplication Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
Summary
is_non_cli_credential()andget_credential()tosky/adaptors/azure.py— hybrid credential factory that preserves the existing CLI fast path while addingDefaultAzureCredentialfor SP/cert/workload identity/managed identitysky/clouds/azure.pyto skip CLI-specific checks (token cache file,az --version) when SP env vars are detected, return empty file mounts, and use client ID as identityassign_storage_account_iam_role()to handle service principals (looks up SP object ID via Graph API, usesprincipalType: ServicePrincipal)How it works
az login(existing)AzureCliCredential(timeout=30)fast path~/.azure/files mountedDefaultAzureCredential→EnvironmentCredentialDefaultAzureCredential→ManagedIdentityCredentialTest plan
tests/unit_tests/test_azure_utils.pycovering all credential pathspytest tests/unit_tests/test_azure_utils.py— 23 tests passformat.sh— pylint 10.00/10, mypy cleanManual verification
🤖 Generated with Claude Code