Conversation
Summary of ChangesHello @SiswoHandoko, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request significantly expands the agent's capabilities by enabling it to interact with local desktop environments on both macOS and Windows. This allows the agent to perform tasks directly on the user's operating system, moving beyond browser-only interactions. The changes involve integrating new computer control classes, adapting the agent's core logic for desktop-specific instructions and tool usage, and updating the project's documentation and dependencies to support this new functionality. Highlights
Changelog
Activity
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
|
Thanks for your pull request! It looks like this may be your first contribution to a Google open source project. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA). View this failed invocation of the CLA check for more information. For the most up to date status, view the checks section at the bottom of the pull request. |
There was a problem hiding this comment.
Code Review
This pull request introduces support for desktop automation on both macOS and Windows by adding desktop and desktop_win environments, involving new DesktopComputer classes and agent logic updates. However, this introduces a critical security vulnerability: Remote Code Execution (RCE) via prompt injection, as the agent is given excessive permissions to control the host system's GUI based on untrusted user input without proper safeguards. Beyond this critical security concern, the changes also present maintainability issues due to significant code duplication between platform-specific computer classes, areas for code cleanup in the agent logic, and a minor documentation inconsistency in the README.
| system_instruction = None | ||
| if self._environment == "desktop": | ||
| system_instruction = ( | ||
| "You are controlling a local desktop GUI (macOS). " | ||
| "Do not type into the current active app unless you have explicitly focused the correct input field. " | ||
| "To open apps, first open Spotlight with Command+Space using key_combination, " | ||
| "then type the app name and press Enter without clicking in other windows. " | ||
| "Prefer keyboard shortcuts to switch apps instead of typing into arbitrary windows." | ||
| ) | ||
| search_intent = any( | ||
| token in self._query.lower() | ||
| for token in ("find", "search", "look for", "locate") | ||
| ) | ||
| if search_intent: | ||
| system_instruction += ( | ||
| " If the user asks to find or search for something, open Spotlight first." | ||
| ) | ||
| elif self._environment == "desktop_win": | ||
| system_instruction = ( | ||
| "You are controlling a local desktop GUI (Windows). " | ||
| "Do not type into the current active app unless you have explicitly focused the correct input field. " | ||
| "To open apps, first open Start/Search with Win or Win+S using key_combination, " | ||
| "then type the app name and press Enter without clicking in other windows. " | ||
| "Prefer keyboard shortcuts to switch apps instead of typing into arbitrary windows." | ||
| ) | ||
| search_intent = any( | ||
| token in self._query.lower() | ||
| for token in ("find", "search", "look for", "locate") | ||
| ) | ||
| if search_intent: | ||
| system_instruction += ( | ||
| " If the user asks to find or search for something, open Start/Search first." | ||
| ) |
There was a problem hiding this comment.
The introduction of desktop automation capabilities (macOS and Windows) via pyautogui creates a critical security risk. The agent's actions, driven by the query parameter (untrusted user input), allow for Remote Code Execution (RCE) via prompt injection, potentially enabling an attacker to take control of the host system. This is exacerbated by duplicated logic for handling desktop and desktop_win environments, specifically the search_intent check and system_instruction appending, which could be refactored for better maintainability and to simplify the application of security mitigations.
To mitigate the RCE risk, consider:
- Implementing mandatory human-in-the-loop (HITL) confirmation for sensitive actions.
- Running the agent in a highly restricted, sandboxed environment.
- Implementing a strict allow-list of permitted actions.
- Sanitizing and validating all LLM-generated output before execution.
Additionally, refactoring the duplicated logic for desktop and desktop_win environments would improve maintainability and reduce redundancy.
| def type_text_at( | ||
| self, | ||
| x: int, | ||
| y: int, | ||
| text: str, | ||
| press_enter: bool = False, | ||
| clear_before_typing: bool = True, | ||
| ) -> EnvState: | ||
| if self._spotlight_pending: | ||
| self._spotlight_pending = False | ||
| else: | ||
| pyautogui.click(x, y) | ||
| if clear_before_typing: | ||
| if sys.platform == "darwin": | ||
| pyautogui.hotkey("command", "a") | ||
| else: | ||
| pyautogui.hotkey("ctrl", "a") | ||
| pyautogui.press("backspace") | ||
| pyautogui.write(text) | ||
| if press_enter: | ||
| pyautogui.press("enter") | ||
| return self.current_state() | ||
|
|
||
| def scroll_document( | ||
| self, direction: Literal["up", "down", "left", "right"] | ||
| ) -> EnvState: | ||
| scroll_amount = self._screen_size[1] // 2 | ||
| if direction == "up": | ||
| pyautogui.scroll(scroll_amount) | ||
| elif direction == "down": | ||
| pyautogui.scroll(-scroll_amount) | ||
| elif direction == "left": | ||
| pyautogui.hscroll(-scroll_amount) | ||
| elif direction == "right": | ||
| pyautogui.hscroll(scroll_amount) | ||
| else: | ||
| raise ValueError("Unsupported direction: ", direction) | ||
| return self.current_state() | ||
|
|
||
| def scroll_at( | ||
| self, | ||
| x: int, | ||
| y: int, | ||
| direction: Literal["up", "down", "left", "right"], | ||
| magnitude: int = 800, | ||
| ) -> EnvState: | ||
| pyautogui.moveTo(x, y) | ||
| if direction == "up": | ||
| pyautogui.scroll(magnitude) | ||
| elif direction == "down": | ||
| pyautogui.scroll(-magnitude) | ||
| elif direction == "left": | ||
| pyautogui.hscroll(-magnitude) | ||
| elif direction == "right": | ||
| pyautogui.hscroll(magnitude) | ||
| else: | ||
| raise ValueError("Unsupported direction: ", direction) | ||
| return self.current_state() | ||
|
|
||
| def wait_5_seconds(self) -> EnvState: | ||
| time.sleep(5) | ||
| return self.current_state() | ||
|
|
||
| def go_back(self) -> EnvState: | ||
| if sys.platform == "darwin": | ||
| pyautogui.hotkey("command", "[") | ||
| else: | ||
| pyautogui.hotkey("alt", "left") | ||
| return self.current_state() | ||
|
|
||
| def go_forward(self) -> EnvState: | ||
| if sys.platform == "darwin": | ||
| pyautogui.hotkey("command", "]") | ||
| else: | ||
| pyautogui.hotkey("alt", "right") | ||
| return self.current_state() | ||
|
|
||
| def search(self) -> EnvState: | ||
| return self.navigate(self._search_engine_url) | ||
|
|
||
| def navigate(self, url: str) -> EnvState: | ||
| normalized_url = url | ||
| if not normalized_url.startswith(("http://", "https://")): | ||
| normalized_url = "https://" + normalized_url | ||
| if sys.platform == "darwin": | ||
| pyautogui.hotkey("command", "l") | ||
| else: | ||
| pyautogui.hotkey("ctrl", "l") | ||
| pyautogui.write(normalized_url) | ||
| pyautogui.press("enter") | ||
| self._current_url = normalized_url | ||
| time.sleep(1) | ||
| return self.current_state() | ||
|
|
||
| def key_combination(self, keys: list[str]) -> EnvState: | ||
| normalized_keys = [self._normalize_key(key) for key in keys] | ||
| if len(normalized_keys) == 1: | ||
| pyautogui.press(normalized_keys[0]) | ||
| else: | ||
| pyautogui.hotkey(*normalized_keys) | ||
| if sys.platform == "darwin" and normalized_keys == ["command", "space"]: | ||
| self._spotlight_pending = True | ||
| time.sleep(0.2) | ||
| return self.current_state() | ||
|
|
||
| def drag_and_drop( | ||
| self, x: int, y: int, destination_x: int, destination_y: int | ||
| ) -> EnvState: | ||
| pyautogui.moveTo(x, y) | ||
| pyautogui.dragTo(destination_x, destination_y, button="left") | ||
| return self.current_state() |
There was a problem hiding this comment.
The DesktopComputer class provides the LLM with excessive permissions by allowing it to perform OS-level input automation via pyautogui. This includes typing arbitrary text (type_text_at), navigating to arbitrary URLs (navigate), and pressing any key combination (key_combination). When combined with an LLM that processes untrusted user input, this tool can be abused to compromise the host system.
Consider restricting the tool's capabilities to the minimum necessary for the intended task and ensuring that all actions are performed in a secure, isolated environment.
| def type_text_at( | ||
| self, | ||
| x: int, | ||
| y: int, | ||
| text: str, | ||
| press_enter: bool = False, | ||
| clear_before_typing: bool = True, | ||
| ) -> EnvState: | ||
| if self._launcher_pending: | ||
| self._launcher_pending = False | ||
| else: | ||
| pyautogui.click(x, y) | ||
| if clear_before_typing: | ||
| pyautogui.hotkey("ctrl", "a") | ||
| pyautogui.press("backspace") | ||
| pyautogui.write(text) | ||
| if press_enter: | ||
| pyautogui.press("enter") | ||
| return self.current_state() | ||
|
|
||
| def scroll_document( | ||
| self, direction: Literal["up", "down", "left", "right"] | ||
| ) -> EnvState: | ||
| scroll_amount = self._screen_size[1] // 2 | ||
| if direction == "up": | ||
| pyautogui.scroll(scroll_amount) | ||
| elif direction == "down": | ||
| pyautogui.scroll(-scroll_amount) | ||
| elif direction == "left": | ||
| pyautogui.hscroll(-scroll_amount) | ||
| elif direction == "right": | ||
| pyautogui.hscroll(scroll_amount) | ||
| else: | ||
| raise ValueError("Unsupported direction: ", direction) | ||
| return self.current_state() | ||
|
|
||
| def scroll_at( | ||
| self, | ||
| x: int, | ||
| y: int, | ||
| direction: Literal["up", "down", "left", "right"], | ||
| magnitude: int = 800, | ||
| ) -> EnvState: | ||
| pyautogui.moveTo(x, y) | ||
| if direction == "up": | ||
| pyautogui.scroll(magnitude) | ||
| elif direction == "down": | ||
| pyautogui.scroll(-magnitude) | ||
| elif direction == "left": | ||
| pyautogui.hscroll(-magnitude) | ||
| elif direction == "right": | ||
| pyautogui.hscroll(magnitude) | ||
| else: | ||
| raise ValueError("Unsupported direction: ", direction) | ||
| return self.current_state() | ||
|
|
||
| def wait_5_seconds(self) -> EnvState: | ||
| time.sleep(5) | ||
| return self.current_state() | ||
|
|
||
| def go_back(self) -> EnvState: | ||
| pyautogui.hotkey("alt", "left") | ||
| return self.current_state() | ||
|
|
||
| def go_forward(self) -> EnvState: | ||
| pyautogui.hotkey("alt", "right") | ||
| return self.current_state() | ||
|
|
||
| def search(self) -> EnvState: | ||
| return self.navigate(self._search_engine_url) | ||
|
|
||
| def navigate(self, url: str) -> EnvState: | ||
| normalized_url = url | ||
| if not normalized_url.startswith(("http://", "https://")): | ||
| normalized_url = "https://" + normalized_url | ||
| pyautogui.hotkey("ctrl", "l") | ||
| pyautogui.write(normalized_url) | ||
| pyautogui.press("enter") | ||
| self._current_url = normalized_url | ||
| time.sleep(1) | ||
| return self.current_state() | ||
|
|
||
| def key_combination(self, keys: list[str]) -> EnvState: | ||
| normalized_keys = [self._normalize_key(key) for key in keys] | ||
| if len(normalized_keys) == 1: | ||
| pyautogui.press(normalized_keys[0]) | ||
| else: | ||
| pyautogui.hotkey(*normalized_keys) | ||
| if normalized_keys in (["win"], ["win", "s"]): | ||
| self._launcher_pending = True | ||
| time.sleep(0.2) | ||
| return self.current_state() | ||
|
|
||
| def drag_and_drop( | ||
| self, x: int, y: int, destination_x: int, destination_y: int | ||
| ) -> EnvState: | ||
| pyautogui.moveTo(x, y) | ||
| pyautogui.dragTo(destination_x, destination_y, button="left") | ||
| return self.current_state() |
There was a problem hiding this comment.
The DesktopComputer class for Windows provides the LLM with excessive permissions via pyautogui, similar to the macOS version. This tool can be abused via prompt injection to perform malicious actions on the host Windows system.
Consider implementing strict safeguards and running the agent in a restricted environment.
| from ..computer import Computer, EnvState | ||
|
|
||
|
|
||
| class DesktopComputer(Computer): |
There was a problem hiding this comment.
This file is almost an exact copy of computers/desktop/desktop.py. This significant code duplication will make future maintenance difficult, as changes will need to be applied in two places.
A better approach would be to create a common base class, say BaseDesktopComputer, that contains all the shared logic. Then, DesktopComputer (for macOS) and DesktopWindowsComputer can inherit from this base class and override only the methods that have platform-specific implementations (like _normalize_key, go_back, navigate, etc.). This will greatly improve code reuse and maintainability.
| | Argument | Description | Required | Default | Supported Environment(s) | | ||
| | ------------------- | -------------------------------------------------------------------------------------------------------------------------------------- | -------- | ----------------------------------------- | ------------------------ | | ||
| | `--query` | The natural language query for the browser agent to execute. | Yes | N/A | All | | ||
| | `--env` | The computer use environment to use. Must be one of the following: `playwright`, `browserbase`, `desktop`, or `desktop_win` | No | N/A | All | |
There was a problem hiding this comment.
The documentation for the --env argument is missing its default value. The table shows "N/A", but the code in main.py specifies default="playwright". To improve clarity for users, the default value should be documented here.
| | `--env` | The computer use environment to use. Must be one of the following: `playwright`, `browserbase`, `desktop`, or `desktop_win` | No | N/A | All | | |
| | `--env` | The computer use environment to use. Must be one of the following: `playwright`, `browserbase`, `desktop`, or `desktop_win` | No | `playwright` | All | |
References
- In documentation, command-line arguments that have a default value should be marked as not required and their default value should be specified.
| from ..computer import Computer, EnvState | ||
|
|
||
|
|
||
| class DesktopComputer(Computer): |
There was a problem hiding this comment.
The class is named DesktopComputer, which is the same as the class in computers/desktop/desktop.py. This is confusing and can lead to errors. Since it's imported as DesktopWindowsComputer in computers/__init__.py, it would be clearer and more consistent to rename the class itself to DesktopWindowsComputer.
| class DesktopComputer(Computer): | |
| class DesktopWindowsComputer(Computer): |
No description provided.