AI Agents AI Gadgets & HW AI Models - LLM AI Open Source AI Security AI for Coding AI for Gaming AI for Images AI for Music AI for Videos Artificial Intelligence Editor's Choice NVIDIA AI Other News Robotics Tech Face-off Tech Satire

Amazon Bedrock AgentCore Browser Adds OS-Level Actions

By Artūras Malašauskas May 05, 2026 5 min read Share:
AWS extends AgentCore Browser with OS-level mouse, keyboard, and screenshot capabilities to handle native dialogs and system prompts beyond DOM automation.

Amazon Web Services has expanded Amazon Bedrock AgentCore Browser with OS-level interaction capabilities, addressing a fundamental limitation in browser automation. The update enables AI agents to interact with native operating system elements that sit outside the browser's DOM layer, including print dialogs, security prompts, and keyboard shortcuts.

Browser automation tools like Playwright and Chrome DevTools Protocol (CDP) work well for standard web tasks. They navigate pages, fill forms, and extract content from the web layer. But the web layer has a hard boundary. Anything the operating system renders—native dialogs, certificate choosers, context menus—sits outside the DOM entirely. CDP cannot see it, and Playwright cannot interact with it.

When a web application calls window.print() and a system print dialog appears, Playwright has no DOM to interact with. When a workflow requires a keyboard shortcut or a right-click context menu, CDP has no mechanism to issue those commands at the OS level. These scenarios tend to surface in production, not in testing environments where web content is predictable enough to validate against.

The challenge compounds for vision-enabled agents. A common architecture captures a screenshot, sends it to a model, receives coordinates or instructions, and executes. This loop works well for web content, but breaks the moment native UI appears. The screenshot captures it, the model reasons about it, and then there's nothing to act with. The agent sees exactly what to do and has no way to do it.

OS Level Actions for AgentCore Browser unblocks these scenarios by exposing direct OS control through the InvokeBrowser API. Agents can now interact with content visible on the screen, not only what's accessible through the browser's web layer. By combining full-desktop screenshots with mouse and keyboard control at the OS level, agents can observe native UI, reason about it, and act on it within the same session.

The feature is available by default on all browser instances in all 14 AWS Regions where Amazon Bedrock AgentCore Browser is available. This includes US East (N. Virginia), US East (Ohio), US West (Oregon), Europe (Frankfurt), Europe (Ireland), Europe (London), Europe (Paris), Europe (Stockholm), Asia Pacific (Mumbai), Asia Pacific (Singapore), Asia Pacific (Sydney), Asia Pacific (Tokyo), Asia Pacific (Seoul), and Canada (Central).

OS Level Actions are organized into three categories: mouse control, keyboard input, and visual capture. The system supports eight distinct actions with specific fields and constraints. Mouse actions cover clicking, moving, dragging, and scrolling. Keyboard actions handle typing text, pressing individual keys, and executing key combinations. The screenshot action captures the full OS desktop, not just the browser viewport.

Mouse actions work with coordinate-based positioning. The mouseClick action defaults to the current cursor position with a left button single click if coordinates are omitted. This is useful when a prior mouseMove has already positioned the cursor. mouseDrag requires four coordinates—start and end positions. mouseScroll accepts a position and delta values for both axes. A right-click context menu, for example, is a single mouseClick with button set to RIGHT at the target coordinates.

Keyboard actions cover different levels of input. keyType is for typing text and handles strings up to 10,000 characters. keyPress is for individual keys that must be pressed repeatedly, such as tab to advance through form fields or escape to dismiss a modal. keyShortcut is for combinations—pass an array of key names and AgentCore presses them simultaneously. Key names must be lowercase.

The expected interaction pattern is an action-screenshot-reaction loop. The agent takes an action, captures a screenshot to observe the current state of the screen, and then decides the next action based on what it sees. This loop allows the agent to react to dynamic UI, including native dialogs and OS prompts that might appear mid-workflow. Each call carries exactly one action and returns a SUCCESS or FAILED status.

Documentation from the company reveals important constraints developers must account for. ASCII-only text input means non-ASCII characters are skipped during input. No key name validation means the API returns SUCCESS even if you provide an unrecognized key name. The default viewport size is 1456×819 pixels, which can be configured when starting a session using the viewPort parameter.

Key use cases include automated testing with system dialog handling, document management workflows, complex UI interactions with right-click menus, and vision-based AI agents that require complete browser environment visibility. The feature serves AI agent developers, test automation engineers, and organizations building LLM-powered web interaction tools.

Some context menu items might not function as expected because of the virtualized environment in which the browser session runs. This is a practical limitation worth noting. The virtualized desktop environment introduces friction that doesn't exist in native browser automation. Developers will need to test edge cases thoroughly before deploying to production.

The InvokeBrowser API follows the same pattern as InvokeCodeInterpreter: a single unified operation with action-type dispatch. You send a request with exactly one action, and receive a corresponding result. The active session is identified using the x-amzn-browser-session-id header, which ties each OS-level action to the correct browser session.

Technical implementation requires developers to understand the dual interaction model. WebSocket-based automation uses CDP for standard browser tasks. OS-level actions use a REST API for operating system-level interactions. This complements CDP by handling scenarios where browser-level automation is insufficient. The two systems work together, not as replacements.

Whether this actually solves production problems remains the real question. The feature addresses documented gaps in browser automation, but virtualized environments introduce their own failure modes. Developers will need to validate workflows against real-world OS configurations, not just test environments. The difference between a controlled test setup and a production deployment with varying permissions and security policies is substantial (a problem that has plagued users for years, frankly).

Amazon Bedrock AgentCore Browser now provides a more complete automation surface. The OS-level actions fill a gap that has existed since browser automation tools first emerged. But filling the gap doesn't guarantee smooth execution. The physical reality of interacting with virtualized desktops—load times, coordinate precision, permission boundaries—will determine whether this works in practice.

Arturas Malas Artūras Malašauskas is an AI Systems Integrator with 20+ years of production-grade web engineering experience. He has designed, shipped, and scaled enterprise Python/PHP systems for logistics, SaaS, and public-sector clients. For the past year, he has focused exclusively on AI integrations: deploying open-source LLMs, building generative media pipelines (image, audio, video), and engineering multi-agent workflows for real production environments. His standard: reproducibility, security, cost-efficient inference—no vaporware. He documents and evaluates emerging AI tooling, separating verified capabilities from marketing noise. Technical editor at: muza-ai.eu, ai-verslas.lt, ai-naujinos.lt Connect on LinkedIn
Share:

Comments

Sign in to comment:
    <