The virtualized worker - Giving the ghost a body and securing it!

AI is evolving from chatbot to worker by combining reasoning tokens, disposable sandboxed VMs, and computer-use vision—delivering finished files instead of advice. The ghost finally has a body.

Dec 23, 2025

Some brief notes:

We spend so much time talking about agents that never sleep, systems that scale infinitely, and work that runs 24/7.

But you’re not a Firecracker microVM. You don’t spin up in 125 milliseconds. You need rest.

So consider this your official Human Interrupt. The swarm can wait. The schemas can wait. The future will still be there.

Over this week and the next week, I'll be will not post as usual on Tuesday and Thursday. Instead, for this week (Christmas 2025) and next week (New Year 2026) I’ll post on Tuesday and Friday. Thanks for understanding!

Whatever this season means to you, I hope it holds warmth, good food, and people worth being present with.

And a disclaimer:

What follows is written from end of 2025, considering these topics for 2026, looking back. The technologies are real. The timeline is a projection—but a projection grounded in what shipped in 2024 and 2025.

TL:DR;

For years, AI was a brain trapped in a jar. You’d ask it to book a flight, and it would hand you a list of airlines to check yourself. You’d ask it to fix your code, and it would spit out a script you had to run manually. The AI could describe work but couldn’t do work. We were stuck playing clipboard for our chatbots.

That era is ending.

The breakthrough came from fusing three technologies: Thinking Tokens that force AI to plan before acting, Firecracker microVMs that spin up disposable computers in 125 milliseconds, and Computer Use capabilities that let AI see screens and click buttons like a human.

The result? AI that doesn’t just talk about spreadsheets—it opens them, processes them, and delivers finished files. Need to audit 10,000 contracts? Spin up 10,000 sandboxed agents. Ten minutes and $40 later, you have your report.

We’re no longer accepting polished-sounding “glossy soup.” We’re demanding functional artifacts. The ghost finally has a body—and it’s ready to work.

The Itch: The “Brain in a Jar” Problem

For the first few years of the AI boom, we suffered from a peculiar limitation. We had built machines capable of extraordinary linguistic and analytical feats, but we had trapped them in a text box.

We had created a Brain in a Jar.

You felt this frustration every time you tried to get real work done in 2024. You would ask an AI to “Book me a flight to London.”

It would reply: “I can’t browse the live web to book tickets, but here is a list of airlines you might check.”

You would ask it to “Analyze this messy Excel file.”

It would reply with a block of Python code and say: “Run this on your machine.”

The AI could describe the work, but it couldn’t do the work. It lacked Agency. It had no hands. It had no eyes. It had no computer. We were stuck in the “Text-in/Text-out” paradigm. If the task required clicking a button, logging into a portal, or running a simulation, the human still had to do it. The AI was just a very smart, very annoying backseat driver.

This created the “Copy-Paste Crisis.”

Engineers became glorified clipboards, moving code from the Chatbot to the Terminal, seeing it fail, pasting the error back into the Chatbot, and repeating the cycle. We realized that as long as the human had to be the runtime environment for the AI, the AI wasn’t saving us time—it was just changing the type of frustration we endured.

To make the leap from “Chatbot” to “Worker,” we needed to give the ghost a body. But giving an AI access to a real computer—your laptop, your servers—was a security nightmare. One hallucination could delete your database. One “rm -rf” command could wipe your startup.

The solution wasn’t to trust the AI. It was to quarantine it.

The Deep Dive: Building the Synthetic Employee

In 2026, we can solve the “Brain in a Jar” problem by converging three distinct technologies into a single stack: Thinking (The Plan), Virtualization (The Body), and Computer Use (The Hands).

Layer 1: The “Thinking” Loop (The Cognitive Pause)

Before an agent can act, it must plan. In the early days, models were “System 1” thinkers (fast, intuitive, and prone to snap judgments)—they reacted instantly. You typed a prompt, and they began generating the response immediately. This was fatal for complex, agentic tasks. If an agent starts clicking buttons before it understands the interface, or starts writing code before it understands the dependencies, it breaks things.

The fix came from Reasoning Tokens, popularized by OpenAI’s o1 and later ChatGPT 5.2

When you give a command to a modern agent, it doesn’t move immediately. It enters a “Thinking State.”

The Prompt: “Log into the supplier portal, find the invoices from May, and reconcile them with our internal ledger.”
The Hidden Thought: I need to locate the 'Invoices' tab. The date format is likely DD/MM/YYYY. I will need to download PDFs. I should verify the download finished before parsing. If the login fails, I need to check if the 2FA token is stale.

This isn’t just a pause; it is a probabilistic self-audit. By generating hidden reasoning tokens, the model explicitly decomposes the request into atomic steps. It simulates the execution path in its “mind” before taking action in the real world.

Crucially, this mechanism reduces tool-use errors. In 2024, agents would often hallucinate tool parameters (e.g., calling a weather tool with location=”Paris, TX” instead of Paris, France). With reasoning tokens, the model catches itself: “Wait, the user context implies Europe. I should specify the country code.” 2

While the planning itself can still contain errors, this internal deliberation creates a critical intervention window. It transforms the agent from a chaotic improviser into a methodical planner, drastically reducing the “copy-paste” disasters of the past.

Layer 2: The Virtualized Runtime (The Disposable Body)

Once the agent has a plan, it needs a place to execute it.

We couldn’t give agents access to our real computers. Docker containers were a popular stopgap, but they were too slow (seconds to boot) and, more importantly, they shared the host kernel—meaning a sophisticated agent could potentially escape the container and infect the host.

So, companies like Manus and infrastructure providers like E2B pioneered the Agentic Runtime using Firecracker microVMs.34

Think of this as a “Disposable Laptop.”

When you assign a task to an agent, the system spins up a brand-new, sterile Linux computer in the cloud. It doesn’t take minutes; it takes about 125 milliseconds. That is faster than the blink of a human eye.

This computer has a browser, a file system, a code editor, and a terminal. The agent “logs in” to this virtual machine and gets to work.

Why “Burning it Down” is a Feature

The superpower of this architecture is Ephemerality.

If the agent accidentally downloads a virus, deletes the system files, or creates an infinite loop that crashes the CPU, it doesn’t matter. The VM is isolated. The moment the task is done—or failed—we burn the VM down. The malware is vaporized along with the virtual machine.

We call this “Epistemic Isolation.” The agent can learn and explore, but its mistakes are contained within a disposable reality. It never touches your corporate network. It never touches your laptop.

Enterprise-Grade Security: The “Egress” Problem

Of course, a sandbox isn’t useful if it’s a prison. The agent needs to talk to the internet to get data. But how do you prevent it from sending your data to a hacker?

Modern runtimes enforce strict Egress Filtering.

The Allowlist: The agent can only connect to domains explicitly whitelisted in the Scope of Work (e.g., api.salesforce.com, github.com). All other traffic is dropped by the hypervisor.
Credential Injection: We solved the authentication problem with just-in-time (JIT) credential injection.5 The agent never sees your permanent API keys. It requests access, and a proxy injects an ephemeral OAuth token into the request headers as it leaves the sandbox. If the agent tries to print the key to the console, it sees nothing. This ensures that even if the agent is “jailbroken,” it has no secrets to steal.

Layer 3: Computer Use (The Hands)

This is where the magic happens.

In the old days, bots could only use APIs. If a website didn’t have an API, the bot was blind. It couldn’t book a flight on a legacy airline site; it couldn’t use the company’s 15-year-old legacy system.

But the real world runs on legacy software, clunky systems, and visual interfaces.

Anthropic solved this with “Computer Use” capabilities.6 They trained models not just to predict text, but to predict pixels and coordinates.

The agent takes a screenshot of the virtual screen inside its Firecracker VM. It analyzes the UI. It decides: “Move mouse to (x:500, y:200). Click left button. Type ‘Admin’.”

Suddenly, the “API Bottleneck” vanished.

Legacy Software: Need to extract data from a Windows 95 app running in a VDI? The agent can “see” it and click the buttons.
Visual Debugging: Need to fix a CSS bug on a website? The agent spins up a browser, renders the page, looks at the screenshot, notices the button is misaligned, changes the code, refreshes, and verifies the fix visually.
Complex Workflows: Need to download a CSV from a cloud application, upload it to your BI system, and take a screenshot of the dashboard? The agent navigates the dropdown menus just like a human.

The Resolution: From “Glossy Soup” to “Functional Artifacts”

So, what does this look like in practice?

It means we stopped accepting “Glossy Soup”—that polished-sounding, grammatically perfect, but chemically inert text that chatbots used to generate.

We started demanding Functional Artifacts.7

A Functional Artifact is a file that works. It is the difference between a recipe and a meal. It is a deliverable that has been executed and verified inside the sandbox before it ever reaches you.

Scenario A: The Procurement Audit (Administrative)

The Old Way: You ask ChatGPT, “How do I audit these invoices?” It gives you a generic checklist (”Check the dates, check the totals”). You still have to open the 500 files yourself.

The New Way (Manus): You upload 500 PDF invoices and a link to your ERP system.

Thinking: The agent plans: “I need to OCR the PDFs, extract the ‘Total’ field, and match it against the ERP row. If the OCR fails, I will try a vision model.”
Virtualization: It spins up a sandbox. It opens a browser and logs into the ERP using injected credentials.
Execution: It downloads the ERP report as Excel. It writes a Python script to parse the PDFs. It runs the script.
- Self-Correction: The script fails on Invoice #42 because the format is different. The agent reads the error, rewrites the regex parser, and re-runs the script. It finds 3 discrepancies.
The Artifact: It doesn’t chat with you. It delivers a Discrepancy_Report.xlsx with the 3 errors highlighted in red, ready for your signature.

Scenario B: The Market Researcher (Analysis)

The Old Way: You ask for “Trends in EV batteries.” The bot hallucinates a few citations and gives you a summary that sounds plausible but cites non-existent papers.

The New Way: You commission a “Deep Research” task.

Thinking: “I need to find the latest IEA report and the Q3 earnings of Tesla and BYD. I need to verify the data sources.”
Virtualization: The agent opens a browser in the sandbox. It navigates to the IEA website. It uses a secure, injected proxy to access the premium Bloomberg Terminal data (without ever seeing the login/password).
Computer Use: It downloads the PDFs. It doesn’t just “read” the text; it takes screenshots of the charts. It uses Python to extract the raw data tables from the images into a CSV. It generates a new graph comparing the datasets using matplotlib.
The Artifact: A 20-slide PowerPoint Deck with the actual charts embedded, source links in the footer, and the raw CSVs in the appendix.

The New Definition of Value

We are no longer impressed by an AI that can talk about code or talk about spreadsheets.

We demand an AI that can open the app, do the work, and save the file.

This changes the economics of labor. We are moving toward “Tokenized Labor.”

Need to audit 10,000 contracts?

Don’t hire 100 lawyers. Don’t wait three weeks.

Spin up 10,000 Firecracker microVMs. Give each one a contract. Let them read, extract, and verify.

Ten minutes later, spin them all down.

Cost: $40. Time: 10 minutes.

The “Virtualized Worker” is safe because it works in a sandbox. It is smart because it uses “Thinking Tokens” to plan. And it is useful because it has “Computer Use” hands to interact with the messy, human-centric world we actually live in.

But having a workforce of virtual employees brings a new problem. If you have 1,000 agents spinning up 1,000 sandboxes, who manages them? Who ensures they are working together and not just creating chaos? Who approves the high-stakes decisions?

In the final article, we will ascend to the “Manager Layer.” We will explore Swarm Orchestration, the protocols for Human-in-the-Loop, and how to run a company where the org chart is made of software.

Deep Dive: Connecting the Dots

To understand the full stack of the “Synthetic Employee,” read these foundational pieces:

The Brain: Before you give an agent a computer, you must give it a memory. The Million-Token Lie (019) explains the “Tiered Memory” architecture that prevents the agent from getting lost.
The Shift: Why did we stop chatting? The Conversational Fallacy (018) explains the pivot from “Consultant” to “Worker.”
The Cost: Running VMs and Reasoning models is expensive. The Two-Brain Architecture (016) explains how to use “Nano” models to keep costs down while using “Frontier” models for planning.

Peace. Stay curious! End of transmission.

References

OpenAI (2025). “Learning to Reason with LLMs.” OpenAI Research. (Detailing the RL training behind the reasoning tokens).

CoThink (2025). “Reasoning Models Reduce Verbosity and Error Rates.” arXiv:2505.22017. (Empirical evidence on error reduction via planning).

Agache, A., et al. (2020). “Firecracker: Lightweight Virtualization for Serverless Applications.” USENIX NSDI. (The foundational architecture of microVMs).

E2B (2025). “Sandboxed Cloud Environments for AI Agents.” E2B Documentation. (Orchestrating agent runtimes at scale).

Aembit (2025). “Identity Security for Non-Human Workloads.” Aembit Blog. (Patterns for JIT credential injection).

Anthropic (2024). “Developing a Computer Use Model.” Anthropic Research. (The shift to pixel-based UI interaction).

Meske, C., et al. (2025). “Vibe Coding: Programming through Dialogue and Visualization.” arXiv preprint. (On the shift from text-based coding to intent-driven artifact generation).

Next Kick Labs

Discussion about this post

Ready for more?