CLI-First Agent Tooling vs Function Calling for OpenClaw

Function calling is easy to admire because it looks like good software design. CLI-first tooling is easier to underestimate because it looks messy. For long-running agents, that instinct is often backward. The hard problem is usually not calling a capability. It is recovering, inspecting, resuming, and handing off work after the agent has already been running for a while.

In mid-March 2026, a high-engagement LocalLLaMA thread pushed this argument into the open. Xiaozhe Yao’s companion essay, “Why I don’t use function calling for my agents”, made the case bluntly: once agents have to operate across real systems, function wrappers often hide too much of the context that actually matters. Parameters live in URLs, auth state, files, session history, and side effects the wrapper does not expose cleanly.

That does not make function calling bad. It makes it incomplete.

For OpenClaw operators, the practical question is not “API or CLI?” It is:

Which surface should be the center of gravity for this workflow?

My judgment is simple: CLI-first, text-protocol, and filesystem-oriented surfaces usually win when the job is long-running, inspectable, and recoverable. Structured function tools still matter, but they are often the wrong place to anchor the whole system.

Why giant function catalogs look attractive

Function catalogs earn their popularity honestly.

They give operators and model providers a clean story:

the tool has a name,
the arguments have a schema,
the output shape can be constrained,
auth and remote API details can stay hidden behind the wrapper,
the blast radius can be narrowed to one approved action.

That is real value. OpenAI’s function-calling guidance explicitly leans into this benefit: with Structured Outputs enabled, models can be forced to return arguments that match a JSON Schema. If your problem is “fill out this exact contract and call this exact remote system,” function calling is elegant.

For narrow actions, elegance matters.

Examples:

create a ticket in one system with fixed required fields,
send a templated notification,
look up a specific record from a service you do not want to expose more broadly,
trigger a bounded workflow with an approval check in front of it.

If the work is transactional, schemas and wrapper boundaries are a strength.

Where function catalogs break down operationally

The trouble starts when agents stop being single-turn callers and start looking like workers.

1. They hide the state that makes recovery possible

Yao’s essay points at the core issue: many real actions need more than declared arguments. They also need hidden URL state, auth context, working-directory assumptions, existing files, or the outcome of previous steps.

That is manageable in a normal application because the developer controls the whole call chain. Long-running agents are different. They fail mid-run, get interrupted, switch models, hit approval boundaries, and return hours later. When the important state lives inside wrapper code instead of in workspace artifacts, recovery becomes guesswork.

The operational question is not “Could the model call the function once?” It is “Could a human or a second agent understand what happened and resume without replaying the entire conversation?“

2. They do not compose as naturally as files and shell surfaces

Large tool catalogs often look expressive while actually being brittle. Every missing step turns into another wrapper:

searchTickets
getTicket
updateTicket
commentOnTicket
summarizeTicketHistory

That works until the workflow changes shape.

By contrast, CLI and file interfaces already assume composition. One command writes output. Another command reads it. A file becomes the handoff. A log becomes the audit trail. A directory becomes the shared workspace. The substrate is small, but the workflow space is large.

This is not only operator taste. The CodeAct paper argues for a related point at the agent level: expressing actions as code in a unified textual environment can outperform approaches that rely on many specialized action schemas.

3. They make human inspection worse right when stakes rise

In real OpenClaw use, people inspect runs when something went wrong:

the agent stalled,
the environment drifted,
the API rejected a payload,
a partial result exists but no one trusts it yet.

Files, logs, and shell output are boring, but boring is good here. Humans can read them. Diff them. Archive them. Attach them to a handoff. Re-run the last command with one change.

Wrapper-heavy tool flows often hide this operational surface behind “tool call succeeded” or “tool call failed.” That is fine for a demo. It is weak for a six-hour task with two retries, one approval pause, and a handoff to another worker.

4. They turn policy sprawl into tooling sprawl

OpenClaw already forces a healthy distinction between tool policy, sandboxing, elevated mode, and host exec approvals. That is a hint, not a nuisance. Operational control works better when the surface area stays understandable.

If you expose fifty thin wrappers for what is really one shell-capable workflow, you have not reduced complexity. You moved it into:

larger allowlists,
more approval decisions,
more compatibility assumptions,
more hidden translation layers,
more places for the model to choose the wrong “correct” tool.

That is one reason giant catalogs age badly in agent systems. Their local neatness becomes global mess.

Why CLI, files, logs, and explicit artifacts age better

CLI-first does not win because terminals are nostalgic. It wins because the operational objects are already visible.

What the operator needs	CLI/filesystem-oriented surface	Giant function-catalog surface
Resume after interruption	Re-open files, logs, and working directory state	Reconstruct wrapper-specific state and prior tool calls
Compose new workflows	Chain commands, scripts, and file artifacts	Add or revise wrappers for each new pathway
Inspect partial work	Read outputs directly and rerun targeted steps	Depend on tool-specific debug surfaces
Hand off to another agent or human	Share durable artifacts in the workspace	Share conversational memory or opaque call traces
Keep the system legible	Small substrate, large behavior space	Large surface area, fragmented behavior

Three properties matter most.

Recovery

Durable artifacts survive model swaps, agent restarts, and human intervention. A task board file, a diff, a test log, or a generated report gives the next actor something stable to inspect.

That is exactly why /guides/openclaw-multi-agent-routing pushes shared workspace artifacts over hidden conversational memory. Handoffs work better when the receiving worker can see a file, not just trust that the last worker “knows what happened.”

Composability

The filesystem is the original coordination protocol. Commands can stay narrow because files and text streams carry the work between them. You do not need a new wrapper every time the order of operations changes.

This is especially important for agentic coding, research, and ops work, where the useful sequence is often discovered during execution rather than designed in advance.

Operator legibility

When humans have to intervene, the best system is the one that makes the evidence obvious:

what the agent read,
what it wrote,
what command failed,
what artifact is now trustworthy,
what still needs review.

Legibility is not just a nice UX property. It is how teams keep autonomy from collapsing into mystery.

This is not anti-API dogma

CLI-first is a center-of-gravity argument, not a manifesto against structured tools.

Structured tools still outperform CLI patterns when the job is one of these:

Exact remote side effects

If the agent needs to create a CRM record, start a billing workflow, or call an internal service with strong auth and required fields, a typed tool is usually better than shelling out through a generic CLI wrapper.

High-trust permission envelopes

Sometimes you want the model to have access to one verb and nothing else. A tightly scoped function wrapper is often easier to review than broad shell access.

That is why /guides/custom-tools still matters. Optional tools and explicit allowlists are useful when the permission boundary itself is the product.

Validation-heavy inputs

Form-like workflows with narrow required fields benefit from schemas. Here, the model should not be discovering the interface by reading files or CLI help text. It should satisfy a contract.

Stable APIs in unstable local environments

Sometimes the local shell is the fragile part. If your Windows service PATH, PTY behavior, or host runtime is unreliable, a remote structured action can be more dependable than another layer of CLI or script orchestration. That is one reason to keep /guides/self-hosted-ai-api-compatibility-matrix in view: “tool support” is not one capability, and compatibility claims drift quickly.

How OpenClaw operators should choose the tradeoff

The easiest way to get this wrong is to ask which interface looks cleaner to a developer. Ask which one makes the run more recoverable.

I recommend five rules.

1. Make the default substrate boring

For long-running work, give OpenClaw a small set of durable surfaces:

workspace files,
logs,
shell commands,
patch/edit paths,
explicit result artifacts.

That gives the agent one environment it can keep re-entering instead of a maze of wrappers that all encode state differently.

2. Use function tools for verbs, not for whole operating systems

Good function tools expose narrow, high-value remote actions.

Bad function catalogs try to replace the entire shell, editor, and filesystem with hundreds of pseudo-actions. That is usually where discoverability and maintenance collapse.

3. Force long tasks to leave artifacts behind

If a workflow can span minutes or hours, design it so the agent must emit evidence outside the chat:

a plan file,
a report,
a patch,
a status JSON,
a run log.

This is the difference between “the model probably did the work” and “the system contains the work.”

4. Design approvals around real power, not around brand names

If a binary can interpret code, spawn subcommands, or read arbitrary files, treat it like a real power surface. Do not smuggle it in through fake-safe wrappers.

That is the practical lesson in /guides/openclaw-exec-approvals-and-safe-bins: approval posture should reflect actual capability, not cosmetic packaging.

5. Treat structured tool support as a separate compatibility claim

Do not assume that an “OpenAI-compatible” or “tool-calling-capable” backend will behave cleanly once the agent starts doing multi-step work. Tool payload support, tool-result continuation, streaming, and later-turn durability are separate questions.

If your architecture depends on structured tools, verify that layer explicitly. If your architecture depends on shell/filesystem work, verify that layer explicitly. Do not let marketing labels stand in for runtime proof.

The real decision frame

The wrong question is:

Which interface is more elegant?

The better questions are:

What happens when the run is interrupted halfway through?
What artifact tells me what the agent actually did?
Can another agent or human pick up the work without transcript archaeology?
Does the approval model match the real blast radius?
Will this surface still make sense after the workflow changes shape?

If those questions dominate, CLI-first usually wins.

If the job is narrow, transactional, and schema-stable, function tools usually win.

That is the tradeoff OpenClaw operators should keep in view: use structured tools as precise instruments, but let files, logs, and shell-native artifacts carry the long-running operational load.

Function catalogs are useful peripherals. They are rarely the best operating system for agent work.

Why CLI-First Agent Tooling Often Beats Giant Function Catalogs