Structured Task Planning for LLM Agents: PDDL via MCP Tool Calls

How to expose formal PDDL planning operations as LLM tool calls through MCP, giving agents a structured, verifiable planning substrate for complex multi-step tasks.

LLM agents are good at reasoning in prose but poor at guaranteeing that a sequence of actions is actually executable in a given environment. Formal planning languages like PDDL (Planning Domain Definition Language) have solved structured task planning for decades—but they sit entirely outside the token stream. Bridging the two worlds by exposing PDDL operations as MCP tool calls gives agents a verifiable, state-tracked planning substrate without forcing the model to “think” in a formal language it was never trained on.

Why Prose Planning Fails for Complex Tasks

When an LLM reasons through a multi-step task entirely in natural language, it has no enforced notion of state. It can “pick up the key” in step 3 and “put the key down” in step 4, then “pick up the key again” in step 7 without ever noticing the contradiction—because the context window holds text, not a world model. This is fine for short, loosely coupled tasks, but it breaks down for logistics, workflow automation, or anything with hard preconditions and resource constraints.

The classic alternative—classical planning with PDDL—works the other way: it is extremely precise about state but requires a complete, hand-authored domain model, and it has no ability to recover gracefully when the real world diverges from that model. Neither pure approach is satisfying for production agents that must handle open-ended instructions in partially specified environments.

PDDL as a Simulation Substrate

The insight behind combining the two is that PDDL can act as a simulation layer rather than a complete planner. Instead of running a full offline planner to produce a finished plan, you expose the PDDL engine interactively: the LLM can query the current state, ask which actions are applicable, attempt an action, observe whether it succeeded and what state changed, and then decide what to do next.

This is exactly the kind of structured environment that makes ReAct-style agents powerful. The model is still doing the high-level reasoning—deciding which action to try, interpreting failures, replanning—but the PDDL engine is the source of truth for what is legal. The agent cannot accidentally apply an action whose preconditions are unmet; the engine simply rejects it and returns the failure reason.

Note

The key design principle: use PDDL as a constraint oracle, not a planner. The LLM proposes; the PDDL engine validates. This separates semantic reasoning (what should I do?) from mechanical correctness (can I do it right now?).

Exposing Planning Operations as MCP Tool Calls

Model Context Protocol provides a clean interface for this pattern. Each planning operation becomes a named tool that the LLM can invoke during its reasoning loop:

Tools registered with the MCP server:

- get_state()                → returns current PDDL state as structured JSON
- get_applicable_actions()   → lists actions whose preconditions are currently met
- apply_action(action, args) → attempts the action; returns new state or error
- get_goal_status()          → checks whether any goal condition is satisfied
- reset()                    → resets the simulation to the initial state

With these tools in its schema, a standard tool-calling LLM can run an interactive search policy: inspect the state, enumerate options, pick one, observe the result, and repeat. This is essentially a tree search guided by language-model intuition, with the PDDL engine pruning illegal branches automatically.

┌─────────────────────────────────────────────────────────┐
│                     LLM Agent Loop                      │
│                                                         │
│  1. get_state()  ──────────────────────────────────┐   │
│                                                    ▼   │
│  2. get_applicable_actions() ◄── current state ────┤   │
│                                                    │   │
│  3. LLM selects action (semantic reasoning)        │   │
│                                                    │   │
│  4. apply_action(chosen_action)                    │   │
│         │                                          │   │
│         ├── success → new state ───────────────────┘   │
│         └── failure → error message → replan           │
│                                                         │
│  5. get_goal_status() → done? → exit : repeat          │
└─────────────┬───────────────────────────────────────────┘
              │ MCP tool calls
              ▼
┌─────────────────────────┐
│   PDDL Simulation Engine │
│  - Domain model          │
│  - Current world state   │
│  - Precondition checker  │
│  - Effect applicator     │
└─────────────────────────┘

Engineering Considerations

Domain authoring overhead. PDDL domains still need to be written. For narrow, well-defined task categories—helpdesk ticket routing, supply chain operations, CI/CD pipelines—this is tractable. For truly open-ended agents, you either need a very broad domain or a mechanism to extend it dynamically. One practical approach is to start with a coarse domain that covers the major object types and actions in your system, then add predicates incrementally as edge cases surface.

State representation at the boundary. The PDDL state is symbolic (predicates and objects), while the LLM operates in natural language. The MCP server needs to serialize state into a format the model can reason about—structured JSON works well. Equally important is deserializing the model’s action choice back into valid PDDL syntax; providing a strict schema for apply_action arguments avoids a whole class of parse errors.

Partial observability. Real environments are not fully observable. You can model this by having get_state() return only the predicates the agent can currently “see,” forcing the model to reason under uncertainty—much closer to real deployment conditions than a fully observable simulation.

Tip

For debugging, log every apply_action call and its result. This gives you a complete, deterministic trace of every state transition the agent caused—far easier to audit than a raw token log.

Failure handling. When apply_action returns a precondition failure, the model receives an explicit error message naming the unmet predicate. This is more actionable than a vague “I couldn’t do that.” Prompt the model to treat these failures as first-class signals: “The action failed because (holding key) is not true. What do I need to do first?”

When to Use This Pattern

This architecture is worth adopting when your agent operates in a domain with well-defined resources, locations, or workflow states—anything that maps cleanly onto PDDL objects and predicates. Manufacturing scheduling, data pipeline orchestration, multi-step form completion, and structured game-playing are natural fits. It is less appropriate for tasks that are fundamentally open-ended or where the action space cannot be enumerated in advance.

The broader principle generalizes beyond PDDL specifically: any formal state machine or constraint checker can serve as the validation layer, with MCP providing the tool-call bridge. PDDL happens to be a mature, expressive formalism with good tooling, but the same pattern applies to finite-state automata, typed workflow graphs, or domain-specific rule engines. The LLM brings flexible reasoning; the formal substrate brings correctness guarantees. Neither alone is sufficient for robust production planning.