Lambda Just Learned to Remember: Why Durable Execution Changes Everything

Gautam Singh | Feb 26, 2026

The Problem That Wouldn't Go Away

Here's a story every serverless developer knows by heart.

You build a workflow. Maybe it processes a document: extract text, call an LLM for entity extraction, enrich from a database, generate a summary, email the result. Five steps. Each one depends on the last. Each LLM call costs real money and takes 10-30 seconds.

You deploy it as a single Lambda function. It works beautifully in testing.

Then production happens.

The LLM times out at step three. Lambda retries. But "retry" means starting from scratch. Steps one and two run again. The LLM gets called again. You pay again. If you're unlucky and the timeout happens three times in a row, you've now spent 3x the cost and 3x the time for a single document.

The instinctive fix is to add manual checkpointing. After each step, write the result to DynamoDB. On failure, load the last checkpoint and resume. Sounds reasonable until you build it. Your clean 50-line workflow balloons into 200+ lines of state management boilerplate: try/except blocks, state serialization, idempotency checks, custom retry logic. And that's for one workflow. Multiply by twenty across your system and you've built a homegrown orchestration framework nobody wants to maintain.

The other option? Step Functions. Define your workflow in JSON using Amazon States Language. It works, and it's battle-tested. But now your logic lives in a JSON state machine separate from your application code. Testing requires deploying to AWS. Adding a step means editing JSON, not Python. For workflows that are 90% Lambda code with a bit of orchestration, it feels like using a sledgehammer to hang a picture frame.

What teams really wanted was something simpler: what if Lambda itself could just remember where it left off?


Enter Durable Execution

That's exactly what AWS delivered. Lambda Durable Functions extend the familiar Lambda programming model with three core capabilities:

Checkpointing. Wrap any operation in context.step() and its result is automatically persisted. If your function fails and Lambda retries it, the SDK replays from the beginning but skips every completed step by returning its stored result. No re-execution. No re-cost.

Suspension. Call context.wait() and your function pauses. The Lambda invocation terminates. You pay zero compute during the wait. When the wait expires (or an external event arrives), Lambda re-invokes your function, replays the completed steps in milliseconds, and continues from where it paused.

Parallel fan-out. Use context.map() to execute operations concurrently. Each branch checkpoints independently. If branch three of five fails, only branch three retries. The other four results are already safe.

The mental model is elegant. Think of your function as a recipe with numbered steps, and the SDK as a notebook that records the result of each step. Every time the kitchen fires up (every Lambda invocation), the chef starts reading from step one. But when they reach a step that's already in the notebook, they skip it and use the recorded result. When they reach a step that's not in the notebook, they execute it for real, record the result, and continue.

That's it. That's the whole model.


What It Looks Like in Code

Let's revisit our five-step document processor. Here's the durable version:

python
from aws_durable_execution import durable_execution

@durable_execution
def handler(event, context):
    doc_id = event["document_id"]

    # Step 1: Extract text (checkpointed automatically)
    text = context.step("extract_text",
        lambda: extract_text(doc_id))

    # Step 2: Entity extraction (auto-retry with backoff)
    entities = context.step("extract_entities",
        lambda: call_bedrock(f"Extract entities: {text}"),
        retry={"max_attempts": 3, "backoff_rate": 2})

    # Step 3: Enrich from database
    enriched = context.step("enrich",
        lambda: enrich_from_database(entities))

    # Step 4: Generate summary (auto-retry)
    summary = context.step("summarize",
        lambda: call_bedrock(f"Summarize: {enriched}"),
        retry={"max_attempts": 3, "backoff_rate": 2})

    # Step 5: Email the result
    context.step("notify",
        lambda: send_email(event["user_email"], summary))

    return {"status": "complete", "doc_id": doc_id}

Same five steps. Same business logic. But now if step four fails, steps one through three are skipped on retry. The LLM calls from steps two are not repeated. The cost is not re-incurred. And you didn't write a single line of state management code.

Compare that to the 200+ lines of DynamoDB checkpointing boilerplate, or the seven files of Step Functions configuration. The difference in developer experience is hard to overstate.


The Timing Isn't Coincidental

You might be wondering: Lambda has been around since 2014. Why did AWS wait eleven years to add durability?

Because the workloads that desperately need it didn't exist until now.

The explosion of GenAI agents in 2025 created an entirely new class of problem. Consider a typical AI agent workflow: receive a query, call an LLM to decompose it into sub-tasks (30 seconds, $0.05), fan out five parallel research tasks against different APIs (2 minutes, $0.25 total), synthesize results with another LLM call (45 seconds, $0.08), then wait for a human to approve the report before sending it.

That last part is the kicker. The human approval might take hours. Or days. Traditional Lambda can't hold a connection open for hours. You'd need to break the workflow apart, store state externally, and wire up a callback mechanism. With durable functions, it's one line:

python
approval = context.wait(
    duration=Duration.from_days(7),
    name="await_approval"
)

Zero compute charges during the wait. The function resumes seamlessly when the approver clicks a button.

If you need market validation that durable execution has arrived, look at Temporal. The leading durable execution platform raised $300 million at a $5 billion valuation in early 2026. Their customer list includes OpenAI (AI agent workflows), Netflix (media processing), and JPMorgan (financial orchestration). AWS launching native durable execution inside Lambda is the clearest signal yet: this pattern has gone mainstream.

Temporal validated the concept. AWS commoditized it.


The One Rule You Can't Break

There's a catch, and it's an important one. Because your function replays from the beginning on every invocation, your code must be deterministic. The sequence of context.step() calls must be identical every time.

This means no random values outside of steps. No time.time() calls between steps. No conditional branches based on data that changes between invocations.

The fix is simple: wrap all non-deterministic logic inside steps. A step result is recorded once and replayed consistently forever after.

python
# BAD: timestamp changes on every replay
timestamp = time.time()
context.step("process", lambda: do_work(timestamp))

# GOOD: timestamp captured inside a step, stable across replays
metadata = context.step("init", lambda: {
    "timestamp": time.time(),
    "request_id": str(uuid.uuid4())
})
context.step("process", lambda: do_work(metadata))

Once you internalize this rule, everything else flows naturally. Step names must be unique and stable. Results must be serializable. Side effects belong inside steps. That's the full contract.


When to Use What

Durable Lambda doesn't replace Step Functions. It complements them. Here's the quick decision framework I use with my teams:

Use Durable Lambda when your workflow is 90%+ Lambda code, when you want to test locally with standard unit tests, when you're building AI agent pipelines, or when you need long-running workflows (up to a year) with code-first logic.

Use Step Functions when you're orchestrating multiple AWS services (S3, MediaConvert, SES, DynamoDB), when you need a visual workflow designer for cross-team collaboration, or when your workflow is primarily service-to-service coordination rather than custom code.

Use both when you need macro-level orchestration between AWS services (Step Functions) with complex micro-logic within individual steps (Durable Lambda).

The key insight: Step Functions orchestrates services. Durable Lambda orchestrates code. Pick the tool that matches your workflow's center of gravity.