# Temporal Time-Skipping: The Clock You Didn't Know Existed


I spent hours debugging a test that worked perfectly in production but exploded in the test environment. The workflow ran fine. The activities completed. The signals were correct. But every single test failed with a `TimeoutError` — the workflow would just... die, before my test code even had a chance to interact with it.

The culprit was a clock I didn't know existed.

This post is about the mental model that finally made Temporal's time-skipping test server make sense to me. If you've ever been confused by `WorkflowEnvironment.start_time_skipping()`, or had tests fail mysteriously with timeouts that don't happen in production, this is for you.

---

## What is Temporal? (The 30-Second Version)

Temporal is a workflow orchestration engine. You define your business logic as a "workflow" — a sequence of steps — and Temporal takes care of running it reliably. If a step fails, Temporal retries it. If your server crashes, Temporal picks up where it left off. It handles all the ugly stuff: retries, timeouts, state persistence, distributed coordination.

The important thing for this post: Temporal runs your workflow inside its own runtime environment. Your workflow code doesn't just execute like a normal Python function. It runs _inside Temporal_, and Temporal manages when things happen.

That distinction — "runs inside Temporal" — is the root of everything that follows.

---

## Temporal is a Timekeeper

Here's the insight that changed everything for me: **Temporal doesn't skip your code. It only controls its own clock.**

I had been thinking about Temporal as some kind of execution engine that speeds up my code. It's not. Temporal is a **timekeeper**. Think of it like a kernel — a central dispatcher that keeps a timetable of scheduled events.

When your workflow calls `workflow.sleep(3600)` (sleep for one hour), Temporal doesn't somehow make your Python code run faster. What it does is add an entry to its internal timetable:

```
"Wake up workflow ABC at current_time + 3600 seconds"
```

When your workflow starts an activity with a 30-second timeout, Temporal adds:

```
"If activity XYZ hasn't returned by current_time + 30 seconds, fail it"
```

When your workflow has an execution timeout of 10 minutes:

```
"If workflow ABC isn't done by current_time + 600 seconds, kill it"
```

Sleeps, timeouts, activity deadlines, heartbeat intervals — from Temporal's perspective, they're all the same thing. They're entries in a timetable. Each one says: "fire event X at time T." The only difference is what event gets fired — resume the workflow, fail an activity, kill the whole thing. But the mechanism is identical: a timestamp and an action.

This is the foundation for understanding time-skipping.

---

## Time-Skipping: Fast-Forwarding Idle Time

Temporal's Python SDK gives you two test environments:

- **`WorkflowEnvironment.start_time_skipping()`** — downloads a lightweight Rust-based test server binary that runs in-process (no Docker needed). This server has a virtual clock that can jump forward.
- **`WorkflowEnvironment.start_local()`** — runs a full Temporal server with a real clock, just like production.

The time-skipping server does one clever thing: **when nothing is happening, it fast-forwards its clock to the next scheduled event.**

Imagine your workflow does this:

```python
await workflow.sleep(3600)   # sleep 1 hour
await do_some_activity()
await workflow.sleep(7200)   # sleep 2 hours
```

In production, this takes 3 hours of wall-clock time (plus however long the activity takes). With time-skipping, here's what happens:

1. Workflow calls `sleep(3600)`. Temporal adds to its timetable: "resume at now + 3600s."
2. Nothing else is pending. The server jumps its clock forward 3600 seconds instantly.
3. Timer fires. Workflow resumes. Activity starts.
4. Activity is running — the server **does not skip**, because it's waiting for a real result.
5. Activity completes. Workflow calls `sleep(7200)`. Timetable entry added.
6. Nothing pending. Server jumps forward 7200 seconds.
7. Timer fires. Workflow finishes.

Total real time: however long the activity took (maybe milliseconds if it's a mock). The 3 hours of sleeping? Gone. Skipped.

This leads to a really important corollary that solidified my understanding:

**If your code never sleeps, never waits, and never sets timeouts, time-skipping gives you zero speedup.**

Think about it. Time-skipping only fast-forwards idle time on Temporal's clock. If there's no idle time — no timers, no sleeps, no timeouts — there's nothing to skip. Your workflow would run at the exact same speed with time-skipping as without it.

Of course, that's a theoretical extreme. Real workflows almost always have timeouts, retry intervals, and sleeps. But the principle is clarifying: time-skipping is not about making your code faster. It's about eliminating wait time between events on Temporal's timetable.

---

## The Handle: Your Remote Control

When you start a workflow, Temporal gives you back a **handle**. Think of it like a restaurant ticket — you placed your order (started the workflow), and now you have a ticket to interact with it while it's being prepared.

```python
handle = await client.start_workflow(
    MyWorkflow.run,
    inputs,
    id="order-123",
    task_queue="kitchen",
)
```

The workflow is now running independently inside Temporal's runtime. The handle is your remote control. It has a few buttons:

- **`handle.result()`** — "Call me when my food is ready." You sit on the phone, waiting, until the workflow finishes and gives you the result.
- **`handle.query()`** — "Hey, how's my order coming?" A quick question. You get an answer immediately and hang up.
- **`handle.signal()`** — "Actually, add extra cheese." You send a message to the running workflow. It gets delivered immediately.
- **`handle.cancel()`** — "Cancel my order."

The handle is just `temporalio.client.WorkflowHandle` — nothing magical. But understanding what each button _does_ to the time-skipping server is where it gets interesting.

---

## `result()` vs `query()` — The Trap

This is where my intuition was completely backwards, and getting it right is what finally unblocked me.

**`handle.result()`** is a phone call where you say: _"Don't hang up until my food is ready."_ You're blocking. You're waiting. And critically, you're telling the time-skipping server: **"I'm waiting for this workflow to finish."** The server hears that and thinks: "They want the result. Let me help by fast-forwarding to when it's done."

**`handle.query()`** is a phone call where you say: _"Is my food ready? No? Okay, bye."_ One shot. You get the current state, and then you're done. The server has no reason to fast-forward anything — you didn't ask it to wait for anything.

Here's the part that tripped me up: **`result()` is event-based (wait for the "done" event), and `query()` is polling (check the current state and return immediately).** You'd think the event-based approach is the "better" one — and in normal programming, it usually is. But in time-skipping mode, the polling approach is safer, because it doesn't give Temporal permission to mess with its clock.

| Operation         | What it says to Temporal       | Triggers time-skipping?                     |
| ----------------- | ------------------------------ | ------------------------------------------- |
| `handle.result()` | "Wake me up when it's done"    | **Yes** — server tries to _make_ it be done |
| `handle.query()`  | "What's the status right now?" | **No** — server just answers                |
| `handle.signal()` | "Deliver this message"         | **No** — immediate delivery                 |

That middle column is the whole story. `result()` gives the server license to fast-forward. `query()` doesn't.

---

## The Race Condition: Death by Time Travel

Now you have all the pieces. Let me show you how they combine to create a very confusing bug.

### The Setup

Our workflow has steps that need user feedback. After a step runs (say, analyzing a document), the workflow pauses and waits for the user to review the output and click "Continue." In production, this might take minutes or hours — the user is reading, thinking, editing.

In the code, this waiting looks like:

```python
# backend/src/genai/temporal/workflow_executor.py, line 149
await workflow.wait_condition(lambda: waiter.signal_received)
```

This tells Temporal: "Pause here until `signal_received` becomes `True`." There's no timeout — it waits as long as it takes. In production, that's fine. The user eventually clicks Continue, a signal is sent, `signal_received` flips to `True`, and the workflow resumes.

### The Exam Analogy

Imagine you're a teacher proctoring an exam with a 2-hour time limit. You have a clock on the wall.

**Production (real clock):** A student raises their hand. "I need my calculator from my locker." You wait. Someone brings it. The student finishes the exam. The clock says 45 minutes passed. No problem.

**Time-skipping (magic clock):** A student raises their hand. "I need my calculator from my locker." You look around the room. Nobody is actively writing. Nothing is happening. So you spin the magic clock forward — 30 minutes, 1 hour, 1.5 hours, 2 hours. "Time's up! Exam over!" The student fails.

The person bringing the calculator walks in 0.1 real seconds later. But the clock already says 2 hours. Too late.

### What Actually Happened

Here's the exact sequence in our tests:

**Real time 0.00s** — Test starts the workflow. Mock activities return instantly (they're fakes — no real I/O).

**Real time ~0.02s** — All activities are done. Workflow enters `wait_condition()` — waiting for a feedback signal. At this moment, Temporal's timetable has **nothing pending**: no activities running, no timers set. Just a workflow sitting at a `wait_condition`.

**Real time ~0.02s** — Test calls `handle.result()`: "Tell me when the workflow finishes."

The time-skipping server hears this and thinks: _"The client wants the result. Let me check what's pending... No activities. No timers. Nothing to wait for except a `wait_condition` I can't satisfy. But there IS a workflow execution timeout at now + 600 seconds. Let me jump there."_

**Server clock jumps:** 0s → 600s.

Workflow execution timeout fires. Workflow dies with `TimeoutError`.

**Real time ~0.03s** — `handle.result()` returns with an error.

Meanwhile, the test had a polling loop that was _supposed_ to query for pending feedback steps and send signals. That loop was scheduled to run after `asyncio.sleep(0.1)` — at real time 0.1 seconds. But the workflow is already dead. The signals arrive at a corpse.

The whole thing happened in ~30 milliseconds of real time. The server just... jumped to the end.

### The Fix

The fix is almost anticlimactic once you understand the problem. Don't call `handle.result()` while the workflow is waiting for signals. Instead, use `handle.query()` to poll, send signals when needed, and only call `result()` after confirming the workflow is already done:

```python
# backend/tests/integration/genai/test_temporal_workflow.py, lines 49-77
async def _run_workflow_with_feedback(handle):
    while True:
        await asyncio.sleep(0.1)    # real-time sleep in the test process

        # query() doesn't trigger time-skipping — clock stays frozen
        progress = await handle.query(JurorAnalysisWorkflow.get_progress)

        # Send signals for any steps waiting for feedback
        for step_id in progress.pending_feedback_steps:
            await handle.signal(
                JurorAnalysisWorkflow.submit_step_feedback,
                StepFeedbackSignal(release_step_id=step_id),
            )

        # Only call result() AFTER the workflow reports it's done
        if progress.status == WorkflowStatus.COMPLETED:
            return await handle.result()  # returns instantly — nothing to skip
```

Why this works:

1. **`query()` doesn't touch the clock.** The workflow stays frozen at its `wait_condition`. No fast-forwarding.
2. **Our polling loop runs in real time.** `asyncio.sleep(0.1)` is a real sleep in the test process — Temporal doesn't control it.
3. **Signals are delivered immediately** regardless of what Temporal's clock says.
4. **By the time we call `result()`, the workflow is already complete.** There's nothing to fast-forward to. It returns instantly.

The workflow's clock never jumps because we never gave the server permission to jump it.

---

## Takeaways

A few rules of thumb I'm keeping in my back pocket:

**The mental model:**

- Temporal is a timekeeper, not an execution engine. It manages a timetable of timers, timeouts, and activity deadlines.
- Time-skipping fast-forwards idle time on Temporal's clock. It doesn't speed up your code.
- If there's nothing on the timetable to skip to, there's nothing to skip.

**The practical rules:**

- Never `await handle.result()` on a workflow that's waiting for external signals. It gives Temporal permission to fast-forward, and the workflow will die before your signals arrive.
- Use `handle.query()` to poll workflow state. Queries are passive reads — they don't affect the clock.
- Signals work immediately in any time mode. They don't care what the server clock says.
- Only call `handle.result()` after you've confirmed the workflow is done via `query()`.

**When to use which test environment:**

- **`start_time_skipping()`** — great for workflows that are self-contained (just timers and activities, no external interaction). Also works for signal-based workflows _if_ you use the query-based polling pattern.
- **`start_local()`** — safer for workflows that require signals/queries during execution, since it runs in real-time. But slower, because timers actually wait.

We kept `start_time_skipping()` because the query-based pattern works correctly, and we get free speedup on any timer-based operations (like activity retry intervals and jitter sleeps).

---

_Disclaimer: Written by Human, improved using AI where applicable._