# Temporal Time-Skipping: The Clock You Didn't Know Existed I spent hours debugging a test that worked perfectly in production but exploded in the test environment. The workflow ran fine. The activities completed. The signals were correct. But every single test failed with a `TimeoutError` — the workflow would just... die, before my test code even had a chance to interact with it. The culprit was a clock I didn't know existed. This post is about the mental model that finally made Temporal's time-skipping test server make sense to me. If you've ever been confused by `WorkflowEnvironment.start_time_skipping()`, or had tests fail mysteriously with timeouts that don't happen in production, this is for you. --- ## What is Temporal? (The 30-Second Version) Temporal is a workflow orchestration engine. You define your business logic as a "workflow" — a sequence of steps — and Temporal takes care of running it reliably. If a step fails, Temporal retries it. If your server crashes, Temporal picks up where it left off. It handles all the ugly stuff: retries, timeouts, state persistence, distributed coordination. The important thing for this post: Temporal runs your workflow inside its own runtime environment. Your workflow code doesn't just execute like a normal Python function. It runs _inside Temporal_, and Temporal manages when things happen. That distinction — "runs inside Temporal" — is the root of everything that follows. --- ## Temporal is a Timekeeper Here's the insight that changed everything for me: **Temporal doesn't skip your code. It only controls its own clock.** I had been thinking about Temporal as some kind of execution engine that speeds up my code. It's not. Temporal is a **timekeeper**. Think of it like a kernel — a central dispatcher that keeps a timetable of scheduled events. When your workflow calls `workflow.sleep(3600)` (sleep for one hour), Temporal doesn't somehow make your Python code run faster. What it does is add an entry to its internal timetable: ``` "Wake up workflow ABC at current_time + 3600 seconds" ``` When your workflow starts an activity with a 30-second timeout, Temporal adds: ``` "If activity XYZ hasn't returned by current_time + 30 seconds, fail it" ``` When your workflow has an execution timeout of 10 minutes: ``` "If workflow ABC isn't done by current_time + 600 seconds, kill it" ``` Sleeps, timeouts, activity deadlines, heartbeat intervals — from Temporal's perspective, they're all the same thing. They're entries in a timetable. Each one says: "fire event X at time T." The only difference is what event gets fired — resume the workflow, fail an activity, kill the whole thing. But the mechanism is identical: a timestamp and an action. This is the foundation for understanding time-skipping. --- ## Time-Skipping: Fast-Forwarding Idle Time Temporal's Python SDK gives you two test environments: - **`WorkflowEnvironment.start_time_skipping()`** — downloads a lightweight Rust-based test server binary that runs in-process (no Docker needed). This server has a virtual clock that can jump forward. - **`WorkflowEnvironment.start_local()`** — runs a full Temporal server with a real clock, just like production. The time-skipping server does one clever thing: **when nothing is happening, it fast-forwards its clock to the next scheduled event.** Imagine your workflow does this: ```python await workflow.sleep(3600) # sleep 1 hour await do_some_activity() await workflow.sleep(7200) # sleep 2 hours ``` In production, this takes 3 hours of wall-clock time (plus however long the activity takes). With time-skipping, here's what happens: 1. Workflow calls `sleep(3600)`. Temporal adds to its timetable: "resume at now + 3600s." 2. Nothing else is pending. The server jumps its clock forward 3600 seconds instantly. 3. Timer fires. Workflow resumes. Activity starts. 4. Activity is running — the server **does not skip**, because it's waiting for a real result. 5. Activity completes. Workflow calls `sleep(7200)`. Timetable entry added. 6. Nothing pending. Server jumps forward 7200 seconds. 7. Timer fires. Workflow finishes. Total real time: however long the activity took (maybe milliseconds if it's a mock). The 3 hours of sleeping? Gone. Skipped. This leads to a really important corollary that solidified my understanding: **If your code never sleeps, never waits, and never sets timeouts, time-skipping gives you zero speedup.** Think about it. Time-skipping only fast-forwards idle time on Temporal's clock. If there's no idle time — no timers, no sleeps, no timeouts — there's nothing to skip. Your workflow would run at the exact same speed with time-skipping as without it. Of course, that's a theoretical extreme. Real workflows almost always have timeouts, retry intervals, and sleeps. But the principle is clarifying: time-skipping is not about making your code faster. It's about eliminating wait time between events on Temporal's timetable. --- ## The Handle: Your Remote Control When you start a workflow, Temporal gives you back a **handle**. Think of it like a restaurant ticket — you placed your order (started the workflow), and now you have a ticket to interact with it while it's being prepared. ```python handle = await client.start_workflow( MyWorkflow.run, inputs, id="order-123", task_queue="kitchen", ) ``` The workflow is now running independently inside Temporal's runtime. The handle is your remote control. It has a few buttons: - **`handle.result()`** — "Call me when my food is ready." You sit on the phone, waiting, until the workflow finishes and gives you the result. - **`handle.query()`** — "Hey, how's my order coming?" A quick question. You get an answer immediately and hang up. - **`handle.signal()`** — "Actually, add extra cheese." You send a message to the running workflow. It gets delivered immediately. - **`handle.cancel()`** — "Cancel my order." The handle is just `temporalio.client.WorkflowHandle` — nothing magical. But understanding what each button _does_ to the time-skipping server is where it gets interesting. --- ## `result()` vs `query()` — The Trap This is where my intuition was completely backwards, and getting it right is what finally unblocked me. **`handle.result()`** is a phone call where you say: _"Don't hang up until my food is ready."_ You're blocking. You're waiting. And critically, you're telling the time-skipping server: **"I'm waiting for this workflow to finish."** The server hears that and thinks: "They want the result. Let me help by fast-forwarding to when it's done." **`handle.query()`** is a phone call where you say: _"Is my food ready? No? Okay, bye."_ One shot. You get the current state, and then you're done. The server has no reason to fast-forward anything — you didn't ask it to wait for anything. Here's the part that tripped me up: **`result()` is event-based (wait for the "done" event), and `query()` is polling (check the current state and return immediately).** You'd think the event-based approach is the "better" one — and in normal programming, it usually is. But in time-skipping mode, the polling approach is safer, because it doesn't give Temporal permission to mess with its clock. | Operation | What it says to Temporal | Triggers time-skipping? | | ----------------- | ------------------------------ | ------------------------------------------- | | `handle.result()` | "Wake me up when it's done" | **Yes** — server tries to _make_ it be done | | `handle.query()` | "What's the status right now?" | **No** — server just answers | | `handle.signal()` | "Deliver this message" | **No** — immediate delivery | That middle column is the whole story. `result()` gives the server license to fast-forward. `query()` doesn't. --- ## The Race Condition: Death by Time Travel Now you have all the pieces. Let me show you how they combine to create a very confusing bug. ### The Setup Our workflow has steps that need user feedback. After a step runs (say, analyzing a document), the workflow pauses and waits for the user to review the output and click "Continue." In production, this might take minutes or hours — the user is reading, thinking, editing. In the code, this waiting looks like: ```python # backend/src/genai/temporal/workflow_executor.py, line 149 await workflow.wait_condition(lambda: waiter.signal_received) ``` This tells Temporal: "Pause here until `signal_received` becomes `True`." There's no timeout — it waits as long as it takes. In production, that's fine. The user eventually clicks Continue, a signal is sent, `signal_received` flips to `True`, and the workflow resumes. ### The Exam Analogy Imagine you're a teacher proctoring an exam with a 2-hour time limit. You have a clock on the wall. **Production (real clock):** A student raises their hand. "I need my calculator from my locker." You wait. Someone brings it. The student finishes the exam. The clock says 45 minutes passed. No problem. **Time-skipping (magic clock):** A student raises their hand. "I need my calculator from my locker." You look around the room. Nobody is actively writing. Nothing is happening. So you spin the magic clock forward — 30 minutes, 1 hour, 1.5 hours, 2 hours. "Time's up! Exam over!" The student fails. The person bringing the calculator walks in 0.1 real seconds later. But the clock already says 2 hours. Too late. ### What Actually Happened Here's the exact sequence in our tests: **Real time 0.00s** — Test starts the workflow. Mock activities return instantly (they're fakes — no real I/O). **Real time ~0.02s** — All activities are done. Workflow enters `wait_condition()` — waiting for a feedback signal. At this moment, Temporal's timetable has **nothing pending**: no activities running, no timers set. Just a workflow sitting at a `wait_condition`. **Real time ~0.02s** — Test calls `handle.result()`: "Tell me when the workflow finishes." The time-skipping server hears this and thinks: _"The client wants the result. Let me check what's pending... No activities. No timers. Nothing to wait for except a `wait_condition` I can't satisfy. But there IS a workflow execution timeout at now + 600 seconds. Let me jump there."_ **Server clock jumps:** 0s → 600s. Workflow execution timeout fires. Workflow dies with `TimeoutError`. **Real time ~0.03s** — `handle.result()` returns with an error. Meanwhile, the test had a polling loop that was _supposed_ to query for pending feedback steps and send signals. That loop was scheduled to run after `asyncio.sleep(0.1)` — at real time 0.1 seconds. But the workflow is already dead. The signals arrive at a corpse. The whole thing happened in ~30 milliseconds of real time. The server just... jumped to the end. ### The Fix The fix is almost anticlimactic once you understand the problem. Don't call `handle.result()` while the workflow is waiting for signals. Instead, use `handle.query()` to poll, send signals when needed, and only call `result()` after confirming the workflow is already done: ```python # backend/tests/integration/genai/test_temporal_workflow.py, lines 49-77 async def _run_workflow_with_feedback(handle): while True: await asyncio.sleep(0.1) # real-time sleep in the test process # query() doesn't trigger time-skipping — clock stays frozen progress = await handle.query(JurorAnalysisWorkflow.get_progress) # Send signals for any steps waiting for feedback for step_id in progress.pending_feedback_steps: await handle.signal( JurorAnalysisWorkflow.submit_step_feedback, StepFeedbackSignal(release_step_id=step_id), ) # Only call result() AFTER the workflow reports it's done if progress.status == WorkflowStatus.COMPLETED: return await handle.result() # returns instantly — nothing to skip ``` Why this works: 1. **`query()` doesn't touch the clock.** The workflow stays frozen at its `wait_condition`. No fast-forwarding. 2. **Our polling loop runs in real time.** `asyncio.sleep(0.1)` is a real sleep in the test process — Temporal doesn't control it. 3. **Signals are delivered immediately** regardless of what Temporal's clock says. 4. **By the time we call `result()`, the workflow is already complete.** There's nothing to fast-forward to. It returns instantly. The workflow's clock never jumps because we never gave the server permission to jump it. --- ## Takeaways A few rules of thumb I'm keeping in my back pocket: **The mental model:** - Temporal is a timekeeper, not an execution engine. It manages a timetable of timers, timeouts, and activity deadlines. - Time-skipping fast-forwards idle time on Temporal's clock. It doesn't speed up your code. - If there's nothing on the timetable to skip to, there's nothing to skip. **The practical rules:** - Never `await handle.result()` on a workflow that's waiting for external signals. It gives Temporal permission to fast-forward, and the workflow will die before your signals arrive. - Use `handle.query()` to poll workflow state. Queries are passive reads — they don't affect the clock. - Signals work immediately in any time mode. They don't care what the server clock says. - Only call `handle.result()` after you've confirmed the workflow is done via `query()`. **When to use which test environment:** - **`start_time_skipping()`** — great for workflows that are self-contained (just timers and activities, no external interaction). Also works for signal-based workflows _if_ you use the query-based polling pattern. - **`start_local()`** — safer for workflows that require signals/queries during execution, since it runs in real-time. But slower, because timers actually wait. We kept `start_time_skipping()` because the query-based pattern works correctly, and we get free speedup on any timer-based operations (like activity retry intervals and jitter sleeps). --- _Disclaimer: Written by Human, improved using AI where applicable._