Query Suspend and Resume

Reading notes about the paper “Query Suspend and Resume”.

Why we need query suspend and resume besides checkpoint?

It is common to handle mixed workloads which usually consist of high-priority and low-priority, especially in multi-tenant environments. Then, it is important for DBMS to be able to quickly suspend long-running, low-priority queries when high-priority queries arrive.
Running quereis sometimes need to be migrated to other resources, and in that case these queries must release control (to be suspended) quickly.
The infrastructure or servers, which run enterprise comptuing systems, will reboot regularly due to the benefits of software rejuvenation. Software Rejuvenation: the concept of periodically stopping the running software, cleaning its internal state through garbage collection, defragmentation, flushing operating system kernel tables and reinitializing internal data structures, and restarting it. So, such reboot won’t aware any running queries, and if no efficient suspend and resume mechanism, we may lose significant computing progress.
Similar to point 3, DBMS system need to be maintained reguarly, and query suspend can provide more flexibility in scheduling jobs, such as, quickly suspend it and preempt resources then resume it when the resources are available.

Assumptions

The resumed query uses the same plan and operator-memory allocations as the suspended one (i.e., we ignore the possibility of re-optimizing the query on resume).
Both the suspended and resumed query plans see the exactly same physical database state, including the order of tuples already processed.
A query plan is executed as a single thread of computation in an iterator-based fashion (e.g., no exchange operators). So, it is not for multi-threads execution or parallism.
The DBMS do not always know in advance when or whether a suspend request will arrive.

Method

Terminology

Checkpoints: A checkpoint, $Ckpt$, for an operator $O$ at time $t$ contains all the information necessary to later restore $O$’s execution state as of time $t$.

Contracts: A contract is an agreement between a parent operator $P$ and a child operator $Q$. Let $r_1 , \cdots , r_n$ be the first-to-last sequence of tuples output by $Q$ to $P$ if they run to completion without being suspended. When a contract $Ctr$ is established between $P$ and $Q$ just before tuple $r_i$ is output by $Q$, $Q$ agrees to be able to regenerate, at any later point in time when $Ctr$ is enforced by $P$, tuples $r_i, \cdots, r_n$ in order.

Contract Graph: For each query, a runtime in-memory data structure called contract graph is maintained. A contract graph $G$ is a directed acyclic graph with checkpoints as nodes and contracts as edges. Suppose that a parent $P$ creates a checkpoint $Ckpt_1$ and at the same time establishes a contract $Ctr$ with its child $Q$, while $Q$ signs $Ctr$ and determines that it would rely on its own checkpoint $Ckpt_2$ to fulfill $Ctr$. At this point, we would record in $G$ a directed edge $Ctr$ from $Ckpt_1$ to $Ckpt_2$.

Suspend Plan: A suspend plan specifies one of two possible actions, DumpState or GoBack, for each operator.

DumpState: One of the two suspend plan choice for each operator. If Suspend()

For the root query operator $O_r$, $DumpState$ writes its heap state to disk, records its disk location in SuspendedQuery, and adds its current control state—needed to resume from exactly the suspend point—to SuspendedQuery. For the other operator $P$ (i.e., when Suspend(Ctr) is called on $p$), $P$ writes its heap state to disk, records its disk location in SuspendedQuery, adds the content of $Ctr$ (the control state recorded earlier when signing $Ctr$) to SuspendedQuery. This information will allow $P$ to resume execution from the point when contract Ctr was signed.

GoBack: One of the two suspend plan choice for each operator.

Execute Phase

Stateful operators will run proactive checkpointing. This means each stateful operator $O$ in a plan will create checkpoints proactively at points during execution when $O$’s heap state is minimal, namely, at $O$’s minimal-heap-state points.

Stateless operators will run reactive checkpointing. stateless filters are always at a minimal-heap-state point. Hence, stateless operators perform reactive checkpointing, where checkpointing is done only on demand, e.g., on a suspend request.

To make sure not only the operator $O$ but also all its children operators in the plan can resume and produce correct input to operator $O$ after the point where $Ckpt$ was created, Contracts, as defined above, is necessary.

If $Q$ is a stateful operator, then $Q$ saves its current control state (control state is small and include, for example, the cursor position for a NLJ) when it signs $Ctr$. Thus, the control state, $Q$’s latest checkpoint, and the contracts with $Q$’s children that $Q$ established at that time) should be sufficient for $Q$ to fulfill $Ctr$.
If $Q$ is a stateless operator, in order to sign $Ctr$, $Q$ creates a reactive checkpoint and in turn establishes a contract with its children.
Whenever the parent operator creates a checkpoint at time $t$, it has to establish contracts with its children at $t$. Thus, an operator’s contract is always associated with a parent operator’s checkpoint. To fulfill a contract signed at $t$, a child operator relies on the last checkpoint of its own created before or at $t$—either a proactive checkpoint for a stateful child or a reactive checkpoint for a stateless child.

Conceptually, the child first restores its execution state to this checkpoint, rolls forward to $t$, and starts producing output tuples for its parent in fulfillment of the contract.

Suspend Phase

When the DBMS wants to suspend a query $Q$, it raises a suspend exception in the thread that runs $Q$. Then, query enters suspend phase, and first task is to choose a suspend plan by go through all the operators in the query plan and determine whether each of them will take DumpState or GoBack. Finally, $O_r$ calls Suspend() on each of its children so that the next GetNext() call (on resume) will retrieve the tuple following the last one received by $O_r$.

Resume Phase

The SuspendedQuery (the query that suspended before) data structure is read back from disk.
The query plan is recreated by instantiating all operators, and the Resume() method is invoked on the root operator.
Resume() gets called on all operators, causing the plan to get ready to produce the tuple immediately after the last one produced before suspend.
When the root operator receives the call, it first calls Resume() on its children to get them into the correct state. It then looks up the SuspendedQuery structure and either loads its internal state from disk (in case of DumpState) or calls GetNext() appropriately on its children (in case of GoBack) to recompute its internal state.

Reference:

Badrish Chandramouli, Christopher N. Bond, Shivnath Babu, and Jun Yang. “Query suspend and resume.” ACM International Conference on Management of Data (SIGMOD), pp. 557-568. 2007.

Categories

Tags