BlinkDB[1] uses two key ideas: (1) an adaptive optimization framework that builds and maintains a set of multi-dimensional stratified samples from original data over time, and (2) a dynamic sample selection strategy that selects an appropriately sized sample based on a query’s accuracy or response time requirements.

BlinkDB is evaluated against the well-known TPC-H benchmarks and a real-world analytic workload derived from Conviva Inc., a company that manages video distribution over the Internet. Our experiments on a 100 node cluster show that BlinkDB can answer queries on up to 17 TBs of data in less than 2 seconds (over 200x faster than Hive), within an error of 2-10%.

Problem & Contribution

Approximation Query Processing[AQP]: how to fast prcess large amounts of data by trading result accuracy for response and space.

Existing AQP techniques make different trade-offs between efficiency and the generality of the queries they support. (1) At one end of the spectrum, existing sampling and sketch based solutions exhibit low space and time complexity but with strong assumptions about the query workload (e.g., future queries and aggregation functions used in queries are known or largely predictable); (2) At the other end of the spectrum, systems like online aggregation (OLA) make fewer assumptions about the query workload, at the expense of highly variable performance.

BlinkDB, a distributed sampling-based approximate query processing system that strives to achieve a better balance between error and latency. BlinkDB allows users to pose SQL-based aggregation queries over stored data, along with response time or error bound constraints. Through this way, BlinkDB would answer the queries over huge data quickly if users explicitly indicate the error tolerance or spend more time to achieve better accuracy if users care more about error.

Build and maintain the stratified samples

Assumption

BlinkDB assumes that the sets of columns used by queries in WHERE, GROUP BY, and HAVING clauses are stable over time. These sets of columns are called “query column sets” or QCSs .

Sample creation

Notation Description
$T$ fact (original) table
$Q$ a query
$t$ a time bound for query $Q$
$e$ an error bound for query $Q$
$n$ the estimated number of rows that can be accessed in time $t$
$\phi$ the QCS for $Q$, a set of columns in $T$
$x$ a $\| \phi \|$-tuple of values for a column set $\phi$, for example `(Chicago, IL)` for $\phi =$ (City, State)
$D(\phi)$ the set of all unique $x$-value for $\phi$ in $T$
$T_x, S_x$ the rows in $T$ (or a subset $S \subseteq T$) have the values $x$ on $\phi$ ($\phi$ is implicit)
$S(\phi, K)$ stratified sample associated with $\phi$, where frequency of every group $x$ in $\phi$ is capped by $K$
$\Delta(\phi, M)$ the number of groups in $T$ under $\phi$ having size less than $M$ -- a measure of sparsity of $T$

The sample creation module creates stratified samples on the most frequently used QCSs to ensure efficient execution for queries on rare values. The stratified samples are created as follows,

  1. A query $Q$ specifying a table $T$, a QCS $\phi$, and either a response time bound $t$ or an error bound $e$ are given
    • $t$ determines the maximum sample size on which we can operate, $n$; $n$ is the optimal sample size, since larger samples produce better statistical results.
    • $e$ almost decides the minimum sample size that will satisfy the error bound, and any larger sample would be suboptimal because it would take longer than necessary.
  2. $T_x = \lbrace r: r \in T \text{ and } r \text{ takes values } x \text{ on columns } \phi \rbrace$.
    • There are $| D(\phi) |$ groups $T_x$ in $T$ of rows in $T$ under $\phi$.
  3. Since computing an aggregate value for each $T_x$ is expensive, so each group $T_x$ has a sample $S \in T$ with $| S | = n$ rows for calculationg.

  4. The sampling strategy is shown as follows,
    • Compute group counts: to each $x \in x_0, \cdots, x_{| D(\phi)-1 |}$, assign a count, forming a $| D(\phi) |$-vector of counts $N_n^{*} = N(\text{max}\lbrace n’: || N(n’) ||_1 \leq n \rbrace).$
    • The samples cap the count of each group at some value $K = \lfloor \frac{n’}{\vert D(\phi) \vert} \rfloor$.
    • For each $x$, sample $N_{nx}$ rows uniformly at random without replacement from $T_x$, forming the sample $S_x$.
  5. For the queries share the same QCS $\phi$ but have different $n$ (the number of rows for a query) due to various $t$ and $e$, we have the simple strategy $S_n \subseteq S_{n^{max}}$ for any $n \leq n^{max}$

  6. Assume a query $Q$, that needs to read $n$ rows in total to satisfy its error bounds or time execution constraints. Let $n_x$ be the number of rows read from $S_x$ to compute the answer. Since the rows are distributed randomly among the blocks [data blocks like HDFS], it is enough for $Q$ to read any subset of blocks comprising $S_x$, as long as these blocks contain at least $n_x$ records. For example, $Q$ reads only blocks $B_{41}$ and $B_{42}$, as these blocks contain enough records to compute the required answer.

  7. Considering three factors: (1) the “sparsity” of the data, (2) workload characteristics, and (3) the storage cost of samples, creating samples can be formulated as an optimization problem.
    • Sparsity of the data: the sparsity function is defined as the number of groups whose size in $T$ is less than some number $M$: $\Delta(\phi, M) = \vert \lbrace x \in D(\phi): \vert T_x \vert < M \rbrace \vert$.
    • Workload characteristics: a query has a QCS $q_j$ with some (unknown) probability $p_j$. Namely, QCSs are drawn from a Multinomial $(p_1, p_2, \cdots)$ distribution. The $p_j$ is based on frequency of queries with QCS $q_j$ in past queries.
    • Storage cost: $\vert S(\phi, K) \vert$ is the storage cost (in rows) of building a stratified sample on a set of columns $\phi$.
  8. The optimization problem statement:
    • Given: overall storage capacity budget (in rows) is $\mathbb{C}$.
    • Objective: select $\beta$ column sets from among $m$ possible QCSs, say $\phi_{i_1}, \cdots, \phi_{i_\beta}$, which can best answer our queries, while satisfying $\sum_{k=1}^\beta \vert S(\phi_{i_k}, K) \vert \leq \mathbb{C}$

Sample selection during runtime

Given a query $Q$, the goal is to select one (or more) sample(s) at run-time that meet the specified time or error constraints and then compute answers over them.

In order to pick the best possible plan, BlinkDB’s run-time dynamic sample selection strategy involves executing the query on a small sample (i.e., a subsample) of data of one or more samples and gathering statistics about the query’s selectivity, complexity and the underlying distribution of its inputs.

Selecting the sample

Choosing an appropriate sample for a query primarily depends on the set of columns $q_j$ that occur in its WHERE and/or GROUP BY clauses and the physical distribution of data in the cluster (i.e., disk vs. memory).

Selecting the right sample/size

Once a set of samples is decided, BlinkDB needs to select a particular sample $\phi_i$ and pick an appropriately sized subsample in that sample based on the query’s response time or error constraints. This is accomplished by constructing an Error-Latency Profile (ELP) which characterizes the rate at which the error decreases (and the query response time increases) with increasing sample sizes, and is built simply by running the query on smaller samples to estimate the selectivity and project latency and error for larger samples.

Reference:

  1. Agarwal, Sameer, Barzan Mozafari, Aurojit Panda, Henry Milner, Samuel Madden, and Ion Stoica. “BlinkDB: Queries with Bounded Errors and Bounded Response Times on Very Large Data.” EuroSys, 2013.