\documentclass[english]{scrartcl}
%\VignetteEngine{knitr::knitr}
%\VignetteIndexEntry{Using the USL Package to Analyze System Scalability}
%\VignetteDepends{usl}
%\VignetteKeyword{usl}
%\VignetteKeyword{scalability}
%\VignettePackage{usl}

\subject{Using the usl package}
\title{Analyze System Scalability in R with the Universal Scalability Law}
\author{Stefan M\"oding}

% Fonts, encoding & language support
\usepackage{fourier}
\usepackage[scaled]{helvet}
\usepackage[scaled=0.76]{beramono}
\usepackage[T1]{fontenc}
\usepackage[utf8]{inputenc}
\usepackage{babel}

\usepackage{amsmath}
\usepackage{microtype}
\usepackage{nicefrac}
\usepackage[all]{nowidow}
\usepackage[hang,marginal]{footmisc}
\usepackage[super]{nth}

% Cross references
\usepackage[pdfusetitle,hidelinks]{hyperref}
\usepackage[all]{hypcap}
\usepackage{bookmark}
\usepackage{cleveref}

\hypersetup{pdfkeywords={Scalability,Universal Scalability Law,USL,R}}

% Layout
\KOMAoption{DIV}{11}
\KOMAoption{toc}{bib}

% Define shortcut commands
\newcommand{\R}{{\normalfont\textsf{R}}}
\newcommand{\usl}{{\normalfont\textsf{usl}}}
\newcommand{\kbd}[1]{{\normalfont\texttt{#1}}}

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
<<echo=FALSE>>=
library(knitr)
opts_knit$set(progress=FALSE, verbose=FALSE)

knit_hooks$set(small.mar=function(before, options, envir) {
if (before && options$fig.show != 'none')
  par(mar=c(5.1, 4.1, 1.1, 1.1), family='Helvetica', ps=11)
})

opts_chunk$set(prompt=TRUE, comment=NA, tidy=FALSE)
opts_chunk$set(out.width='\\textwidth', small.mar=TRUE)
opts_chunk$set(fig.width=7, fig.height=3.6)
opts_chunk$set(fig.align='center', fig.pos='htbp', fig.path='usl-')

options(prompt='R> ', scipen=4, digits=4, width=78)
options(digits.secs=3, show.signif.stars=TRUE)
options(str=strOptions(strict.width='cut'))
@

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\begin{document}

\maketitle

\begin{abstract}\noindent
The Universal Scalability Law is used to quantify the scalability of hardware
or software systems. It uses sparse measurements from an existing system to
predict the throughput for different loads and can be used to learn more
about the scalability limitations of the system. This document introduces the
\usl{} package for \R{} and shows how easily it can be used to perform the
relevant calculations.
\end{abstract}

\tableofcontents


%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\section{Version}

This document describes version \Sexpr{packageVersion("usl")} of the \usl{}
package.


%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\section{Introduction}

Every system architect faces the challenge to deliver an application system
that meets the requirements. A critical point during the design is the
scalability of the system.

Informally scalability can be defined as the ability to support a growing
amount of work. A system is said to scale if it handles the changing demand
or hardware environment in a reasonable efficient and practical way.

Scalability can have two facets with respect to a computer system. On the one
hand, there is software scalability where the focus is about how the system
behaves when the demand increases, i.e., when more users are using it or more
requests need to be handled. On the other hand, there is hardware scalability
where the behavior of an application system running on larger hardware
configurations is investigated.

The Universal Scalability Law (USL) has been developed by Dr.~Neil~J.~Gunther
to allow the quantification of scalability for the purpose of capacity
planning. It provides an analytic model for the scalability of a computer
system.

A comprehensive introduction to the Universal Scalability Law including the
mathematical grounding has been published by Dr.~Gunther (cf. \cite{Gun07}).


%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\section{Background}

Dr.~Gunther shows how the scalability of every computer system can be
described by a common rational function. This function is \emph{universal} in
the sense that it does not assume any specific type of software, hardware or
system architecture.

\Cref{eqn:usl2} illustrates the original Universal Scalability Law where
$C(N) = X(N) / X(1)$ is the relative capacity given by the ratio of the
measured throughput $X(N)$ for load $N$ to the throughput $X(1)$ for load $1$.

\begin{equation}\label{eqn:usl2}
C(N) = \frac{N}{1 + \alpha (N - 1) + \beta N (N - 1)}
\end{equation}

The denominator consists of three terms that all have a specific physical
interpretation:

\begin{labeling}{Concurrency:}
\item[Concurrency:] The first term models linear scalability that would exist
  if the different parts of the system (processors, threads \dots) could work
  without any interference caused by their interaction.
\item[Contention:] The second term of the denominator refers to the
  contention between different parts of the system. Most common are issues
  caused by serialization or queueing effects.
\item[Coherency:] The last term represents the delay induced by keeping the
  system in a coherent and consistent state. This is necessary when writable
  data is shared in different parts of the system. Predominant factors for
  such a delay are caches implemented in software and hardware.
\end{labeling}

In other words: $\alpha$ and $\beta$ represent two concrete physical issues
that limit the achievable speedup for parallel execution. Note that the
contention and coherency terms grow linearly respectively quadratically with
$N$. As a consequence their influence becomes larger with an increasing $N$.

Due to the quadratic characteristic of the coherency term there will be a
point where the throughput of the system will start to go retrograde, i.e.,
will start to decrease with further increasing load.

Dr.~Gunther proved that \cref{eqn:usl2} is reduced to Amdahl's Law for
$\beta = 0$ (cf. \cite{Gun07}). Therefore the Universal Scalability Law can be
seen as a generalization of Amdahl's Law for speedup in parallel computing.

We can solve this nonlinear equation to estimate the coefficients $\alpha$ and
$\beta$ using a sparse set of measurements for the throughput $X_i$ at
different loads $N_i$. Initially the Universal Scalability Law normalized the
measurements using the throughput $X(1)$ for load $1$ as the point of reference.
This placed an additional burden on the performance analyst who needed to either
measure the value or find a smart way to estimate it.

It was then discovered that a third coefficient $\gamma$ can be added to the
formula to represent the missing point of reference (cf. \cite{Gun:usl3}).
\Cref{eqn:usl3} shows the updated Universal Scalability Law using three
parameters.

\begin{equation}\label{eqn:usl3}
X(N) = \frac{\gamma N}{1 + \alpha (N - 1) + \beta N (N - 1)}
\end{equation}

Nonlinear regression will solve this equation and find best-fit values for
the three parameters $\alpha$, $\beta$ and $\gamma$.

The \usl{} package has been created to subsume the computation into one
simple function call. This greatly reduces the manual work that previously
was needed to perform the scalability analysis.

Initially the coefficients are called $\sigma$ and $\kappa$ when hardware
scalability is evaluated but $\alpha$ and $\beta$ when software scalability
is analyzed. Up to version 1.8.0 the \usl{} package always used \texttt{sigma}
and \texttt{kappa} as coefficients. This has been changed starting with the
2.0.0 release of the \usl{} package. Now the package will also use
\texttt{alpha} and \texttt{beta} as coefficient names to follow Dr.~Gunther's
nomenclature. Additionally the \texttt{gamma} parameter was added to implement
the Universal Scalability Law with three parameters.


%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\section{Examples of Scalability Analysis}

The following sections present some examples of how the \usl{} package can be
used when performing a scalability analysis. They also explain typical
function calls and their arguments.

\subsection{Case Study: Hardware Scalability}

The \usl{} package contains a demo dataset with benchmark measurements from a
ray tracer software\footnote{\url{http://sourceforge.net/projects/brlcad/}}.
The data was gathered on an SGI~Origin~2000 with 64 R12000 processors running
at 300~MHz.

A number of reference images with different levels of complexity were
computed for the benchmark. The measurements contain the average number of
calculated ray-geometry intersections per second for the number of used
processors.

It is important to note that with changing hardware configurations the
relative number of \emph{homogeneous} application processes per processor is
to be held constant. So when $k$ application processes were used for the $N$
processor benchmark then $2k$ processes must be used to get the result for
$2N$ processors.

Start the analysis by loading the \usl{} package and look at the supplied
dataset.

<<>>=
library(usl)
data(raytracer)
raytracer
@

The data shows the throughput for different hardware configurations covering
the available range from one to 64 processors. We can easily see that the
benefit for switching from one processor to four processors is much larger
than the gain for upgrading from 48 to 64 processors.

Create a simple scatterplot to visualize the raw data.

<<'rtplot1', fig.show='hide'>>=
plot(throughput ~ processors, data = raytracer)
@

\Cref{fig:rtplot2} shows the throughput of the system for the different
number of processors. This plot is a typical example for the effects of
\emph{diminishing returns}\label{dimret}, because it clearly shows how the
benefit of adding more processors to the system gets smaller for higher
numbers of processors.

<<'rtplot2', echo=FALSE, fig.cap='Measured throughput of a ray tracing software in relation to the number of available processors'>>=
<<rtplot1>>
@

Our next step builds the USL model from the dataset. The \kbd{usl()} function
creates an S4 object that encapsulates the computation.

The first argument is a formula with a symbolic description of the model we
want to analyze. In this case we would like to analyze how the ``throughput''
changes with regard to the number of ``processors'' in the system. The second
argument is the dataset with the measured values. Note how this call matches
the syntax of the \kbd{plot()} function.

<<>>=
usl.model <- usl(throughput ~ processors, data = raytracer)
@

The model object can be investigated with the \kbd{summary()} function.

<<>>=
summary(usl.model)
@

The output of the \kbd{summary()} function shows different types of
information.

\begin{itemize}
\item First of all it includes the call we used to create the model.
\item The efficiency tells us something about the ratio of useful work that
  is performed per processor. It is obvious that two processors might be able
  to handle twice the work of one processor but not more. Calculating the
  ratio of the workload per processor should usually be less or equal to $1$.
  But often the maximum of the distribution is not exactly $1$. This is caused
  by the regression model when the expected value differs slightly from the
  measured value.
\item We are performing a regression on the data to calculate the
  coefficients and therefore we determine the residuals for the fitted
  values. The distribution of the residuals is also given as part of the
  summary.
\item The coefficients $\alpha$ and $\beta$ are the result that we are
  essentially interested in. They tell us the magnitude of the contention and
  coherency effects within the system.
\item The third coefficient $\gamma$ estimates the throughput for a single
  processor. In this case we actually have a measurement for the single
  processor case but for a real-world analysis this value if often unknown or
  can't be measured. The difference between the value of $\gamma$ and the
  measurement is also caused by the regression. The difference should be small
  if the regression finds a reasonable model.
\item The residual standard error estimates how well the model fits the data.
  We can see that the difference between the model prediction and the measured
  values is typically within $\Sexpr{signif(sigma(usl.model), 3)}$ ray-tracing
  operations per second.
\item Finally the scalability bounds are printed. These bounds state important
  limits of the scalability as defined by the model. We will look at the
  meaning of these values shortly.
\end{itemize}

The function \kbd{efficiency()} extracts the efficiency values from the model
and allows us to have a closer look at the specific efficiencies of the
different processor configurations.

<<>>=
efficiency(usl.model)
@

A bar plot is useful to visually compare the decreasing efficiencies for the
configurations with an increasing number of processors. \Cref{fig:rtbarplot}
shows the output diagram.

<<'rtbarplot', fig.cap='Rate of efficiency per processor for different numbers of processors running the ray tracing software'>>=
barplot(efficiency(usl.model), ylab = "efficiency / processor", xlab = "processors")
@

The efficiency can be used for a first validation and sanity check of the
measured values. Values larger than $1.0$ usually need a closer
investigation. It is also suspicious if the efficiency gets bigger when the
load increases.

The model coefficients alone can be retrieved with the \kbd{coef()} function.

<<>>=
coef(usl.model)
@

We can see that the model estimates the contention parameter $\alpha$ to be
about $\Sexpr{signif(100 * coef(usl.model)[['alpha']], 2)}$ percent and the
coherency parameter $\beta$ to be $\Sexpr{signif(coef(usl.model)[['beta']], 2)}$.

With $\beta = 0$ the model indicates no coherency penalty for the system. In
this case the throughput will asymptotically approach an upper bound as the
number of processors is increased. The summary output above indicates a limit
of $\Sexpr{signif(summary(usl.model)$limit, 3)}$ as the upper bound for the
maximal achievable throughput. There is no way to increase the throughput
beyond that number, no matter how many processors are available whatsoever.

For $\beta = 0$ the predicted throughput never reaches a maximum. Theoretically
the throughput could always be increased by adding more processors. At the
same time it is limited as stated in the previous paragraph. So the improvement
becomes infinitesimal for more and more processors.

To get an impression of the scalability function we can use the \kbd{plot()}
function and create a combined graph with the original data as dots and the
calculated scalability function as a solid line. The scalability bounds are
added as dotted lines to the output. \Cref{fig:rtplot3} has the result of that
plot.

<<'rtplot3', fig.cap='Throughput of a ray tracing software using different numbers of processors'>>=
plot(throughput ~ processors, data = raytracer, pch = 16, ylim = c(0, 400))
plot(usl.model, add = TRUE, bounds = TRUE)
@

<<'bounds', echo=FALSE>>=
Xroof <- usl.model$limit
Nopt  <- usl.model$optimal[1]
Xopt  <- usl.model$optimal[2]
@

The plot has dotted lines for the upper limit defined by Amdahl (here for a
throughput of $\Sexpr{signif(Xroof, 3)}$) and the linear scalability we strive
to see in all our applications. Unfortunately the throughput falls away from
the linear path rather quickly.

Optimal scalability is reached at $\Sexpr{signif(Nopt, 3)}$ processors
performing $\Sexpr{signif(Xopt, 3)}$ ray-tracing operations per second. The
optimum is defined as the point on the x-axis below the intersection of the
linear scalability bound and the scalability limit defined by Amdahl's Law.

Running a system below this point will underutilize already purchased capacity.
But pushing the system above this point leads to \emph{diminishing returns}.
The point of optimal scalability is also given in the output of the
\kbd{summary()} function as shown before.

The confidence intervals for the model coefficients are returned by calling the
\kbd{confint()} function.

<<>>=
confint(usl.model, level = 0.95)
@

The confidence interval shows for a given confidence level the expected range
that the true value lies in. We can see that the true $\beta$ might in fact not
really be zero. One reason is that all measurements contain measurement and
observation errors. The other reason is that a very small $\beta$ might only
cause a measurable effect when a much higher number of processors are used.

SGI marketed the Origin~2000 with up to 128 processors. Let's assume that
going from 64 to 128 processors does not introduce any additional limitations
to the system architecture. Then we can use the existing model and forecast
the system throughput for other numbers like 96 and 128 processors using the
\kbd{predict()} function.

<<>>=
predict(usl.model, data.frame(processors = c(96, 128)))
@

We can see from the prediction that the throughput will reach
$\Sexpr{signif(predict(usl.model, data.frame(processors = c(128)))[[1]], 4)}$
for 128 processors.

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\subsection{Case Study: Software Scalability}

In this section we will perform an analysis of a SPEC benchmark. A
SPARCcenter~2000 with 16 CPUs was used in October 1994 for the SDM91
benchmark\footnote{\url{http://www.spec.org/osg/sdm91/results/results.html}}.
The benchmark simulates a number of users working on a UNIX server (editing
files, compiling \dots) and measures the number of script executions per
hour.

First, select the demo dataset with the data from the SPEC SDM91 benchmark.

<<>>=
library(usl)
data(specsdm91)
specsdm91
@

The data provides the measurements made during the benchmark. The column
``load'' shows the number of virtual users that were simulated by the
benchmark and the column ``throughput'' has the measured number of script
executions per hour for that load.

Next we create the USL model for this dataset by calling the \kbd{usl()}
function. Again we specify a symbolic description of the model and the
dataset with the measurements. But this time we choose a different method for
the analysis.

<<>>=
usl.model <- usl(throughput ~ load, specsdm91, method = "nls")
@

There are currently three possible values for the \texttt{method} parameter:

\begin{labeling}{\texttt{default}:}
\item[\texttt{nls}:] This method uses the \kbd{nls()} function of the stats
  package for a nonlinear regression model. It estimates the coefficients
  $\alpha$, $\beta$ and $\gamma$. The nonlinear regression uses constraints
  for its parameters which means the ``port'' algorithm is used internally
  to solve the model. So all restrictions of the ``port'' algorithm apply.
\item[\texttt{nlxb}:] A nonlinear regression model is also used in this case.
  But instead of the \kbd{nls()} function it uses the \kbd{nlxb()} function
  from the nlsr package (cf. \cite{R:nlsr}). It is expected to be more robust
  than the \texttt{nls} method.
\item[\texttt{default}:] The default method traditionally used a transformation
  into a \nth{2} degree polynomial and required the data to be normalized. This
  was the original algorithm for the two-parameter Universal Scalability Law
  (cf. chapter 5.2.3 of \cite{Gun07}). Starting with version 2.0.0 of the
  \usl{} package this algorithm is no longer available. The \kbd{nlxb()}
  function is used instead.
\end{labeling}

We also use the \kbd{summary()} function to look at the details for the
analysis.

<<>>=
summary(usl.model)
@

Looking at the coefficients we notice that $\alpha$ is about
\Sexpr{signif(coef(usl.model)[['alpha']], 2)} and $\beta$ is about
\Sexpr{signif(coef(usl.model)[['beta']], 1)}. The parameter $\alpha$ indicates
that about \Sexpr{signif(100*coef(usl.model)[['alpha']], 2)} percent of the
execution time is strictly serial. As a rule of thumb we can say that $3$
percent is an excellent value for a reasonable complex application. Note that
this serial fraction is also recognized in Amdahl's Law.

We hypothesize that a proposed change to the system --- maybe a redesign of
the cache architecture or the elimination of a point to point communication
--- could reduce $\beta$ by half. Now we want to predict how the scalability of
the system would change.

We can calculate the point of maximum scalability for the current system and
for the hypothetical system with the \kbd{peak.scalability()} function.

<<>>=
peak.scalability(usl.model)
peak.scalability(usl.model, beta = 0.00005)
@

The function accepts the optional arguments \texttt{alpha} and
\texttt{beta}. They are useful to do a what-if analysis. Setting these
parameters override the calculated model parameters and show how the system
would behave with a different contention or coherency coefficient.

In this case we learn that the point of peak scalability would move from
around \Sexpr{signif(peak.scalability(usl.model), 3)} to about
\Sexpr{signif(peak.scalability(usl.model, beta=0.00005), 3)} virtual users if
we would be able to actually build the system with the assumed optimization.

Both calculated scalability functions can be plotted using the \kbd{plot()}
or \kbd{curve()} functions. The following commands create a graph of the
original data points and the derived scalability functions. To completely
include the scalability of the hypothetical system, we have to increase the
range of the plotted values with the first command.

<<'spplot1', fig.show='hide'>>=
plot(specsdm91, pch = 16, ylim = c(0,2500))
plot(usl.model, add = TRUE)

# Create function cache.scale to perform calculations with the model
cache.scale <- scalability(usl.model, beta = 0.00005)
curve(cache.scale, lty = 2, add = TRUE)
@

We used the function \kbd{scalability()} here. It is a higher order function
returning a function and not just a single value. The return value is assigned
to define \kbd{cache.scale}. That makes it possible to use the \kbd{curve()}
function to plot the values over the specific range.

\Cref{fig:spplot2} shows the measured throughput in scripts per hour for a
given load, i.e., the number of simulated users. The solid line indicates the
derived USL model while the dashed line resembles our hypothetical system
using the proposed optimization.

<<'spplot2', echo=FALSE, fig.cap='The result of the SPEC SDM91 benchmark for a SPARCcenter 2000 (dots) together with the calculated scalability function (solid line) and a hypothetical scalability function (dashed line)'>>=
<<spplot1>>
@

From the figure we can see that the scalability really peaks at one point.
Increasing the load beyond that point leads to retrograde behavior, i.e., the
throughput decreases again. As we have calculated earlier, the measured
system will reach this point sooner than the hypothetical system.

We can combine the \kbd{cache.scale()} (defined before) and
\kbd{peak.scalability()} functions to get the predicted throughput values for
the peak values.

<<>>=
scalability(usl.model)(peak.scalability(usl.model))

# Use cache.scale function defined before
cache.scale(peak.scalability(usl.model, beta = 0.00005))
@

This illustrates how the Universal Scalability Law can help to decide if the
system currently is more limited by contention or by coherency issues and
also what impact a proposed change would have.

The \kbd{predict()} function can also be used to calculate a confidence bands
for the scalability function at a specified level. To get a smoother graph it
is advisable to predict the values for a higher number of points. Let's start
by creating a data frame with the required load values.

<<>>=
load <- with(specsdm91, expand.grid(load = seq(min(load), max(load))))
@

We use the data frame to determine the fitted values and also the upper and
lower confidence bounds at the requested level. The result will be a matrix
with column names \kbd{fit} for the fitted values, \kbd{lwr} for the lower
and \kbd{upr} for the upper bounds.

<<>>=
fit <- predict(usl.model, newdata = load, interval = "confidence", level = 0.95)
@

The matrix is used to define the coordinates of a polygon containing the area
between the lower and the upper bounds. The polygon connects the points of
the lower bounds from lower to higher values and then back using the points
of the upper bounds.

<<>>=
usl.polygon <- matrix(c(load[, 1], rev(load[, 1]), fit[, 'lwr'], rev(fit[, 'upr'])),
                       nrow = 2 * nrow(load))
@

The plot is composed from multiple single plots. The first plot initializes
the canvas and creates the axis. Then the polygon is plotted using a gray
area. In the next step the measured values are added as points. Finally a
solid line is plotted to indicate the fitted scalability function. See
\cref{fig:ciplot1} for the entire plot.

<<'ciplot1', fig.cap='The result of the SPEC SDM91 benchmark with confidence bands for the scalability function at the 95\\% level'>>=
# Create empty plot (define canvas size, axis, ...)
plot(specsdm91, xlab = names(specsdm91)[1], ylab = names(specsdm91)[2],
      ylim = c(0, 2000), type = "n")

# Plot gray polygon indicating the confidence interval
polygon(usl.polygon, border = NA, col = "gray")

# Plot the measured throughput
points(specsdm91, pch = 16)

# Plot the fit
lines(load[, 1], fit[, 'fit'])
@

Another way to illustrate the impact of the parameters $\alpha$ and $\beta$
on the scalability is by looking at the achievable speedup when a fixed load
is parallelized. A naive estimation would be that doubling the degree of
parallelization should cut the execution time in halve.

Unfortunately it doesn't work this way. In general there is a range where
doubling the parallelization will actually improve the execution time. But
the improvement will get smaller and smaller when the degree of parallelism
is increased further. This is also an effect of \emph{diminishing returns} as
already seen in \cref{dimret}. The real execution time is in fact the sum of
the ideal execution time and the overhead for dealing with contention and
coherency delays.

The total execution time of a parallelized workload depends on the degree of
parallelism $p$ and the coefficients $\alpha$ and $\beta$ of the associated
USL model.

The magnitude of the three components --- given as fractions of the serial
execution time $T_1$ --- that account for the total execution time of the
parallelized workload can be calculated as follows
(cf. \cite{DBLP:journals/corr/abs-0808-1431} eqn. 26).

\begin{align}
T_{ideal}      &= \frac{1}{p} T_1\label{eqn:ideal}
\end{align}
\begin{align}
T_{contention} &= \alpha \left(\frac{p-1}{p}\right) T_1\label{eqn:contention}
\end{align}
\begin{align}
T_{coherency}  &= \beta \frac{1}{2} (p-1) T_1\label{eqn:coherency}
\end{align}

The function \kbd{overhead()} can be used to calculate the correspondent
fractions for a given model. The function has the same interface as the
\kbd{predict()} function. Calling it with only the model as argument will
calculate the overhead for the fitted values. It can also be called with a
data frame as second argument. Then the data frame will be used to determine
the values for the calculation.

Let's use our current model to calculate the overhead for a load of $10$,
$20$, $100$ and $200$ simulated users. We create a data frame with the number
of users and use the \kbd{overhead()} function to estimate the overhead.

<<>>=
load <- data.frame(load = c(10, 20, 100, 200))
ovhd <- overhead(usl.model, newdata = load)
ovhd
@

We can see that the ideal execution time for running $10$ jobs in parallel is
$\nicefrac{1}{10}$ of the execution time of running the jobs unparallelized.
To get the total fraction we have to add the overhead for contention
($\Sexpr{100*signif(ovhd[1,'contention'], 2)}\%$) and for coherency delays
($\Sexpr{100*signif(ovhd[1,'coherency'], 2)}\%$). This gives a total of
$\Sexpr{100*signif(sum(ovhd[1,]), 4)}\%$. So with $10$ jobs in parallel we
are only about $\Sexpr{signif(1/sum(ovhd[1,]), 2)}$ times faster than running
the same workload in a serial way.

\Cref{eqn:contention} shows that the percentage of time spent on dealing with
contention will converge to the value of $\alpha$. \Cref{eqn:coherency}
explains that coherency delays will grow beyond any limit if the degree of
parallelism is large enough. This corresponds to the observation that adding
more parallelism will sometimes make performance worse.

A stacked barplot can be used to visualize how the different effects change
with an increasing degree of parallelism. Note that the result matrix must be
transposed to match the format needed for the \kbd{barplot()} command.

<<'ovplot1', fig.cap='Decomposition of the execution time for parallelized workloads of the SPECSDM91 benchmark. The time is measured as a fraction of the time needed for serial execution of the workload.'>>=
barplot(height = t(ovhd), names.arg = load[, 1],
         xlab = names(load), legend.text = TRUE)
@

\Cref{fig:ovplot1} shows the resulting plot. It clearly shows the decrease in
ideal execution time when the degree of parallelism is increased. It also
shows how initially almost only contention contributes to the total execution
time. For higher degrees of parallelism the impact of coherency delays grows.
Note how the difference in ideal execution time between $100$ and $200$
parallel jobs effectively has no effect on the total execution time.


%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\subsection{Case Study: Multi-valued Data}

It is very common to use multi-valued data for a scalability analysis. These
measurements are often taken from a live system and may include many
different data points for similar load values. This could be the result of a
non-homogeneous workload and an analyst has to decide how to take that into
account. But for a production system there is usually no feasible way to
create a homogeneous workload.

The following data shows a subset of performance data gathered from an Oracle
database system providing a login service for multiple web applications. For
the analysis we focus on only two of the available metrics:

\begin{labeling}{\texttt{txn\_rate}:}
\item[\texttt{txn\_rate}:] The average number of processed database
  transactions. This metric is given as transactions per second.
\item[\texttt{db\_time}:] The average time spent inside the database either
  working on a CPU or waiting for resources (I/O, locks, buffers \dots). The
  time is expressed as seconds per second, so two sessions working for
  exactly one quarter of a second each will contribute a total of half a
  second for that second. Oracle has coined the term \emph{Average Active
  Sessions} (AAS) for this metric.
\end{labeling}

Let's have a look at the first couple of data points in our data set. For
each time interval of two minutes there is a corresponding value for the
average database time per seconds and for the average number of transactions
per second in this interval.

<<>>=
data(oracledb)
head(subset(oracledb, select = c(timestamp, db_time, txn_rate)))
@

A naive approach would be a plot of the data as a time series (see
\cref{fig:oraplot1}). This plot shows the familiar pattern of an OLTP
application that is mostly used during office hours. Unfortunately this type
of plot is pretty much useless when performing a scalability analysis.

<<'oraplot1', echo=FALSE, fig.cap='Transaction rates of an Oracle database system during the day of January 19th, 2012'>>=
plot(txn_rate ~ timestamp, oracledb, pch = 20, xlab = "Time of day", ylab = "Txn / sec")
@

The Universal Scalability Law correlates a throughput with a load. In this
case the throughput is clearly given by the transaction rate and the database
time is a taken as the load metric. The definition above states that the
total time spent --- either running on a CPU or waiting --- is a measurement
for the average number of active sessions. So we use that to express the load
on the database system.

As usual, we call the \kbd{usl()} function to carry out the analysis. See
\cref{fig:orausl1} for the scatterplot of the data including the plot of the
estimated scalability function.

<<'orausl1', fig.cap='Relationship between the transaction rate and the number of average active sessions in an Oracle database system'>>=
plot(txn_rate ~ db_time, oracledb,
      xlab = "Average active sessions", ylab = "Txn / sec")

usl.oracle <- usl(txn_rate ~ db_time, oracledb)
plot(usl.oracle, add = TRUE)
@

Now we can retrieve the coefficients for this model.

<<>>=
coef(usl.oracle)
@

Here our $\alpha$ is about an order of magnitude bigger than what we have
seen in the previous sections. This indicates a major issue with some kind of
serialization or queueing that severely limits the scalability. In fact it is
so bad that the impact is already visible with only a few active sessions
working at the same time: according to the model the peak throughput is
reached at about $\Sexpr{signif(peak.scalability(usl.oracle), 2)}$ sessions.

<<>>=
peak.scalability(usl.oracle)
@

The confidence interval for $\alpha$ confirms that there is only a small
uncertainty about the magnitude of the calculated coefficients.

<<>>=
confint(usl.oracle)
@

This analysis shows how we can use some of the metrics provided by a live
Oracle database system to learn about the scalability. Note that neither the
Oracle software nor the application needed any additional instrumentation to
collect this data. Also the analysis was done without any internal knowledge
about the way the application was using the database.

% Looking at the plot of the scalability function in \cref{fig:orausl1}, we
% could argue that the estimated function seems to be too pessimistic. There
% are many data points showing a much higher transaction rate than the
% scalability function suggests. Numerically this is caused by using the
% \emph{least squares} method which considers all data points equally.

% Let's estimate the scalability when only the alleged \emph{good} data
% points are taken into account. Here is a constructive description to
% isolate the points that make up the upper boundary of the comprising area.

% \begin{enumerate}
% \item Determine the convex hull of the data points
% \item Locate the data points with the smallest and the largest abscissa
% \item Find the straight line passing through these two points
% \item Use only the data points on the convex hull that are located on or above
%   this line
% \end{enumerate}

% <<'orausl2', echo=FALSE, fig.cap='Using the convex hull to set the boundary between upper and lower data points'>>=
% p <- with(oracledb, data.frame(db_time, txn_rate))

% # Plot all points
% plot(txn_rate ~ db_time, p,
%      xlab = "Average active sessions", ylab = "Transactions per second")

% # Calculate convex hull
% idx <- chull(cbind(p$db_time, p$txn_rate))
% pts <- p[idx, ]

% # Get point with smallest abscissa
% p1 <- which.min(pts$db_time)
% x1 <- pts[p1, 1]
% y1 <- pts[p1, 2]

% # Get point with largest abscissa
% p2 <- which.max(pts$db_time)
% x2 <- pts[p2, 1]
% y2 <- pts[p2, 2]

% # Slope–intercept equation: y=mx+b
% m <- (y2 - y1) / (x2 - x1)
% b <- (-x1 * m) + y1

% # Plot points on convex hull
% points(pts, pch = 16)

% # Remove points below the line and sort ascending by db_time
% top <- subset(pts, txn_rate >= (m * db_time) + b)
% top <- top[order(top$db_time), ]

% lines(top$db_time, top$txn_rate, lty = 1)

% abline(b, m, lty = 2)
% @

% \Cref{fig:orausl2} uses solid black points to mark the points of the convex
% hull. The dashed line, defined by the coordinates
% $\left(\Sexpr{signif(x1,2)},\Sexpr{signif(y1,2)}\right)$ and
% $\left(\Sexpr{signif(x2,2)},\Sexpr{signif(y2,2)}\right)$, is the boundary
% between the lower and the upper part.

% We will now use only the data points on the upper part of the convex hull
% for an additional analysis.

% <<>>=
% usl.oracle2 <- usl(txn_rate ~ db_time, top, method = "nlxb")

% coef(usl.oracle2)
% confint(usl.oracle2)
% @

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\bibliographystyle{plain}
\bibliography{usl}

\end{document}