tfm-report/Secciones/incremental_slicing.tex

395 lines
23 KiB
TeX
Raw Normal View History

2019-11-15 22:34:58 +01:00
% !TEX encoding = UTF-8
% !TEX spellcheck = en_US
% !TEX root = ../paper.tex
\chapter{Main explanation?}
\section{First definition of the SDG}
\label{sec:first-def-sdg}
The system dependence graph (SDG) is a method for program slicing that was first
proposed by Horwitz, Reps and Blinkey \cite{HorwitzRB88}. It builds upon the
existing control flow graph (CFG), defining dependencies between vertices of the
CFG, and building a program dependence graph (PDG), which represents them. The
system dependence graph (SDG) is then build from the assembly of the different
PDGs (each representing a method of the program), linking each method call to
its corresponding definition. Because each graph is built from the previous one,
new constructs can be added with to the CFG, without the need to alter the
algorithm that converts CFG to PDG and then to SDG. The only modification
possible is the redefinition of a dependency or the addition of new kinds of
dependence.
The language covered by the initial proposal was a simple one, featuring
procedures with modifiable parameters and basic instructions, including calls to
procedures, variable assignments, arithmetic and logic operators and conditional
instructions (branches and loops): the basic features of an imperative
programming language. The control flow graph was as simple as the programs
themselves, with each graph representing one procedure. The instructions of the
program are represented as vertices of the graph and are split into two
categories: statements, which have no effect on the control flow (assignments,
procedure calls) and predicates, whose execution may lead to one of multiple
---though traditionally two--- paths (conditional instructions). Statements are
connected sequentially to the next instruction. Predicates have two outgoing
edges, each connected to the first statement that should be executed, according
to the result of evaluating the conditional expression in the guard of the
predicate.
\begin{definition}[Control Flow Graph~\cite{???}]
A \emph{control flow graph} $G$ of a program $P$ is a tuple $\langle N, E \rangle$, where $N$ is a set of nodes, composed of a method's statements and two special nodes, ``Start'' and ``End''. $E$ is a set of edges of the form $e = \left(n_1, n_2\right)$ a directed edge from $n_1$ to $n_2$
\end{definition}
To build the PDG and then the SDG, some dependencies must be extracted from the CFG, which are defined as follows:
\begin{definition}[Postdominance]
Vertex $b$ \textit{postdominates} vertex $b$ if and only if $a \neq b$ and $b$ is on every path from $a$ to the ``End'' vertex.
\end{definition}
\begin{definition}[Control dependency]
\label{def:ctrl-dep}
Vertex $b$ is \textit{control dependent} on vertex $a$ ($a \ctrldep b$) if and only if $b$ postdominates one but not all of $a$'s successors. It follows that a vertex with only one successor cannot be the source of control dependence.
\end{definition}
\begin{definition}[Data dependency]
Vertex $b$ is \textit{data dependent} on vertex $a$ ($a \datadep b$) if and only if $a$ may define a variable $x$, $b$ may use $x$ and there an $x$-definition free path from $a$ to $b$.\footnote{The initial definition of data dependency was further split into in-loop data dependencies and the rest, but the difference is not relevant for computing the slices in the SDG.}
\end{definition}
It should be noted that variable definitions and uses can be computed for each
statement independently, analyzing the procedures called by it if necessary. In
general, any instruction uses all variables that appear in it, save for the
left-hand side of assignments. Similarly, no instruction defines variables,
except those in the left-hand side of assignments. The variables used and
defined by a procedure call are those used and defined by its body.
With the data and control dependencies, the PDG may be built, by replacing the
edges from the CFG by data and control dependence edges. The first tends to be
represented as a thin solid line, and the latter as a thick solid line. In the
examples, data dependencies will be thin solid red lines.
The organization of the vertices of the PDG tends to resemble a tree graph, with
the ``Start'' node in the position of the root (at the top), and the ``End''
node typically omitted. The control dependence edges structure the tree
vertically. In the case that a vertex is control dependent on multiple vertices,
it will be placed one level below the lowest source of control dependency. With
a programming language this simple, cyclical control dependencies do not appear,
but should they do so in further sections, the instructions are sorted top to
bottom in the order they appear in the program. Horizontally, the vertices are
sorted by their order in the program, left to right, in order to make the graph
more readable. Data dependency edges are placed without reordering the nodes of
the graph. In the examples given, edges like $a \datadep a$ or $b \ctrldep b$
may be omitted, as they are not relevant for later use of the graph. Please be
noted that the location of the vertices is irrelevant for the slicing algorithm,
and the aforementioned sorting rules are just for consistency with previous
papers on the topic and to ease the visualization of programs.
Finally, the SDG is built from the combination of all the PDGs that compose the
program. Each call vertex is connected to the ``Start'' of the corresponding
procedure. All edges that connect PDGs are represented with dashed lines.
\begin{figure}
\begin{minipage}{0.3\linewidth}
\begin{lstlisting}
proc main() {
a = 10;
b = 20;
f(a, b);
print(a);
}
proc f(x, y) {
while (x > y) {
x = x - 1;
}
print(x);
}
\end{lstlisting}
\end{minipage}
\begin{minipage}{0.6\linewidth}
\includegraphics[width=0.3\linewidth]{img/cfgsimple}
\includegraphics[width=0.65\linewidth]{img/cfgsimple2}
\end{minipage}
\includegraphics[width=0.5\linewidth]{img/pdgsimple}
\includegraphics[width=0.49\linewidth]{img/pdgsimple2}
\includegraphics[width=0.6\linewidth]{img/sdgsimple}
\includegraphics[width=0.4\linewidth]{img/legendsimple}
\caption{A simple program with its CFGs (top right), PDGs (center) and SDG (bottom).}
\label{fig:sdg-loop}
\end{figure}
\subsubsection{Procedures and data dependencies}
The only thing left to explain before introducing more constructs into the
language is the passing of parameters. Most programming language accept a
variable number of input parameters and one output parameter. In the case of
input parameters passed by reference, or constructs such as structs or classes,
modifying a field of a parameter may modify the original variable. In order to
deal with everything related to parameter passing, including global variables,
class fields, etc. there is a small extension to be made to the CFG and PDG.
In the CFG, the ``Start'' and ``End'' nodes contain a list of assignments,
inputting and outputting respectively the appropriate values, as can be seen in
the example. Consequently, every vertex that contains a procedure or function
call pack and unpack the arguments. For every variable $x$ that is used in a
procedure, every call to it must be preceded by $x_{in} = x$, and the
procedures's ``Start'' vertex must contain $x = x_{in}$. The opposite happens
when a variable must be ``outputted''\carlos{replace}: before the ``End'' node,
the value must be packed ($x_{out} = x$), and after each call, the value must be
assigned to the corresponding variable ($x = x_{out}$). Parameters may be
assigned as $par^i_{in} = expr_i$ (where $i$ is the index of the parameter in
the procedure definition, $par^i$ is the name of the parameter and $expr_i$ is
the expression in the $i^{th}$ position in the procedure call) in the call
vertex, and parameters whose modifications inside the procedure are passed back
to the calling procedure must be extracted as $var = par^i_{out}$ (where $var$
is the name of the variable ---passed by reference--- in the calling
procedure).\carlos{What if object/struct passed by value?} As an addition, in
the SDG, an extra edge is added (summary edge), which represents the
dependencies that the input variables have on the outputs. This allows the
algorithm to know the dependencies without traversing the corresponding
function.
All these additions are added as extra lines in the ``Start'', ``End'' and
calling vertices. When building the PDG, all additions (variable assignments)
are split into their own vertices, and are control dependent on them. Data
dependencies no longer flow throw the call vertex, but throw the appropriate
child, which minimizes the size of the slice produced. As an example,
figure~\ref{fig:sdg-loop} shows the three stages of a program, from CFG to SDG.
The construction of the CFG is straight-forward, save for the packing and
unpacking of variables in the start, end and call vertices. In the PDG, the
statements are split, control and data dependencies replace the control flow
edges. Finally, both PDGs are linked via call and parameter (input and output)
edges, forming the SDG. Summary edges are placed according to the data and
control flow of the method call, and the graph is complete.
\section{Unconditional control flow}
Even though the initial definition of the SDG was useful to compute slices, the
language covered was not enough for the typical language of the 1980's, which
included (in one form or another) unconditional control flow. Therefore, one of
the first additions contributed to the algorithm to build system dependence
graphs was the inclusion of unconditional jumps, such as ``break'',
``continue'', ``goto'' and ``return'' statements (or any other equivalent). A
naive representation would be to treat them the same as any other statement, but
with the outgoing edge landing in the corresponding instruction (outside the
loop, at the loop condition, at the method's end, etc.). An alternative
approach is to represent the instruction as an edge, not a vertex, connecting
the previous statement with the next to be executed. Both of these approaches
fail to generate a control dependence from the unconditional jump, as the
definition of control dependence (see Definition~\ref{def:ctrl-dep}) requires a
vertex to have more than one successor for it to be possible to be a source of
control dependence. From here, there stem two approaches: the first would be to
redefine control dependency, in order to reflect the real effect of these
instructions ---as some authors~\cite{DanBHHKL11} have tried to do--- and the
second would be to alter the creation of the SDG to ``create'' those
dependencies, which is the most widely--used solution.
The most popular approach was proposed by Ball and Horwitz\cite{BalH93}, and
represents unconditional jumps as a \textsl{pseudo--predicate}. The true edge
would lead to the next instruction to be executed, and the false edge would be
non-executable or \textit{dummy} edges, connected to the instruction that would
be executed were the unconditional jump a \textit{nop}. The consequence of this
solution is that every instruction placed after the unconditional jump is
control dependent on the jump, as can be seen in Figure~\ref{fig:break-graphs}.
In the example, when slicing with respect to variable $a$ on line 5, every
statement would be included, save for ``print(a)''. Line 4 is not strictly
necessary in this example ---in the context of weak slicing---, but is included
nonetheless. In the original paper, the transformation is proved to be
complete, but not correct, as for some examples, the slice includes more
unconditional jumps that would be strictly necessary, even for weak slicing.
Ball and Horwitz theorize that a more correct approach would be possible, if it
weren't for the limitation of slices to be a subset of statements of the
program, in the same order as in the original.
\begin{figure}
\centering
\begin{minipage}{0.3\linewidth}
\begin{lstlisting}
static void f() {
int a = 1;
while (a > 0) {
if (a > 10) break;
a++;
}
System.out.println(a);
}
\end{lstlisting}
\end{minipage}
\begin{minipage}{0.6\linewidth}
\includegraphics[width=0.4\linewidth]{img/breakcfg}
\includegraphics[width=0.59\linewidth]{img/breakpdg}
\end{minipage}
\caption{A program with unconditional control flow, its CFG (center) and PDG(right).}
\label{fig:break-graphs}
\end{figure}
\section{Exceptions}
As seen in section~\ref{sec:intro-exception}, exception handling in Java adds
two constructs: the \texttt{throw} and the \texttt{try-catch} statements. The
first one resembles an unconditional control flow statement, with an unknown (on
compile time) destination. The exception will be caught by a \texttt{catch} of
the corresponding type or a supertype ---if it exists. Otherwise, it will crash
the corresponding thread (or in single-threaded programs, stop the Java Virtual
Machine). The second stops the exceptional control flow conditionally, based on
the dynamic typing of the exception thrown. Both introduce challenges that must
be solved.
\subsection{\texttt{throw} statement}
The \texttt{throw} statement represents two elements at the same time: an
unconditional jump and an erroneous exit from its method. The first one has
been extensively covered and solved, but the second requires a small addition
to the CFG: instead of having a single ``End'' node, it will be split in two
---normal and error exit---, though the ``End'' cannot be removed, as a restriction
of most slicing algorithms is that the CFG have only one sink node. Therefore all
nodes that connected to the ``End'' will now lead to ``Normal exit'', all throw
statements' true outgoing edges will connect to the ``Error exit'', and both exit
nodes will converge on the ``End'' node.
\texttt{throw} statements in Java take a single value, a subtype of \texttt{Throwable},
and that value is used to stop the propagation of the exception; which can be handled
as a returned value. This treatment of \texttt{throw} statements only modifies the
structure of the CFG, without altering any other algorithm, nor the basic definitions
for control and data dependencies, making it very easy to incorporate to any existing
slicing software solution that follows the general model described.
\begin{example}[CFG of an uncaught \texttt{throw} statement] \ \\
\begin{minipage}{0.69\linewidth}
\begin{lstlisting}
double f(int x) {
if (x < 0)
throw new RuntimeException()
return Math.sqrt(x)
}
\end{lstlisting}
By analyzing the CFG, we can see that both exits are control dependent on the \texttt{throw}
statement; data dependencies present no special case in this example.
\end{minipage}
\begin{minipage}{0.3\linewidth}
\includegraphics[width=\linewidth]{img/throw-example-cfg}
\end{minipage}
\end{example}
\subsection{\texttt{try-catch} statement}
The \texttt{try-catch-finally} statement is the only way to stop an exception once it's thrown,
filtering by type, or otherwise letting it propagate further up the call stack. On top of that,
\texttt{finally} helps guarantee consistency, executing in any case (even when an exception is
left uncaught, the program returns or an exception occurs in a \texttt{catch} block). The main
problem with this construct is that \texttt{catch} blocks are not always necessary, but their
absence may make the compilation fail ---because a \texttt{try} block has no \texttt{catch} or
\texttt{finally} block---, or modify the execution in unexpected ways that are not always accounted
for in slicing software.
For the \texttt{try} block, it is normally represented as a pseudo--predicate, connected to the
first statement inside it and to the end of the first instruction after the whole \texttt{try-catch-finally}
construct. Inside the \texttt{try} there can be four distinct sources of exceptions:
\begin{description}
\item[Method calls.] If an exception is thrown inside a method and it is not caught, it will
surface inside the \texttt{try} block. As \textit{checked} exceptions must be declared
explicitly, method declarations may be consulted to see if a method call may or may not
throw any exceptions. On this front, polymorphism and inheritance present no problem, as
inherited methods may not modify the signature ---which includes the exceptions that may
be thrown. If \textit{unchecked} exceptions are also considered, all method calls shall
be included, as any can trigger at the very least a \texttt{StackOverflowException}.
\item[\texttt{throw} statements.] The least common, but most simple, as it is treated as a
\texttt{throw} inside a method.
\item[Implicit unchecked exceptions.] If \textit{unchecked} exceptions are considered, many
common expressions may throw an exception, with the most common ones being trying to call
a method or accessing a field of a \texttt{null} object (\texttt{NullPointerException}),
accessing an invalid index on an array (\texttt{ArrayIndexOutOfBoundsException}), dividing
an integer by 0 (\texttt{ArithmeticException}), trying to cast to an incompatible type
(\texttt{ClassCastException}) and many others. On top of that, the user may create new
types that inherit from \texttt{RuntimeException}, but those may only be explicitly thrown.
Their inclusion in program slicing and therefore in the method's CFG generates extra
dependencies that make the slices produced bigger.
\item[Erorrs.] May be generated at any point in the execution of the program, but they normally
signal a situation from which it may be impossible to recover, such as an internal JVM error.
In general, most programs do not consider these to be ``catch-able''.
\end{description}
All exception sources are treated in a similar fashion: the statement that may throw an exception
is treated as a predicate, with the true edge connected to the next instruction were the statement
to execute without raising exceptions; and the false edge connected to the \texttt{catch} node.
\carlos{CATCH Representation doesn't matter, it is similar to a switch but checking against types.
The difference exists where there exists the possibility of not catching the exception;
which is semantically possible to define. When a \texttt{catch (Throwable e)} is declared,
it is impossible for the exception to exit the method; therefore the control dependency must
be redefined.}
The filter for exceptions in Java's \texttt{catch} blocks is a type (or multiple types since
Java 8), with a class that encompasses all possible exceptions (\texttt{Throwable}), which acts
as a catch--all.
In the literature there exist two alternatives to represent \texttt{catch}: one mimics a static
switch statement, placing all the \texttt{catch} block headers at the same height, all pending
from the exception-throwing exception and the other mimics a dynamic switch or a chain of \texttt{if}
statements. The option chosen affects how control dependencies should be computed, as the different
structures generate different control dependencies by default.
\begin{description}
\item[Switch representation.] There exists no relation between different \texttt{catch} blocks,
each exception--throwing statement is connected through an edge labeled false to each
of the \texttt{catch} blocks that could be entered. Each \texttt{catch} block is a
pseudo--statement, with its true edge connected to the end of the \texttt{try} and the
As an example, a \texttt{1 / 0} expression may be connected to \texttt{ArithmeticException},
\texttt{RuntimeException}, \texttt{Exception} or \texttt{Throwable}.
If any exception may not be caught, there exists a connection to the ``Error exit'' of the method.
\item[If-else representation.] Each exception--throwing statement is connected to the first
\texttt{catch} block. Each \texttt{catch} block is represented as a predicate, with the true
edge connected to the first statement inside the \texttt{catch} block, and the false edge
to the next \texttt{catch} block, until the last one. The last one will be a pseudo--predicate
connected to the first statement after the \texttt{try} if it is a catch--all type or to the
``Error exit'' if it isn't.
\end{description}
\begin{example}[Catches.]\ \\
\begin{minipage}{0.49\linewidth}
\begin{lstlisting}
try {
f();
} catch (CheckedException e) {
} catch (UncheckedException e) {
} catch (Throwable e) {
}
\end{lstlisting}
\end{minipage}
\begin{minipage}{0.49\linewidth}
\carlos{missing figures with 4 alternatives: if-else (with catch--all and without) and switch (same two)}
% \includegraphics[0.5\linewidth]{img/catch1}
% \includegraphics[0.5\linewidth]{img/catch2}
% \includegraphics[0.5\linewidth]{img/catch3}
% \includegraphics[0.5\linewidth]{img/catch4}
\end{minipage}
\end{example}
Regardless of the approach, when there exists a catch--all block, there is no dependency generated
from the \texttt{catch}, as all of them will lead to the next instruction. However, this means that
if no data is outputted from the \texttt{try} or \texttt{catch} block, the catches will not be picked
up by the slicing algorithm, which may alter the results unexpectedly. If this problem arises, the
simple and obvious solution would be to add artificial edges to force the inclusion of all \texttt{catch}
blocks, which adds instructions to the slice ---lowering its score when evaluating against benchmarks---
but are completely innocuous as they just stop the exception, without running any extra instruction.
Another alternative exists, though, but slows down the process of creating a slice from a SDG.
The \texttt{catch} block is only strictly needed if an exception that it catches may be thrown and
an instruction after the \texttt{try-catch} block should be executed; in any other case the \texttt{catch}
block is irrelevant and should not be included. However, this change requires analyzing the inclusion
of \texttt{catch} blocks after the two--pass algorithm has completed, slowing it down. In any case, each
approach trades time for accuracy and vice--versa, but the trade--off is small enough to be negligible.
Regarding \textit{unchecked} exceptions, an extra layer of analysis should be performed to tag statements
with the possible exceptions they may throw. On top of that, methods must be analyzed and tagged
accordingly. The worst case is that of inaccessible methods, which may throw any \texttt{RuntimeException},
but with the source code unavailable, they must be marked as capable of throwing it. This results on
a graph where each instruction is dependent on the proper execution of the previous statement; save
for simple statements that may not generate exceptions. The trade--off here is between completeness and
correctness, with the inclusion of \textit{unchecked} exceptions increasing both the completeness and the
slice size, reducing correctness. A possible solution would be to only consider user--generated exceptions
or assume that library methods may never throw an unchecked exception. A new slicing variation that
annotates methods or limits the unchecked exceptions to be considered.
Regarding the \texttt{finally} block, most approaches treat it properly; representing it twice: once
for the case where there is no active exception and another one for the case where it executes with
an exception active. An exception could also be thrown here, but that would be represented normally.
% vim: set noexpandtab:ts=2:sw=2:wrap