edit introduction and remove todos

This commit is contained in:
Carlos Galindo 2019-10-21 17:08:29 +00:00
parent 064b2a322f
commit 998d943573
6 changed files with 136 additions and 81 deletions

View file

@ -1,6 +1,6 @@
all: paper.pdf
pdf: images
pdf: paper.pdf
images:
$(MAKE) -C img

View file

@ -4,6 +4,7 @@
\chapter{Main explanation?}
\section{First definition of the SDG}
\label{sec:first-def-sdg}
The system dependence graph (SDG) is a method for program slicing that was first proposed by Horwitz, Reps and Blinkey \cite{HorwitzRB88}. It builds upon the existing control flow graph (CFG), defining dependencies between vertices of the CFG, and building a program dependence graph (PDG), which represents them. The system dependence graph (SDG) is then build from the assembly of the different PDGs (each representing a method of the program), linking each method call to its corresponding definition. Because each graph is built from the previous one, new constructs can be added with to the CFG, without the need to alter the algorithm that converts CFG to PDG and then to SDG. The only modification possible is the redefinition of a dependency or the addition of new kinds of dependence.
@ -66,7 +67,7 @@ Finally, the SDG is built from the combination of all the PDGs that compose the
The only thing left to explain before introducing more constructs into the language is the passing of parameters. Most programming language accept a variable number of input parameters and one output parameter. In the case of input parameters passed by reference, or constructs such as structs or classes, modifying a field of a parameter may modify the original variable. In order to deal with everything related to parameter passing, including global variables, class fields, etc. there is a small extension to be made to the CFG and PDG.
In the CFG, the ``Start'' and ``End'' nodes contain a list of assignments, inputting and outputting respectively the appropriate values, as can be seen in the example. Consequently, every vertex that contains a procedure or function call pack and unpack the arguments. For every variable $x$ that is used in a procedure, every call to it must be preceded by $x_{in} = x$, and the procedures's ``Start'' vertex must contain $x = x_{in}$. The opposite happens when a variable must be ``outputted''\todo{replace}: before the ``End'' node, the value must be packed ($x_{out} = x$), and after each call, the value must be assigned to the corresponding variable ($x = x_{out}$). Parameters may be assigned as $par^i_{in} = expr_i$ (where $i$ is the index of the parameter in the procedure definition, $par^i$ is the name of the parameter and $expr_i$ is the expression in the $i^{th}$ position in the procedure call) in the call vertex, and parameters whose modifications inside the procedure are passed back to the calling procedure must be extracted as $var = par^i_{out}$ (where $var$ is the name of the variable ---passed by reference--- in the calling procedure).\todo{What if object/struct passed by value?} As an addition, in the SDG, an extra edge is added (summary edge), which represents the dependencies that the input variables have on the outputs. This allows the algorithm to know the dependencies without traversing the corresponding function.
In the CFG, the ``Start'' and ``End'' nodes contain a list of assignments, inputting and outputting respectively the appropriate values, as can be seen in the example. Consequently, every vertex that contains a procedure or function call pack and unpack the arguments. For every variable $x$ that is used in a procedure, every call to it must be preceded by $x_{in} = x$, and the procedures's ``Start'' vertex must contain $x = x_{in}$. The opposite happens when a variable must be ``outputted''\carlos{replace}: before the ``End'' node, the value must be packed ($x_{out} = x$), and after each call, the value must be assigned to the corresponding variable ($x = x_{out}$). Parameters may be assigned as $par^i_{in} = expr_i$ (where $i$ is the index of the parameter in the procedure definition, $par^i$ is the name of the parameter and $expr_i$ is the expression in the $i^{th}$ position in the procedure call) in the call vertex, and parameters whose modifications inside the procedure are passed back to the calling procedure must be extracted as $var = par^i_{out}$ (where $var$ is the name of the variable ---passed by reference--- in the calling procedure).\carlos{What if object/struct passed by value?} As an addition, in the SDG, an extra edge is added (summary edge), which represents the dependencies that the input variables have on the outputs. This allows the algorithm to know the dependencies without traversing the corresponding function.
All these additions are added as extra lines in the ``Start'', ``End'' and calling vertices.
When building the PDG, all additions (variable assignments) are split into their own vertices, and are control dependent on them.
@ -84,7 +85,7 @@ Therefore, one of the first additions contributed to the algorithm to build syst
A naive representation would be to treat them the same as any other statement, but with the outgoing edge landing in the corresponding instruction (outside the loop, at the loop condition, at the method's end, etc.).
An alternative approach is to represent the instruction as an edge, not a vertex, connecting the previous statement with the next to be executed.
Both of these approaches fail to generate a control dependence from the unconditional jump, as the definition of control dependence (see Definition~\ref{def:ctrl-dep}) requires a vertex to have more than one successor for it to be possible to be a source of control dependence.
A possible ---but difficult--- solution would be to redefine control dependence, as some\todo{citation-needed} have done.
A possible ---but difficult--- solution would be to redefine control dependence, as some\carlos{citation-needed} have done.
The most popular solution was proposed by Ball and Horwitz\cite{BalH93}, and represents unconditional jumps as a predicate.
The true edge would lead to the next instruction to be executed, and the false edge would be non-executable or \textit{dummy} edges, connected to the instruction that would be executed were the unconditional jump a \textit{nop}.

View file

@ -67,7 +67,7 @@ of the program (i.e., for any input data).
when running $S$ is a prefix of the values produced when running $P$.
\end{definition}
Both definitions (\ref{def:strong-slice} and~\ref{def:weak:slice}) are
Both definitions (\ref{def:strong-slice} and~\ref{def:weak-slice}) are
used throughout the literature, with some cases favoring the first and some the
second. Though the definitions come from the corresponding citations, the naming
was first used in a control dependency analysis by Danicic~\cite{DanBHHKL11},
@ -98,75 +98,95 @@ included in the slice and the program are behaving in a different way.
\caption{Execution logs of different slices and their original program.}
\end{table}
The most efficient and broadly used data structure for slicing is the system
dependence graph (SDG), first introduced by Horwitz, Reps and Blinkey
\cite{HorwitzRB88}. It represents the statements of a program as vertices, and
their dependencies as directed edges. Method calls are connected to method
definitions, and so are the corresponding input and output parameters. SDGs show
two different kinds of dependencies: \textsl{data} and \textsl{control}. The
first one connects nodes that write to variables (i.e., they \emph{define} their
value) to the nodes that use (or \textsl{may} use) the value, and it is often
represented as a dashed\todo{check} line.
Control dependencies are used to represent which nodes have control over the
execution of others (conditional jumps and loops, mainly), and its
representation is often a solid line. In order to obtain a slice of a program,
its SDG must be built ($\mathcal{O}(n^2)$) from the source code.
Then a two pass search ($\mathcal{O}(n)$ each) is performed to obtain the slice.
The SDG can be reused to obtain a different slice of the same program (with a
different criterion or kind \carlos{cambiar palabra} of slice).
The efficiency derives from the linear cost of the search on the SDG, so most
modifications modify the complexity of the SDG's construction, but try to keep
the slice process linear.
Program slicing is a language--agnostic tool, but the original proposal by
Weiser~\cite{Wei81} covers a simple imperative programming language.
Since, the literature has been expanded by dozens of authors, that have
described and implemented slicing for more complex structures, such as
uncontrolled control flow~\cite{HorwitzRB88}, global variables~\cite{???},
exception handling~\cite{AllH03}; and for other programming paradigms, such as
object-oriented languages~\cite{???} or functional languages~\cite{???}.
\carlos{Se pueden poner más, faltan las citas correspondientes.}
The SDG is built in 3 stages, each resulting in a different graph:
\subsection{The System Dependence Graph (SDG)}
\begin{description}
\item[CFG] The control flow graph is the representation of the control
dependencies in a method of a program. Every statement has an edge from
itself to every statement that can immediately follow. This means that
most will only have one outgoing edge, and conditional jumps and loops
will have two. The graph starts in a ``Begin'' or ``Start'' node, and
ends in an ``End'' node, to which the last statement and all return
statements are connected. It is created directly from the source code,
without any need for data dependency analysis.
\item[PDG] The program dependence graph is the result of restructuring and
adding data dependencies to a CFG. All statements are placed below and
connected to a ``Begin'' node, except those which are inside a loop or
conditional block. Then data dependencies are added (red or dashed
edges), adding an edge between two nodes if there is a data dependency.
\todo{add definitions?}
\item[SDG] Finally, the system dependence graph is the interconnection of
each method's PDG. When a call is made, the input arguments are passed
to subnodes of the call, and the result is obtained in another subnode.
There is an edge from the call to the beginning of the corresponding
method, and an extra type of edge exists: \textsl{summary edges}, which
summarize the data dependencies between input and output variables.
\end{description}
There exist multiple approaches to compute a slice from a given program and
criterion, but the most efficient and broadly use data structure is the System
Dependence Graph (SDG), first introduced by Horwitz, Reps and
Blinkey~\cite{HorwitzRB88}. It is computed from the program's statements, and
once built, a slicing criterion is chosen, the graph traversed using a specific
algorithm, and the slice obtained. Its efficiency resides in the fact that for
multiple slices that share the same program, the graph must only be built once.
On top of that, building the graph has a complexity of $\mathcal{O}(n^2)$ with
respect to the number of statements in a program, but the traversal is linear
with respect to the number of nodes in the graph (each corresponding to a
statement).
The SDG is a directed graph, and as such it has vertices or nodes, each
representing an instruction in the program ---barring some auxiliary nodes
introduced by some approaches--- and directed edges, which represent the
dependencies among nodes. Those edges represent various kinds of dependencies
---control, data, calls, parameter passing, summary--- which will be defined in
section~\ref{sec:first-def-sdg}.
To create the SDG, first a \textsl{control flow graph} is built for each method
in the program, then its control and data dependencies are computed, resulting
in the \textsl{program dependence graph}. Finally, all the graphs from every
method are joined into the SDG. This process will be explained at greater
lengths in section~\ref{sec:first-def-sdg}.
%TODO: marked for removal --- this process is repeated later in ref{sec:first-deg-sdg}
%\begin{description}
%\item[CFG] The control flow graph is the representation of the control
%dependencies in a method of a program. Every statement has an edge from
%itself to every statement that can immediately follow. This means that
%most will only have one outgoing edge, and conditional jumps and loops
%will have two. The graph starts in a ``Begin'' or ``Start'' node, and
%ends in an ``End'' node, to which the last statement and all return
%statements are connected. It is created directly from the source code,
%without any need for data dependency analysis.
%\item[PDG] The program dependence graph is the result of restructuring and
%adding data dependencies to a CFG. All statements are placed below and
%connected to a ``Begin'' node, except those which are inside a loop or
%conditional block. Then data dependencies are added (red or dashed
%edges), adding an edge between two nodes if there is a data dependency.
%\item[SDG] Finally, the system dependence graph is the interconnection of
%each method's PDG. When a call is made, the input arguments are passed
%to subnodes of the call, and the result is obtained in another subnode.
%There is an edge from the call to the beginning of the corresponding
%method, and an extra type of edge exists: \textsl{summary edges}, which
%summarize the data dependencies between input and output variables.
%\end{description}
An example is provided in figure~\ref{fig:basic-graphs}, where a simple
multiplication program is converted to CFG, then PDG and finally SDG. For
simplicity only the CFG and PDG of \texttt{multiply} are shown. Control
simplicity, only the CFG and PDG of \texttt{multiply} are shown. Control
dependencies are black, data dependencies red and summary edges blue.
\begin{figure}
\centering
% \lstinputlisting[firstline=8, lastline=16]{./dot/simple.java}
\includegraphics[width=0.5\linewidth]{img/multiplycfg}
\begin{minipage}{0.4\linewidth}
\begin{lstlisting}
int multiply(int x, int y) {
int result = 0;
while (x > 0) {
result += y;
x--;
}
System.out.println(result);
return result;
}
\end{lstlisting}
\end{minipage}
\begin{minipage}{0.59\linewidth}
\includegraphics[width=\linewidth]{img/multiplycfg}
\end{minipage}
\includegraphics[width=\linewidth]{img/multiplypdg}
\includegraphics[width=\linewidth]{img/multiplysdg}
\caption{A simple multiplication program, its CFG, PDG and SDG}
\label{fig:basic-graphs}
\end{figure}
The original proposal by Weiser\cite{Wei81} covers the simplest of an imperative
programming language. The various iterations\todo{cite} until reaching the
SDG\todo{cite} have added other elements, such as return statements\todo{cite},
global variables\todo{cite}, object oriented features\todo{cite} and finally
exception handling\cite{AllH03}.
\subsection{Metrics}
There are 5 metrics considered when evaluating a slicing algorithm:
There are four relevant metrics considered when evaluating a slicing algorithm:
\begin{description}
\item[Completeness] The solution includes all the statements that affect the
@ -193,15 +213,6 @@ There are 5 metrics considered when evaluating a slicing algorithm:
from the aforementioned schema show a wider variation in speed.
\end{description}
\subsection{Program slicing as a debugging technique}
Program slicing is first and foremost a debugging technique, having each
variation a different purpose:
\begin{description}
\item[Backward static]
\end{description}
\section{Exception handling in Java}
\label{sec:intro-exception}
@ -253,15 +264,60 @@ consists of the following elements:
\section{Exception handling in other programming languages}
In almost all programming languages, errors exist, and must be dealt with.
Java's exception system is a common one among object-oriented programming
languages, but not the only one,
In almost all programming languages, errors can appear (either through the
developer, the user or the system's fault), and must be dealt with.
Most of the popular object oriented programs feature some kind of error system,
normally very similar to Java's exceptions. In this section, we will perform a
small survey on the most popular programming languages. The ``most popular''
list has been obtained from the Stack Overflow 2019 Developer
Survey\footnotemark ($>5\%$ usage in the industry). The languages and their
usage in the industry are shown in Figure~\ref{fig:languages}.
small survey of the error-handling techniques used on the most popular
programming languages. The language list has been extracted from a survey
performed by the programming Q\&A website Stack
Overflow\footnote{\url{https://stackoverflow.com}}. The survey contains a
question about the technologies used by professional developers in their work,
and from that list we have extracted those languages with more than $5\%$ usage
in the industry. Table~\ref{tab:popular-languages} shows the list and its
source.
\begin{table}
\begin{minipage}{0.6\linewidth}
\centering
\begin{tabular}{r | r }
\textbf{Language} & $\%$ usage \\ \hline
JavaScript & 69.7 \\ \hline
HTML/CSS & 63.1 \\ \hline
SQL & 56.5 \\ \hline
Python & 39.4 \\ \hline
Java & 39.2 \\ \hline
Bash/Shell/PowerShell & 37.9 \\ \hline
C\# & 31.9 \\ \hline
PHP & 25.8 \\ \hline
TypeScript & 23.5 \\ \hline
C++ & 20.4 \\ \hline
\end{tabular}
\end{minipage}
\begin{minipage}{0.39\linewidth}
\begin{tabular}{r | r }
\textbf{Language} & $\%$ usage \\ \hline
C & 17.3 \\ \hline
Ruby & 8.9 \\ \hline
Go & 8.8 \\ \hline
Swift & 6.8 \\ \hline
Kotlin & 6.6 \\ \hline
R & 5.6 \\ \hline
VBA & 5.5 \\ \hline
Objective-C & 5.2 \\ \hline
Assembly & 5.0 \\ \hline
\end{tabular}
\end{minipage}
% The caption has a weird structure due to the fact that there's a footnote
% inside of it.
\caption[Commonly used programming languages]{The most commonly used
programming languages by professional developers\protect\footnotemark}
\label{tab:popular-languages}
\end{table}
\footnotetext{Data from \url{https://insights.stackoverflow.com/survey/2019/\#technology-\_-programming-scripting-and-markup-languages}}
Most of them feature an exception system similar to the one appearing in Java,
while others (bash, assembly, VBA, C) have no built-in method, but allow
\carlos{todo}. Some
@ -271,8 +327,6 @@ check if the exception is of a given set of types for the catching mechanism
exceptions ---either by catching the type from which all exceptions inherit or
by providing no condition to check.
\footnotetext{\url{https://insights.stackoverflow.com/survey/2019/\#technology-\_-programming-scripting-and-markup-languages}}
Go doesn't have an exception system per se, but a simple one can be built by
using the keywords ``panic'' (throw an exception with a value associated),
``defer'' (finally, run even when a panic is activated) and ``recover''
@ -284,4 +338,4 @@ acting as a finally. The panic can only be stopped via the ``recover''
instruction, which obtains the value associated with the panic. Then, the
exception
% vim: set noexpandtab:ts=2:sw=2:wrap
% vim: set noexpandtab:tabstop=2:sw=2:wrap

View file

@ -72,12 +72,12 @@
\maketitle
\begin{abstract}
This must be filled \todo{complete}
\carlos{por completar}
\end{abstract}
\selectlanguage{spanish}
\begin{abstract}
A completar \todo{completar}
\carlos{por completar}
\end{abstract}
\selectlanguage{english}

View file

@ -43,11 +43,11 @@ This is the process used to build the Program Dependence Graph.
\footnotetext{A problem presents itself here, as some exceptions may be able to trigger different catch blocks, due to the secuential nature of catches and polymorphism in Java. A way to fix this is to make catch blocks behave as a switch.}. %TODO
\item[Step 3 (compute dependences):] For each node in the CFG, compute the control and data dependencies. Non-executable edges are only included when computing control dependencies.\\
\todo{put inside definition}
\carlos{put inside definition}
A node $a$ is \textsl{control dependent} on node $b$ iff $a$ post-dominates one but not all of $b$'s successors.\\
A node $a$ is \textsl{data dependent} on node $b$ iff $b$ defines or may define a variable $x$, $a$ uses or may use $x$, and there is an $x$-definition-free path in the CFG from $b$ to $a$.\\
\item[Step 4 (convert each CFG into a PDG):] each node of the CFG is one node of the PDG, with two exceptions. The first are the \textsl{enter}, \textsl{exit} and method call nodes, where the variable input and output assignments are split and placed as control-dependent on their original node. The second is the \textsl{exit} node, which is to be removed (the control-dependencies from \textsl{exit} to the variable outputs is transferred to the \textsl{enter} node). Then all the dependencies computed in the previous step are drawn.
\item[Step 5 (connect PDGs to form a SDG):] each method call to $M$ must be connected to the \textsl{enter} node in $M$'s PDG, as a control dependence. Each variable input from the method call is connected to a variable input of the method definition via a data dependence. Each variable output from the method definition is connected to the variable output of the method call via a data dependence. Each method exit is connected \todo{complete}.
\item[Step 5 (connect PDGs to form a SDG):] each method call to $M$ must be connected to the \textsl{enter} node in $M$'s PDG, as a control dependence. Each variable input from the method call is connected to a variable input of the method definition via a data dependence. Each variable output from the method definition is connected to the variable output of the method call via a data dependence. Each method exit is connected \carlos{complete}.
\end{description}
\begin{itemize}

View file

@ -3,7 +3,7 @@
% !TeX root = paper.tex
\chapter{State of the art}
Slicing was proposed\cite{Wei81} and improved until the proposal of the current system (the SDG) \todo{(citation)}. Specifically in the context of exceptions, multiple approaches have been attempted, with varying degrees of success. There exist commercial solutions for various programming languages: \todo{name them and link}.
Slicing was proposed\cite{Wei81} and improved until the proposal of the current system (the SDG) \carlos{(citation)}. Specifically in the context of exceptions, multiple approaches have been attempted, with varying degrees of success. There exist commercial solutions for various programming languages: \carlos{name them and link}.
In the realm of academia, there exists no definite solution. One of the most relevant initial proposal\cite{AllH03}, although not the first one\cite{SinH98,SinHR99} to target Java specifically.
It uses the existing proposals for \textsl{return}, \textsl{goto} and other unconditional jumps to model the behavior of \textsl{throw} statements. Control flow inside \textsl{try-catch-finally} statements is simulated, both for explicit \textsl{throw} and those nested inside a method call. The base algorithm is presented, and then the proposal is detailed as changes. Unchecked exceptions are considered but regarded as ``worthless'' to include, due to the increase in size of the slices, which reduces their effectiveness as a debugging tool. This is due to the number of unchecked exceptions embedded in normal Java instructions, such as \texttt{NullException} in any instance field or method, \texttt{IndexOutOfBoundsException} in array accesses and countless others. On top of that, handling \textsl{unchecked} exceptions opens the problem of calling an API to which there is no analyzable source code, either because the module was compiled before-hand or because it is part of a distributed system. The first should not be an obstacle, as class files can be easily decompiled. The only information that may be lost is variable names and comments, which don't affect a slice's precision, only its readability.
@ -12,7 +12,7 @@ Chang and Jo\cite{JoC04} present an alternative to the CFG by computing exceptio
Jiang et al.\cite{JiaZSJ06} describes a solution specific for the exception system in C++, which differs from Java's implementation of exceptions. They reuse the idea of non-executable edges in \textsl{throw} nodes, and introduce handling \textsl{catch} nodes as a switch, each trying to catch the exception before deferring onto the next \textsl{catch} or propagating it to the calling method. Their proposal is center around the IECFG (Improved Exception Control-Flow Graph), which propagates control dependencies onto the PDG and then the SDG. Finally, in their SDG, each normal and exceptional return and their data output are connected to all \textsl{catch} statements where the data may have arrived, which is fine for the example they propose, but could be inefficient if the method has many different call nodes.
Others\cite{PraMB11} have worked specifically on the C++ exception framework. \todo{remove or expand}.
Others\cite{PraMB11} have worked specifically on the C++ exception framework. \carlos{remove or expand}.
Finally, Hao\cite{JieS11} introduced a Object-Oriented System Dependence Graph with exception handling (EOSDG), which represented a generic object-oriented language, with exception handling capabilities. Its broadness allows for the EOSDG to fit into both Java and C++. It uses concepts from Jiang\cite{JiaZSJ06}, such as cascading \textsl{catch} statements, while adding explicit support for virtual calls, polymorphism and inheritance.