diff --git a/Makefile b/Makefile index c290f85..02762e8 100644 --- a/Makefile +++ b/Makefile @@ -12,5 +12,6 @@ paper.pdf: paper.tex images pdflatex -synctex=1 -interaction=nonstopmode paper.tex clean: + rm -f Secciones/*.aux rm -f *.toc *.aux *.bbl *.blg *.fls *.out *.log *.synctex.gz $(MAKE) -C img clean diff --git a/Secciones/background.tex b/Secciones/background.tex new file mode 100644 index 0000000..e0c593f --- /dev/null +++ b/Secciones/background.tex @@ -0,0 +1,463 @@ +% !TEX encoding = UTF-8 +% !TEX spellcheck = en_US +% !TEX root = ../paper.tex + +\chapter{Background} +\label{cha:background} + +\section{Program slicing} +\textsl{Program slicing} \cite{Wei81,Sil12} is a debugging technique that +answers the question: ``which parts of a program affect a given statement and +variable?'' The statement and the variable are the basic input to create a slice +and are called the \textsl{slicing criterion}. The criterion can be more +complex, as different slicing techniques may require additional pieces of input. +The \textsl{slice} of a program is the list of statements from the original +program ---which constitutes a valid program---, whose execution will result in +the same values for the variable (selected in the slicing criterion) being read +by a debugger in the selected statement. +There exist two fundamental dimensions along which the problem of slicing can be +proposed: +\begin{itemize} + \item \textsl{Static} or \textsl{dynamic}: slicing can be performed + statically or dynamically. + \textsl{Static slicing} \cite{Wei81} is a slice which considers all + possible executions of the program, only taking into account the + semantics of the programming language. + In contrast, \textsl{dynamic slicing} \cite{KorL88} limits the slice to + the statements present in an execution log. The slicing criterion is + expanded to include a position in the log that corresponds to one + instance of the selected statement, making it much more specific. It may + help finding a bug related to indeterministic behavior (such as a random + or pseudo-random number generator), but must be recomputed for each case + being analyzed. + \item \textsl{Backward} or \textsl{forward}: \textsl{backward slicing} + \cite{Wei81} is generally more used, because it looks at the statements + that affect the slicing criterion. In contrast, \textsl{forward slicing} + \cite{BerC85} computes the statements that are affected by the slicing + criterion. There also exists a mixed approach called \textsl{chopping} + \cite{JacR94}, which is used to find all statements that affect or are + affected by the slicing criterion. +\end{itemize} + +Since the definition of program slicing, the most extended form of slicing has +been \textsl{static backward slicing}, which obtains the list of statements that +affect the value of a variable in a given statement, in all possible executions +of the program (i.e., for any input data). +\begin{definition}[Strong static backward slice \cite{Wei81,HorwitzRB88}] + \label{def:strong-slice} + \carlos{Falta ver exactamente cuál es la cita correcta.} + Given a program $P$ and a slicing criterion $C = \langle s,v \rangle$, where + $s$ is a statement and $v$ is a set of variables in $P$ (the variables may + or may not be used in $s$), $S$ is the \textsl{strong slice} of $P$ with + respect to $C$ if $S$ has the following properties: + \begin{enumerate} + \item $S$ is an executable program. + \item $S \subseteq P$, or $S$ is the result of removing code from $P$. + \item For any input $I$, the values produced on each execution of $s$ + for each of the variables in $v$ is the same when executing $S$ as + when executing $P$. \label{enum:exact-output} + \end{enumerate} +\end{definition} + +\begin{definition}[Weak static backward slice \cite{RepY89}] + \label{def:weak-slice} + \carlos{Comprobar cita y escribir formalmente} + Given a program $P$ and a slicing criterion $C = \langle s,v \rangle$, where + $s$ is a statement and $v$ is a set of variables in $P$ (the variables may + or may not be used in $s$), $S$ is the \textsl{weak slice} of $P$ with + respect to $C$ if $S$ has the following properties: + \begin{enumerate} + \item $S$ is an executable program. + \item $S \subseteq P$, or $S$ is the result of removing code from $P$. + \item For any input $I$, the values produced on each execution of $s$ + for each of the variables in $v$ when executing $P$ is a prefix of + those produced while executing $S$ ---which means that the slice + may continue producing values, but the first values produced always + match up with the original program. + \end{enumerate} +\end{definition} + +Both definitions (\ref{def:strong-slice} and~\ref{def:weak-slice}) are +used throughout the literature, with some cases favoring the first and some the +second. Though the definitions come from the corresponding citations, the naming +was first used in a control dependency analysis by Danicic~\cite{DanBHHKL11}, +where slices which produce the same output as the original are named +\textsl{strong}, and those where the original is a prefix of the slice, +\textsl{weak} \carlos{Se podría argumentar que con el slice débil es suficiente +para debugging, ya que si un error se presenta en el original, aparecerá también en el programa fragmentado}. + +\begin{example}[Strong, weak and incorrect slices] + In table~\ref{tab:slice-weak} we can observe examples for the various + definitions. Each row shows the values produced by the execution of a + program or one of its slices. The first is the original, which computes + $3!$. Slice A is one slice, whose execution is identical and therefore is a + strong slice. Slice B is correct but continues producing values after the + original stops ---a weak slice. It would fit the relaxed definition but not + a strong one. Slice C is incorrect, as the values differ from the original. + Some data or control dependency has not been included in the slice and the + program are behaving in a different way. +\end{example} + +\begin{table} + \centering + \label{tab:slice-weak} + \begin{tabular}{r | r | r | r | r | r } + Iteration & \textbf{1} & \textbf{2} & \textbf{3} & \textbf{4} & \textbf{5} \\ \hline + Original & 1 & 2 & 6 & - & - \\ \hline + Slice A & 1 & 2 & 6 & - & - \\ \hline + Slice B & 1 & 2 & 6 & 24 & 120 \\ \hline + Slice C & 1 & 1 & 3 & 5 & 8 \\ + \end{tabular} + \caption{Execution logs of different slices and their original program.} +\end{table} + +Program slicing is a language--agnostic tool, but the original proposal by +Weiser~\cite{Wei81} covers a simple imperative programming language. +Since, the literature has been expanded by dozens of authors, that have +described and implemented slicing for more complex structures, such as +uncontrolled control flow~\cite{HorwitzRB88}, global variables~\cite{???}, +exception handling~\cite{AllH03}; and for other programming paradigms, such as +object-oriented languages~\cite{???} or functional languages~\cite{???}. +\carlos{Se pueden poner más, faltan las citas correspondientes.} + +\subsection{The System Dependence Graph (SDG)} + +There exist multiple approaches to compute a slice from a given program and +criterion, but the most efficient and broadly use data structure is the System +Dependence Graph (SDG), first introduced by Horwitz, Reps and +Blinkey~\cite{HorwitzRB88}. It is computed from the program's statements, and +once built, a slicing criterion is chosen, the graph traversed using a specific +algorithm, and the slice obtained. Its efficiency resides in the fact that for +multiple slices that share the same program, the graph must only be built once. +On top of that, building the graph has a complexity of $\mathcal{O}(n^2)$ with +respect to the number of statements in a program, but the traversal is linear +with respect to the number of nodes in the graph (each corresponding to a +statement). + +The SDG is a directed graph, and as such it has vertices or nodes, each +representing an instruction in the program ---barring some auxiliary nodes +introduced by some approaches--- and directed edges, which represent the +dependencies among nodes. Those edges represent various kinds of dependencies +---control, data, calls, parameter passing, summary--- which will be defined in +section~\ref{sec:first-def-sdg}. + +To create the SDG, first a \textsl{control flow graph} is built for each method +in the program, then its control and data dependencies are computed, resulting +in the \textsl{program dependence graph}. Finally, all the graphs from every +method are joined into the SDG. This process will be explained at greater +lengths in section~\ref{sec:first-def-sdg}. +%TODO: marked for removal --- this process is repeated later in ref{sec:first-deg-sdg} +%\begin{description} + %\item[CFG] The control flow graph is the representation of the control + %dependencies in a method of a program. Every statement has an edge from + %itself to every statement that can immediately follow. This means that + %most will only have one outgoing edge, and conditional jumps and loops + %will have two. The graph starts in a ``Begin'' or ``Start'' node, and + %ends in an ``End'' node, to which the last statement and all return + %statements are connected. It is created directly from the source code, + %without any need for data dependency analysis. + %\item[PDG] The program dependence graph is the result of restructuring and + %adding data dependencies to a CFG. All statements are placed below and + %connected to a ``Begin'' node, except those which are inside a loop or + %conditional block. Then data dependencies are added (red or dashed + %edges), adding an edge between two nodes if there is a data dependency. + %\item[SDG] Finally, the system dependence graph is the interconnection of + %each method's PDG. When a call is made, the input arguments are passed + %to subnodes of the call, and the result is obtained in another subnode. + %There is an edge from the call to the beginning of the corresponding + %method, and an extra type of edge exists: \textsl{summary edges}, which + %summarize the data dependencies between input and output variables. +%\end{description} +An example is provided in figure~\ref{fig:basic-graphs}, where a simple +multiplication program is converted to CFG, then PDG and finally SDG. For +simplicity, only the CFG and PDG of \texttt{multiply} are shown. Control +dependencies are black, data dependencies red and summary edges blue. + +\begin{figure} + \centering + \begin{minipage}{0.4\linewidth} + \begin{lstlisting} + int multiply(int x, int y) { + int result = 0; + while (x > 0) { + result += y; + x--; + } + System.out.println(result); + return result; + } + \end{lstlisting} + \end{minipage} + \begin{minipage}{0.59\linewidth} + \includegraphics[width=\linewidth]{img/multiplycfg} + \end{minipage} + \includegraphics[width=\linewidth]{img/multiplypdg} + \includegraphics[width=\linewidth]{img/multiplysdg} + \caption{A simple multiplication program, its CFG, PDG and SDG} + \label{fig:basic-graphs} +\end{figure} + +\subsection{Metrics} + +There are four relevant metrics considered when evaluating a slicing algorithm: + +\begin{description} + \item[Completeness.] The solution includes all the statements that affect + the slice. This is the most important feature, and almost all + publications achieve at least completeness. Trivial completeness is + easily achievable, as simple as including the whole program in the + slice. + \item[Correctness.] The solution excludes all statements that don't affect + the slice. Most solutions are complete, but the degree of correctness is + what sets them apart, as smaller slices will not execute unnecessary + code to compute the values, decreasing the executing time. + \item[Features covered.] Which features or language a slicing algorithm + covers. Different approaches to slicing cover different programming + languages and even paradigms. There are slicing techniques (published or + commercially available) for most popular programming languages, from C++ + to Erlang. Some slicing techniques only cover a subset of the targeted + language, and as such are less useful for commercial applications, but + can be a stepping stone in the betterment of the field. + \item[Speed.] Speed of graph generation and slice creation. As previously + stated, slicing is a two-step process: build a graph and traverse it. + The traversal is linear in most proposals, with small variations. Graph + generation tends to be longer and with higher variance, but it is not as + relevant, because it is only done once (per program being analyzed). As + such, this is the least important metric. Only proposals that deviate + from the aforementioned schema show a wider variation in speed. +\end{description} + +\subsection{Program slicing as a debugging technique} + +Program slicing is first and foremost a debugging technique, having each +variation a different purpose: + +\begin{description} + \item[Backward static.] Used to obtain the lines that affect a statement, + normally used on a line which outputs an incorrect value, to narrow down + the source of the bug. + \item[Forwarde static.] Used to obtain the lines affected by a statement, + used to identify dead code, to check the effects a line has in the rest + of the program. + \item[Chopping static.] Obtains both the statements affected by and the + statements that affect the selected statement. + \item[Dynamic.] Can be combined with any of the previous variations, and + limits the slice to an execution log, only including statements that + have run in a specific execution. The slice produced is much smaller and + useful. + \item[Quasi--static.] Some input values are given, and some are left + unspecified: the result is a slice between the small dynamic slice and + the general but bigger static slice. It can be specially useful when + debugging a set of function calls which have a specific static input for + some parameters, and variable input for others. + \item[Simultaneous.] Similar to dynamic slicing, but considers multiple + executions instead of only one. Similarly to quasy--static slicing, it + can offer a slightly bigger slice while keeping the scope focused on the + source of the bug. + \carlos{completar} +\end{description} + +\section{Exception handling in Java} +\label{sec:intro-exception} + +Exception handling is common in most modern programming languages. In Java, it +consists of the following elements: +\begin{description} + \item[Throwable.] An interface that encompasses all the exceptions or errors + that may be thrown. Child classes are \texttt{Exception} for most errors + and \texttt{Error} for internal errors in the Java Virtual Machine. + Exceptions can be classified in two categories: \textsl{unchecked} + (those inheriting from \texttt{RuntimeException} or \texttt{Error}) and + \textsl{checked} (the rest). The first may be thrown anywhere, whereas + the second, if thrown, must be caught or declared in the method header. + \item[throws.] A statement that activates an exception, altering the normal + control-flow of the method. If the statement is inside a \textsl{try} + block with a \textsl{catch} clause for its type or any supertype, the + control flow will continue in the first statement of such clause. + Otherwise, the method is exited and the check performed again, until + either the exception is caught or the last method in the stack + (\textsl{main}) is popped, and the execution of the program ends + abruptly. + \item[try.] This statement is followed by a block of statements and by one + or more \textsl{catch} clauses. All exceptions thrown in the statements + contained or any methods called will be processed by the list of + catches. Optionally, after the \textsl{catch} clauses a \textsl{finally} + block may appear. + \item[catch.] Contains two elements: a variable declaration (the type must + be an exception) and a block of statements to be executed when an + exception of the corresponding type (or a subtype) is thrown. + \textsl{catch} clauses are processed sequentially, and if any matches + the type of the thrown exception, its block is executed, and the rest + are ignored. Variable declarations may be of multiple types + \texttt{(T1|T2 exc)}, when two unrelated types of exception must be + caught and the same code executed for both. When there is an inheritance + relationship, the parent suffices.\footnotemark + \item[finally.] Contains a block of statements that will always be executed + if the \textsl{try} is entered. It is used to tidy up, for example + closing I/O streams. The \textsl{finally} can be reached in two ways: + with an exception pending (thrown in \textsl{try} and not captured by + any \textsl{catch} or thrown inside a \textsl{catch}) or without it + (when the \textsl{try} or \textsl{catch} block end successfully). After + the last instruction of the block is executed, if there is an exception + pending, control will be passed to the corresponding \textsl{catch} or + the program will end. Otherwise, the execution continues in the next + statement after the \textsl{try-catch-finally} block. +\end{description} + +\footnotetext{Introduced in Java 7, see \url{https://docs.oracle.com/javase/7/docs/technotes/guides/language/catch-multiple.html} for more details.} + +\subsection{Exception handling in other programming languages} + +In almost all programming languages, errors can appear (either through the +developer, the user or the system's fault), and must be dealt with. Most of the +popular object oriented programs feature some kind of error system, normally +very similar to Java's exceptions. In this section, we will perform a small +survey of the error-handling techniques used on the most popular programming +languages. The language list has been extracted from a survey performed by the +programming Q\&A website Stack +Overflow\footnote{\url{https://stackoverflow.com}}. The survey contains a +question about the technologies used by professional developers in their work, +and from that list we have extracted those languages with more than $5\%$ usage +in the industry. Table~\ref{tab:popular-languages} shows the list and its +source. Except Bash, Assembly, VBA, C and G, the rest of the languages shown +feature an exception system similar to the one appearing in Java. + +\begin{table} + \begin{minipage}{0.6\linewidth} + \centering + \begin{tabular}{r | r } + \textbf{Language} & $\%$ usage \\ \hline + JavaScript & 69.7 \\ \hline + HTML/CSS & 63.1 \\ \hline + SQL & 56.5 \\ \hline + Python & 39.4 \\ \hline + Java & 39.2 \\ \hline + Bash/Shell/PowerShell & 37.9 \\ \hline + C\# & 31.9 \\ \hline + PHP & 25.8 \\ \hline + TypeScript & 23.5 \\ \hline + C++ & 20.4 \\ \hline + \end{tabular} + \end{minipage} + \begin{minipage}{0.39\linewidth} + \begin{tabular}{r | r } + \textbf{Language} & $\%$ usage \\ \hline + C & 17.3 \\ \hline + Ruby & 8.9 \\ \hline + Go & 8.8 \\ \hline + Swift & 6.8 \\ \hline + Kotlin & 6.6 \\ \hline + R & 5.6 \\ \hline + VBA & 5.5 \\ \hline + Objective-C & 5.2 \\ \hline + Assembly & 5.0 \\ \hline + \end{tabular} + \end{minipage} + % The caption has a weird structure due to the fact that there's a footnote + % inside of it. + \caption[Commonly used programming languages]{The most commonly used + programming languages by professional developers\protect\footnotemark} + \label{tab:popular-languages} +\end{table} + +\footnotetext{Data from \url{https://insights.stackoverflow.com/survey/2019/\#technology-\_-programming-scripting-and-markup-languages}} + +The exception systems that are similar to Java are mostly all the same, +featuring a \texttt{throw} statement (\texttt{raise} in Python), try-catching +structure and most include a finally block that may be appended to try blocks. +The difference resides in the value passed by the exception, which in languages +that feature inheritance it is a class descending from a generic error or +exception, and in languages without it, it is an arbitrary value (e.g. +JavaScript, TypeScript). In object--oriented programming, the filtering is +performed by comparing if the exception is a subtype of the exception being +caught (Java, C++, C\#, PowerShell\footnotemark, etc.); and in languages with +arbitrary exception values, a boolean condition is specified, and the first +catch block that fulfills its condition is activated, in following a pattern +similar to that of \texttt{switch} statements (e.g. JavaScript). In both cases +there exists a way to indicate that all exceptions should be caught, regardless +of type and content. + +\footnotetext{Only since version 2.0, released with Windows 7.} + +On the other hand, in the other languages there exist a variety of systems that +emulate or replace exception handling: + +\begin{description} % bash, vba, C and Go exceptions explained + \item[Bash] The popular Bourne Again SHell features no exception system, apart + from the user's ability to parse the return code from the last statement + executed. Traps can also be used to capture erroneous states and tidy up all + files and environment variables before exiting the program. Traps allow the + programmer to react to a user or system--sent signal, or an exit run from + within the Bash environment. When a trap is activated, its code run, and the + signal doesn't proceed and stop the program. This doesn't replace a fully + featured exception system, but \texttt{bash} programs tend to be small in + size, with programmers preferring the efficiency of C or the commodities of + other high--level languages when the task requires it. + \item[VBA] Visual Basic for Applications is a scripting programming language + based on Visual Basic that is integrated into Microsoft Office to automate + small tasks, such as generating documents from templates, making advanced + computations that are impossible or slower with spreadsheet functions, etc. + The only error--correcting system it has is the directive \texttt{On Error + $x$}, where $x$ can be 0 ---lets the error crash the program---, + \texttt{Next} ---continues the execution as if nothing had happened--- or a + label in the program ---the execution jumps to the label in case of + error. The directive can be set and reset multiple times, therefore creating + artificial \texttt{try-catch} blocks, but there is no possibility of + attaching a value to the error, lowering its usefulness. + \item[C] In C, errors can also be control via return values, but some of the + instructions it features can be used to create a simple exception system. + \texttt{setjmp} and \texttt{longjmp} are two instructions which set up and + perform inter--function jumps. The first makes a snapshot of the call stack + in a buffer, and the second returns to the position where the buffer was + safe, destroying the current state of the stack and replacing it with the + snapshot. Then, the execution continues from the evaluation of + \texttt{setjmp}, which returns the second argument passed to + \texttt{longjmp}. + \begin{example}[User-built exception system in C] \ \\ + \label{fig:exceptions-c} + \begin{minipage}{0.5\linewidth} + \begin{lstlisting}[language=C] + int main() { + if (!setjmp(ref)) { + res = safe_sqrt(x, ref); + } else { + // Handle error + printf /* ... */ + } + } + \end{lstlisting} + \end{minipage} + \begin{minipage}{0.49\linewidth} + \begin{lstlisting}[language=C] + double safe_sqrt(double x, int ref) { + if (x < 0) + longjmp(ref, 1); + return /* ... */; + } + \end{lstlisting} + \end{minipage} + In the \texttt{main} function, line 2 will be executed twice: first when + it is normally reached ---returning 0--- and the second when line 3 in + \texttt{safe\_sqrt} is run, returning the second argument of \texttt{longjmp}, + and therefore entering the else block in the \texttt{main} method. + \end{example} + \item[Go] The programming language Go is the odd one out in this section, being a + modern programming language without exceptions, though it is an intentional + design decision made by its authors\footnotemark. The argument made was that + exception handling systems introduce abnormal control--flow and complicate + code analysis and clean code generation, as it is not clear the paths that + the code may follow. Instead, Go allows functions to return multiple values, + with the second value typically associated to an error type. The error is + checked before the value, and acted upon. Additionally, Go also features a + simple panic system, with the functions \texttt{panic} ---throws an + exception with a value associated---, \texttt{defer} ---runs after the + function has ended or when a \texttt{panic} has been activated--- and + \texttt{recover} ---stops the panic state and retrieves its value. The + \texttt{defer} statement doubles as catch and finally, and multiple + instances can be accumulated. When appropriate, they will run in LIFO order + (Last In--First Out). +\end{description} + +\footnotetext{\url{https://golang.org/doc/faq\#exceptions}} + +% vim: set noexpandtab:tabstop=2:shiftwidth=2:softtabstop=2:wrap diff --git a/Secciones/incremental_slicing.tex b/Secciones/incremental_slicing.tex new file mode 100644 index 0000000..22932d3 --- /dev/null +++ b/Secciones/incremental_slicing.tex @@ -0,0 +1,394 @@ +% !TEX encoding = UTF-8 +% !TEX spellcheck = en_US +% !TEX root = ../paper.tex +\chapter{Main explanation?} + +\section{First definition of the SDG} +\label{sec:first-def-sdg} + +The system dependence graph (SDG) is a method for program slicing that was first +proposed by Horwitz, Reps and Blinkey \cite{HorwitzRB88}. It builds upon the +existing control flow graph (CFG), defining dependencies between vertices of the +CFG, and building a program dependence graph (PDG), which represents them. The +system dependence graph (SDG) is then build from the assembly of the different +PDGs (each representing a method of the program), linking each method call to +its corresponding definition. Because each graph is built from the previous one, +new constructs can be added with to the CFG, without the need to alter the +algorithm that converts CFG to PDG and then to SDG. The only modification +possible is the redefinition of a dependency or the addition of new kinds of +dependence. + +The language covered by the initial proposal was a simple one, featuring +procedures with modifiable parameters and basic instructions, including calls to +procedures, variable assignments, arithmetic and logic operators and conditional +instructions (branches and loops): the basic features of an imperative +programming language. The control flow graph was as simple as the programs +themselves, with each graph representing one procedure. The instructions of the +program are represented as vertices of the graph and are split into two +categories: statements, which have no effect on the control flow (assignments, +procedure calls) and predicates, whose execution may lead to one of multiple +---though traditionally two--- paths (conditional instructions). Statements are +connected sequentially to the next instruction. Predicates have two outgoing +edges, each connected to the first statement that should be executed, according +to the result of evaluating the conditional expression in the guard of the +predicate. + +\begin{definition}[Control Flow Graph~\cite{???}] + A \emph{control flow graph} $G$ of a program $P$ is a tuple $\langle N, E \rangle$, where $N$ is a set of nodes, composed of a method's statements and two special nodes, ``Start'' and ``End''. $E$ is a set of edges of the form $e = \left(n_1, n_2\right)$ a directed edge from $n_1$ to $n_2$ +\end{definition} + +To build the PDG and then the SDG, some dependencies must be extracted from the CFG, which are defined as follows: + +\begin{definition}[Postdominance] + Vertex $b$ \textit{postdominates} vertex $b$ if and only if $a \neq b$ and $b$ is on every path from $a$ to the ``End'' vertex. +\end{definition} + +\begin{definition}[Control dependency] + \label{def:ctrl-dep} + Vertex $b$ is \textit{control dependent} on vertex $a$ ($a \ctrldep b$) if and only if $b$ postdominates one but not all of $a$'s successors. It follows that a vertex with only one successor cannot be the source of control dependence. +\end{definition} + +\begin{definition}[Data dependency] + Vertex $b$ is \textit{data dependent} on vertex $a$ ($a \datadep b$) if and only if $a$ may define a variable $x$, $b$ may use $x$ and there an $x$-definition free path from $a$ to $b$.\footnote{The initial definition of data dependency was further split into in-loop data dependencies and the rest, but the difference is not relevant for computing the slices in the SDG.} +\end{definition} + +It should be noted that variable definitions and uses can be computed for each +statement independently, analyzing the procedures called by it if necessary. In +general, any instruction uses all variables that appear in it, save for the +left-hand side of assignments. Similarly, no instruction defines variables, +except those in the left-hand side of assignments. The variables used and +defined by a procedure call are those used and defined by its body. + +With the data and control dependencies, the PDG may be built, by replacing the +edges from the CFG by data and control dependence edges. The first tends to be +represented as a thin solid line, and the latter as a thick solid line. In the +examples, data dependencies will be thin solid red lines. + +The organization of the vertices of the PDG tends to resemble a tree graph, with +the ``Start'' node in the position of the root (at the top), and the ``End'' +node typically omitted. The control dependence edges structure the tree +vertically. In the case that a vertex is control dependent on multiple vertices, +it will be placed one level below the lowest source of control dependency. With +a programming language this simple, cyclical control dependencies do not appear, +but should they do so in further sections, the instructions are sorted top to +bottom in the order they appear in the program. Horizontally, the vertices are +sorted by their order in the program, left to right, in order to make the graph +more readable. Data dependency edges are placed without reordering the nodes of +the graph. In the examples given, edges like $a \datadep a$ or $b \ctrldep b$ +may be omitted, as they are not relevant for later use of the graph. Please be +noted that the location of the vertices is irrelevant for the slicing algorithm, +and the aforementioned sorting rules are just for consistency with previous +papers on the topic and to ease the visualization of programs. + +Finally, the SDG is built from the combination of all the PDGs that compose the +program. Each call vertex is connected to the ``Start'' of the corresponding +procedure. All edges that connect PDGs are represented with dashed lines. + +\begin{figure} +\begin{minipage}{0.3\linewidth} + \begin{lstlisting} + proc main() { + a = 10; + b = 20; + f(a, b); + print(a); + } + + proc f(x, y) { + while (x > y) { + x = x - 1; + } + print(x); + } + \end{lstlisting} +\end{minipage} +\begin{minipage}{0.6\linewidth} + \includegraphics[width=0.3\linewidth]{img/cfgsimple} + \includegraphics[width=0.65\linewidth]{img/cfgsimple2} +\end{minipage} +\includegraphics[width=0.5\linewidth]{img/pdgsimple} +\includegraphics[width=0.49\linewidth]{img/pdgsimple2} +\includegraphics[width=0.6\linewidth]{img/sdgsimple} +\includegraphics[width=0.4\linewidth]{img/legendsimple} +\caption{A simple program with its CFGs (top right), PDGs (center) and SDG (bottom).} +\label{fig:sdg-loop} +\end{figure} + +\subsubsection{Procedures and data dependencies} + +The only thing left to explain before introducing more constructs into the +language is the passing of parameters. Most programming language accept a +variable number of input parameters and one output parameter. In the case of +input parameters passed by reference, or constructs such as structs or classes, +modifying a field of a parameter may modify the original variable. In order to +deal with everything related to parameter passing, including global variables, +class fields, etc. there is a small extension to be made to the CFG and PDG. + +In the CFG, the ``Start'' and ``End'' nodes contain a list of assignments, +inputting and outputting respectively the appropriate values, as can be seen in +the example. Consequently, every vertex that contains a procedure or function +call pack and unpack the arguments. For every variable $x$ that is used in a +procedure, every call to it must be preceded by $x_{in} = x$, and the +procedures's ``Start'' vertex must contain $x = x_{in}$. The opposite happens +when a variable must be ``outputted''\carlos{replace}: before the ``End'' node, +the value must be packed ($x_{out} = x$), and after each call, the value must be +assigned to the corresponding variable ($x = x_{out}$). Parameters may be +assigned as $par^i_{in} = expr_i$ (where $i$ is the index of the parameter in +the procedure definition, $par^i$ is the name of the parameter and $expr_i$ is +the expression in the $i^{th}$ position in the procedure call) in the call +vertex, and parameters whose modifications inside the procedure are passed back +to the calling procedure must be extracted as $var = par^i_{out}$ (where $var$ +is the name of the variable ---passed by reference--- in the calling +procedure).\carlos{What if object/struct passed by value?} As an addition, in +the SDG, an extra edge is added (summary edge), which represents the +dependencies that the input variables have on the outputs. This allows the +algorithm to know the dependencies without traversing the corresponding +function. + +All these additions are added as extra lines in the ``Start'', ``End'' and +calling vertices. When building the PDG, all additions (variable assignments) +are split into their own vertices, and are control dependent on them. Data +dependencies no longer flow throw the call vertex, but throw the appropriate +child, which minimizes the size of the slice produced. As an example, +figure~\ref{fig:sdg-loop} shows the three stages of a program, from CFG to SDG. +The construction of the CFG is straight-forward, save for the packing and +unpacking of variables in the start, end and call vertices. In the PDG, the +statements are split, control and data dependencies replace the control flow +edges. Finally, both PDGs are linked via call and parameter (input and output) +edges, forming the SDG. Summary edges are placed according to the data and +control flow of the method call, and the graph is complete. + +\section{Unconditional control flow} + +Even though the initial definition of the SDG was useful to compute slices, the +language covered was not enough for the typical language of the 1980's, which +included (in one form or another) unconditional control flow. Therefore, one of +the first additions contributed to the algorithm to build system dependence +graphs was the inclusion of unconditional jumps, such as ``break'', +``continue'', ``goto'' and ``return'' statements (or any other equivalent). A +naive representation would be to treat them the same as any other statement, but +with the outgoing edge landing in the corresponding instruction (outside the +loop, at the loop condition, at the method's end, etc.). An alternative +approach is to represent the instruction as an edge, not a vertex, connecting +the previous statement with the next to be executed. Both of these approaches +fail to generate a control dependence from the unconditional jump, as the +definition of control dependence (see Definition~\ref{def:ctrl-dep}) requires a +vertex to have more than one successor for it to be possible to be a source of +control dependence. From here, there stem two approaches: the first would be to +redefine control dependency, in order to reflect the real effect of these +instructions ---as some authors~\cite{DanBHHKL11} have tried to do--- and the +second would be to alter the creation of the SDG to ``create'' those +dependencies, which is the most widely--used solution. + +The most popular approach was proposed by Ball and Horwitz\cite{BalH93}, and +represents unconditional jumps as a \textsl{pseudo--predicate}. The true edge +would lead to the next instruction to be executed, and the false edge would be +non-executable or \textit{dummy} edges, connected to the instruction that would +be executed were the unconditional jump a \textit{nop}. The consequence of this +solution is that every instruction placed after the unconditional jump is +control dependent on the jump, as can be seen in Figure~\ref{fig:break-graphs}. +In the example, when slicing with respect to variable $a$ on line 5, every +statement would be included, save for ``print(a)''. Line 4 is not strictly +necessary in this example ---in the context of weak slicing---, but is included +nonetheless. In the original paper, the transformation is proved to be +complete, but not correct, as for some examples, the slice includes more +unconditional jumps that would be strictly necessary, even for weak slicing. +Ball and Horwitz theorize that a more correct approach would be possible, if it +weren't for the limitation of slices to be a subset of statements of the +program, in the same order as in the original. + +\begin{figure} +\centering +\begin{minipage}{0.3\linewidth} + \begin{lstlisting} +static void f() { + int a = 1; + while (a > 0) { + if (a > 10) break; + a++; + } + System.out.println(a); +} + \end{lstlisting} +\end{minipage} +\begin{minipage}{0.6\linewidth} + \includegraphics[width=0.4\linewidth]{img/breakcfg} + \includegraphics[width=0.59\linewidth]{img/breakpdg} +\end{minipage} +\caption{A program with unconditional control flow, its CFG (center) and PDG(right).} +\label{fig:break-graphs} +\end{figure} + +\section{Exceptions} + +As seen in section~\ref{sec:intro-exception}, exception handling in Java adds +two constructs: the \texttt{throw} and the \texttt{try-catch} statements. The +first one resembles an unconditional control flow statement, with an unknown (on +compile time) destination. The exception will be caught by a \texttt{catch} of +the corresponding type or a supertype ---if it exists. Otherwise, it will crash +the corresponding thread (or in single-threaded programs, stop the Java Virtual +Machine). The second stops the exceptional control flow conditionally, based on +the dynamic typing of the exception thrown. Both introduce challenges that must +be solved. + +\subsection{\texttt{throw} statement} + +The \texttt{throw} statement represents two elements at the same time: an +unconditional jump and an erroneous exit from its method. The first one has +been extensively covered and solved, but the second requires a small addition +to the CFG: instead of having a single ``End'' node, it will be split in two +---normal and error exit---, though the ``End'' cannot be removed, as a restriction +of most slicing algorithms is that the CFG have only one sink node. Therefore all +nodes that connected to the ``End'' will now lead to ``Normal exit'', all throw +statements' true outgoing edges will connect to the ``Error exit'', and both exit +nodes will converge on the ``End'' node. + +\texttt{throw} statements in Java take a single value, a subtype of \texttt{Throwable}, +and that value is used to stop the propagation of the exception; which can be handled +as a returned value. This treatment of \texttt{throw} statements only modifies the +structure of the CFG, without altering any other algorithm, nor the basic definitions +for control and data dependencies, making it very easy to incorporate to any existing +slicing software solution that follows the general model described. + +\begin{example}[CFG of an uncaught \texttt{throw} statement] \ \\ + \begin{minipage}{0.69\linewidth} + \begin{lstlisting} + double f(int x) { + if (x < 0) + throw new RuntimeException() + return Math.sqrt(x) + } + \end{lstlisting} + By analyzing the CFG, we can see that both exits are control dependent on the \texttt{throw} + statement; data dependencies present no special case in this example. + \end{minipage} + \begin{minipage}{0.3\linewidth} + \includegraphics[width=\linewidth]{img/throw-example-cfg} + \end{minipage} +\end{example} + +\subsection{\texttt{try-catch} statement} + +The \texttt{try-catch-finally} statement is the only way to stop an exception once it's thrown, +filtering by type, or otherwise letting it propagate further up the call stack. On top of that, +\texttt{finally} helps guarantee consistency, executing in any case (even when an exception is +left uncaught, the program returns or an exception occurs in a \texttt{catch} block). The main +problem with this construct is that \texttt{catch} blocks are not always necessary, but their +absence may make the compilation fail ---because a \texttt{try} block has no \texttt{catch} or +\texttt{finally} block---, or modify the execution in unexpected ways that are not always accounted +for in slicing software. + +For the \texttt{try} block, it is normally represented as a pseudo--predicate, connected to the +first statement inside it and to the end of the first instruction after the whole \texttt{try-catch-finally} +construct. Inside the \texttt{try} there can be four distinct sources of exceptions: + +\begin{description} + \item[Method calls.] If an exception is thrown inside a method and it is not caught, it will + surface inside the \texttt{try} block. As \textit{checked} exceptions must be declared + explicitly, method declarations may be consulted to see if a method call may or may not + throw any exceptions. On this front, polymorphism and inheritance present no problem, as + inherited methods may not modify the signature ---which includes the exceptions that may + be thrown. If \textit{unchecked} exceptions are also considered, all method calls shall + be included, as any can trigger at the very least a \texttt{StackOverflowException}. + \item[\texttt{throw} statements.] The least common, but most simple, as it is treated as a + \texttt{throw} inside a method. + \item[Implicit unchecked exceptions.] If \textit{unchecked} exceptions are considered, many + common expressions may throw an exception, with the most common ones being trying to call + a method or accessing a field of a \texttt{null} object (\texttt{NullPointerException}), + accessing an invalid index on an array (\texttt{ArrayIndexOutOfBoundsException}), dividing + an integer by 0 (\texttt{ArithmeticException}), trying to cast to an incompatible type + (\texttt{ClassCastException}) and many others. On top of that, the user may create new + types that inherit from \texttt{RuntimeException}, but those may only be explicitly thrown. + Their inclusion in program slicing and therefore in the method's CFG generates extra + dependencies that make the slices produced bigger. + \item[Erorrs.] May be generated at any point in the execution of the program, but they normally + signal a situation from which it may be impossible to recover, such as an internal JVM error. + In general, most programs do not consider these to be ``catch-able''. +\end{description} + +All exception sources are treated in a similar fashion: the statement that may throw an exception +is treated as a predicate, with the true edge connected to the next instruction were the statement +to execute without raising exceptions; and the false edge connected to the \texttt{catch} node. + +\carlos{CATCH Representation doesn't matter, it is similar to a switch but checking against types. + The difference exists where there exists the possibility of not catching the exception; + which is semantically possible to define. When a \texttt{catch (Throwable e)} is declared, + it is impossible for the exception to exit the method; therefore the control dependency must + be redefined.} + +The filter for exceptions in Java's \texttt{catch} blocks is a type (or multiple types since +Java 8), with a class that encompasses all possible exceptions (\texttt{Throwable}), which acts +as a catch--all. +In the literature there exist two alternatives to represent \texttt{catch}: one mimics a static +switch statement, placing all the \texttt{catch} block headers at the same height, all pending +from the exception-throwing exception and the other mimics a dynamic switch or a chain of \texttt{if} +statements. The option chosen affects how control dependencies should be computed, as the different +structures generate different control dependencies by default. + +\begin{description} + \item[Switch representation.] There exists no relation between different \texttt{catch} blocks, + each exception--throwing statement is connected through an edge labeled false to each + of the \texttt{catch} blocks that could be entered. Each \texttt{catch} block is a + pseudo--statement, with its true edge connected to the end of the \texttt{try} and the + As an example, a \texttt{1 / 0} expression may be connected to \texttt{ArithmeticException}, + \texttt{RuntimeException}, \texttt{Exception} or \texttt{Throwable}. + If any exception may not be caught, there exists a connection to the ``Error exit'' of the method. + \item[If-else representation.] Each exception--throwing statement is connected to the first + \texttt{catch} block. Each \texttt{catch} block is represented as a predicate, with the true + edge connected to the first statement inside the \texttt{catch} block, and the false edge + to the next \texttt{catch} block, until the last one. The last one will be a pseudo--predicate + connected to the first statement after the \texttt{try} if it is a catch--all type or to the + ``Error exit'' if it isn't. +\end{description} + +\begin{example}[Catches.]\ \\ + \begin{minipage}{0.49\linewidth} + \begin{lstlisting} + try { + f(); + } catch (CheckedException e) { + } catch (UncheckedException e) { + } catch (Throwable e) { + } + \end{lstlisting} + \end{minipage} + \begin{minipage}{0.49\linewidth} + \carlos{missing figures with 4 alternatives: if-else (with catch--all and without) and switch (same two)} +% \includegraphics[0.5\linewidth]{img/catch1} +% \includegraphics[0.5\linewidth]{img/catch2} +% \includegraphics[0.5\linewidth]{img/catch3} +% \includegraphics[0.5\linewidth]{img/catch4} + \end{minipage} +\end{example} + +Regardless of the approach, when there exists a catch--all block, there is no dependency generated +from the \texttt{catch}, as all of them will lead to the next instruction. However, this means that +if no data is outputted from the \texttt{try} or \texttt{catch} block, the catches will not be picked +up by the slicing algorithm, which may alter the results unexpectedly. If this problem arises, the +simple and obvious solution would be to add artificial edges to force the inclusion of all \texttt{catch} +blocks, which adds instructions to the slice ---lowering its score when evaluating against benchmarks--- +but are completely innocuous as they just stop the exception, without running any extra instruction. + +Another alternative exists, though, but slows down the process of creating a slice from a SDG. +The \texttt{catch} block is only strictly needed if an exception that it catches may be thrown and +an instruction after the \texttt{try-catch} block should be executed; in any other case the \texttt{catch} +block is irrelevant and should not be included. However, this change requires analyzing the inclusion +of \texttt{catch} blocks after the two--pass algorithm has completed, slowing it down. In any case, each +approach trades time for accuracy and vice--versa, but the trade--off is small enough to be negligible. + +Regarding \textit{unchecked} exceptions, an extra layer of analysis should be performed to tag statements +with the possible exceptions they may throw. On top of that, methods must be analyzed and tagged +accordingly. The worst case is that of inaccessible methods, which may throw any \texttt{RuntimeException}, +but with the source code unavailable, they must be marked as capable of throwing it. This results on +a graph where each instruction is dependent on the proper execution of the previous statement; save +for simple statements that may not generate exceptions. The trade--off here is between completeness and +correctness, with the inclusion of \textit{unchecked} exceptions increasing both the completeness and the +slice size, reducing correctness. A possible solution would be to only consider user--generated exceptions +or assume that library methods may never throw an unchecked exception. A new slicing variation that +annotates methods or limits the unchecked exceptions to be considered. + +Regarding the \texttt{finally} block, most approaches treat it properly; representing it twice: once +for the case where there is no active exception and another one for the case where it executes with +an exception active. An exception could also be thrown here, but that would be represented normally. + +% vim: set noexpandtab:ts=2:sw=2:wrap diff --git a/Secciones/motivation.tex b/Secciones/motivation.tex new file mode 100644 index 0000000..067a5cc --- /dev/null +++ b/Secciones/motivation.tex @@ -0,0 +1,138 @@ +% !TEX encoding = UTF-8 +% !TEX spellcheck = en_US +% !TEX root = ../paper.tex + +\chapter{Introduction} +\label{cha:introduction} +\section{Motivation} +\label{sec:motivation} + +Program slicing~\cite{Wei81} is a debugging technique which, given a line of +code and a variable of a program, simplifies such program so that the only parts +left of it are those that affect the value of the selected variable. + +\begin{example}[Program slicing in a simple method] + If the following program is sliced on line 5 (variable \texttt{x}), the + result would be the program of the right, with the \texttt{if} block + skipped, as it doesn't affect the value of \texttt{x}. + \label{exa:program-slicing} + \begin{center} + \begin{minipage}{0.49\linewidth} + \begin{lstlisting}[stepnumber=1] +void f(int x) { + if (x < 0) + System.err.println(x); + x++; + System.out.println(x); +} + \end{lstlisting} + \end{minipage} + \begin{minipage}{0.49\linewidth} + \begin{lstlisting}[stepnumber=1] +void f(int x) { + + + x++; + System.out.println(x); +} + \end{lstlisting} + \end{minipage} + \end{center} +\end{example} + +Slices are an executable program whose execution will produce the same values +for the specified line and variable as the original program, and are used to +facilitate debugging of large and complex programs, where the data flow may not +be easily understandable. + +Though it may seem a really powerful technique, the whole Java language is not +completely covered by it, and that makes it difficult to apply in practical +settings. An area that has been investigated, yet doesn't have a definitive +solution yet is exception handling. Example~\ref{exa:program-slicing2} +demonstrates how, even using the latest developments in program +slicing~\cite{Allen03}, the sliced version doesn't include the catch block, and +therefore doesn't produce a correct slice. + +\begin{example}[Program slicing with examples] + If the following program is sliced in line 17, variable \texttt{x}, the + slice is incomplete, as it lacks the \texttt{catch} block from lines 4-6. + \label{exa:program-slicing2} + \begin{center} + \begin{minipage}{0.49\linewidth} + \begin{lstlisting}[stepnumber=1] +void f(int x) { + try { + g(x); + } catch (RuntimeException e) { + System.err.println("Error"); + } + + System.out.println("g() was ok"); + + g(x); +} + +void g(int x) { + if (x < 0) { + throw new RuntimeException(); + } + System.out.println(x); +} + \end{lstlisting} + \end{minipage} + \begin{minipage}{0.49\linewidth} + \begin{lstlisting}[stepnumber=1] +void f(int x) { + try { + g(x); + } + + + + + + g(x); +} + +void g(int x) { + if (x < 0) { + throw new RuntimeException(); + } + System.out.println(x); +} + \end{lstlisting} + \end{minipage} + \end{center} +\end{example} + +As big a problem as this one is, it doesn't occur in all cases, because of how +\texttt{catch} blocks are generally treated when slicing. Generally, two kinds +of dependencies among statements are analyzed: control (on the result of this +line may depend whether another one executes or not) and data (on the result of +this line, the inputs for another one may change). + +The problem described doesn't occur when the inside the \texttt{try} block there +exist outgoing data dependencies, but it does when there aren't, creating +problems for structures with side effects such as a write action to a file or +database, or a network request whose result isn't used outside the \texttt{try}. +As most slicing tools ignore side effects and focus exclusively on the code and +some \texttt{catch} blocks are erroneously removed, which leads to incomplete +slices, which end with an error that is normally caught. + +\section{Contributions} + +The main contribution of this paper is a complete solution for program slicing +in the presence of exception handling constructs for Java; but in the process we +will present a history of program slicing, specifically those changes that +have affected exception handling. Furthermore, we provide a summary of the +different contributions each author has made to the field. + +The rest of the paper is structured as follows: chapter~\ref{cha:background} +summarizes the theoretical background required, chapter~\ref{cha:state-art} +provides a bird's eye view of the current state of the art, +chapter~\ref{cha:solution} provides a step by step description of the problems +found with the state of the art and the solutions proposed, and +chapter~\ref{cha:conclusion} summarizes the paper and provides avenues of future +work. + +% vim: set noexpandtab:tabstop=2:shiftwidth=2:softtabstop=2:wrap diff --git a/Secciones/solution.tex b/Secciones/solution.tex new file mode 100644 index 0000000..9a05208 --- /dev/null +++ b/Secciones/solution.tex @@ -0,0 +1,113 @@ +% !TEX encoding = UTF-8 +% !TEX spellcheck = en_US +% !TEX root = ../paper.tex +\chapter{Proposed solution} +\label{cha:solution} + +This solution is an extension of Allen's\cite{AllH03}, with some modifications to solve the problem found. Before starting, we need to split all instructions in three categories: + +\begin{description} + \item[statement] non-branching instruction, e.g. an assignment or method call. + \item[predicate] conditional branch, e.g. if statements and loops. + \item[pseudo-predicate] unconditional jump, e.g. break, continue, return, goto and throw instructions. +\end{description} + +Pseudo-predicates have been previously use to model unconditional jumps with a counter-intuitive reasoning: the next statement that would be executed were the pseudo-predicate not there would be executed, therefore it is control dependent on it. Going back to the definition of control dependency, one could argue that the real control dependency is on the conditional branch that lead to the + +\begin{figure} +\centering +\begin{lstlisting} +if (a) { + return a; +} +print(a); +\end{lstlisting} +\begin{lstlisting} +if (a) { + +} +print(a); +\end{lstlisting} +\caption{Example of pseudo-predicates control dependencies} +\end{figure} + +This is the process used to build the Program Dependence Graph. + +\begin{description} + \item[Step 1 (static analysis):] Identify for each instruction the variables read and defined. Each method is annotated with the global variables that they access or modify. + \item[Step 2 (build CFGs):] Build a CFG for each method of the program. The start of all methods is a vertex labeled \textsl{enter}, which also contains the assignments for parameters and global variables used (\texttt{var = var\_in}). The \textsl{enter} node is connected to the first instruction of the method. In a similar fashion, all methods end in an \textsl{exit} vertex with the corresponding output variables. There exists one \textsl{normal exit} to which the last instruction and all return instructions are connected. If the method can throw any exceptions, there exists one \textsl{error exit} for each type of exception that may be thrown. The normal and erroneous exits are connected to the \textsl{exit} node. + + Every normal statement is connected to the subsequent one by an unlabeled edge. Predicates have two outgoing edges, labeled \textsl{true} and \textsl{false}. Pseudo-predicates also have two outgoing edges. The \textsl{true} edge is connected to the destination of the jump (\textsl{normal exit} in the case of return, the begin or end of the loop in the case of continue and break, etc.). The \textsl{false} edge is a non-executable edge, marked with a dashed line, and it is connected to the next instruction that would be executed if the pseudo-predicate was a \textsl{nop}. + + Nodes that represent a call to a method $M$ include the transfer of parameters and variables that may be read or written to, then execute the call, and finally the extraction of modified variables. Call nodes are an exception to the previous paragraph, as they can have an unlimited amount of outgoing edges. Each outgoing edge lands on a pseudo-predicate which indicates if the execution was correct or an exception was raised. The executable edge of each pseudo-predicate will lead to the next instruction to be executed, whereas the non-executable one will lead to the end of the try-catch block. All call nodes can lead to a \textsl{normal return} node, which is linked to the next instruction, and one error node for each type of exception that may be thrown. The erroneous returns are labeled \textsl{catch ExType}, and lead to the first instruction in the corresponding catch block\footnotemark. Any exception that may not be caught will lead to the erroneous exit node of the method it's in. See the example for more details. + + \footnotetext{A problem presents itself here, as some exceptions may be able to trigger different catch blocks, due to the secuential nature of catches and polymorphism in Java. A way to fix this is to make catch blocks behave as a switch.}. %TODO + + \item[Step 3 (compute dependences):] For each node in the CFG, compute the control and data dependencies. Non-executable edges are only included when computing control dependencies.\\ + \carlos{put inside definition} + A node $a$ is \textsl{control dependent} on node $b$ iff $a$ post-dominates one but not all of $b$'s successors.\\ + A node $a$ is \textsl{data dependent} on node $b$ iff $b$ defines or may define a variable $x$, $a$ uses or may use $x$, and there is an $x$-definition-free path in the CFG from $b$ to $a$.\\ + \item[Step 4 (convert each CFG into a PDG):] each node of the CFG is one node of the PDG, with two exceptions. The first are the \textsl{enter}, \textsl{exit} and method call nodes, where the variable input and output assignments are split and placed as control-dependent on their original node. The second is the \textsl{exit} node, which is to be removed (the control-dependencies from \textsl{exit} to the variable outputs is transferred to the \textsl{enter} node). Then all the dependencies computed in the previous step are drawn. + \item[Step 5 (connect PDGs to form a SDG):] each method call to $M$ must be connected to the \textsl{enter} node in $M$'s PDG, as a control dependence. Each variable input from the method call is connected to a variable input of the method definition via a data dependence. Each variable output from the method definition is connected to the variable output of the method call via a data dependence. Each method exit is connected \carlos{complete}. +\end{description} + +\begin{itemize} + \item An extra type of control dependency represented by an ``exception edge''. It will represent the need to include a \textsl{catch} clause when an exception can be thrown. It is represented with a dotted line (dashed line is for data dependency). These edges have a special characteristic: when one is traversed, only ``exception edges'' may be traversed from the new nodes included in the slice. If the same node is reached by another kind of edge, the restriction is lifted. The behavior is documented in algorithm \ref{alg:2pass}, with changes from the original algorithm are \underline{underlined}. + \item Add an extra ``exception edge'' from each ``exit with exception of type T'' node, where the type of the exception is \texttt{t} to all the corresponding ``\texttt{throw e}'', such that \texttt{e} is or inherits from \texttt{T}. + \item Add an extra ``exception edge'' from each catch statement to every statement that can throw that error. + \item The exception edges will only be placed when the method or the try-catch statement are loop-carrier\footnote{Loop-carrier, when referring to a statement, is the property that in a CFG for the complete program, the node representing the statement is part of a loop, meaning that it could be executed again once it is executed.}. +\end{itemize} + +\begin{algorithm} % generate slice +\caption{Two-pass algorithm to obtain a backward static slice with exceptions} +\label{alg:2pass} +\begin{algorithmic}[1] + \REQUIRE SDG $\mathcal{G}$ representing program P. $\mathcal{G} = \{\mathcal{S}, \mathcal{E}\}$, where $\mathcal{S}$ is a set of states (some are statements) connected by a set of edges $\mathcal{E}$. Each edge, is a triplet composed of the type of edge (control, data or \underline{exception} dependency, summary, param-in, param-out), the source and destination of the edge. + \REQUIRE A slicing criterion, composed of a statement $s \in \mathcal{S}$ and a variable $v$. + \ENSURE $\mathcal{S}' \subseteq \mathcal{S}$, representing the slice of P according to the criterion provided. + + \medskip + \COMMENT{First pass (do not traverse output parameter edges).} + \STATE{$\mathcal{S}' \Leftarrow \emptyset$ (slice), $\mathcal{Q}\Leftarrow\{s\}$ (queue), $\mathcal{S}\Leftarrow \mathcal{S} - \{s\}$ (not visited), $\mathcal{R}\Leftarrow \emptyset$ (only visited via exception edge)} + \WHILE{$\mathcal{Q} \neq \emptyset$} + \STATE{$a \in \mathcal{Q}$} \COMMENT{Select an element from $\mathcal{Q}$} + \STATE{$\mathcal{Q} \Leftarrow \mathcal{Q} - \{a\}$} + \STATE{$\mathcal{S}' \Leftarrow \mathcal{S}' + \{a\}$} + \FORALL{$\mathcal{A}$ in $\{\{type, origin, a\} \in \mathcal{E}\}$} + \IF{$type \neq$ param-out \AND ($origin \notin \mathcal{S}'$ \OR ($origin \in \mathcal{R}$ \AND $a \notin \mathcal{R}$))} \label{line:param-out} + \IF{\underline{$a \in \mathcal{R}$}} + \IF{\underline{$type =$ exception}} + \STATE{\underline{$\mathcal{Q} \Leftarrow \mathcal{Q} + \{origin\}$}} + \STATE{\underline{$\mathcal{R} \Leftarrow \mathcal{R} + \{origin\}$}} + \ENDIF + \ELSE + \STATE{$\mathcal{Q} \Leftarrow \mathcal{Q} + \{origin\}$} + \ENDIF + \ENDIF + \ENDFOR + \ENDWHILE + \\ + \medskip + \COMMENT{Second pass (very similar, do not traverse input parameter edges).} + \STATE $\mathcal{Q} \Leftarrow \mathcal{S}'$ + \WHILE{$\mathcal{Q} \neq \emptyset$} + \STATE{$a \in \mathcal{Q}$} \COMMENT{Select an element from $\mathcal{Q}$} + \STATE{$\mathcal{Q} \Leftarrow \mathcal{Q} - \{a\}$} + \STATE{$\mathcal{S}' \Leftarrow \mathcal{S}' + \{a\}$} + \FORALL{$\mathcal{A}$ in $\{\{type, origin, a\} \in \mathcal{E}\}$} + \IF{$type \neq$ param-in \AND ($origin \notin \mathcal{S}'$ \OR ($origin \in \mathcal{R}$ \AND $a \notin \mathcal{R}$))} + \IF{\underline{$a \in \mathcal{R}$}} + \IF{\underline{$type =$ exception}} + \STATE{\underline{$\mathcal{Q} \Leftarrow \mathcal{Q} + \{origin\}$}} + \STATE{\underline{$\mathcal{R} \Leftarrow \mathcal{R} + \{origin\}$}} + \ENDIF + \ELSE + \STATE{$\mathcal{Q} \Leftarrow \mathcal{Q} + \{origin\}$} + \ENDIF + \ENDIF + \ENDFOR + \ENDWHILE +\end{algorithmic} +\end{algorithm} + +% vim: set noexpandtab:ts=2:sw=2:wrap diff --git a/Secciones/state_of_the_art.tex b/Secciones/state_of_the_art.tex new file mode 100644 index 0000000..5ceec0c --- /dev/null +++ b/Secciones/state_of_the_art.tex @@ -0,0 +1,70 @@ +% !TEX encoding = UTF-8 +% !TEX spellcheck = en_US +% !TEX root = ../paper.tex +\chapter{State of the art} +\label{cha:state-art} + +Slicing was proposed\cite{Wei81} and improved until the proposal of the current system (the SDG) \carlos{(citation)}. Specifically in the context of exceptions, multiple approaches have been attempted, with varying degrees of success. There exist commercial solutions for various programming languages: \carlos{name them and link}. +In the realm of academia, there exists no definite solution. One of the most relevant initial proposal\cite{AllH03}, although not the first one\cite{SinH98,SinHR99} to target Java specifically. + +It uses the existing proposals for \textsl{return}, \textsl{goto} and other unconditional jumps to model the behavior of \textsl{throw} statements. Control flow inside \textsl{try-catch-finally} statements is simulated, both for explicit \textsl{throw} and those nested inside a method call. The base algorithm is presented, and then the proposal is detailed as changes. Unchecked exceptions are considered but regarded as ``worthless'' to include, due to the increase in size of the slices, which reduces their effectiveness as a debugging tool. This is due to the number of unchecked exceptions embedded in normal Java instructions, such as \texttt{NullException} in any instance field or method, \texttt{IndexOutOfBoundsException} in array accesses and countless others. On top of that, handling \textsl{unchecked} exceptions opens the problem of calling an API to which there is no analyzable source code, either because the module was compiled before-hand or because it is part of a distributed system. The first should not be an obstacle, as class files can be easily decompiled. The only information that may be lost is variable names and comments, which don't affect a slice's precision, only its readability. + +Chang and Jo\cite{JoC04} present an alternative to the CFG by computing exception-induced control flow separately from the traditional control flow computation, but go no further into the ramifications it entails for the PDG and the SDG. + +Jiang et al.\cite{JiaZSJ06} describes a solution specific for the exception system in C++, which differs from Java's implementation of exceptions. They reuse the idea of non-executable edges in \textsl{throw} nodes, and introduce handling \textsl{catch} nodes as a switch, each trying to catch the exception before deferring onto the next \textsl{catch} or propagating it to the calling method. Their proposal is center around the IECFG (Improved Exception Control-Flow Graph), which propagates control dependencies onto the PDG and then the SDG. Finally, in their SDG, each normal and exceptional return and their data output are connected to all \textsl{catch} statements where the data may have arrived, which is fine for the example they propose, but could be inefficient if the method has many different call nodes. + +Others\cite{PraMB11} have worked specifically on the C++ exception framework. \carlos{remove or expand}. + +Finally, Hao\cite{JieS11} introduced a Object-Oriented System Dependence Graph with exception handling (EOSDG), which represented a generic object-oriented language, with exception handling capabilities. Its broadness allows for the EOSDG to fit into both Java and C++. It uses concepts from Jiang\cite{JiaZSJ06}, such as cascading \textsl{catch} statements, while adding explicit support for virtual calls, polymorphism and inheritance. + +% TODO UNCOMPLETE + +\hrulefill +\marginnote{Alternative explanation of \cite{AllH03}, with counter example. Maybe should move the counter example backwards.} + +In her paper, Horwitz suggests treating exceptions in the following way: +\begin{itemize} + \item Statements are divided into statements, predicates (loops and conditional blocks) and pseudo-predicates (return and throw statements). Statements only have one successor in the CFG, predicates have two (one when the condition is true and another when false), pseudo-predicates have two, but the one labeled ``false'' is non-executable. The non-executable edge connects to the statement that would be executed if the unconditional jump was replaced by a ``nop''. + \item \textsl{try-catch-finally} blocks are treated differently, but it has fewer dependencies than needed. Each catch block is control-dependent on any statement that may throw the corresponding exception. The +\end{itemize} + +\begin{lstlisting}[title=Example] +void main() { + int x = 0; + while (true) { + try { + f(x); + } catch (ExceptionA e) { + x--; + } catch (ExceptionB e) { + System.err.println(x); + } catch (ExceptionC e) { + System.out.println(x); + } + System.out.println(x); + } +} + +void f(x) { + x--; + if (x > 10) + throw new ExceptionA(); + else if (x == 0) + throw new ExceptionB(); + else if (x > 0) + throw new ExceptionC(); + x++; + System.out.println(x); +} + +static class ExceptionA extends ExceptionC {} +static class ExceptionB extends Exception {} +static class ExceptionC extends Exception {} +\end{lstlisting} + +In this example we can explore all the errors found with the current state of the art. + +The first problem found is the lack of \texttt{catch} statements in the slice, as no edge is drawn from the catch. Some of the catch blocks will be included via data dependencies, but some may not be reached, though they are still necessary if the slice includes anything after a caught exception. +Therefore, an extra control dependency must be introduced, in order to always include a ``catch'' statement in the slice if the ``throw'' statement is in the slice. In the example, only the catch statement from line 20 will be included, and if ExceptionC or ExceptionB were thrown, they would not be caught. That would not be a problem if the function $f$ was not executed again, but it is, making the slice incorrect. + +% vim: set noexpandtab:ts=2:sw=2:wrap diff --git a/img/breakcfg.pdf b/img/breakcfg.pdf index fd0cc26..323c21b 100644 Binary files a/img/breakcfg.pdf and b/img/breakcfg.pdf differ diff --git a/img/breakpdg.pdf b/img/breakpdg.pdf index 3693dea..56966e8 100644 Binary files a/img/breakpdg.pdf and b/img/breakpdg.pdf differ diff --git a/img/cfgsimple.pdf b/img/cfgsimple.pdf index 26634e9..dec3048 100644 Binary files a/img/cfgsimple.pdf and b/img/cfgsimple.pdf differ diff --git a/img/cfgsimple2.pdf b/img/cfgsimple2.pdf index 7abdbf7..8a79181 100644 Binary files a/img/cfgsimple2.pdf and b/img/cfgsimple2.pdf differ diff --git a/img/legendsimple.pdf b/img/legendsimple.pdf index 20d1a7d..f1f12e2 100644 Binary files a/img/legendsimple.pdf and b/img/legendsimple.pdf differ diff --git a/img/multiplycfg.pdf b/img/multiplycfg.pdf index c8d90c8..080316e 100644 Binary files a/img/multiplycfg.pdf and b/img/multiplycfg.pdf differ diff --git a/img/multiplypdg.pdf b/img/multiplypdg.pdf index feff5f0..681ae19 100644 Binary files a/img/multiplypdg.pdf and b/img/multiplypdg.pdf differ diff --git a/img/multiplysdg.pdf b/img/multiplysdg.pdf index d205eff..e20f335 100644 Binary files a/img/multiplysdg.pdf and b/img/multiplysdg.pdf differ diff --git a/img/pdgsimple.pdf b/img/pdgsimple.pdf index 5da9a37..df3230d 100644 Binary files a/img/pdgsimple.pdf and b/img/pdgsimple.pdf differ diff --git a/img/pdgsimple2.pdf b/img/pdgsimple2.pdf index 2152030..2690aea 100644 Binary files a/img/pdgsimple2.pdf and b/img/pdgsimple2.pdf differ diff --git a/img/sdgsimple.pdf b/img/sdgsimple.pdf index c9b7f31..7d07cd0 100644 Binary files a/img/sdgsimple.pdf and b/img/sdgsimple.pdf differ diff --git a/img/throw-example-cfg.dot b/img/throw-example-cfg.dot new file mode 100644 index 0000000..ed74a15 --- /dev/null +++ b/img/throw-example-cfg.dot @@ -0,0 +1,8 @@ +digraph g { + Start [shape=box]; + End [shape=box]; + Start -> End [style=dashed]; + Start -> "if (x < 0)" -> "throw" -> "Error exit" -> End; + "throw" -> "return Math.sqrt(x)" [style=dashed]; + "if (x < 0)" -> "return Math.sqrt(x)" -> "Normal exit" -> End; +} \ No newline at end of file diff --git a/img/throw-example-cfg.pdf b/img/throw-example-cfg.pdf new file mode 100644 index 0000000..5c6459c Binary files /dev/null and b/img/throw-example-cfg.pdf differ diff --git a/paper.tex b/paper.tex index 537507a..ef65892 100644 --- a/paper.tex +++ b/paper.tex @@ -37,9 +37,8 @@ \newcommand{\done}[1]{} \newcommand{\doubt}[1]{} \newcommand{\josep}[1]{} - \newcommand{\david}[1]{} + \newcommand{\carlos}[1]{} \newcommand{\sergio}[1]{} - \newcommand{\tama}[1]{} \else % Working \definecolor{ignoreColor}{rgb}{1,0.5,0} @@ -86,11 +85,11 @@ \tableofcontents -\include{motivation} -\include{introduction} -\include{incremental_slicing} -\include{state_of_the_art} -\include{solution} +\include{Secciones/motivation} +\include{Secciones/background} +\include{Secciones/incremental_slicing} +\include{Secciones/state_of_the_art} +\include{Secciones/solution} \bibliographystyle{plain} \bibliography{../../../../../../Biblio/biblio.bib}