tfm-report/Secciones/incremental_slicing.tex

442 lines
46 KiB
TeX
Raw Normal View History

2019-11-15 22:34:58 +01:00
% !TEX encoding = UTF-8
% !TEX spellcheck = en_GB
2019-11-15 22:34:58 +01:00
% !TEX root = ../paper.tex
\chapter{Main explanation?}
2019-12-03 15:12:13 +01:00
\label{cha:incremental}
\carlos{Review if we want to call nodes ``Enter'' and ``Exit'' or ``Start'' and ``End'' (I'd prefer the first one).}
2019-12-03 22:52:07 +01:00
\sergio{Enter o Entry?}
2019-12-05 14:10:59 +01:00
\josep{No es una decision nuestra, coge la misma palabra que Orwitz en el paper del SDG}
2019-11-15 22:34:58 +01:00
\section{First definition of the SDG}
\label{sec:first-def-sdg}
2019-12-04 17:21:21 +01:00
The system dependence graph (SDG) is \deleted{a method}\added{the main data structure for program representation used in the}\deleted{for} program slicing\added{ area. It}\deleted{that} was first
proposed by Horwitz, Reps and Blinkey \cite{HorwitzRB88}\added{ and, since then, many approaches have based their models on it}. It builds upon the
2019-11-15 22:34:58 +01:00
existing control flow graph (CFG), defining dependencies between vertices of the
2019-12-04 17:21:21 +01:00
CFG, and building a program dependence graph (PDG), which represents them.\sergio{Volvemos a poner las siglas y su significado?CFG?PDG? ya se han puesto antes} The
\deleted{system dependence graph (}SDG\deleted{)} is then built from the assembly of the different
2019-11-15 22:34:58 +01:00
PDGs (each representing a method of the program), linking each method call to
its corresponding definition. Because each graph is built from the previous one,
new constructs can be added with to the CFG, without the need to alter the
2019-12-04 17:21:21 +01:00
algorithm that converts \added{each} CFG to PDG and then to \added{the final} SDG. The only modification
possible is the redefinition of a\added{n already defined} dependency or the addition of new kinds of
2019-11-15 22:34:58 +01:00
dependence.
2019-12-04 17:21:21 +01:00
The language covered by the initial proposal \deleted{was}\added{is}\sergio{todo en presente o todo en pasado} a simple one, featuring
2019-11-15 22:34:58 +01:00
procedures with modifiable parameters and basic instructions, including calls to
procedures, variable assignments, arithmetic and logic operators and conditional
2019-12-04 17:21:21 +01:00
instructions (branches and loops)\deleted{:}\added{, i.e.,}\sergio{no se si i.e., queda bien aqui :/} the basic features of an imperative
programming language. The \deleted{control flow graph was}\added{CFGs are} as simple as the programs
2019-11-15 22:34:58 +01:00
themselves, with each graph representing one procedure. The instructions of the
program are represented as vertices of the graph and are split into two
2019-12-04 17:21:21 +01:00
categories: statements, which have no effect on the control flow (\added{e.g., }assignments,
2019-11-15 22:34:58 +01:00
procedure calls) and predicates, whose execution may lead to one of multiple
2019-12-04 17:21:21 +01:00
---though traditionally two--- \added{different paths} (\added{e.g., }conditional instructions). \deleted{S}\added{While s}tatements are
connected sequentially to the next instruction\deleted{. P}\added{, on the contrary, p}redicates have two outgoing
edges, each\added{ of them} connected to the first statement that should be executed\deleted{,} according
2019-11-15 22:34:58 +01:00
to the result of evaluating the conditional expression in the guard of the
predicate.
2019-12-03 15:12:13 +01:00
\begin{definition}[Control Flow Graph \carlos{add original citation}]
2019-12-05 16:13:28 +01:00
A \emph{control flow graph} $G$ of a program\sergio{program o method?} $P$ is a directed graph, represented as a tuple $\langle N, E \rangle$, where $N$ is a set of nodes \josep{such that for each statement $s$ in $P$there is a node in $N$ labeled with $S$ and there are two special nodes...}, composed of a method's\sergio{method o program?} statements plus two special nodes, ``Start'' and ``End''; and $E$ is a set of edges of the form $e = \left(n_1, n_2\right) | n_1, n_2 \in N$.
2019-12-05 14:10:59 +01:00
\josep{Esto es una definicion. No pueden haber opinion ni contenido vago. O defines que Start y End son nodos o no lo defines. Pero no diugas lo que han hecho otros en una definicion. Lo que sigue yo lo quitaría}
Most algorithms\added{, in order} to generate the SDG\added{,} mandate the ``Start'' node to be the only source and \added{the} ``End'' \added{node} to be the only sink in the graph. \carlos{Is it necessary to define source and sink in the context of a graph?}\josep{quitalo}.
2019-12-04 17:21:21 +01:00
2019-12-05 14:10:59 +01:00
\josep{desde aqui}Edges are created according to the possible execution paths that exist; each statement is connected to any statement that may immediately follow it. Formally, \josep{hasta aqui sacalo fuera de la definicion, para explicarla., Pero no tiene sentido que digas algo informal en una defincicion y dentro incluso de la definicion digas formally, Debe ser TODO formally por definicion (valga la redundancia)}an edge $e = (n_1, n_2)$ exists if and only if there exists an execution of the program where $n_2$ is executed immediately after $n_1$. \josep{de nuevo, no puedes decir in general. O defines que si se evaluan o que no, pero no digas lo que se suele hacer. Aqui estas definiendo}In general, expressions are not evaluated\added{when generating the CFG}; so a\deleted{n \texttt{if}}\added{ conditional} instruction \added{will have}\deleted{has} two outgoing edges \added{regardless the condition value being} \deleted{even if the condition is} always true or false, e.g.\added{,} \texttt{1 == 0}.
2019-11-15 22:34:58 +01:00
\end{definition}
2019-12-04 17:21:21 +01:00
To build the PDG and then the SDG, there are two dependencies based directly on the CFG's structure: data and control dependence. \sergio{But first, we need to define the concept of postdominance in a graph necessary in the definition of control dependency:}\sergio{no me convence mucho pero plantearse si poner algo aqui o dejarlo como esta.}
2019-11-15 22:34:58 +01:00
2019-12-03 15:12:13 +01:00
\begin{definition}[Postdominance \carlos{add original citation?}]
\label{def:postdominance}
2019-12-05 16:13:28 +01:00
\josep{Let $C = (N,E)$ be a CFG.}
Vertex $b\josep{\in N}$ \textit{postdominates} vertex $a\josep{\in N}$ if and only if $b$ is on every path from $a$ to the ``End'' vertex.
2019-11-15 22:34:58 +01:00
\end{definition}
2019-12-04 17:21:21 +01:00
\begin{definition}[Control dependency\sergio{dependency o dependence?} \carlos{add original citation}]
2019-11-15 22:34:58 +01:00
\label{def:ctrl-dep}
2019-12-05 16:13:28 +01:00
\josep{Let $C = (N,E)$ be a CFG.}
Vertex $b\josep{\in N}$ is \textit{control dependent} on vertex $a\josep{\in N}$ ($a \ctrldep b$) if and only if $b$ postdominates one but not all of $a$'s successors. \josep{Lo que sigue es en realidad es un lema. No hace falta ponerlo como lema, pero sí sacarlo a después de la definicion.}It follows that a vertex with only one successor cannot be the source of control dependence.
2019-11-15 22:34:58 +01:00
\end{definition}
2019-12-04 17:21:21 +01:00
\begin{definition}[Data dependency\sergio{dependency o dependence?} \carlos{add original citation}]
\label{def:data-dep}
2019-12-05 16:13:28 +01:00
\josep{Let $C = (N,E)$ be a CFG.}
Vertex $b\josep{\in N}$ is \textit{data dependent} on vertex $a\josep{\in N}$ ($a \datadep b$) if and only if $a$ may define a variable $x$, $b$ may use $x$ and there exists a \carlos{could it be ``an''??} $x$-definition free path from $a$ to $b$.
2019-12-03 15:12:13 +01:00
2019-12-05 16:13:28 +01:00
Data dependency was originally defined as flow dependency, and split into loop and non--loop related dependencies\josep{creo que es loop-carried. Me parece que esta en el paper de Frank Tip}, but that distinction is no longer useful to compute program slices \sergio{Quien dijo que ya no es util? Vale la pena citarlo?}. \josep{Si que es useful en program slicing, pero no en debugging.}
2019-12-05 14:01:22 +01:00
It should be noted that variable definitions and uses can be computed for each statement independently, analysing the procedures called by it if necessary. The variables used and defined by a procedure call are those used and defined by its body.
2019-11-15 22:34:58 +01:00
\end{definition}
2019-12-03 15:12:13 +01:00
With the data and control dependencies, the PDG may be built by replacing the
2019-11-15 22:34:58 +01:00
edges from the CFG by data and control dependence edges. The first tends to be
represented as a thin solid line, and the latter as a thick solid line. In the
2019-12-04 17:21:21 +01:00
examples, \added{data and control dependencies are represented by thin solid red and black lines respectively}\deleted{data dependencies will be thin solid red lines}.
2019-11-15 22:34:58 +01:00
2019-12-03 15:12:13 +01:00
\begin{definition}[Program dependence graph]
\label{def:pdg}
2019-12-05 16:13:28 +01:00
\josep{Given a program $P$,} The \textsl{program dependence graph} (PDG) \josep{associated with $P$} is a directed graph (and originally a tree\sergio{???}\josep{sobran las aclaraciones historicas en una definicion}) represented by \josep{a triple $\langle N, E_c, E_d \rangle$ where $N$ is...} three elements: a set of nodes $N$, a set of control edges $E_c$ and a set of data edges $E_d$. \sergio{$PDG = \langle N, E_c, E_d \rangle$}
2019-11-15 22:34:58 +01:00
2019-12-06 20:35:20 +01:00
Method $M$, CFG $C = \langle N, E \rangle$, the PDG is $P = \langle N', E_c, E_d \rangle$, where
% $$E_c = \{ (a, b) | a, b \in N' \wedge a \ctrldep b\}$$
\begin{enumerate}
\vspace{-1em}
\item $N' = N \backslash \{End\}$ \vspace{-1em}
\item $(a, b) \in E_c \iff a, b \in N' \wedge a \ctrldep b ~ \wedge \not\exists c \in N' ~.~ a \ctrldep c \wedge c \ctrldep b$ \vspace{-1em}
\item $(a, b) \in E_d \iff a, b \in N' \wedge a \datadep b$
\end{enumerate}
2019-12-05 16:13:28 +01:00
The set of nodes corresponds to the set of nodes of the CFG\josep{que CFG? no se puede dar por hecho que existe un CFG en una definicion}, excluding the ``End'' node.
2019-11-15 22:34:58 +01:00
2019-12-06 17:31:03 +01:00
Both sets of edges are built as follows\josep{:}. There is a control edge between two nodes $n_1$ and $n_2$ if and only if $n_1 \ctrldep n_2$\sergio{acordarse de lo de evitar la generacion de arcos para prevenir la transitividad. Decidir si definimos Control arc como ua definicion aparte.}, and a data edge between $n_1$ and $n_2$ if and only if $n_1 \datadep n_2$. Additionally, if a node $n$ does not have any incoming control edges, it has a ``default'' control edge $e = (\textnormal{Start},n)$; so that ``Start'' is the only source node of the graph.
2019-12-03 15:12:13 +01:00
2019-12-06 17:31:03 +01:00
Note: \josep{dentro de una definicion no pueden haber notas. Esto va fuera}the most common graphical representation is a tree--like structure based on the control edges, and nodes sorted left to right according to their position on the original program. Data edges do not affect the structure, so that the graph is easily readable.
2019-12-03 15:12:13 +01:00
\end{definition}
2019-12-04 17:21:21 +01:00
\sergio{creo que en la definicion de CFG y PDG tiene que quedar mas claro que hay varios por programa (uno por funcion), para que esta ultima frase cobre mas sentido.}
2019-12-03 15:12:13 +01:00
Finally, the SDG is built from the combination of all the PDGs that compose the
program.
\begin{definition}[System dependence graph]
\label{def:sdg}
2019-12-06 20:35:20 +01:00
Given a program $P$ composed of a set of $n$ methods $M = \{m_0 ... m_n\}$ and their associated PDGs (each method $m_i$ has a PDG $G_{PDG}^i = \langle N^i, E_c^i, E_d^i \rangle$), the \textit{system dependence graph} (SDG) of $P$ is a graph $G = \langle N', E'_c, E'_d, E_{fc}, E_s \rangle$ where $N = \bigcup_{i=0}^n N^i$, $ $, $ $, $ $, and $ $.
2019-12-06 17:31:03 +01:00
\josep{Arreglar esta definicion como la del PDG. Ahora mismo es totalmente informal. Deberia definirse encima del PDG. Es decir, una SDG es la conexion adecuada de varios PDGs, uno por método. Y solo definir lo nuevo: call arcs, parameter-in arcs, parameter-out arcs y summary arcs.}
The \textsl{system dependence graph} (SDG) is a directed graph that represents the control and data dependencies of a whole program. It has three kinds of edges: control, data and function call. The graph is built combining multiple PDGs, with the ``Start'' nodes labeled after the function they begin. There exists one function call edge between each node containing one or more calls and each of the ``Start'' node\josep{s} of the method called. In a programming language where the function call is ambiguous (e.g. with pointers or polymorphism), there exists one edge leading to every possible function called.\sergio{Esta definicion ha quedado muy informal no? Donde han quedado los $E_c,~E_d,~E_{fc},$ Nodes del PDG...?}
2019-12-03 15:12:13 +01:00
\end{definition}
\begin{example}[Creation of a SDG from a simple program]
Given the program shown below (left), the control flow graphs for both methods are shown on the right: \\
\begin{minipage}{0.2\linewidth}
\begin{lstlisting}
proc main() {
a = 10;
b = 20;
f(a, b);
}
proc f(x, y) {
while (x > y) {
x = x - 1;
2019-11-15 22:34:58 +01:00
}
2019-12-03 15:12:13 +01:00
print(x);
}
\end{lstlisting}
\end{minipage}
\begin{minipage}{0.79\linewidth}
\includegraphics[width=0.6\linewidth]{img/cfgsimple}
\end{minipage}
2019-12-04 17:21:21 +01:00
\sergio{Centrar la figura, sobra mucho espacio a la derecha}
2019-12-03 15:12:13 +01:00
2019-12-06 17:31:03 +01:00
Then, control and data dependencies are computed, arranging the nodes in the \josep{corresponding} PDG\josep{s (see the two PDGs inside the two squares below)}\sergio{FigureRef missing}. Finally, the two graphs are connected with summary edges\sergio{with que? esto no se sabe aun ni lo que es ni para que sirve. En todo caso function call edges, y si ese es el negro que va de f(a,b) a Start f() para diferenciarlo deberia ser de otro color} to create the SDG:
2019-12-03 15:12:13 +01:00
\begin{center}
\includegraphics[width=0.8\linewidth]{img/sdgsimple}
\end{center}
\end{example}
\subsubsection{Function calls and data dependencies}
2019-12-06 17:31:03 +01:00
\carlos{Vocabulary: when is appropriate the use of method, function and procedure????}\sergio{buena pregunta, yo creo que es jerarquico, method incluye function y procedure y los dos ultimos son disjuntos entre si no?} \josep{No. metodo implica orientacion a objetos. si estas hablando de un lenguaje en particular (p.e., Java), entonces debes usar el vocabulario de ese lenguaje (p.e., method). Si hablas en general y quieres usar una palabra que subsuma a todos, yo he visto dos maneras de hacerlo: (1) usar routine (aunque podrias usar otra palabra, por ejemplo metodo) la primera vez y ponerle una footnote diciendo que en el resto del articulo usamos routine para referirnos a metodo/funcion/procedimiento/predicado. (2) Usar metodo/funcion/procedimiento/predicado así, separado por barras. En esta tesina parece mas apropiado hablar de metodo, y la primera vez poner una footnote que diga que hablaremos de métodos, pero todos los desarrollos son igualmente aplicables a funciones y procedimientos.}
2019-12-03 15:12:13 +01:00
2019-12-06 17:45:09 +01:00
In the original definition of the SDG, there was special handling of data dependencies when calling functions, as it was considered that parameters were passed by value, and global variables did not exist. \carlos{Name and cite paper that introduced it} solves this issue by splitting function calls and function \added{definitions} into multiple nodes. This proposal solved \josep{the problem}everything\sergio{lo resuelve todo?} related to parameter passing: by value, by reference, complex variables such as structs or objects and return values.
2019-12-03 15:12:13 +01:00
To such end, the following modifications are made to the different graphs:
\begin{description}
2019-12-04 17:21:21 +01:00
\item[CFG.] In each CFG, global variables read or modified and parameters are added to the label of the ``Start'' node in assignments of the form $par = par_{in}$ for each parameter and $x = x_{in}$ for global variables. Similarly, global variables and parameters modified are added to the label of the ``End'' node as \added{assignments of the form} $x_{out} = x$. \added{From now on, we will refer to the described assignments as input and output information respectively.} \sergio{\{}The parameters are only passed back if the value set by the called method can be read by the callee\sergio{\} no entiendo a que se refiere esta frase}. Finally, in method calls the same values must be packed and unpacked: each statement containing a function called is relabeled to contain \added{its related} input (of the form $par_{in} = \textnormal{exp}$ for parameters or $x_{in} = x$ for global variables) and output (always of the form $x = x_{out}$) \added{information}. \sergio{no hay parameter\_out? asumo entonces que no hay paso por valor?}
\item[PDG.] Each node \added{augmented with input or output information}\deleted{modified} in the CFG is \added{now} split into multiple nodes: the original \deleted{label}\added{node} \added{(Start, End or function call)} is the main node and each assignment \added{contained in the input and output information} is represented as a new node, which is control--dependent on the main one. Visually, \added{new nodes coming from the input information}\deleted{input is} \added{are} placed on the left and \added{the ones coming from the output information}\deleted{output} on the right; with parameters sorted accordingly.
2019-12-06 17:45:09 +01:00
\item[SDG.] Three kinds of edges are introduced: parameter input (param--in), parameter output (param--out) and summary edges. Parameter input edges are placed between each method call's input node and the corresponding method definition input node. Parameter output edges are placed between each method definition's output node and the corresponding method call output node. Summary edges are placed between the input and output nodes of a method call, according to the dependencies inside the method definition: if there is a path from an input node to an output node, that shows a dependence and a summary method is placed in all method calls between those two nodes.\sergio{Tengo la sensacion de que la explicacion de que es un summary llega algo tarde y tal vez deberia estar en alguna definicion previa. Que opine Josep que piensa}\josep{Efectivamente. Llega tarde. No pueden definirse estas dependencias despues de definir el SDG, porque entonces lo que has definido en la definicion formal no es un SDG (solo una parte de el) y cuando hables de SDG a partir de ahora todo estara incompleto. Las definiciones son sagradas, así que hay dos soluciones: (1) explicar estos tres arcos antes de la definicion de SDG para poder definirlos formalmente en la definicion de SDG, o (2) retrasar la definiucion formal de SDG hasta aqui (para poder incluirlos). O cualquier otra cosa que haga que el SDG esté bien definido}
2019-12-03 15:12:13 +01:00
2019-12-06 17:45:09 +01:00
Note: \deleted{parameter input and output}\added{param-in and param-out} edges are separated because the traversal algorithm traverses them only sometimes (the output edges are excluded in the first pass and the input edges in the second).\sergio{delicado mencionar lo de las pasadas sin haber hablado antes de nada del algoritmo de slicing, a los que no sepan de slicing se les quedara el ojete frio aqui. Plantearse quitar esta nota.}\josep{Esta nota retrasala hasta que hables del algoritmo de slicing. En ese momento puedes decir que precisamente para que hayan dos pasadas se distingue entre parameter-ín y paramneter-out. Alli tendrá sentido y será aclaratorio. Aquí es confusorio. ;-)}
2019-12-03 15:12:13 +01:00
\end{description}
\begin{example}[Variable packing and unpacking]
2019-12-06 20:27:12 +01:00
Let it be \josep{Excelente cancion de los beatles. Buenísima. Pero mejor empieza así: Let $f(x, y)$ be a function with... ;-)} a function $f(x, y)$ with two integer parameters \added{which\josep{that} modifies the argument passed in its second parameter}, and a call $f(a + b, c)$, with parameters passed by reference if possible. The label of the method call node in the CFG would be ``\texttt{x\_in = a + b, y\_in = c, f(a + b, c)\josep{???}, c = y\_out}''; method $f$ would have \texttt{x = x\_in, y = y\_in} in the ``Start'' node and \texttt{y\_out = y} in the ``End'' node. The relevant section of the SDG would be: \josep{Todo este parrafo y la figura que sigue no se entienden. Hay que reescribirlo y explicarlo más detenidamente, paso a paso. Se supone que este es el ejmplo de la sección. El que va a aclarar las dudas de qué es $x_in$, etc. y de cómo funciona el SDG. Sin embargo, más que aclarar, lía (a uno que no sepa de slicing no le aclara nada). De hecho, para que se entendiera bien, una vez has construido el grafo, estaría bien continuar un poco el ejemplo explicando como las dependencias hacen que lo que hay dentro del método llamado depende (siguiendo los arcos) de lo que hay en el método llamador (o al menos de los parámetros de la llamada). Esto requiere un poco de texto explicativo.}
2019-12-03 15:12:13 +01:00
\begin{center}
\includegraphics[width=0.5\linewidth]{img/parameter-passing}
\end{center}
\end{example}
2019-11-15 22:34:58 +01:00
2019-12-04 17:21:21 +01:00
\sergio{Esta figura molaria mas evolutiva si diera tiempo, asi seria casi autoexplicativa: CFG $\rightarrow$ PDG $\rightarrow$ SDG. La actual seria el SDG, las otras tendrian poco mas que un nodo y una etiqueta.}
2019-11-15 22:34:58 +01:00
\section{Unconditional control flow}
2019-12-04 17:21:21 +01:00
Even though the initial definition of the SDG was \deleted{useful}\added{adequate} to compute slices, the
language covered was not enough for the typical language of the 1980s, which
2019-11-15 22:34:58 +01:00
included (in one form or another) unconditional control flow. Therefore, one of
2019-12-04 17:21:21 +01:00
the first \added{proposed upgrades}\deleted{additions contributed} to the algorithm to build \deleted{system dependence
graphs}\added{SDGs} was the inclusion of unconditional jumps, such as ``break'',
``continue'', ``goto'' and ``return'' statements (or any other equivalent). A
2019-11-15 22:34:58 +01:00
naive representation would be to treat them the same as any other statement, but
with the outgoing edge landing in the corresponding instruction (outside the
loop, at the loop condition, at the method's end, etc.).
2019-12-04 17:21:21 +01:00
An alternative approach is to represent the instruction as an edge, not a vertex, connecting the previous statement with the next to be executed. \sergio{Juntaria las 2 propuestas anteriores (naive y alternative) en 1 frase, no las separaria, porque despues de leer la primera ya me he mosqueado porque no deciamos ni quien la hacia ni por que no era util.}
Both of these approaches fail to generate a control dependence from the unconditional jump, as the definition of control dependence (see definition~\ref{def:ctrl-dep}) requires a vertex to have more than one successor for it to be possible to be a source of control dependence.
From here, there stem two approaches: the first would be to
2019-11-15 22:34:58 +01:00
redefine control dependency, in order to reflect the real effect of these
instructions ---as some authors~\cite{DanBHHKL11} have tried to do--- and the
second would be to alter the creation of the SDG to ``create'' those
dependencies, which is the most widely--used solution \cite{BalH93}.
The most popular approach was proposed by Ball and Horwitz~\cite{BalH93}, classifying instructions into three separate categories:
\begin{description}
2019-12-06 20:27:12 +01:00
\item[Statement.] Any instruction that is not a conditional or unconditional jump. \josep{\deleted{It has one outgoing edge in the CFG, to the next instruction that follows it in the program.}\added{Those nodes that represent an statement in the CFG have one outgoing edge pointing to the next instruction that follows it in the program.}}
\item[Predicate.] Any conditional jump instruction, such as \texttt{while}, \texttt{until}, \texttt{do-while}, \texttt{if}, etc. \josep{\deleted{It has two outgoing edges, labeled \textit{true} and \textit{false}; leading to the corresponding instructions.}\added{In the CFG, those nodes representing predicates have two outgoing edges, labeled \textit{true} and \textit{false}, leading to the corresponding instructions.}}
\item[Pseudo--predicates.] Unconditional jumps (e.g. \texttt{break}, \texttt{goto}, \texttt{continue}, \texttt{return}); are like predicates, with the difference that the outgoing edge labeled \textit{false} is marked as non--executable\josep{---because there is no possible execution where such edge would be possible,\deleted{, and there is no possible execution where such edge would be possible,} according to the definition of the CFG (see Definition~\ref{def:cfg})---}. Originally the edges had a specific reasoning backing them up: the \textit{true} edge leads to the jump's destination and the \textit{false} one, to the instruction that would be executed if the unconditional jump was removed, or converted into a \texttt{no op}\sergio{no op o no-op?} (a blank operation that performs no change to the program's state). \sergio{\{}This specific behavior is used with unconditional jumps, but no longer applies to pseudo--predicates, as more instructions have used this category as means of ``artificially'' \carlos{bad word choice} generating control dependencies.\sergio{\}No entrar en este jardin, cuando se definio esto no se contemplaba la creacion de nodos artificiales. -Quita el originally, ahora es originally.}
2019-12-05 10:04:15 +01:00
\end{description}
2019-12-06 20:27:12 +01:00
\carlos{Pseudo--statements now have been introduced and are used to generate all control edges (for now just the Start method to the End).}\josep{No entiendo este CCC}
2019-12-06 20:27:12 +01:00
As a consequence of this classification, every instruction after an unconditional jump $j$ is control--dependent (either directly or indirectly) on $j$ and the structure containing it (\josep{a predicate such as }a conditional statement or a loop), as can be seen in the following example.
2019-11-15 22:34:58 +01:00
\begin{figure}
\centering
\begin{minipage}{0.3\linewidth}
\begin{lstlisting}
2019-11-15 22:34:58 +01:00
static void f() {
int a = 1;
while (a > 0) {
if (a > 10) break;
a++;
}
System.out.println(a);
}
\end{lstlisting}
\end{minipage}
\begin{minipage}{0.6\linewidth}
\includegraphics[width=0.4\linewidth]{img/breakcfg}
\includegraphics[width=0.59\linewidth]{img/breakpdg}
\end{minipage}
\caption{A program with unconditional control flow, its CFG (center) and PDG(right).}
\label{fig:break-graphs}
\end{figure}
2019-11-15 22:34:58 +01:00
\begin{example}[Control dependencies generated by unconditional instructions]
\label{exa:unconditional}
2019-12-04 17:21:21 +01:00
Figure~\ref{fig:break-graphs} showcases a small program with a \texttt{break} statement, its CFG and PDG with a slice in grey. The slicing criterion (line 5, variable $a$) is control dependent on both the unconditional jump and its surrounding conditional instruction (both on line 4); even though it is not necessary to include it\sergio{a quien se refiere este it?} (in the context of weak slicing).
2019-11-15 22:34:58 +01:00
Note: the ``Start'' node $S$ is also categorized as a pseudo--statement, with the \textit{false} edge connected to the ``End'' node, therefore generating a dependence from $S$ to all the nodes inside the method. This removes the need to handle $S$ with a special case when converting a CFG to a PDG, but lowers the explainability of non--executable edges as leading to the ``instruction that would be executed if the node was absent or a no--op''.
\end{example}
2019-11-15 22:34:58 +01:00
The original paper~\cite{BalH93} does prove its completeness, but disproves its correctness by providing a counter--example similar to example~\ref{exa:nested-unconditional}. This proof affects both weak and strong slicing, so improvements can be made on this proposal. The authors postulate that a more correct approach would be achievable if the slice's restriction of being a subset of instructions were lifted.
2019-11-15 22:34:58 +01:00
\begin{example}[Nested unconditional jumps]
\label{exa:nested-unconditional}
2019-12-04 17:21:21 +01:00
In the case of nested unconditional jumps where both jump to the same destination, only one of them (the out--most one) is needed. Figure~\ref{fig:nested-unconditional} showcases the problem, with the minimal slice \carlos{have not defined this yet} in grey, and the algorithmically computed slice in light blue. Specifically, lines 3 and 5 are included unnecessarily.
\begin{figure}
\begin{minipage}{0.15\linewidth}
\begin{lstlisting}
while (X) {
if (Y) {
if (Z) {
A;
break;
}
B;
break;
2019-11-15 22:34:58 +01:00
}
C;
}
D;
\end{lstlisting}
2019-11-15 22:34:58 +01:00
\end{minipage}
\begin{minipage}{0.84\linewidth}
\includegraphics[width=0.4\linewidth]{img/nested-unconditional-cfg}
\includegraphics[width=0.59\linewidth]{img/nested-unconditional-pdg}
2019-11-15 22:34:58 +01:00
\end{minipage}
\caption{A program with nested unconditional control flow (left), its CFG (center) and PDG (right).}
\label{fig:nested-unconditional}
\end{figure}
\end{example}
\carlos{Add proposals to fix both problems showcased.}
\section{Exceptions}
2019-12-05 16:10:05 +01:00
\sergio{Creo que aun no hemos dicho que nuestro target language es Java, creo que ahora seria un buen momento.}
Exception handling was first tackled in the context of Java program slicing by Sinha et al. \cite{SinH98}, with later contributions by Allen and Horwitz~\cite{AllH03}. There exist contributions for other programming languages, which will be explored later (chapter~\ref{cha:state-art}) \deleted{and other small contributions}. \sergio{Tal vez cambiaria el orden de estas frases para ir de lo general a lo concreto, diria primero que hay muchas contribuciones que veremos en el chapter~\ref{cha:state-art} y luego que nos vamos a centrar en los planteamientos que abordan el problema para Java, donde las propuestas con mas peso son: tal y tal.} The following section will explain the treatment of the different elements of exception handling in Java program slicing.
As seen in section~\ref{sec:intro-exception}, exception handling in Java adds
two constructs: \texttt{throw} and \texttt{try-catch}. Structurally, the
first one resembles an unconditional control flow statement carrying a value ---like \texttt{return} statements--- but its destination is not fixed, as it depends on the dynamic typing of the value.
2019-12-05 16:10:05 +01:00
If there is a compatible \texttt{catch} block, execution will continue inside it, otherwise the method exits with the \deleted{corresponding value as the }error \added{as returned value}.
The same process is repeated in the method that called the current one, until either the call stack is emptied or the exception is successfully caught.
2019-12-05 16:10:05 +01:00
\deleted{If}\added{Eventually, in case} the exception is not caught \deleted{at all}\added{by any stacked method}, the program exits with an error ---except in multi--threaded programs, in which case the corresponding thread is terminated.
The \texttt{try-catch} statement can be compared to a \texttt{switch} which compares types (with \texttt{instanceof}) instead of constants (with \texttt{==} and \texttt{Object\#equals(Object)} \sergio{esta notacion es obligatoria o podemos decir ``... and the \texttt{equals} operands"?}). Both structures require special handling to place the proper dependencies, so that slices are complete and as correct as \deleted{can be}\added{possible}.
\subsection{\texttt{throw} statement}
The \texttt{throw} statement compounds two elements in one instruction: an
2019-12-06 21:53:09 +01:00
unconditional jump with a value attached and a switch to an ``exception mode'', in which the statement's execution order is disregarded. The first one has been extensively covered and solved; as it is equivalent to the \texttt{return} instruction, but the second one requires a small addition to the CFG: there must be an alternative control flow, where the path of the exception is shown. For now\sergio{esto suena muy espanyol no? So far?}, without including \texttt{try-catch} structures, any exception thrown will exit its method with an error; so a new ``Error end'' node is needed.\sergio{No me convence esta frase, a ver como os suena esto (aunque no estoy muy convencido de ello) $\rightarrow$ So far, without including \texttt{try-catch} structures, any exception thrown would activate the mentioned ``exception mode" and leave its method with an error state. Hence, in order to represent this behaviour, a different exit point (represented with a node called ``Error end") need to be defined.} \deleted{T}\added{Consecuently, t}he pre-existing ``End'' node is renamed \added{as} ``Normal end'', \deleted{but now the}\added{leaving the} CFG \deleted{has}\added{with} two distinct sink nodes; which is forbidden in most slicing algorithms. To solve that problem, a general ``End'' node is created, with both normal and \deleted{exit}\added{error} ends connected to it; making it the only sink in the graph.
2019-12-05 16:10:05 +01:00
In order to properly accommodate a method's output variables (global variables or parameters passed by reference that have been modified), variable unpacking is added to the ``Error exit'' node; same as the ``Exit''\sergio{Exit?End?Vaya cacao llevamos con esto xD} node in previous examples. This change constitutes an increase in precision, as now the outputted variables are differentiated\deleted{; f}\added{. F}or example\added{,} a slice which only requires the error exit may include less variable modifications than one which includes both.
2019-12-05 16:10:05 +01:00
This treatment of \texttt{throw} statements only modifies the structure of the CFG, without altering the other graphs, the traversal algorithm, or the basic definitions for control and data dependencies. That fact makes it easy to incorporate to any existing program slicer that follows the general model described. Example~\ref{exa:throw} showcases the new exit nodes and the treatment of the \texttt{throw}\sergio{ statement?} as if it were an unconditional jump whose destination is the ``Error exit''.
\begin{example}[CFG of an uncaught \texttt{throw} statement]
2019-12-05 16:10:05 +01:00
Consider the simple Java method on the \deleted{right}\added{left} of figure~\ref{fig:throw}; which performs a square root if the number is positive, throwing otherwise a \texttt{RuntimeError}. The CFG in the centre illustrates the treatment of \texttt{throw}, ``normal exit'' and ``error exit'' as pseudo--statements, and the PDG on the right describes the control dependencies generated from the \texttt{throw} statement to the following instructions and exit nodes.
\label{exa:throw}
\begin{figure}[h]
\begin{minipage}{0.3\linewidth}
\begin{lstlisting}
double f(int x) {
if (x < 0)
throw new RuntimeException()
return Math.sqrt(x)
}
\end{lstlisting}
\end{minipage}
\begin{minipage}{0.69\linewidth}
\includegraphics[width=\linewidth]{img/throw-example-cfg}
\end{minipage}
2019-12-05 16:10:05 +01:00
\caption{A simple program with a \texttt{throw} statement \added{(left)}, its CFG (centre) and its PDG (\deleted{left}\added{right}).}
\label{fig:throw}
\end{figure}
2019-11-15 22:34:58 +01:00
\end{example}
2019-12-05 10:04:15 +01:00
\subsection{\texttt{try-catch-finally} statement}
The \texttt{try-catch} statement is the only way to stop an exception once it is thrown.
2019-12-06 21:53:09 +01:00
It filters \added{each} exception by its type; letting those which do not match any of the catch blocks propagate to \deleted{another}\added{an external} \texttt{try-catch}\deleted{surrounding it}\added{block} or \deleted{outside the method,} to the previous \deleted{one}\added{method} in the call stack.
2019-12-05 10:04:15 +01:00
On top of that, the \texttt{finally} block helps programmers guarantee code execution. It can be used replacing or in conjunction with \texttt{catch} blocks.
The code placed inside a \texttt{finally} block is guaranteed to run if the \texttt{try} block has been entered.
This holds true whether the \texttt{try} block exits correctly, an exception is caught, an exception is left uncaught or an exception is caught and another one is thrown while handling it (within its \texttt{catch} block).
\carlos{This would be useful to explain that the new dependencies introduced by the non-executable edges are not ``normal'' control dependencies, but ``presence'' dependencies. Opposite to traditional control dependence, where $a \ctrldep b$ if and only if the number of times $b$ is executed is dependent on the \textit{execution} of $a$ (e.g. conditional blocks and loops); this new control dependencies exist if and only if the number of times $b$ is executed is dependent on the \textit{presence} or \textit{absence} of $a$; which introduces a meta-problem. In the case of exceptions, it is easy to grasp that the absence of a catch block alters the results of an execution. Same with unconditional jumps, the absence of breaks modifies the flow of the program, but its execution does not control anything. A differentiation seems appropriate, even if only as subcategories of control dependence: execution control dependence and presence control dependence.}
2019-11-15 22:34:58 +01:00
2019-12-06 21:53:09 +01:00
The main problem when including \texttt{try-catch} blocks in program slicing is that \texttt{catch} blocks are not always strictly necessary for the slice (less so for weak slices), but introduce new styles of control dependence \sergio{De esto se habla luego? de estos ``new styles"? si es asi acuerdate de referenciarlo forward diciendo donde. Me imagino que es lo que pone en tu comentario de la presence control dependence.}; which must be properly mapped to the SDG. The absence of \texttt{catch} blocks may also be a problem for compilation, as Java requires at least one \texttt{catch} or \texttt{finally} block to accompany each \texttt{try} block; though that could be fixed after generating the slice, if it is required that the slice be \sergio{be or to be?} executable.
2019-11-15 22:34:58 +01:00
2019-12-06 21:53:09 +01:00
A typical\sergio{La tipica o la de la propuesta de Horwitz? Si es la de Horwitz di que ellos lo hacen asi, que ya hemos dicho que es lo mas importante hasta la fecha en Java.} representation of the \texttt{try} block is as a pseudo-predicate, connected to the first statement inside it and to the instruction that follows the \texttt{try} block.
2019-12-05 10:04:15 +01:00
This generates control dependencies from the \texttt{try} node to each of the instructions it contains.
2019-12-06 21:53:09 +01:00
\carlos{This is not really a ``control'' dependency, could be replaced by the definition of structural dependence.}\sergio{Totalmente, pero para decir esto hay que definir la structural dependence, que imagino que estara en la seccion 4.}
2019-12-05 10:04:15 +01:00
Inside the \texttt{try} there can be four distinct sources of exceptions:
2019-11-15 22:34:58 +01:00
\begin{description}
\item[Method calls.] If an exception is thrown inside a method and it is not caught, it will
2019-12-05 10:04:15 +01:00
surface inside the \texttt{try} block.
As \textit{checked} exceptions must be declared explicitly, method declarations may be consulted to see if a method call may or may not throw any exceptions.
On this front, polymorphism and inheritance present no problem, as inherited methods must match the signature of the parent method ---including exceptions that may be thrown.
If \textit{unchecked} exceptions are also considered, method calls could be analysed to know which exceptions may be thrown, or the documentation be checked automatically for the comment annotation \texttt{@throws} to know which ones are thrown.
2019-12-06 21:53:09 +01:00
\sergio{In case \textit{unchecked} exceptions would be also considered, a further analysis must be done }
2019-11-15 22:34:58 +01:00
\item[\texttt{throw} statements.] The least common, but most simple, as it is treated as a
2019-12-05 10:04:15 +01:00
\texttt{throw} inside a method. The type of the exception may be obvious, as most \carlos{this is a weird claim to make without backup} exceptions are built and thrown in the same instruction; but it also may be hidden: e.g., \texttt{throw (Exception) o} where \texttt{o} is a variable of type Object.
2019-11-15 22:34:58 +01:00
\item[Implicit unchecked exceptions.] If \textit{unchecked} exceptions are considered, many
common expressions may throw an exception, with the most common ones being trying to call
a method or accessing a field of a \texttt{null} object (\texttt{NullPointerException}),
accessing an invalid index on an array (\texttt{ArrayIndexOutOfBoundsException}), dividing
an integer by 0 (\texttt{ArithmeticException}), trying to cast to an incompatible type
(\texttt{ClassCastException}) and many others. On top of that, the user may create new
types that inherit from \texttt{RuntimeException}, but those may only be explicitly thrown.
Their inclusion in program slicing and therefore in the method's CFG generates extra
dependencies that make the slices produced bigger.
2019-12-05 14:01:22 +01:00
\item[Errors.] May be generated at any point in the execution of the program, but they normally
2019-11-15 22:34:58 +01:00
signal a situation from which it may be impossible to recover, such as an internal JVM error.
2019-12-05 14:01:22 +01:00
In general, most programs will not attempt to catch them, and can be excluded in order to simplify implicit unchecked exceptions (any instruction at any moment may throw an Error).
2019-11-15 22:34:58 +01:00
\end{description}
2019-12-05 14:01:22 +01:00
All exception sources are treated very similarly: the statement that may throw an exception
2019-11-15 22:34:58 +01:00
is treated as a predicate, with the true edge connected to the next instruction were the statement
2019-12-05 14:01:22 +01:00
to execute without raising exceptions; and the false edge connected to all the possible \texttt{catch} nodes which may be compatible with the exception thrown.
2019-11-15 22:34:58 +01:00
2019-12-05 14:01:22 +01:00
The case of method calls that may throw exceptions is slightly different, as there may be variables to unpack, both in the case of a normal or erroneous exit. To that end, nodes containing method calls have an unlimited number of outgoing edges: one to leads to a node labelled ``normal return'', after which the variables produced by any normal exit of the method are unpacked; and all the others to any possible catch that may catch the exception thrown. Each catch must then unpack the variables produced by the erroneous exits of the method.
The ``normal return'' node is itself a pseudo-statement; with the \textit{true} edge leading to the following instruction and the \textit{false} one to the first common instruction between all the paths of length $\ge 1$ that start from the method call ---which translates to the instruction that follows the \texttt{try} block if all possible exceptions thrown by the method are caught or the ``Exit'' node if there are some left uncaught.
\deleted{Carlos: CATCH Representation doesn't matter, it is similar to a switch but checking against types.
The difference exists where there exists the chance of not catching the exception;
2019-11-15 22:34:58 +01:00
which is semantically possible to define. When a \texttt{catch (Throwable e)} is declared,
it is impossible for the exception to exit the method; therefore the control dependency must
be redefined.}
2019-12-05 14:01:22 +01:00
\deleted{The filter for exceptions in Java's \texttt{catch} blocks is a type (or multiple types since
2019-11-15 22:34:58 +01:00
Java 8), with a class that encompasses all possible exceptions (\texttt{Throwable}), which acts
2019-12-05 14:01:22 +01:00
as a catch-all.
2019-11-15 22:34:58 +01:00
In the literature there exist two alternatives to represent \texttt{catch}: one mimics a static
switch statement, placing all the \texttt{catch} block headers at the same height, all pending
from the exception-throwing exception and the other mimics a dynamic switch or a chain of \texttt{if}
statements. The option chosen affects how control dependencies should be computed, as the different
2019-12-05 14:01:22 +01:00
structures generate different control dependencies by default.}
2019-11-15 22:34:58 +01:00
2019-12-05 14:01:22 +01:00
\deleted{\begin{description}
2019-11-15 22:34:58 +01:00
\item[Switch representation.] There exists no relation between different \texttt{catch} blocks,
2019-12-05 14:01:22 +01:00
each exception-throwing statement is connected through an edge labelled false to each
2019-11-15 22:34:58 +01:00
of the \texttt{catch} blocks that could be entered. Each \texttt{catch} block is a
2019-12-05 14:01:22 +01:00
pseudo-statement, with its true edge connected to the end of the \texttt{try} and the
2019-11-15 22:34:58 +01:00
As an example, a \texttt{1 / 0} expression may be connected to \texttt{ArithmeticException},
\texttt{RuntimeException}, \texttt{Exception} or \texttt{Throwable}.
If any exception may not be caught, there exists a connection to the ``Error exit'' of the method.
2019-12-05 14:01:22 +01:00
\item[If-else representation.] Each exception-throwing statement is connected to the first
2019-11-15 22:34:58 +01:00
\texttt{catch} block. Each \texttt{catch} block is represented as a predicate, with the true
edge connected to the first statement inside the \texttt{catch} block, and the false edge
2019-12-05 14:01:22 +01:00
to the next \texttt{catch} block, until the last one. The last one will be a pseudo-predicate
connected to the first statement after the \texttt{try} if it is a catch-all type or to the
2019-11-19 00:06:07 +01:00
``Error exit'' if it \added{is not}\deleted{isn't}.
2019-12-05 14:01:22 +01:00
\end{description}}
2019-11-15 22:34:58 +01:00
2019-12-05 14:01:22 +01:00
\begin{example}[Catches.]
Consider the following segment of Java code in figure~\ref{fig:try-catch}, which includes some statements that do not use data (X, Y and Z), and method call to \texttt{f} that uses \texttt{x} and \texttt{y}, two global variables. \texttt{f} may throw an exception, so it has been placed inside a \texttt{try-catch} structure, with a statement in the \texttt{catch} that logs the error when it occurs. Additionally, when \texttt{f} exits without an error, only \texttt{x} is modified; but when an error occurs, only \texttt{y} is modified.
2019-12-06 20:35:20 +01:00
Note how the pseudo-statements act to create control dependencies between the \textit{true} and \textit{false} edges, such as the ``normal return'', ``catch'', ``try''. The statements contained after the function call, inside the \texttt{catch} and the \texttt{try} blocks are respectively control dependent on the aforementioned nodes. Finally, consider the statement \texttt{Z}; which is not dependent on any part of the \texttt{try-catch} block, as all exceptions that may be thrown are caught: it will execute regardless of the path taken inside the \texttt{try} block. \carlos{Consider critiquing the result, saying that despite the last sentence, statements can be removed (the catch) so that the dependencies are no longer the same.}
2019-12-05 14:01:22 +01:00
\begin{figure}[h]
\begin{minipage}{0.35\linewidth}
2019-11-15 22:34:58 +01:00
\begin{lstlisting}
2019-12-05 14:01:22 +01:00
try {
X;
f();
Y;
} catch (Exception e) {
System.out.println("error");
}
Z;
2019-11-15 22:34:58 +01:00
\end{lstlisting}
\end{minipage}
2019-12-05 14:01:22 +01:00
\begin{minipage}{0.64\linewidth}
\includegraphics[width=\linewidth]{img/try-catch-example}
2019-11-15 22:34:58 +01:00
\end{minipage}
2019-12-05 14:01:22 +01:00
\caption{A simple example of the representation of \texttt{try-catch} structures and method calls that may throw exceptions.}
\label{fig:try-catch}
\end{figure}
2019-11-15 22:34:58 +01:00
\end{example}
2019-12-06 20:35:20 +01:00
\carlos{From here to the end of the chapter, delete / move to solution chapter}
2019-11-15 22:34:58 +01:00
Regardless of the approach, when there exists a catch--all block, there is no dependency generated
from the \texttt{catch}, as all of them will lead to the next instruction. However, this means that
if no data is outputted from the \texttt{try} or \texttt{catch} block, the catches will not be picked
up by the slicing algorithm, which may alter the results unexpectedly. If this problem arises, the
simple and obvious solution would be to add artificial edges to force the inclusion of all \texttt{catch}
blocks, which adds instructions to the slice ---lowering its score when evaluating against benchmarks---
but are completely innocuous as they just stop the exception, without running any extra instruction.
Another alternative exists, though, but slows down the process of creating a slice from a SDG.
The \texttt{catch} block is only strictly needed if an exception that it catches may be thrown and
an instruction after the \texttt{try-catch} block should be executed; in any other case the \texttt{catch}
2019-12-05 14:01:22 +01:00
block is irrelevant and should not be included. However, this change requires analysing the inclusion
2019-11-15 22:34:58 +01:00
of \texttt{catch} blocks after the two--pass algorithm has completed, slowing it down. In any case, each
approach trades time for accuracy and vice--versa, but the trade--off is small enough to be negligible.
Regarding \textit{unchecked} exceptions, an extra layer of analysis should be performed to tag statements
2019-12-05 14:01:22 +01:00
with the possible exceptions they may throw. On top of that, methods must be analysed and tagged
2019-11-15 22:34:58 +01:00
accordingly. The worst case is that of inaccessible methods, which may throw any \texttt{RuntimeException},
but with the source code unavailable, they must be marked as capable of throwing it. This results on
a graph where each instruction is dependent on the proper execution of the previous statement; save
for simple statements that may not generate exceptions. The trade--off here is between completeness and
correctness, with the inclusion of \textit{unchecked} exceptions increasing both the completeness and the
slice size, reducing correctness. A possible solution would be to only consider user--generated exceptions
or assume that library methods may never throw an unchecked exception. A new slicing variation that
annotates methods or limits the unchecked exceptions to be considered.
Regarding the \texttt{finally} block, most approaches treat it properly; representing it twice: once
for the case where there is no active exception and another one for the case where it executes with
an exception active. An exception could also be thrown here, but that would be represented normally.
% vim: set noexpandtab:ts=2:sw=2:wrap
2019-12-06 21:53:09 +01:00