Chapter 3 reviewed

This commit is contained in:
Sergio Pérez 2019-12-04 16:21:21 +00:00
parent cf1594fbd4
commit 19c27d6b88
1 changed files with 48 additions and 39 deletions

View File

@ -10,82 +10,86 @@
\section{First definition of the SDG}
\label{sec:first-def-sdg}
The system dependence graph (SDG) is a method for program slicing that was first
proposed by Horwitz, Reps and Blinkey \cite{HorwitzRB88}. It builds upon the
The system dependence graph (SDG) is \deleted{a method}\added{the main data structure for program representation used in the}\deleted{for} program slicing\added{ area. It}\deleted{that} was first
proposed by Horwitz, Reps and Blinkey \cite{HorwitzRB88}\added{ and, since then, many approaches have based their models on it}. It builds upon the
existing control flow graph (CFG), defining dependencies between vertices of the
CFG, and building a program dependence graph (PDG), which represents them. The
system dependence graph (SDG) is then built from the assembly of the different
CFG, and building a program dependence graph (PDG), which represents them.\sergio{Volvemos a poner las siglas y su significado?CFG?PDG? ya se han puesto antes} The
\deleted{system dependence graph (}SDG\deleted{)} is then built from the assembly of the different
PDGs (each representing a method of the program), linking each method call to
its corresponding definition. Because each graph is built from the previous one,
new constructs can be added with to the CFG, without the need to alter the
algorithm that converts CFG to PDG and then to SDG. The only modification
possible is the redefinition of a dependency or the addition of new kinds of
algorithm that converts \added{each} CFG to PDG and then to \added{the final} SDG. The only modification
possible is the redefinition of a\added{n already defined} dependency or the addition of new kinds of
dependence.
The language covered by the initial proposal was a simple one, featuring
The language covered by the initial proposal \deleted{was}\added{is}\sergio{todo en presente o todo en pasado} a simple one, featuring
procedures with modifiable parameters and basic instructions, including calls to
procedures, variable assignments, arithmetic and logic operators and conditional
instructions (branches and loops): the basic features of an imperative
programming language. The control flow graph was as simple as the programs
instructions (branches and loops)\deleted{:}\added{, i.e.,}\sergio{no se si i.e., queda bien aqui :/} the basic features of an imperative
programming language. The \deleted{control flow graph was}\added{CFGs are} as simple as the programs
themselves, with each graph representing one procedure. The instructions of the
program are represented as vertices of the graph and are split into two
categories: statements, which have no effect on the control flow (assignments,
categories: statements, which have no effect on the control flow (\added{e.g., }assignments,
procedure calls) and predicates, whose execution may lead to one of multiple
---though traditionally two--- paths (conditional instructions). Statements are
connected sequentially to the next instruction. Predicates have two outgoing
edges, each connected to the first statement that should be executed, according
---though traditionally two--- \added{different paths} (\added{e.g., }conditional instructions). \deleted{S}\added{While s}tatements are
connected sequentially to the next instruction\deleted{. P}\added{, on the contrary, p}redicates have two outgoing
edges, each\added{ of them} connected to the first statement that should be executed\deleted{,} according
to the result of evaluating the conditional expression in the guard of the
predicate.
\begin{definition}[Control Flow Graph \carlos{add original citation}]
\label{def:cfg}
A \emph{control flow graph} $G$ of a program $P$ is a directed graph, represented as a tuple $\langle N, E \rangle$, where $N$ is a set of nodes, composed of a method's statements plus two special nodes, ``Start'' and ``End''; and $E$ is a set of edges of the form $e = \left(n_1, n_2\right) | n_1, n_2 \in N$. Most algorithms to generate the SDG mandate the ``Start'' node to be the only source and ``End'' to be the only sink in the graph. \carlos{Is it necessary to define source and sink in the context of a graph?}.
Edges are created according to the possible execution paths that exist; each statement is connected to any statement that may immediately follow it. Formally, an edge $e = (n_1, n_2)$ exists if and only if there exists an execution of the program where $n_2$ is executed immediately after $n_1$. In general, expressions are not evaluated; so an \texttt{if} instruction has two outgoing edges even if the condition is always true or false, e.g. \texttt{1 == 0}.
\label{def:cfg}
A \emph{control flow graph} $G$ of a program\sergio{program o method?} $P$ is a directed graph, represented as a tuple $\langle N, E \rangle$, where $N$ is a set of nodes, composed of a method's\sergio{method o program?} statements plus two special nodes, ``Start'' and ``End''; and $E$ is a set of edges of the form $e = \left(n_1, n_2\right) |~n_1, n_2 \in N$. Most algorithms\added{, in order} to generate the SDG\added{,} mandate the ``Start'' node to be the only source and \added{the} ``End'' \added{node} to be the only sink in the graph. \carlos{Is it necessary to define source and sink in the context of a graph?}.
Edges are created according to the possible execution paths that exist; each statement is connected to any statement that may immediately follow it. Formally, an edge $e = (n_1, n_2)$ exists if and only if there exists an execution of the program where $n_2$ is executed immediately after $n_1$. In general, expressions are not evaluated \added{when generating the CFG}; so a\deleted{n \texttt{if}}\added{ conditional} instruction \added{will have}\deleted{has} two outgoing edges \added{regardless the condition value being} \deleted{even if the condition is} always true or false, e.g.\added{,} \texttt{1 == 0}.
\end{definition}
To build the PDG and then the SDG, there are two dependencies based directly on the CFG's structure: data and control dependence.
To build the PDG and then the SDG, there are two dependencies based directly on the CFG's structure: data and control dependence. \sergio{But first, we need to define the concept of postdominance in a graph necessary in the definition of control dependency:}\sergio{no me convence mucho pero plantearse si poner algo aqui o dejarlo como esta.}
\begin{definition}[Postdominance \carlos{add original citation?}]
\label{def:postdominance}
Vertex $b$ \textit{postdominates} vertex $a$ if and only if $b$ is on every path from $a$ to the ``End'' vertex.
\end{definition}
\begin{definition}[Control dependency \carlos{add original citation}]
\begin{definition}[Control dependency\sergio{dependency o dependence?} \carlos{add original citation}]
\label{def:ctrl-dep}
Vertex $b$ is \textit{control dependent} on vertex $a$ ($a \ctrldep b$) if and only if $b$ postdominates one but not all of $a$'s successors. It follows that a vertex with only one successor cannot be the source of control dependence.
\end{definition}
\begin{definition}[Data dependency \carlos{add original citation}]
\begin{definition}[Data dependency\sergio{dependency o dependence?} \carlos{add original citation}]
\label{def:data-dep}
Vertex $b$ is \textit{data dependent} on vertex $a$ ($a \datadep b$) if and only if $a$ may define a variable $x$, $b$ may use $x$ and there exists a \carlos{could it be ``an''??} $x$-definition free path from $a$ to $b$.
Data dependency was originally defined as flow dependency, and split into loop and non--loop related dependencies, but that distinction is no longer useful to compute program slices.
Data dependency was originally defined as flow dependency, and split into loop and non--loop related dependencies, but that distinction is no longer useful to compute program slices \sergio{Quien dijo que ya no es util? Vale la pena citarlo?}.
It should be noted that variable definitions and uses can be computed for each statement independently, analyzing the procedures called by it if necessary. The variables used and defined by a procedure call are those used and defined by its body.
\end{definition}
With the data and control dependencies, the PDG may be built by replacing the
edges from the CFG by data and control dependence edges. The first tends to be
represented as a thin solid line, and the latter as a thick solid line. In the
examples, data dependencies will be thin solid red lines.
examples, \added{data and control dependencies are represented by thin solid red and black lines respectively}\deleted{data dependencies will be thin solid red lines}.
\begin{definition}[Program dependence graph]
\label{def:pdg}
The \textsl{program dependence graph} (PDG) is a directed graph (and originally a tree) represented by three elements: a set of nodes $N$, a set of control edges $E_c$ and a set of data edges $E_d$.
The \textsl{program dependence graph} (PDG) is a directed graph (and originally a tree\sergio{???}) represented by three elements: a set of nodes $N$, a set of control edges $E_c$ and a set of data edges $E_d$. \sergio{$PDG = \langle N, E_c, E_d \rangle$}
The set of nodes corresponds to the set of nodes of the CFG, excluding the ``End'' node.
Both sets of edges are built as follows. There is a control edge between two nodes $n_1$ and $n_2$ if and only if $n_1 \ctrldep n_2$, and a data edge between $n_1$ and $n_2$ if and only if $n_1 \datadep n_2$. Additionally, if a node $n$ does not have any incoming control edges, it has a ``default'' control edge $e = (\textnormal{Start},n)$; so that ``Start'' is the only source node of the graph.
Both sets of edges are built as follows. There is a control edge between two nodes $n_1$ and $n_2$ if and only if $n_1 \ctrldep n_2$\sergio{acordarse de lo de evitar la generacion de arcos para prevenir la transitividad. Decidir si definimos Control arc como ua definicion aparte.}, and a data edge between $n_1$ and $n_2$ if and only if $n_1 \datadep n_2$. Additionally, if a node $n$ does not have any incoming control edges, it has a ``default'' control edge $e = (\textnormal{Start},n)$; so that ``Start'' is the only source node of the graph.
Note: the most common graphical representation is a tree--like structure based on the control edges, and nodes sorted left to right according to their position on the original program. Data edges do not affect the structure, so that the graph is easily readable.
\end{definition}
\sergio{creo que en la definicion de CFG y PDG tiene que quedar mas claro que hay varios por programa (uno por funcion), para que esta ultima frase cobre mas sentido.}
Finally, the SDG is built from the combination of all the PDGs that compose the
program.
\begin{definition}[System dependence graph]
\label{def:sdg}
The \textsl{system dependence graph} (SDG) is a directed graph that represents the control and data dependencies of a whole program. It has three kinds of edges: control, data and function call. The graph is built combining multiple PDGs, with the ``Start'' nodes labeled after the function they begin. There exists one function call edge between each node containing one or more calls and each of the ``Start'' node of the method called. In a programming language where the function call is ambiguous (e.g. with pointers or polymorphism), there exists one edge leading to every possible function called.
The \textsl{system dependence graph} (SDG) is a directed graph that represents the control and data dependencies of a whole program. It has three kinds of edges: control, data and function call. The graph is built combining multiple PDGs, with the ``Start'' nodes labeled after the function they begin. There exists one function call edge between each node containing one or more calls and each of the ``Start'' node of the method called. In a programming language where the function call is ambiguous (e.g. with pointers or polymorphism), there exists one edge leading to every possible function called.\sergio{Esta definicion ha quedado muy informal no? Donde han quedado los $E_c,~E_d,~E_{fc},$ Nodes del PDG...?}
\end{definition}
\begin{example}[Creation of a SDG from a simple program]
@ -109,8 +113,9 @@ proc f(x, y) {
\begin{minipage}{0.79\linewidth}
\includegraphics[width=0.6\linewidth]{img/cfgsimple}
\end{minipage}
\sergio{Centrar la figura, sobra mucho espacio a la derecha}
Then, control and data dependencies are computed, arranging the nodes in the PDG. Finally, the two graphs are connected with summary edges to create the SDG:
Then, control and data dependencies are computed, arranging the nodes in the PDG \sergio{FigureRef missing}. Finally, the two graphs are connected with summary edges\sergio{with que? esto no se sabe aun ni lo que es ni para que sirve. En todo caso function call edges, y si ese es el negro que va de f(a,b) a Start f() para diferenciarlo deberia ser de otro color} to create the SDG:
\begin{center}
\includegraphics[width=0.8\linewidth]{img/sdgsimple}
@ -119,39 +124,43 @@ proc f(x, y) {
\subsubsection{Function calls and data dependencies}
\carlos{Vocabulary: when is appropriate the use of method, function and procedure????}
\carlos{Vocabulary: when is appropriate the use of method, function and procedure????}\sergio{buena pregunta, yo creo que es jerarquico, method incluye function y procedure y los dos ultimos son disjuntos entre si no?}
In the original definition of the SDG, there was special handling of data dependencies when calling functions, as it was considered that parameters were passed by value, and global variables did not exist. \carlos{Name and cite paper that introduced it} solves this issue by splitting function calls and function into multiple nodes. This proposal solved everything related to parameter passing: by value, by reference, complex variables such as structs or objects and return values.
In the original definition of the SDG, there was special handling of data dependencies when calling functions, as it was considered that parameters were passed by value, and global variables did not exist. \carlos{Name and cite paper that introduced it} solves this issue by splitting function calls and function \added{definitions} into multiple nodes. This proposal solved everything\sergio{lo resuelve todo?} related to parameter passing: by value, by reference, complex variables such as structs or objects and return values.
To such end, the following modifications are made to the different graphs:
\begin{description}
\item[CFG.] In each CFG, global variables read or modified and parameters are added to the label of the ``Start'' node in assignments of the form $par = par_{in}$ for each parameter and $x = x_{in}$ for global variables. Similarly, global variables and parameters modified are added to the label of the ``End'' node as $x_{out} = x$. The parameters are only passed back if the value set by the called method can be read by the callee. Finally, in method calls the same values must be packed and unpacked: each statement containing a function called is relabeled to contain input (of the form $par_{in} = \textnormal{exp}$ for parameters or $x_{in} = x$ for global variables) and output (always of the form $x = x_{out}$).
\item[PDG.] Each node modified in the CFG is split into multiple nodes: the original label is the main node and each assignment is represented as a new node, which is control--dependent on the main one. Visually, input is placed on the left and output on the right; with parameters sorted accordingly.
\item[SDG.] Three kinds of edges are introduced: parameter input (param--in), parameter output (param--out) and summary edges. Parameter input edges are placed between each method call's input node and the corresponding method definition input node. Parameter output edges are placed between each method definition's output node and the corresponding method call output node. Summary edges are placed between the input and output nodes of a method call, according to the dependencies inside the method definition: if there is a path from an input node to an output node, that shows a dependence and a summary method is placed in all method calls between those two nodes.
\item[CFG.] In each CFG, global variables read or modified and parameters are added to the label of the ``Start'' node in assignments of the form $par = par_{in}$ for each parameter and $x = x_{in}$ for global variables. Similarly, global variables and parameters modified are added to the label of the ``End'' node as \added{assignments of the form} $x_{out} = x$. \added{From now on, we will refer to the described assignments as input and output information respectively.} \sergio{\{}The parameters are only passed back if the value set by the called method can be read by the callee\sergio{\} no entiendo a que se refiere esta frase}. Finally, in method calls the same values must be packed and unpacked: each statement containing a function called is relabeled to contain \added{its related} input (of the form $par_{in} = \textnormal{exp}$ for parameters or $x_{in} = x$ for global variables) and output (always of the form $x = x_{out}$) \added{information}. \sergio{no hay parameter\_out? asumo entonces que no hay paso por valor?}
\item[PDG.] Each node \added{augmented with input or output information}\deleted{modified} in the CFG is \added{now} split into multiple nodes: the original \deleted{label}\added{node} \added{(Start, End or function call)} is the main node and each assignment \added{contained in the input and output information} is represented as a new node, which is control--dependent on the main one. Visually, \added{new nodes coming from the input information}\deleted{input is} \added{are} placed on the left and \added{the ones coming from the output information}\deleted{output} on the right; with parameters sorted accordingly.
\item[SDG.] Three kinds of edges are introduced: parameter input (param--in), parameter output (param--out) and summary edges. Parameter input edges are placed between each method call's input node and the corresponding method definition input node. Parameter output edges are placed between each method definition's output node and the corresponding method call output node. Summary edges are placed between the input and output nodes of a method call, according to the dependencies inside the method definition: if there is a path from an input node to an output node, that shows a dependence and a summary method is placed in all method calls between those two nodes.\sergio{Tengo la sensacion de que la explicacion de que es un summary llega algo tarde y tal vez deberia estar en alguna definicion previa. Que opine Josep que piensa}
Note: parameter input and output edges are separated because the traversal algorithm traverses them only sometimes (the output edges are excluded in the first pass and the input edges in the second).
Note: \deleted{parameter input and output}\added{param-in and param-out} edges are separated because the traversal algorithm traverses them only sometimes (the output edges are excluded in the first pass and the input edges in the second).\sergio{delicado mencionar lo de las pasadas sin haber hablado antes de nada del algoritmo de slicing, a los que no sepan de slicing se les quedara el ojete frio aqui. Plantearse quitar esta nota.}
\end{description}
\begin{example}[Variable packing and unpacking]
Let it be a function $f(x, y)$ with two integer parameters, and a call $f(a + b, c)$, with parameters passed by reference if possible. The label of the method call node in the CFG would be ``\texttt{x\_in = a + b, y\_in = c, f(a + b, c), c = y\_out}''; method $f$ would have \texttt{x = x\_in, y = y\_in} in the ``Start'' node and \texttt{y\_out = y} in the ``End'' node. The relevant section of the SDG would be:
Let it be a function $f(x, y)$ with two integer parameters \added{which modifies the argument passed in its second parameter}, and a call $f(a + b, c)$, with parameters passed by reference if possible. The label of the method call node in the CFG would be ``\texttt{x\_in = a + b, y\_in = c, f(a + b, c), c = y\_out}''; method $f$ would have \texttt{x = x\_in, y = y\_in} in the ``Start'' node and \texttt{y\_out = y} in the ``End'' node. The relevant section of the SDG would be:
\begin{center}
\includegraphics[width=0.5\linewidth]{img/parameter-passing}
\end{center}
\end{example}
\sergio{Esta figura molaria mas evolutiva si diera tiempo, asi seria casi autoexplicativa: CFG $\rightarrow$ PDG $\rightarrow$ SDG. La actual seria el SDG, las otras tendrian poco mas que un nodo y una etiqueta.}
\section{Unconditional control flow}
Even though the initial definition of the SDG was useful to compute slices, the
Even though the initial definition of the SDG was \deleted{useful}\added{adequate} to compute slices, the
language covered was not enough for the typical language of the 1980s, which
included (in one form or another) unconditional control flow. Therefore, one of
the first additions contributed to the algorithm to build system dependence
graphs was the inclusion of unconditional jumps, such as ``break'',
the first \added{proposed upgrades}\deleted{additions contributed} to the algorithm to build \deleted{system dependence
graphs}\added{SDGs} was the inclusion of unconditional jumps, such as ``break'',
``continue'', ``goto'' and ``return'' statements (or any other equivalent). A
naive representation would be to treat them the same as any other statement, but
with the outgoing edge landing in the corresponding instruction (outside the
loop, at the loop condition, at the method's end, etc.).
An alternative approach is to represent the instruction as an edge, not a vertex, connecting the previous statement with the next to be executed. Both of these approaches fail to generate a control dependence from the unconditional jump, as the definition of control dependence (see definition~\ref{def:ctrl-dep}) requires a vertex to have more than one successor for it to be possible to be a source of control dependence.
An alternative approach is to represent the instruction as an edge, not a vertex, connecting the previous statement with the next to be executed. \sergio{Juntaria las 2 propuestas anteriores (naive y alternative) en 1 frase, no las separaria, porque despues de leer la primera ya me he mosqueado porque no deciamos ni quien la hacia ni por que no era util.}
Both of these approaches fail to generate a control dependence from the unconditional jump, as the definition of control dependence (see definition~\ref{def:ctrl-dep}) requires a vertex to have more than one successor for it to be possible to be a source of control dependence.
From here, there stem two approaches: the first would be to
redefine control dependency, in order to reflect the real effect of these
instructions ---as some authors~\cite{DanBHHKL11} have tried to do--- and the
@ -163,7 +172,7 @@ The most popular approach was proposed by Ball and Horwitz~\cite{BalH93}, classi
\begin{description}
\item[Statement.] Any instruction that is not a conditional or unconditional jump. It has one outgoing edge in the CFG, to the next instruction that follows it in the program.
\item[Predicate.] Any conditional jump instruction, such as \texttt{while}, \texttt{until}, \texttt{do-while}, \texttt{if}, etc. It has two outgoing edges, labeled \textit{true} and \textit{false}; leading to the corresponding instructions.
\item[Pseudo--predicates.] Unconditional jumps (e.g. \texttt{break}, \texttt{goto}, \texttt{continue}, \texttt{return}); are like predicates, with the difference that the outgoing edge labeled \textit{false} is marked as non--executable, and there is no possible execution where such edge would be possible, according to the definition of the CFG (as seen in definition~\ref{def:cfg}). Originally the edges had a specific reasoning backing them up: the \textit{true} edge leads to the jump's destination and the \textit{false} one, to the instruction that would be executed if the unconditional jump was removed, or converted into a \texttt{no op} (a blank operation that performs no change to the program's state). This specific behavior is used with unconditional jumps, but no longer applies to pseudo--predicates, as more instructions have used this category as means of ``artificially'' \carlos{bad word choice} generating control dependencies.
\item[Pseudo--predicates.] Unconditional jumps (e.g. \texttt{break}, \texttt{goto}, \texttt{continue}, \texttt{return}); are like predicates, with the difference that the outgoing edge labeled \textit{false} is marked as non--executable, and there is no possible execution where such edge would be possible, according to the definition of the CFG (as seen in \sergio{definition o Definition?}definition~\ref{def:cfg}). Originally the edges had a specific reasoning backing them up: the \textit{true} edge leads to the jump's destination and the \textit{false} one, to the instruction that would be executed if the unconditional jump was removed, or converted into a \texttt{no op}\sergio{no op o no-op?} (a blank operation that performs no change to the program's state). \sergio{\{}This specific behavior is used with unconditional jumps, but no longer applies to pseudo--predicates, as more instructions have used this category as means of ``artificially'' \carlos{bad word choice} generating control dependencies.\sergio{\}No entrar en este jardin, cuando se definio esto no se contemplaba la creacion de nodos artificiales. -Quita el originally, ahora es originally.}
\end{description}
As a consequence of this classification, every instruction after an unconditional jump $j$ is control--dependent (either directly or indirectly) on $j$ and the structure containing it (a conditional statement or a loop), as can be seen in the following example.
@ -192,7 +201,7 @@ static void f() {
\begin{example}[Control dependencies generated by unconditional instructions]
\label{exa:unconditional}
Figure~\ref{fig:break-graphs} showcases a small program with a \texttt{break} statement, its CFG and PDG with a slice in gray. The slicing criterion (line 5, variable $a$) is control dependent on both the unconditional jump and its surrounding conditional instruction (both on line 4); even though it is not necessary to include it (in the context of weak slicing).
Figure~\ref{fig:break-graphs} showcases a small program with a \texttt{break} statement, its CFG and PDG with a slice in grey. The slicing criterion (line 5, variable $a$) is control dependent on both the unconditional jump and its surrounding conditional instruction (both on line 4); even though it is not necessary to include it\sergio{a quien se refiere este it?} (in the context of weak slicing).
Note: the ``Start'' node $S$ is also categorized as a pseudo--statement, with the \textit{false} edge connected to the ``End'' node, therefore generating a dependence from $S$ to all the nodes inside the method. This removes the need to handle $S$ with a special case when converting a CFG to a PDG, but lowers the explainability of non--executable edges as leading to the ``instruction that would be executed if the node was absent or a no--op''.
\end{example}
@ -201,7 +210,7 @@ The original paper~\cite{BalH93} does prove its completeness, but disproves its
\begin{example}[Nested unconditional jumps]
\label{exa:nested-unconditional}
In the case of nested unconditional jumps where both jump to the same destination, only one of them (the out--most one) is needed. Figure~\ref{fig:nested-unconditional} showcases the problem, with the minimal slice \carlos{have not defined this yet} in gray, and the algorithmically computed slice in light blue. Specifically, lines 3 and 5 are included unnecessarily.
In the case of nested unconditional jumps where both jump to the same destination, only one of them (the out--most one) is needed. Figure~\ref{fig:nested-unconditional} showcases the problem, with the minimal slice \carlos{have not defined this yet} in grey, and the algorithmically computed slice in light blue. Specifically, lines 3 and 5 are included unnecessarily.
\begin{figure}
\begin{minipage}{0.15\linewidth}