Data-Flow Guided Slicing

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

We propose a flow-insensitive analysis that prunes out portions of code which are irrelevant to a specified set of data-flow paths. Our approach is fast and scalable, in addition to being able to generate a certificate as an audit for the computed result. We have implemented our technique in a tool called DSlicer and applied it to a set of 10600 real-world Android applications. Results are conclusive, we found out that the program code can be significantly reduced by 36% on average with respect to a specified set of data leak paths.

💡 Research Summary

The paper introduces a novel, flow‑insensitive program slicing technique that targets only the preservation of specified data‑flow paths, discarding all code that cannot affect those paths. The authors call this approach “Data‑Flow Guided Slicing” and implement it in a tool named DSlicer.

Motivation
Static analyses that aim to detect data leaks or other security‑relevant properties often have to process the entire code base, even though only a small fraction of the program actually participates in the data‑flow of interest. This unnecessary processing leads to high computational cost and limits scalability, especially for large Android applications. The authors therefore propose a lightweight pre‑processing step that removes irrelevant code while guaranteeing that no specified source‑to‑sink path is lost.

Program Representation – Assignment Graph
The core of the technique is a compact, flow‑insensitive representation called an assignment graph. The original program (Java/Android bytecode) is first translated into a Jimple‑like three‑address intermediate representation. Each instruction is then mapped to nodes and edges according to a fixed set of translation rules (Table 1 in the paper). Key aspects include:

Variables and constants become simple assignment edges.
Unary and binary operations are modeled as edges from each operand to the result variable, ignoring the actual computation.
Array accesses are conservatively treated as whole‑array assignments because precise index analysis would be costly.
Object fields are represented by a single node per field per class (e.g., C.f). All instances of class C share this node, providing a sound over‑approximation of aliasing.
Method calls are expressed by edges from the caller’s actual arguments to the callee’s formal parameters, and from the callee’s special return node r back to the caller’s receiving variable.
Returns are modeled as an edge from the returned variable to the callee’s r.

Dynamic dispatch is handled by constructing an over‑approximate call graph using class‑hierarchy based type estimation, ensuring soundness in the presence of polymorphism.

Slice Computation – ComputeSlice Algorithm
Given an assignment graph G, a set of source identifiers SR (e.g., APIs that read private data) and a set of sink identifiers SK (e.g., APIs that transmit data), the algorithm proceeds in three phases:

Forward marking – All source nodes are marked with “+”. The marking is propagated forward along outgoing edges until a fixed point is reached. After this phase, every node reachable from any source carries a “+”.
Backward marking – All sink nodes are marked with “‑”. The marking is propagated backward, but only through nodes that already have a “+”. This ensures that only nodes that are both reachable from a source and can reach a sink are considered.
Result extraction – Nodes that have both “+” and “‑” belong to at least one source‑to‑sink data‑flow path. Any method that references at least one of these nodes (either via a local variable, a parameter, or a field access) is retained in the slice; all other methods are deemed irrelevant and can be safely removed.

The algorithm runs in linear time with respect to the number of graph edges, and it is deliberately conservative: a method is kept if any of its symbols participates in a relevant flow.

Certification
To address trustworthiness, DSlicer produces a two‑part certificate:

Translation certificate – The full assignment graph together with evidence that each original instruction was translated according to the prescribed rules.
Analysis certificate – The marking (+, ‑) for every node, together with a proof that the forward and backward propagation obey the algorithm’s logical constraints.

Both certificates can be independently verified by a lightweight checker supplied with the tool, allowing third parties to audit the slicing result without re‑running the full analysis.

Implementation and Empirical Evaluation
DSlicer is written in Python and leverages Androguard for APK parsing and decompilation. The user supplies an APK and a configuration file listing source and sink APIs (the authors used the default set from FlowDroid). DSlicer outputs a list of relevant/irrelevant methods and can optionally generate a new, trimmed APK.

The authors evaluated DSlicer on 10,600 real Android applications collected from various sources, including Google Play. Key findings:

Runtime – The analysis scales linearly with the number of methods. The average runtime is about 5 seconds per app; the worst case (≈49 k methods) took 1,682 seconds.
Code reduction – On average, 36 % of methods were identified as irrelevant and could be removed. For most apps, the reduction ranged between 15 % and 65 %, stabilizing around 30 %–40 % for larger applications.
Scalability – The tool handled apps with up to 49,146 methods without failure, demonstrating suitability for large‑scale mobile code bases.

Related Work
The paper situates its contribution among classic program slicing (Weiser), dynamic slicing, and modern data‑flow analyses for Android (FlowDroid, TaintDroid, Amandroid, etc.). It distinguishes itself by focusing solely on preserving data‑flow paths rather than full program semantics, and by providing a certifiable, flow‑insensitive slicing method that works on object‑oriented code with dynamic dispatch.

Conclusions and Future Directions
Data‑Flow Guided Slicing offers a practical, low‑overhead preprocessing step that can dramatically shrink the code base for downstream security analyses, improving both performance and precision. The certification mechanism enhances trust, making the approach attractive for security‑critical pipelines. Future work could explore finer‑grained (variable‑level) slicing, integration with other static analyses (e.g., vulnerability detection), and broader adoption of the certification format as a standard for static analysis results.

Data-Flow Guided Slicing

💡 Research Summary

Comments & Academic Discussion

Leave a Comment