Automated Generation of Accurate Privacy Captions From Android Source Code Using Large Language Models

Automated Generation of Accurate Privacy Captions From Android Source Code Using Large Language Models
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Privacy captions are short sentences that succinctly describe what personal information is used, how it is used, and why, within an app. These captions can be utilized in various notice formats, such as privacy policies, app rationales, and app store descriptions. However, inaccurate captions may mislead users and expose developers to regulatory fines. Existing approaches to generating privacy notices or just privacy captions include using questionnaires, templates, static analysis, or machine learning. However, these approaches either rely heavily on developers’ inputs and thus strain their efforts, use limited source code context, leading to the incomplete capture of app privacy behaviors, or depend on potentially inaccurate privacy policies as a source for creating notices. In this work, we address these limitations by developing Privacy Caption Generator (PCapGen), an approach that - i) automatically identifies and extracts large and precise source code context that implements privacy behaviors in an app, ii) uses a Large Language Model (LLM) to describe coarse- and fine-grained privacy behaviors, and iii) generates accurate, concise, and complete privacy captions to describe the privacy behaviors of the app. Our evaluation shows PCapGen generates concise, complete, and accurate privacy captions as compared to the baseline approach. Furthermore, privacy experts choose PCapGen captions at least 71% of the time, whereas LLMs-as-judge prefer PCapGen captions at least 76% of the time, indicating strong performance of our approach.


💡 Research Summary

The paper introduces PCapGen (Privacy Caption Generator), a novel framework that automatically produces accurate, concise, and complete privacy captions for Android applications by leveraging large language models (LLMs) together with static and taint analysis. Privacy captions are short natural‑language sentences that describe what personal data an app collects, how it is used, and why it is needed. These captions are the building blocks of many privacy notices, including privacy policies, in‑app rationales, and app‑store descriptions. Existing approaches either require substantial developer effort (questionnaires, manual annotations), rely on limited code context (few methods in a call graph), or depend on potentially inaccurate privacy policies as a source of truth, leading to incomplete or misleading captions.

PCapGen addresses these shortcomings through three tightly coupled components.

  1. Identifier – A heuristics‑enhanced taint analysis built on FlowDroid. While FlowDroid traditionally matches data flows between a fixed list of source and sink APIs, PCapGen extends this list with heuristics that discover new source methods (e.g., newly added Android APIs) and unknown sink methods, thereby capturing a broader set of personal‑information flows. The output is a set of taint paths that trace the movement of personal data through classes, methods, and statements.

  2. Extractor – Static analysis extracts the full source code of every class and method participating in the identified taint paths. An LLM is then employed to pinpoint the exact statements that handle personal data, providing a fine‑grained view of the code. This step yields a “large source‑code context” that spans multiple files, methods, and intra‑method statements, far beyond the three‑method limit of prior work.

  3. Generator – Using In‑Context Learning (ICL), the extracted code context is fed to an LLM which synthesizes a natural‑language description of the privacy behavior. The authors experiment with three state‑of‑the‑art LLMs: GPT‑4 (gpt‑4‑0613), Claude Opus 4 (claude‑opus‑4‑20250514), and DeepSeek (deepseek‑r1‑distill‑llama‑70b). Prompt engineering includes several example captions to guide the model toward the desired style (concise, accurate, and complete).

The evaluation is organized around four research questions (RQs).

  • RQ 1 compares the three LLM configurations on three quality dimensions—accuracy, conciseness, and completeness. Claude‑based PCapGen (PCapGen_Claude) consistently outperforms the GPT‑4 and DeepSeek variants across all metrics.
  • RQ 2 measures semantic similarity between PCapGen captions and developer‑crafted baseline captions, showing a high degree of overlap, which indicates that the system does not drift into unrelated phrasing.
  • RQ 3 employs “LLM‑as‑judge” models to re‑evaluate the captions. All three judges rate PCapGen_Claude captions higher than the baselines, and they prefer PCapGen captions at least 76 % of the time.
  • RQ 4 involves human privacy experts who assess the same set of captions. While statistical significance is not reached, experts still favor PCapGen_Claude captions 71 % of the time, confirming that the automatically generated text is perceived as at least as trustworthy as manually written captions.

Key contributions of the work are:

  1. A hybrid analysis‑LLM pipeline that combines traditional static/taint analysis with LLMs for both fine‑grained statement identification and high‑level natural‑language synthesis.
  2. An extended heuristics‑based taint analysis that scales beyond fixed source‑sink lists, enabling the capture of previously unseen data‑flow patterns in modern Android apps.
  3. Comprehensive evaluation methodology that includes both automated LLM judges and domain experts, providing a robust benchmark for future privacy‑notice generation research.
  4. Public release of a curated dataset containing Android code snippets annotated with privacy captions, along with the scripts and a custom annotation tool used in the user study, facilitating reproducibility and further exploration.

Overall, PCapGen demonstrates that accurate privacy captions can be generated automatically without any developer input, dramatically reducing the burden on small teams or solo developers while improving compliance with privacy regulations and marketplace policies. Future directions include extending the approach to iOS or cross‑platform codebases, refining the heuristics for even finer taint detection, and integrating the generated captions directly into app‑store pipelines for real‑time compliance checking.


Comments & Academic Discussion

Loading comments...

Leave a Comment