Designing and Comparing RPQ Semantics

Designing and Comparing RPQ Semantics
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Modern property graph database query languages such as Cypher, PGQL, GSQL, and the standard GQL draw inspiration from the formalism of regular path queries (RPQs). In order to output walks explicitly, they depart from the classical and well-studied homomorphism semantics. However, it then becomes difficult to present results to users because RPQs may match infinitely many walks. The aforementioned languages use ad-hoc criteria to select a finite subset of those matches. For instance, Cypher uses trail semantics, discarding walks with repeated edges; PGQL and GSQL use shortest walk semantics, retaining only the walks of minimal length among all matched walks; and GQL allows users to choose from several semantics. Even though there is academic research on these semantics, it focuses almost exclusively on evaluation efficiency. In an attempt to better understand, choose and design RPQ semantics, we present a framework to categorize and compare them according to other criteria. We formalize several possible properties, pertaining to the study of RPQ semantics seen as mathematical functions mapping a database and a query to a finite set of walks. We show that some properties are mutually exclusive, or cannot be met. We also give several new RPQ semantics as examples. Some of them may provide ideas for the design of new semantics for future graph database query languages.


💡 Research Summary

The paper “Designing and Comparing RPQ Semantics” addresses a practical problem that arises in modern graph database query languages such as Cypher, PGQL, GSQL, and the emerging standard GQL. While regular path queries (RPQs) are formally defined as regular expressions over edge labels, the classical homomorphism semantics only returns pairs of source and target vertices. Real‑world systems, however, need to present the actual walks to users, which can be infinite in number. Consequently, languages adopt ad‑hoc restrictions—Cypher uses trail semantics (no repeated edges), PGQL and GSQL use shortest‑walk semantics (only walks of minimal length), and GQL allows several options. Existing academic work has focused almost exclusively on the computational complexity of evaluating these semantics, leaving a gap in understanding the design space of RPQ semantics themselves.

The authors propose a formal framework that treats an RPQ semantics as a function S that, given a database D and a regular expression R, returns a finite subset of the matching walks Matches(D,R). They distinguish two broad families:

  1. Filter‑based semantics: a predicate f_S : Walks → {⊤,⊥} decides independently for each walk whether it belongs to the result. Trail semantics (T_r) and acyclic semantics (A_c) are canonical examples. Such semantics act as post‑filters on the set of matches and do not require global knowledge of the whole match set.

  2. Global‑minimization semantics: a partial order ≤_S is defined on walks (e.g., by length, by edge‑distinctness) and the semantics returns all minimal elements of Matches(D,R) with respect to this order. Shortest‑walk semantics (Sh) and shortest‑trail semantics (ShT) fall into this category. These semantics need to inspect the entire match set to determine minimality.

To evaluate and compare semantics, the paper formalizes a suite of desirable properties:

  • Monotonicity: adding edges to the database never removes previously returned walks.
  • Continuity: results converge when the database is built incrementally.
  • Composability: applying two semantics sequentially yields the same result as a single combined semantics.
  • Coverage: the returned set “covers” the space of matches, e.g., for every pair of vertices that have a matching walk, at least one such walk appears in the result.
  • Symmetry, Closure under rational operators, and others.

Through a series of impossibility theorems, the authors show that many of these properties are mutually exclusive. For instance, no semantics can be both monotone and a global‑minimization semantics; a monotone semantics cannot simultaneously guarantee full coverage under the shortest‑walk criterion. Likewise, filter‑based semantics cannot be closed under rational operators because they lack the ability to reason about concatenations that generate new walks.

The paper introduces several new semantics to illustrate the design space:

  • ShV (Shortest‑Vertex‑distinct): returns shortest walks that never repeat a vertex.
  • ShC (Shortest‑Cycle‑free): returns shortest walks that contain no cycles.
  • ShV‑C (Shortest‑Vertex‑distinct‑with‑Coverage): combines ShV with a coverage guarantee, ensuring each reachable vertex pair is represented.
  • Additional variants such as ShV‑S (adding symmetry) are also defined.

Each new semantics is analyzed with respect to the previously defined properties, and a comprehensive table (located in the appendix) summarizes which properties each semantics satisfies or violates.

The computational complexity of evaluating these semantics is revisited. The authors confirm known results: trail semantics is NP‑hard, shortest‑walk semantics is polynomial‑time, and acyclic semantics is PSPACE‑complete. For the newly proposed semantics, they show that ShV remains polynomial, while ShV‑C becomes NP‑hard due to the coverage constraint. These findings reinforce the intuition that adding global constraints typically raises evaluation difficulty.

In the related‑work section, the authors compare their approach to SPARQL property paths (which ultimately use endpoint semantics), to GQL’s “SIMPLE” keyword (essentially acyclic semantics), and to recent proposals such as “simple run” and “binding trail” semantics. They argue that while prior work has examined specific semantics, none has provided a systematic property‑based taxonomy.

The conclusion emphasizes that the choice of RPQ semantics should be guided not only by algorithmic efficiency but also by user‑centric criteria such as result interpretability, safety (e.g., avoiding edge repetition), and completeness. The presented property matrix offers language designers a concrete decision‑making tool: depending on the application’s priorities (e.g., guaranteeing a shortest path versus guaranteeing no edge duplication), one can select an appropriate semantics or even design a new one that balances the trade‑offs. Future work is suggested on dynamic semantics that can adapt to changing workloads, on visualisation‑aware semantics, and on integrating the property framework into query optimisers and language specifications.


Comments & Academic Discussion

Loading comments...

Leave a Comment