We show that testing inclusion between languages represented by regular expressions with numerical occurrence indicators (RE#s) is NP-hard, even if the expressions satisfy the requirement of "unambiguity", which is required for XML Schema content model expressions.
💡 Deep Analysis
📄 Full Content
arXiv:1111.0422v1 [cs.CC] 2 Nov 2011
Inclusion of Unambiguous #REs is NP-Hard
Pekka Kilpel¨ainen
University of Kuopio
Department of Computer Science
Pekka.Kilpelainen@cs.uku.fi
May 27, 2004
Abstract
We show that testing inclusion between languages represented by
regular expressions with numerical occurrence indicators (#REs) is
NP-hard, even if the expressions satisfy the requirement of “unambi-
guity”, which is required for XML Schema content model expressions.
1
Proof of the result
We have seen before [3] that testing for inclusion and overlap of languages
represented by #REs is NP-hard. Testing for the overlap was seen hard also
for expressions that satisfy the XML requirement of “unambiguity”.
On
the other hand, the NP-hardness proof of #RE inclusion used ambiguous
expressions. Here we show that unambiguity does not make the testing of
inclusion essentially easier. The proof is based on a polynomial time Turing
reduction [1, Chap. 5] from PARTITION, which is one of the best-known
NP-complete problems [2, 1].
Theorem 1.1 The #RE inclusion problem is NP-hard, also for unambigu-
ous #REs.
Proof. Let a set A = {a1, . . . , ak} and a positive integer weight w(a) of each
a ∈A form an instance of PARTITION. The problem is to decide whether
A can be split in two equal-weight subsets A′ and A −A′, that is, whether
X
a∈A′
w(a) =
X
a∈A−A′
w(a)
(1)
holds for some A′ ⊆A. Notice that (1) can hold only if the total weight of
the set A is even. Therefore we can assume that P
a∈A w(a) = 2n for some
1
positive integer n, which means that (1) holds if and only if
X
a∈A′
w(a) = n
(2)
for some A′ ⊆A.
For shortness, denote the weight w(ai) of an item ai ∈A by wi.
Now form the following two #REs over the alphabet Σ = {a0, a1, . . . , ak}:
E1
=
an+1..n+1
0
(aw1..w1
1
|ǫ)(aw2..w2
2
|ǫ) · · · (awk..wk
k
|ǫ)
E2
=
((a0|a1| · · · |ak)n+1..2n)1..2
Notice that both expressions are trivially unambiguous since each symbol
of Σ appears exactly once in both of them. Expression E1 describes words
of the form an+1
0
u, where the length of the suffix u equals the total weight
of some subset of A. Therefore L(E1) ⊆{v ∈Σ∗| n + 1 ≤|v| ≤3n + 1}.
Obviously E1 accepts a word of length 2n + 1 if and only if a partition that
satisfies (2) exists. Expression E2, on the other hand, rejects any words of
length 2n + 1:
L(E2)
=
2n
[
i=n+1
Σi ∪
4n
[
i=2n+2
Σi
=
{v ∈Σ∗| n + 1 ≤|v| ≤4n, |v| ̸= 2n + 1}
Now L(E1) ⊆L(E2) holds iffE1 does not accept any word of length 2n + 1,
which holds if and only if no partition which satisfies (1) exists.
□
So, a polynomial-time algorithm for testing the inclusion of unambiguous
#REs would imply P = NP, which is considered most unlikely.
References
[1] M.R. Garey and D.S. Johnson.
Computers and Intractability.
W.H.
Freeman and Company, New York, 1979.
[2] R.M. Karp. Reducibility among combinatorial problems. In R.E. Miller
and J.W. Thatcher, editors, Complexity of Computer Computations,
pages 85–103. Plenum Press, New York, 1972.
[3] P. Kilpel¨ainen and R. Tuhkanen. Regular expressions with numerical
occurrence indicators—preliminary results. In Proc. of the Eighth Sym-
posium on Programming Languages and Software Tools, pages 163–173.
University of Kuopio, Department of Computer Science, 2003.
2