Appendix D. Regular expressions
The regular expressions (or regex for short) used in searches and
segmentation rules are those supported by Java. Should you need more
specific information, consult the Java
Regex documentation. See additional references and examples
below.
Note
This chapter is intended for advanced users, who need to define
their own variants of segmentation rules or devise more complex and
powerful key search items.
Table D.1. Regex - Flags
The construct |
... matches the following |
(?i) |
Enables case-insensitive matching (by default, the pattern is
case-sensitive). |
Table D.2. Regex - Character
The construct |
... matches the following |
x |
The character x, except the following... |
\uhhhh |
The character with hexadecimal value 0xhhhh |
\t |
The tab character ('\u0009') |
\n |
The newline (line feed) character ('\u000A') |
\r |
The carriage-return character ('\u000D') |
\f |
The form-feed character ('\u000C') |
\a |
The alert (bell) character ('\u0007') |
\e |
The escape character ('\u001B') |
\cx |
The control character corresponding to x |
\0n |
The character with octal value 0n (0 <= n <= 7) |
\0nn |
The character with octal value 0nn (0 <= n <=
7) |
\0mnn |
The character with octal value 0mnn (0 <= m <= 3, 0
<= n <= 7) |
\xhh |
The character with hexadecimal value 0xhh |
Table D.3. Regex - Quotation
The construct |
...matches the following |
\ |
Nothing, but quotes the following character. This is required
if you would like to enter any of the meta characters
!$()*+.<>?[\]^{|} to match as themselves. |
\\ |
For example, this is the backslash character |
\Q |
Nothing, but quotes all characters until \E |
\E |
Nothing, but ends quoting started by \Q |
Table D.4. Regex - Classes for Unicode blocks and categories
The construct |
...matches the following |
\p{InGreek} |
A character in the Greek block (simple
block) |
\p{Lu} |
An uppercase letter (simple
category) |
\p{Sc} |
A currency symbol |
\P{InGreek} |
Any character except one in the Greek block
(negation) |
[\p{L}&&[^\p{Lu}]] |
Any letter except an uppercase letter (subtraction) |
Table D.5. Regex - Character classes
The construct |
...matches the following |
[abc] |
a, b, or c (simple class) |
[^abc] |
Any character except a, b, or c (negation) |
[a-zA-Z] |
a through z or A through Z, inclusive (range) |
Table D.6. Regex - Predefined character classes
The construct |
...matches the following |
. |
Any character (except for line terminators) |
\d |
A digit: [0-9] |
\D |
A non-digit: [^0-9] |
\s |
A whitespace character: [ \t\n\x0B\f\r] |
\S |
A non-whitespace character: [^\s] |
\w |
A word character: [a-zA-Z_0-9] |
\W |
A non-word character: [^\w] |
Table D.7. Regex - Boundary matchers
The construct |
...matches the following |
^ |
The beginning of a line |
$ |
The end of a line |
\b |
A word boundary |
\B |
A non-word boundary |
Table D.8. Regex - Greedy quantifiers
The construct |
...matches the following |
X
?
|
X, once or not at all |
X
*
|
X, zero or more times |
X
+
|
X, one or more times |
Note
greedy quantifiers will match as much as they can. For example,
a+
will match the aaa in
aaabbb
Table D.9. Regex - Reluctant (non-greedy) quantifiers
The construct |
...matches the following |
X?? |
X, once or not at all |
X*? |
X, zero or more times |
X+? |
X, one or more times |
Note
non-greedy quantifiers will match as little as they can. For
example,
a+?
will match the first
a
in
aaabbb
Table D.10. Regex - Logical operators
The construct |
...matches the following |
XY |
X followed by Y |
X|Y |
Either X or Y |
(XY) |
XY as a single group |