Go to the previous, next section.
Characters are things you can type. Operators are things in a regular expression that match one or more characters. You compose regular expressions from operators, which in turn you specify using one or more characters.
Most characters represent what we call the match-self operator, i.e., they match themselves; we call these characters ordinary. Other characters represent either all or parts of fancier operators; e.g., `.' represents what we call the match-any-character operator (which, no surprise, matches (almost) any character); we call these characters special. Two different things determine what characters represent what operators:
In the following sections, we describe these things in more detail.
In any particular syntax for regular expressions, some characters are
always special, others are sometimes special, and others are never
special. The particular syntax that Regex recognizes for a given
regular expression depends on the value in the syntax
field of
the pattern buffer of that regular expression.
You get a pattern buffer by compiling a regular expression. See section GNU Pattern Buffers, and section POSIX Pattern Buffers, for more information on pattern buffers. See section GNU Regular Expression Compiling, section POSIX Regular Expression Compiling, and section BSD Regular Expression Compiling, for more information on compiling.
Regex considers the value of the syntax
field to be a collection
of bits; we refer to these bits as syntax bits. In most cases,
they affect what characters represent what operators. We describe the
meanings of the operators to which we refer in section Common Operators,
section GNU Operators, and section GNU Emacs Operators.
For reference, here is the complete list of syntax bits, in alphabetical order:
@cnindex RE_BACKSLASH_ESCAPE_IN_LIST
RE_BACKSLASH_ESCAPE_IN_LISTS
[
... ]
and [^
... ]
)
quotes (makes ordinary, if it's special) the following character; if
this bit isn't set, then `\' is an ordinary character inside lists.
(See section The Backslash Character, for what `\' does outside of lists.)
@cnindex RE_BK_PLUS_QM
RE_BK_PLUS_QM
RE_LIMITED_OPS
is set.
@cnindex RE_CHAR_CLASSES
RE_CHAR_CLASSES
@cnindex RE_CONTEXT_INDEP_ANCHORS
RE_CONTEXT_INDEP_ANCHORS
^
), and
section The Match-end-of-line Operator ($
).
@cnindex RE_CONTEXT_INDEP_OPS
RE_CONTEXT_INDEP_OPS
RE_LIMITED_OPS
isn't set) `+' and `?' (or `\+' and `\?', depending
on the syntax bit RE_BK_PLUS_QM
) represent repetition operators
only if they're not first in a regular expression or just after an
open-group or alternation operator. The same holds for `{' (or
`\{', depending on the syntax bit RE_NO_BK_BRACES
) if
it is the beginning of a valid interval and the syntax bit
RE_INTERVALS
is set.
@cnindex RE_CONTEXT_INVALID_OPS
RE_CONTEXT_INVALID_OPS
If this bit isn't set, then you can put the characters representing the repetition and alternation characters anywhere in a regular expression. Whether or not they will in fact be operators in certain positions depends on other syntax bits.
@cnindex RE_DOT_NEWLINE
RE_DOT_NEWLINE
@cnindex RE_DOT_NOT_NULL
RE_DOT_NOT_NULL
@cnindex RE_INTERVALS
RE_INTERVALS
@cnindex RE_LIMITED_OPS
RE_LIMITED_OPS
@cnindex RE_NEWLINE_ALT
RE_NEWLINE_ALT
@cnindex RE_NO_BK_BRACES
RE_NO_BK_BRACES
RE_INTERVALS
is set.
@cnindex RE_NO_BK_PARENS
RE_NO_BK_PARENS
@cnindex RE_NO_BK_REFS
RE_NO_BK_REFS
@cnindex RE_NO_BK_VBAR
RE_NO_BK_VBAR
RE_LIMITED_OPS
is set.
@cnindex RE_NO_EMPTY_RANGES
RE_NO_EMPTY_RANGES
@cnindex RE_UNMATCHED_RIGHT_PAREN_ORD
RE_UNMATCHED_RIGHT_PAREN_ORD
RE_NO_BK_PARENS
is set) to match `)'.
If you're programming with Regex, you can set a pattern buffer's
(see section GNU Pattern Buffers, and section POSIX Pattern Buffers)
syntax
field either to an arbitrary combination of syntax bits
(see section Syntax Bits) or else to the configurations defined by Regex.
These configurations define the syntaxes used by certain
programs---GNU Emacs,
POSIX Awk,
traditional Awk,
Grep,
Egrep--in addition to syntaxes for POSIX basic and extended
regular expressions.
The predefined syntaxes--taken directly from `regex.h'---are:
#define RE_SYNTAX_EMACS 0 #define RE_SYNTAX_AWK \ (RE_BACKSLASH_ESCAPE_IN_LISTS | RE_DOT_NOT_NULL \ | RE_NO_BK_PARENS | RE_NO_BK_REFS \ | RE_NO_BK_VBAR | RE_NO_EMPTY_RANGES \ | RE_UNMATCHED_RIGHT_PAREN_ORD) #define RE_SYNTAX_POSIX_AWK \ (RE_SYNTAX_POSIX_EXTENDED | RE_BACKSLASH_ESCAPE_IN_LISTS) #define RE_SYNTAX_GREP \ (RE_BK_PLUS_QM | RE_CHAR_CLASSES \ | RE_HAT_LISTS_NOT_NEWLINE | RE_INTERVALS \ | RE_NEWLINE_ALT) #define RE_SYNTAX_EGREP \ (RE_CHAR_CLASSES | RE_CONTEXT_INDEP_ANCHORS \ | RE_CONTEXT_INDEP_OPS | RE_HAT_LISTS_NOT_NEWLINE \ | RE_NEWLINE_ALT | RE_NO_BK_PARENS \ | RE_NO_BK_VBAR) #define RE_SYNTAX_POSIX_EGREP \ (RE_SYNTAX_EGREP | RE_INTERVALS | RE_NO_BK_BRACES) /* P1003.2/D11.2, section 4.20.7.1, lines 5078ff. */ #define RE_SYNTAX_ED RE_SYNTAX_POSIX_BASIC #define RE_SYNTAX_SED RE_SYNTAX_POSIX_BASIC /* Syntax bits common to both basic and extended POSIX regex syntax. */ #define _RE_SYNTAX_POSIX_COMMON \ (RE_CHAR_CLASSES | RE_DOT_NEWLINE | RE_DOT_NOT_NULL \ | RE_INTERVALS | RE_NO_EMPTY_RANGES) #define RE_SYNTAX_POSIX_BASIC \ (_RE_SYNTAX_POSIX_COMMON | RE_BK_PLUS_QM) /* Differs from ..._POSIX_BASIC only in that RE_BK_PLUS_QM becomes RE_LIMITED_OPS, i.e., \? \+ \| are not recognized. Actually, this isn't minimal, since other operators, such as \`, aren't disabled. */ #define RE_SYNTAX_POSIX_MINIMAL_BASIC \ (_RE_SYNTAX_POSIX_COMMON | RE_LIMITED_OPS) #define RE_SYNTAX_POSIX_EXTENDED \ (_RE_SYNTAX_POSIX_COMMON | RE_CONTEXT_INDEP_ANCHORS \ | RE_CONTEXT_INDEP_OPS | RE_NO_BK_BRACES \ | RE_NO_BK_PARENS | RE_NO_BK_VBAR \ | RE_UNMATCHED_RIGHT_PAREN_ORD) /* Differs from ..._POSIX_EXTENDED in that RE_CONTEXT_INVALID_OPS replaces RE_CONTEXT_INDEP_OPS and RE_NO_BK_REFS is added. */ #define RE_SYNTAX_POSIX_MINIMAL_EXTENDED \ (_RE_SYNTAX_POSIX_COMMON | RE_CONTEXT_INDEP_ANCHORS \ | RE_CONTEXT_INVALID_OPS | RE_NO_BK_BRACES \ | RE_NO_BK_PARENS | RE_NO_BK_REFS \ | RE_NO_BK_VBAR | RE_UNMATCHED_RIGHT_PAREN_ORD)
POSIX generalizes the notion of a character to that of a collating element. It defines a collating element to be "a sequence of one or more bytes defined in the current collating sequence as a unit of collation."
This generalizes the notion of a character in two ways. First, a single character can map into two or more collating elements. For example, the German collates as the collating element `s' followed by another collating element `s'. Second, two or more characters can map into one collating element. For example, the Spanish `ll' collates after `l' and before `m'.
Since POSIX's "collating element" preserves the essential idea of a "character," we use the latter, more familiar, term in this document.
The `\' character has one of four different meanings, depending on the context in which you use it and what syntax bits are set (see section Syntax Bits). It can: 1) stand for itself, 2) quote the next character, 3) introduce an operator, or 4) do nothing.
[
... ]
and [^
... ]
)) if the syntax bit
RE_BACKSLASH_ESCAPE_IN_LISTS
is not set. For example, `[\]'
would match `\'.
RE_BACKSLASH_ESCAPE_IN_LISTS
is set.
RE_BK_PLUS_QM
, RE_NO_BK_BRACES
, RE_NO_BK_VAR
,
RE_NO_BK_PARENS
, RE_NO_BK_REF
in section Syntax Bits. Also:
\b
)).
\B
)).
\<
)).
\>
)).
\w
)).
\W
)).
emacs
defined, then `\sclass' represents the match-syntactic-class
operator and `\Sclass' represents the
match-not-syntactic-class operator (see section Syntactic Class Operators).
Go to the previous, next section.