Lexical Analysis ================ Encoding -------- All Nim source files are in the UTF-8 encoding (or its ASCII subset). Other encodings are not supported. Any of the standard platform line termination sequences can be used - the Unix form using ASCII LF (linefeed), the Windows form using the ASCII sequence CR LF (return followed by linefeed), or the old Macintosh form using the ASCII CR (return) character. All of these forms can be used equally, regardless of platform. Indentation ----------- Nim's standard grammar describes an `indentation sensitive`:idx: language. This means that all the control structures are recognized by indentation. Indentation consists only of spaces; tabulators are not allowed. The indentation handling is implemented as follows: The lexer annotates the following token with the preceding number of spaces; indentation is not a separate token. This trick allows parsing of Nim with only 1 token of lookahead. The parser uses a stack of indentation levels: the stack consists of integers counting the spaces. The indentation information is queried at strategic places in the parser but ignored otherwise: The pseudo terminal ``IND{>}`` denotes an indentation that consists of more spaces than the entry at the top of the stack; ``IND{=}`` an indentation that has the same number of spaces. ``DED`` is another pseudo terminal that describes the *action* of popping a value from the stack, ``IND{>}`` then implies to push onto the stack. With this notation we can now easily define the core of the grammar: A block of statements (simplified example):: ifStmt = 'if' expr ':' stmt (IND{=} 'elif' expr ':' stmt)* (IND{=} 'else' ':' stmt)? simpleStmt = ifStmt / ... stmt = IND{>} stmt ^+ IND{=} DED # list of statements / simpleStmt # or a simple statement Comments -------- Comments start anywhere outside a string or character literal with the hash character ``#``. Comments consist of a concatenation of `comment pieces`:idx:. A comment piece starts with ``#`` and runs until the end of the line. The end of line characters belong to the piece. If the next line only consists of a comment piece with no other tokens between it and the preceding one, it does not start a new comment: .. code-block:: nim i = 0 # This is a single comment over multiple lines. # The scanner merges these two pieces. # The comment continues here. `Documentation comments`:idx: are comments that start with two ``##``. Documentation comments are tokens; they are only allowed at certain places in the input file as they belong to the syntax tree! Multiline comments ------------------ Starting with version 0.13.0 of the language Nim supports multiline comments. They look like: .. code-block:: nim #[Comment here. Multiple lines are not a problem.]# Multiline comments support nesting: .. code-block:: nim #[ #[ Multiline comment in already commented out code. ]# proc p[T](x: T) = discard ]# Multiline documentation comments also exist and support nesting too: .. code-block:: nim proc foo = ##[Long documentation comment here. ]## Identifiers & Keywords ---------------------- Identifiers in Nim can be any string of letters, digits and underscores, beginning with a letter. Two immediate following underscores ``__`` are not allowed:: letter ::= 'A'..'Z' | 'a'..'z' | '\x80'..'\xff' digit ::= '0'..'9' IDENTIFIER ::= letter ( ['_'] (letter | digit) )* Currently any Unicode character with an ordinal value > 127 (non ASCII) is classified as a ``letter`` and may thus be part of an identifier but later versions of the language may assign some Unicode characters to belong to the operator characters instead. The following keywords are reserved and cannot be used as identifiers: .. code-block:: nim :file: ../keywords.txt Some keywords are unused; they are reserved for future developments of the language. Identifier equality ------------------- Two identifiers are considered equal if the following algorithm returns true: .. code-block:: nim proc sameIdentifier(a, b: string): bool = a[0] == b[0] and a.replace(re"_|–", "").toLower == b.replace(re"_|–", "").toLower That means only the first letters are compared in a case sensitive manner. Other letters are compared case insensitively and underscores and en-dash (Unicode point U+2013) are ignored. This rather unorthodox way to do identifier comparisons is called `partial case insensitivity`:idx: and has some advantages over the conventional case sensitivity: It allows programmers to mostly use their own preferred spelling style, be it humpStyle, snake_style or dash–style and libraries written by different programmers cannot use incompatible conventions. A Nim-aware editor or IDE can show the identifiers as preferred. Another advantage is that it frees the programmer from remembering the exact spelling of an identifier. The exception with respect to the first letter allows common code like ``var foo: Foo`` to be parsed unambiguously. Historically, Nim was a fully `style-insensitive`:idx: language. This meant that it was not case-sensitive and underscores were ignored and there was no even a distinction between ``foo`` and ``Foo``. String literals --------------- Terminal symbol in the grammar: ``STR_LIT``. String literals can be delimited by matching double quotes, and can contain the following `escape sequences`:idx:\ : ================== =================================================== Escape sequence Meaning ================== =================================================== ``\n`` `newline`:idx: ``\r``, ``\c`` `carriage return`:idx: ``\l`` `line feed`:idx: ``\f`` `form feed`:idx: ``\t`` `tabulator`:idx: ``\v`` `vertical tabulator`:idx: ``\\`` `backslash`:idx: ``\"`` `quotation mark`:idx: ``\'`` `apostrophe`:idx: ``\`` '0'..'9'+ `character with decimal value d`:idx:; all decimal digits directly following are used for the character ``\a`` `alert`:idx: ``\b`` `backspace`:idx: ``\e`` `escape`:idx: `[ESC]`:idx: ``\x`` HH `character with hex value HH`:idx:; exactly two hex digits are allowed ================== =================================================== Strings in Nim may contain any 8-bit value, even embedded zeros. However some operations may interpret the first binary zero as a terminator. Triple quoted string literals ----------------------------- Terminal symbol in the grammar: ``TRIPLESTR_LIT``. String literals can also be delimited by three double quotes ``"""`` ... ``"""``. Literals in this form may run for several lines, may contain ``"`` and do not interpret any escape sequences. For convenience, when the opening ``"""`` is followed by a newline (there may be whitespace between the opening ``"""`` and the newline), the newline (and the preceding whitespace) is not included in the string. The ending of the string literal is defined by the pattern ``"""[^"]``, so this: .. code-block:: nim """"long string within quotes"""" Produces:: "long string within quotes" Raw string literals ------------------- Terminal symbol in the grammar: ``RSTR_LIT``. There are also raw string literals that are preceded with the letter ``r`` (or ``R``) and are delimited by matching double quotes (just like ordinary string literals) and do not interpret the escape sequences. This is especially convenient for regular expressions or Windows paths: .. code-block:: nim var f = openFile(r"C:\texts\text.txt") # a raw string, so ``\t`` is no tab To produce a single ``"`` within a raw string literal, it has to be doubled: .. code-block:: nim r"a""b" Produces:: a"b ``r""""`` is not possible with this notation, because the three leading quotes introduce a triple quoted string literal. ``r"""`` is the same as ``"""`` since triple quoted string literals do not interpret escape sequences either. Generalized raw string literals ------------------------------- Terminal symbols in the grammar: ``GENERALIZED_STR_LIT``, ``GENERALIZED_TRIPLESTR_LIT``. The construct ``identifier"string literal"`` (without whitespace between the identifier and the opening quotation mark) is a generalized raw string literal. It is a shortcut for the construct ``identifier(r"string literal")``, so it denotes a procedure call with a raw string literal as its only argument. Generalized raw string literals are especially convenient for embedding mini languages directly into Nim (for example regular expressions). The construct ``identifier"""string literal"""`` exists too. It is a shortcut for ``identifier("""string literal""")``. Character literals ------------------ Character literals are enclosed in single quotes ``''`` and can contain the same escape sequences as strings - with one exception: `newline`:idx: (``\n``) is not allowed as it may be wider than one character (often it is the pair CR/LF for example). Here are the valid `escape sequences`:idx: for character literals: ================== =================================================== Escape sequence Meaning ================== =================================================== ``\r``, ``\c`` `carriage return`:idx: ``\l`` `line feed`:idx: ``\f`` `form feed`:idx: ``\t`` `tabulator`:idx: ``\v`` `vertical tabulator`:idx: ``\\`` `backslash`:idx: ``\"`` `quotation mark`:idx: ``\'`` `apostrophe`:idx: ``\`` '0'..'9'+ `character with decimal value d`:idx:; all decimal digits directly following are used for the character ``\a`` `alert`:idx: ``\b`` `backspace`:idx: ``\e`` `escape`:idx: `[ESC]`:idx: ``\x`` HH `character with hex value HH`:idx:; exactly two hex digits are allowed ================== =================================================== A character is not an Unicode character but a single byte. The reason for this is efficiency: for the overwhelming majority of use-cases, the resulting programs will still handle UTF-8 properly as UTF-8 was specially designed for this. Another reason is that Nim can thus support ``array[char, int]`` or ``set[char]`` efficiently as many algorithms rely on this feature. The `Rune` type is used for Unicode characters, it can represent any Unicode character. ``Rune`` is declared in the `unicode module `_. Numerical constants ------------------- Numerical constants are of a single type and have the form:: hexdigit = digit | 'A'..'F' | 'a'..'f' octdigit = '0'..'7' bindigit = '0'..'1' HEX_LIT = '0' ('x' | 'X' ) hexdigit ( ['_'] hexdigit )* DEC_LIT = digit ( ['_'] digit )* OCT_LIT = '0' ('o' | 'c' | 'C') octdigit ( ['_'] octdigit )* BIN_LIT = '0' ('b' | 'B' ) bindigit ( ['_'] bindigit )* INT_LIT = HEX_LIT | DEC_LIT | OCT_LIT | BIN_LIT INT8_LIT = INT_LIT ['\''] ('i' | 'I') '8' INT16_LIT = INT_LIT ['\''] ('i' | 'I') '16' INT32_LIT = INT_LIT ['\''] ('i' | 'I') '32' INT64_LIT = INT_LIT ['\''] ('i' | 'I') '64' UINT_LIT = INT_LIT ['\''] ('u' | 'U') UINT8_LIT = INT_LIT ['\''] ('u' | 'U') '8' UINT16_LIT = INT_LIT ['\''] ('u' | 'U') '16' UINT32_LIT = INT_LIT ['\''] ('u' | 'U') '32' UINT64_LIT = INT_LIT ['\''] ('u' | 'U') '64' exponent = ('e' | 'E' ) ['+' | '-'] digit ( ['_'] digit )* FLOAT_LIT = digit (['_'] digit)* (('.' (['_'] digit)* [exponent]) |exponent) FLOAT32_SUFFIX = ('f' | 'F') ['32'] FLOAT32_LIT = HEX_LIT '\'' FLOAT32_SUFFIX | (FLOAT_LIT | DEC_LIT | OCT_LIT | BIN_LIT) ['\''] FLOAT32_SUFFIX FLOAT64_SUFFIX = ( ('f' | 'F') '64' ) | 'd' | 'D' FLOAT64_LIT = HEX_LIT '\'' FLOAT64_SUFFIX | (FLOAT_LIT | DEC_LIT | OCT_LIT | BIN_LIT) ['\''] FLOAT64_SUFFIX As can be seen in the productions, numerical constants can contain underscores for readability. Integer and floating point literals may be given in decimal (no prefix), binary (prefix ``0b``), octal (prefix ``0o`` or ``0c``) and hexadecimal (prefix ``0x``) notation. There exists a literal for each numerical type that is defined. The suffix starting with an apostrophe ('\'') is called a `type suffix`:idx:. Literals without a type suffix are of the type ``int``, unless the literal contains a dot or ``E|e`` in which case it is of type ``float``. For notational convenience the apostrophe of a type suffix is optional if it is not ambiguous (only hexadecimal floating point literals with a type suffix can be ambiguous). The type suffixes are: ================= ========================= Type Suffix Resulting type of literal ================= ========================= ``'i8`` int8 ``'i16`` int16 ``'i32`` int32 ``'i64`` int64 ``'u`` uint ``'u8`` uint8 ``'u16`` uint16 ``'u32`` uint32 ``'u64`` uint64 ``'f`` float32 ``'d`` float64 ``'f32`` float32 ``'f64`` float64 ``'f128`` float128 ================= ========================= Floating point literals may also be in binary, octal or hexadecimal notation: ``0B0_10001110100_0000101001000111101011101111111011000101001101001001'f64`` is approximately 1.72826e35 according to the IEEE floating point standard. Literals are bounds checked so that they fit the datatype. Non base-10 literals are used mainly for flags and bit pattern representations, therefore bounds checking is done on bit width, not value range. If the literal fits in the bit width of the datatype, it is accepted. Hence: 0b10000000'u8 == 0x80'u8 == 128, but, 0b10000000'i8 == 0x80'i8 == -1 instead of causing an overflow error. Operators --------- Nim allows user defined operators. An operator is any combination of the following characters:: = + - * / < > @ $ ~ & % | ! ? ^ . : \ These keywords are also operators: ``and or not xor shl shr div mod in notin is isnot of``. `=`:tok:, `:`:tok:, `::`:tok: are not available as general operators; they are used for other notational purposes. ``*:`` is as a special case treated as the two tokens `*`:tok: and `:`:tok: (to support ``var v*: T``). Other tokens ------------ The following strings denote other tokens:: ` ( ) { } [ ] , ; [. .] {. .} (. .) The `slice`:idx: operator `..`:tok: takes precedence over other tokens that contain a dot: `{..}`:tok: are the three tokens `{`:tok:, `..`:tok:, `}`:tok: and not the two tokens `{.`:tok:, `.}`:tok:.