meta data for this page
Here, by category, is a list of all the settings in CongoCC that can be set at the top of a grammar file.
Options relating to File/Class/Package Naming
By default, the tool has some naming conventions that, actually, you might as well use. For example, if your grammar lies in a file named Foo.ccc
then the tool will generate FooParser.java
and FooLexer.java
based on the filename. You can override that default naming using the options PARSER_CLASS and LEXER_CLASS respectively. You might prefer to use BASE_NAME. If you set:
BASE_NAME=Foo;
at the top of the grammar, then you set the parser class to FooParser
and the lexer class to FooLexer
in one fell swoop. If you set:
BASE_NAME="";
then the parser class is Parser
and the lexer class is Lexer
and some people would like that.
Unlike its predecessor, CongoCC does not generate any code into the *default* or *unnamed* package. In fact, it puts the parser and lexer code into a package that you can specify via the PARSER_PACKAGE setting. And, assuming you have tree-building turned on, it generates the various parse tree nodes in a package that can be specified via the NODE_PACKAGE setting. If you don't set either of those things, the packages are named based on the parser class name. So, if your parser class name is FooParser
, it will create a package called fooparser
and the node package (assuming it is also unspecified) will be fooparser.ast
. And, that is all generated relative to the location of the grammar file. Unless you override that with BASE_SRC_DIR (which most people will do!)
The BASE_SRC_DIR is either an absolute directory on the file system or (more likely) is a relative directory, relative to where the grammar file is. So, if you don't specify it, it is the same as saying:
BASE_SRC_DIR=".";
Something like:
BASE_SRC_DIR="../../build/generated-code";
would be pretty typical. Note also that BASE_SRC_DIR is one of the handful of settings that can also be set on the command-line, via the -d
setting. If it is set in the grammar and on the command-line, the command-line setting takes precedence.
Options relating to Tree Building
By default, tree-building is on. You can turn it off via:
TREE_BUILDING_ENABLED=false;
If you want tree-building to be enabled, but to be off by default, you can use:
TREE_BUILDING_DEFAULT=false;
In CongoCC (like in JavaCC 21, but not legacy JavaCC) Tokens are considered the terminal nodes in the parse tree. They are added to the tree. You can disable that via:
TOKENS_ARE_NODES=false;
The default tree-building pattern is that a production creates a node if there are two (or more) nodes on the stack when the production exits. That is called “smart node creation” and is very typically what one wants, but maybe not. You can turn it off via:
SMART_NODE_CREATION=false;
And then, the tree-building machinery will create a new node for a production, assuming that there is one or more nodes created. If you don't want any nodes created by default, you can use:
NODE_DEFAULT_VOID=true;
or just:
NODE_DEFAULT_VOID;
for short. (All these boolean-valued settings can be set to true by just writing them with no value.
Options Relating to Niggling Whitespace Issues
By default, CongoCC normalizes newlines to a lone line-feed character,i.e. converts CR or CR-LF to LF, i.e \n. If you want to preserve the newlines as they were in the input you can write:
PRESERVE_LINE_ENDINGS;
at the top of your grammar. By default, hard tab characters are left as they are in the input and also, for error reporting purposes, are treated as one horizontal space. This is not usually what you want. If you write:
TAB_SIZE=4;
at the top of the grammar, the tabs are converted to spaces on the basis of the tab stops being at 4-space intervals. (Or whatever interval you want, of course.) If you really want to preserve the original tab characters, you can also write:
PRESERVE_TABS;
If you want to ensure that the input to the parser ends with a final newline (i.e. it gets tacked on if it's not there) you can use:
ENSURE_FINAL_EOL;
If you want to ensure that the input to a parser ends with some specific string, possibly a control character you can use the more general:
TERMINATING_STRING="\u001A";
In the above case, the hex 1A
is tacked on if it is not there. That is a CTRL-Z control character. But the terminating string can be anything you specify. By the way: ENSURE_FINAL_EOL;
is the exact same thing as: TERMINATING_STRING=“\n”;
.
Settings relating to Lexical Processing
If you want certain token types to be inactive by default (though presumably turned on at key spots) you can use the DEACTIVATE_TOKENS setting.
DEACTIVATE_TOKENS=LPAREN,RPAREN;
means that, by default, those tokens are inactive. You can use:
DEFAULT_LEXICAL_STATE=JAVA;
to specify that the parser, by default, starts in the JAVA
lexical state. If this setting is unused, the default lexical state is taken to be one named DEFAULT
.
You can use the EXTRA_TOKENS setting to specify some extra token types are not defined with regular expressions in the lexical grammar.This can be particularly useful in token hook routines, in particular for generating synthetic tokens. For example, this is how the synthetic INDENT and DEDENT tokens in Python are handled. You will see the line:
EXTRA_TOKENS=INDENT,DEDENT;
which defines these token types, except they are nowhere to be found in the lexical grammar!
Settings related to Token class generation
One feature that is sometimes useful is token chaining. You can insert a synthetic token into the chain of tokens. This is actually a very tricky, error-prone usage pattern that is off by default. And we recommend that you only turn it on if you really need it! If you examine the included Python grammar, you will see this in use. You turn it on via:
TOKEN_CHAINING;
Conversely, you can turn on the MINIMAL_TOKEN setting to generate a minimal token. With that turned on, Token
does not have an image
field. It has a getImage()
method that uses its beginOffset/endOffset
to get its string image on demand. But there is no setImage()
method. There are some (possibly tricky) coding patterns that involve setting a token's String image to a different value from what was read in. But if you don't need that, you can use MINIMAL_TOKEN and this also reduces the memory footprint of your token objects.
Settings related to generating a fault-tolerant parser
You can set
FAULT_TOLERANT;
at the top of your grammar to turn on the experimental support for building a fault tolerant parser. It is off by default.