These are the options I use for the Stanford tokenizer to preprocess my data for parsing with the MATE Parser:
normalizeParentheses=false, normalizeOtherBrackets=false, untokenizable=allKeep, escapeForwardSlashAsterisk=false
This is the explanation of the options from the documentation:
- normalizeParentheses: Whether to map round parentheses to -LRB-, -RRB-, as in the Penn Treebank
- normalizeOtherBrackets: Whether to map other common bracket characters to -LCB-, -LRB-, -RCB-, -RRB-, roughly as in the Penn Treebank
- untokenizable: What to do with untokenizable characters (ones not known to the tokenizer). Six options combining whether to log a warning for none, the first, or all, and whether to delete them or to include them as single character tokens in the output: noneDelete, firstDelete, allDelete, noneKeep, firstKeep, allKeep. The default is “firstDelete”.
- escapeForwardSlashAsterisk: Whether to put a backslash escape in front of / and * as the old PTB3 WSJ does for some reason (something to do with Lisp readers??).