Stanford Tokenizer options for MATE Parser

These are the options I use for the Stanford tokenizer to preprocess my data for parsing with the MATE Parser:

normalizeParentheses=false,
normalizeOtherBrackets=false,
untokenizable=allKeep,
escapeForwardSlashAsterisk=false

This is the explanation of the options from the documentation:

  • normalizeParentheses: Whether to map round parentheses to -LRB-, -RRB-, as in the Penn Treebank
  • normalizeOtherBrackets: Whether to map other common bracket characters to -LCB-, -LRB-, -RCB-, -RRB-, roughly as in the Penn Treebank
  • untokenizable: What to do with untokenizable characters (ones not known to the tokenizer). Six options combining whether to log a warning for none, the first, or all, and whether to delete them or to include them as single character tokens in the output: noneDelete, firstDelete, allDelete, noneKeep, firstKeep, allKeep. The default is “firstDelete”.
  • escapeForwardSlashAsterisk: Whether to put a backslash escape in front of / and * as the old PTB3 WSJ does for some reason (something to do with Lisp readers??).