If you do sentiment analysis on document level, there are huge amounts of data annotated with star-ratings available on Amazon and similar pages. In theory. In practice, to get this data, you need to crawl Amazon pages, download the reviews and parse the HTML to extract the individual reviews. And this would be the n-th time somebody wrote a script to do that. So, to save you the waste of time, Andrea Esuli kindly offers some scripts to download Amazon reviews and convert them to a csv file. Thank you! You can find it on Andrea Esuli’s web page.
normalizeParentheses=false, normalizeOtherBrackets=false, untokenizable=allKeep, escapeForwardSlashAsterisk=false
This is the explanation of the options from the documentation:
- normalizeParentheses: Whether to map round parentheses to -LRB-, -RRB-, as in the Penn Treebank
- normalizeOtherBrackets: Whether to map other common bracket characters to -LCB-, -LRB-, -RCB-, -RRB-, roughly as in the Penn Treebank
- untokenizable: What to do with untokenizable characters (ones not known to the tokenizer). Six options combining whether to log a warning for none, the first, or all, and whether to delete them or to include them as single character tokens in the output: noneDelete, firstDelete, allDelete, noneKeep, firstKeep, allKeep. The default is “firstDelete”.
- escapeForwardSlashAsterisk: Whether to put a backslash escape in front of / and * as the old PTB3 WSJ does for some reason (something to do with Lisp readers??).