Sentence Tokenization: Keeping track of parentheses and quotations
Full sentences may appear inside parentheses, what is consensus regarding those?
Nazia: we also have to consider sentences inside quotations. i.e.:
She turned to him, 'This is great. ' she said.
which currently gets split as :"She turned to him, 'This is great. "
and"' she said. "
Another example: fo ['Pamela Ferguson professari í Dundee fróðskaparsetrinum ger vart við, at “tíðindafólk tykjast at ganga á markinum, um tey almannagera myndir o.s.fr. av fólki undir illgruna.” Hetta gevur teimum fleiri smá støð at goyma seg fyri rándjórum.'] no-split
Does not recognize to split here: ...fólki undir illgruna."