EACL 2006 is hosting a workshop on wikis and blogs in Trento, Italy.
Among other things
Wikis and blogs and other dynamic text sources is focused on the
type of text produced in these
kind of online media. But is text produced so differently from text
found in your personal diary or online news ticker? Yes, somehow but that
should not be surprising and linguists already suffer from bad text. Just
recall the never ending rants on the penn treebank, a collection of
Wall Street Journal articles. This is really
bad bad text from a linguists perspective
with lots of dirt in the data. Note, however, that the WSJ is not
some blog.
Over a decade or so parsers for English
have been evaluated against that beast of data corpus, mostly stochastic
methods. Rule-based systems as
ours have a hard time
to cover all these irregularities. Irregular speech and language is quite normal
though: speech recognition in noisy environments but also ungrammatical language, child language. Every aspect of language can deviate from your
grammar knowledge for one or the other reason. So systems must cope with
so called
every day data the same way they do with regular input
as good as possible in a robust way. That's why I like our stuff so much: it does not
implement robustness on top but it is robust out of the box because parsing
is reduced to a constraint optimization problem (
read the parers for more info). So far
on the
new challenges.
Again, wikis are only perceived by its most prominent installation, Wikipedia:
"In contrast to blogs, wikis have high ambitions as regards factual correctness, persistence, editorial quality, and trustworthiness."
(from the call for participation).