Wednesday, 22 October 2008

Experiments with the projective 1st order McDonald model

I have tried experimenting with the McDonald parser for the last few days, and I think it is very unlikely to work well with Local Optimality Parsing. For the record, here is what I have done. I trained the projective 1st order McDonald parser on the Danish Dependency Treebank (Java Memory errors prevented me from using more than the first 50,000 words). I then tried analyzing all prefixes of the first two sentences in the test corpus, in order to determine the incremental behaviour of the McDonald model. In particular, I looked at an incremental parse of the first two sentences in the Danish Dependency Treebank (translated to English). The optimal MST analysis of the first example ("Two Russian historians Andronik Mirganjan and Igor Klamkin believe not that Russia can be-developed without an 'iron fist'.") is rather poor: three subjects, invalid treatment of coordination. What is worse, the optimal analysis changes unpredictably back and forth as more words are added to the input. Eg:

1. "Two well-known Russian historians Andronik": "Two" is the head and "Andronik" a nominal object of "Two".
2. "Two well-known Russian historians Andronik Mirganjan": "Mirganjan" is the head, "Andronik" a first name to "Mirganjan", and "Two" is a modifier of "Mirganjan".
3. "Two well-known Russian historians Andronik Mirganjan and": "Mirganjan" becomes a nominal object of "Two".

I have also tried looking at the edge weights produced by the MST parser, in order to see whether they could be used for error detection. However, although large negative scores sometimes correlate with real errors, they often do not: real errors may be assigned large positive edge weights, and perfectly fine edges may be assigned a large negative score (eg, "believe" and "that").

I think there are three main lessons to be learned from this: (1) The edge-factoring in the MST model means that it will make quite a number of really stupid mistakes (several subjects, illegal coordination structures) that will be difficult to recover from in a local optimality parser. (2) The discriminative scoring means that scores may have very little to do with probabilities, so they do not really provide any support for error-driven parsing. (3) It is not enough that the probability model works well on full sentences, it must also work well on sentence prefixes, ie, the incremental behaviour of the probability model is really important for incremental parsing.

In sum, I have reached a point where I do not think an implementation of local optimality parsing with the McDonald probability model is going to work. It would have been nice if it could have provided a quick "proof of concept", but it didn't -- which means that we haven't really learned a lot. Rather than spending more time on the McDonald model, I will now concentrate on the "real" OSDT development. When we have reached the point where the toolkit supports the implementation and training of the Collins model, I will see how far I can push the Collins model in terms of error detection. If that fails to lead to a "proof of concept", the next step will be to try the complex generative probability model that I have presented in my dissertation.

1 comment:

Keith said...

Hi Matthias. I'm a little surprised by the results. John and I did experiments using a very similar model using my k-best MST parser and we found good correlation between regions of temporary ambiguity and entropy (we computed a truncated entropy over the 50-best analyses). I wonder if the 1-best is just too noisy and going deeper down into the hypothesis space helps.