I have tried experimenting with the McDonald parser for the last few days, and I think it is very unlikely to work well with Local Optimality Parsing. For the record, here is what I have done. I trained the projective 1st order McDonald parser on the Danish Dependency Treebank (Java Memory errors prevented me from using more than the first 50,000 words). I then tried analyzing all prefixes of the first two sentences in the test corpus, in order to determine the incremental behaviour of the McDonald model. In particular, I looked at an incremental parse of the first two sentences in the Danish Dependency Treebank (translated to English). The optimal MST analysis of the first example ("Two Russian historians Andronik Mirganjan and Igor Klamkin believe not that Russia can be-developed without an 'iron fist'.") is rather poor: three subjects, invalid treatment of coordination. What is worse, the optimal analysis changes unpredictably back and forth as more words are added to the input. Eg:
1. "Two well-known Russian historians Andronik": "Two" is the head and "Andronik" a nominal object of "Two".
2. "Two well-known Russian historians Andronik Mirganjan": "Mirganjan" is the head, "Andronik" a first name to "Mirganjan", and "Two" is a modifier of "Mirganjan".
3. "Two well-known Russian historians Andronik Mirganjan and": "Mirganjan" becomes a nominal object of "Two".
I have also tried looking at the edge weights produced by the MST parser, in order to see whether they could be used for error detection. However, although large negative scores sometimes correlate with real errors, they often do not: real errors may be assigned large positive edge weights, and perfectly fine edges may be assigned a large negative score (eg, "believe" and "that").
I think there are three main lessons to be learned from this: (1) The edge-factoring in the MST model means that it will make quite a number of really stupid mistakes (several subjects, illegal coordination structures) that will be difficult to recover from in a local optimality parser. (2) The discriminative scoring means that scores may have very little to do with probabilities, so they do not really provide any support for error-driven parsing. (3) It is not enough that the probability model works well on full sentences, it must also work well on sentence prefixes, ie, the incremental behaviour of the probability model is really important for incremental parsing.
In sum, I have reached a point where I do not think an implementation of local optimality parsing with the McDonald probability model is going to work. It would have been nice if it could have provided a quick "proof of concept", but it didn't -- which means that we haven't really learned a lot. Rather than spending more time on the McDonald model, I will now concentrate on the "real" OSDT development. When we have reached the point where the toolkit supports the implementation and training of the Collins model, I will see how far I can push the Collins model in terms of error detection. If that fails to lead to a "proof of concept", the next step will be to try the complex generative probability model that I have presented in my dissertation.
Wednesday, 22 October 2008
Monday, 6 October 2008
The new OSDT blog
I just realized that a long time has went by, where Martin and I have disappeared into a black hole of busy Java coding, without leaving much of a trace to the outside world. So I decided to join the crowd and create a blog for the OSDT project that will allow us to keep track of the latest progress and developments, in a somewhat informal way.
So what is the latest status? The Google code website now contains an outline of the entire OSDT system, with (incomplete) interfaces for a lot of the modules. A lot of work remains to be done, but progress is steady and I now have the feeling that the coding complexity of the system is perfectly manageable. I am pretty confident that we will have a working system in the spring.
If you want to track the progress, don't forget that we agreed to use the issue tracking system at Google code -- I am keeping it up-to-date so that you can at least see what I am up to at the moment.
I have been busy working on the graph module, which will make it possible to store a graph which consists of an arbitrary number of layers, each containing an arbitrary number of nodes and edges. Each layer specifies the features that the nodes and edges in the layer are allowed to have, and it also specifies the graph functions associated with all nodes and edges in the layer: functions that can be evaluated at any node in the layer, and which may in turn depend on other functions evaluated at the same node/edge or its neighbours in the graph. In other words, layer functions are an abstraction that cover functions like graph height and subtree node count, the individual generative log-probabilities in a generative probability model, or the log-probability feature vectors and feature weights for a feature-based system.
This would be really dull stuff, if it wasn't for the fact that incremental graph updates are the most important and time-consuming activity in a repair-based system: at every processing step, we need to evaluate a really large number of minor graph modifications to a huge and complex graph. This means that we don't want to destroy the original graph before we actually decide on which modification to apply, we don't want to copy the original graph, and we don't want to reevaluate all the functions at all the nodes and edges in the original graph unless we really have to.
This calls for an implementation where we can create a new graph as a copy of a base graph, with a number of subsequent modifications that do not change the original graph. It also means that we need a Makefile-like update system where we record the functional dependencies between the local properties in the graph (ie, the local functions, the local features, and the local graph structure at nodes and edges in the graph), and where a change of a local property (eg, as a consequence of a change of a feature or of the graph structure) propagates into a chain of reevaluations of all the local functions that are affected by these changes. I have implemented most of this machinery now -- the function updates are the last piece in the puzzle, and I am working on it right now.
Another change, compared to June, is that Marisa's comments about separating the graph and the graph interpreter have been vindicated. She was right and I was wrong. The graph as such is just an uninterpretable bunch of layers, nodes, and edges, so we need to have some components that take care of actually interpreting the different substructures in the graph. To this purpose, it is convenient to have a separate object (a view) that is responsible for interpreting the nodes and edges in a particular part of the graph. These views can be nested, and the OSDT code now contains views for trees, projective trees, word alignments, etc. This code hasn't quite stabilized yet, but it is getting close.
Oh, did I remember to tell you about the 160 unit tests? Sounds like much, but is far too little -- try turning on the unit testing coverage module in Netbeans, and it will tell you the depressingly low percentage of code that is covered by the tests. Nevertheless, unit tests have saved me an inordinate amount of time when restructuring the code, so if you contribute OSDT code, please don't forget the unit tests!!!
I know Martin has been working on integrating the McDonald probability model into the OSDT toolkit, and he has also been working on the XHPM learning algorithm. John and his student Zhong Chen are currently looking at the search algorithms in the parser. When the core OSDT code enters a more stable phase, we will also be in a good position to incorporate Marco's graph query system based on first-order-logic, and Keith's code for expectation maximization. But I would violate the personalized and highly opinionated spirit of a blog if I didn't leave it Martin, John, Zhong, Marco, and Keith to fill in the details...
So what is the latest status? The Google code website now contains an outline of the entire OSDT system, with (incomplete) interfaces for a lot of the modules. A lot of work remains to be done, but progress is steady and I now have the feeling that the coding complexity of the system is perfectly manageable. I am pretty confident that we will have a working system in the spring.
If you want to track the progress, don't forget that we agreed to use the issue tracking system at Google code -- I am keeping it up-to-date so that you can at least see what I am up to at the moment.
I have been busy working on the graph module, which will make it possible to store a graph which consists of an arbitrary number of layers, each containing an arbitrary number of nodes and edges. Each layer specifies the features that the nodes and edges in the layer are allowed to have, and it also specifies the graph functions associated with all nodes and edges in the layer: functions that can be evaluated at any node in the layer, and which may in turn depend on other functions evaluated at the same node/edge or its neighbours in the graph. In other words, layer functions are an abstraction that cover functions like graph height and subtree node count, the individual generative log-probabilities in a generative probability model, or the log-probability feature vectors and feature weights for a feature-based system.
This would be really dull stuff, if it wasn't for the fact that incremental graph updates are the most important and time-consuming activity in a repair-based system: at every processing step, we need to evaluate a really large number of minor graph modifications to a huge and complex graph. This means that we don't want to destroy the original graph before we actually decide on which modification to apply, we don't want to copy the original graph, and we don't want to reevaluate all the functions at all the nodes and edges in the original graph unless we really have to.
This calls for an implementation where we can create a new graph as a copy of a base graph, with a number of subsequent modifications that do not change the original graph. It also means that we need a Makefile-like update system where we record the functional dependencies between the local properties in the graph (ie, the local functions, the local features, and the local graph structure at nodes and edges in the graph), and where a change of a local property (eg, as a consequence of a change of a feature or of the graph structure) propagates into a chain of reevaluations of all the local functions that are affected by these changes. I have implemented most of this machinery now -- the function updates are the last piece in the puzzle, and I am working on it right now.
Another change, compared to June, is that Marisa's comments about separating the graph and the graph interpreter have been vindicated. She was right and I was wrong. The graph as such is just an uninterpretable bunch of layers, nodes, and edges, so we need to have some components that take care of actually interpreting the different substructures in the graph. To this purpose, it is convenient to have a separate object (a view) that is responsible for interpreting the nodes and edges in a particular part of the graph. These views can be nested, and the OSDT code now contains views for trees, projective trees, word alignments, etc. This code hasn't quite stabilized yet, but it is getting close.
Oh, did I remember to tell you about the 160 unit tests? Sounds like much, but is far too little -- try turning on the unit testing coverage module in Netbeans, and it will tell you the depressingly low percentage of code that is covered by the tests. Nevertheless, unit tests have saved me an inordinate amount of time when restructuring the code, so if you contribute OSDT code, please don't forget the unit tests!!!
I know Martin has been working on integrating the McDonald probability model into the OSDT toolkit, and he has also been working on the XHPM learning algorithm. John and his student Zhong Chen are currently looking at the search algorithms in the parser. When the core OSDT code enters a more stable phase, we will also be in a good position to incorporate Marco's graph query system based on first-order-logic, and Keith's code for expectation maximization. But I would violate the personalized and highly opinionated spirit of a blog if I didn't leave it Martin, John, Zhong, Marco, and Keith to fill in the details...
Subscribe to:
Posts (Atom)