Marseille INRIA Columbia AT&T (MICA)
- MICA is a dependency parser trained on the Penn Treebank
- MICA can associate dependency parses with rich linguistic
information such as voice, the presence of empty subjects (PRO),
wh-movement, and whether a verb heads a relative clause.
- MICA is fast (450 words per second plus 6 seconds initialization
on a standard high-end machine) and has state-of-the-art performance
(87.6% unlabeled dependency accuracy on the Penn Treebank).
- MICA consists of two processes: the supertagger, which associates
tags representing rich syntactic information with the input word
sequence, and the actual parser, based on the INRIA SYNTAX system,
which derives the syntactic structure from the n-best chosen
supertags. Only the supertagger uses lexical information, the parser
only sees the supertag hypotheses.
- MICA returns n-best parses for arbitrary n;
parse trees are associated with probabilities. A
packed forest can also be returned.
MICA is still under development. The documentation in particular is
work in progress.
- A minimalist installation and user guide can be found in the Readme
file (see section Download)
- A description of the output format of MICA can be found here
- An overview of the parser can be found in the paper
- The grammar implemented in MICA can be found in this file
Each line in the file corresponds to a elementary tree.
Here is an example:
t27 S##1#l# NP#0#2#l#s NP#0#2#r#s VP##3#l# V##4#l#h V##4#r#h NP#1#5#l#s NP#1#5#r#s VP##3#r# S##1#r#
This is the tree t27 for the basic transitive verb.
The nodes of the tree are listed in a depth-first, left-to-right traversal.
Each node is listed twice: once when descending, and again when ascending.
This means that for leaf nodes, there are two entries for the left nodes right next to each other.
The format for each node is (using an example to explain):
- NP - node label
- 0 - deep argument position (this is only filled for substitution nodes)
- 2 - the number of the node in the tree; each node has a unique number, but since each node is listed twice in this enumeration, each number will occur exactly twice
- l - this is the left version of the node, the right version (r) will also occur, in this case right next to it since it is a leaf node
- s - the type of node; this can be s for substitution, h for head, c for co-head (for strongly governed prepositions) or nothing
A more readable version of the grammar is available here Do not print, it is over 1,000 pages
- Theoretical aspects of the parser are described in a series
of technical papers
- S.Bangalore, P. Haffner, Classification of Large Label Sets
in Proceedings of the Snowbird Learning Workshop 2005
- S.Bangalore, A.Joshi, Supertagging: An approach to
almost parsing, in Computational Linguistics, 25(2) 1999
- A. Nasr, O.Rambow, Parsing with Lexicalized Probabilistic Recursive Transition Networks,
in Finite-State Methods and Natural Language Processing 2005 - LNAI 4002
- A. Nasr, O.Rambow, A Simple String-Rewriting Formalism for Dependency Grammar, in Workshop on Recent Advances in Dependency Grammar -
- A. Nasr, O.Rambow, Supertagging and Full Parsing , in 7th International Workshop
on Tree Adjoining Grammars and Related Frameworks-TAG+ 2004
Information concerning MICA is provided through two mailing lists:
is dedicated to messages concerning new releases of MICA.
is for exchange of information among MICA users. Check the mailing
for past contributions.