The Montreal Transducer module for GATE
User guide
Copyright Luc Plamondon, Université de Montréal,
2004.
plamondl@iro.umontreal.ca
$Id: README.html,v 1.4 2004/04/14 19:35:55 plamondl Exp $
Table of contents
-
What is GATE
-
What is the Montreal Transducer?
-
Getting help
-
Installation procedure
-
How to use it with the GATE GUI?
-
How to use it in a standalone GATE program?
-
Changes to the JAPE language
-
For developers
-
Licence
-
Change log
1) What is GATE?
GATE is a development environment for language engineering. It is open
source and it can be downloaded from http://gate.ac.uk.
The processing of a document is divided into small tasks that are performed
by independent JavaBeans modules. The Montreal Transducer is one of those
modules.
2) What is the Montreal Transducer?
A transducer has 2 inputs: a document and a human-readable grammar. Generally,
the output is a document with annotations added according to the grammar,
but it could be anything else because the grammar allows Java code to be
executed upon the parsing of a rule. A transducer can be used to identify
named entities in a document, for example.
The GATE framework (2.1 and 2.2) comes with a basic "Jape Transducer"
which is fully described in the Gate
user guide. The JAPE grammar language understood by the transducer
is also explained. There is also an "Ontology Aware Transducer" that is
a wrapper around the Jape Transducer (in fact, the latter's core is already
ontology aware). And there is a "ANNIE Transducer" that is nothing more
than a Jape Transducer that loads with a named-entity recognition grammar.
The Montreal Transducer is an improved Jape Transducer. It is intended
to make grammar authoring easier by providing a more flexible version of
the JAPE language and it also fixes a few bugs.
If you write JAPE grammars, see section Changes
to the JAPE language for all the details. Otherwise, here is
a short description of the enhancements:
a) The improvements
-
While only '==' constraints were allowed on annotation attributes,
the grammar now accepts constraints such as {MyAnnot.attrib != value},
{MyAnnot.attrib
> value}, {MyAnnot.attrib < value},
{MyAnnot.attrib
=~ value} and {MyAnnot.attrib !~ value}
-
The grammar now accepts negated constraints such as {!MyAnnot}
(true if no annotation starting from current node has the MyAnnot type)
and {!MyAnnot.attrib == value} (true if {MyAnnot.attrib ==
value} fails), where the '==' constraint can be any other
operator
-
Because the transducer compiles rules at run-time, the classpath must include
the transducer jar file (unless the transducer is bundled in the GATE jar
file). The Montreal Transducer updates the classpath automatically when
it is initialised.
b) The bugs fixed
-
Constraints on more than one annotation types for a same node now work.
For example, {MyAnnot1, MyAnnot2} was allowed by the Jape Transducer
but not implemented yet
-
The "*" and "+" Kleene operators were not greedy when
they occurred inside a rule. The document region parsed by a rule is correct
but ambiguous labels inside the rule were not resolved the expected way.
In the following rule for example, a node that would match both constraints
should be part of the ":titles" label and not ":names"
because the first "+" is expected to be greedy:
({Lookup.majorType == title})+:titles ({Token.orth == upperInitial})*:names
3) Getting help
The reader should be familiar with the Jape language. See the Gate
user guide (version 1.33), more specifically section
6 JAPE: Regular Expressions Over Annotations and Appendix
B JAPE: Implementation.
The Montreal Transducer sources are freely available, so user support
will be very limited. You may find what you are looking for on the
project homepage.
Developers will find comments on classes and methods through the javadoc
pages: doc/javadoc/index.html.
4) Installation procedure
Java 1.4 or higher is required. The Montreal Transducer has been tested
on GATE 2.1 and 2.2.
Make sure the MtlTransducer.jar and creole.xml files
are in the same directory and that is it.
The directory must be accessible by the embedding application via the
"file:" protocol. Unlike for most GATE modules, the directory
(also known as a repository) of a transducer cannot be a web URL ("http://www...").
This is because the transducer compiles java code (the grammar rules) every
time it is loaded and the resource jar file must be part of the classpath
when compiling, but only regular file URLs are allowed in the classpath.
The resource will try to add the jar file to the classpath automatically.
If problems arise when loading the transducer, add the jar file to the
classpath manually prior to running the application.
5) How to use it with the GATE GUI
In the GUI menu, click on File / Load a CREOLE Repository, then
enter the URL of the directory where MtlTransducer.jar and creole.xml
files live. The path must begin with "file:". It cannot be a web
URL (see Installation procedure).
Then click on File / New processing resource and choose Montreal
Transducer. The only mandatory field is the Grammar URL: enter
the path of a main.jape file in the same manner as for a regular
Jape Transducer (this URL can point to a file on the web). Add the new
module to a processing pipeline. It may be necessary to run a tokeniser
and gazetteer before the transducer if the grammar uses Token
and Lookup annotations.
6) How to use it in a standalone GATE program?
A good starting point is the example code here.
The following code registers a repository (the directory where the MtlTransducer.jar
and creole.xml files live; the directory cannot be a web URL,
see Installation procedure), then creates a
Montreal Transducer with specific parameters (the grammarURL parameter
is mandatory and it should point to a main.jape file like for
a regular Jape Transducer), and finally adds the resource to a pipeline.
It may be necessary to run a tokeniser and gazetteer before the transducer
if the grammar uses Token and Lookup annotations.
// Create a pipeline
SerialAnalyserController annieController = (SerialAnalyserController)
Factory.createResource("gate.creole.SerialAnalyserController",
Factory.newFeatureMap(), Factory.newFeatureMap(),
"ANNIE_" + Gate.genSym());
// Load a tokeniser, gazetteer, etc. here
// Register the external repository where the Montreal Transducer
jar file lives
gate.Gate.getCreoleRegister().registerDirectories(new URL("file:MtlTransducer/build"));
// Create an instance of the transducer after having set the grammar
URL
FeatureMap params;
params = Factory.newFeatureMap();
params.put("grammarURL", new URL("file:creole/NE/main.jape"));
params.put("inputASName", "Original markups");
ProcessingResource transducerPR = (ProcessingResource)
Factory.createResource("ca.umontreal.iro.rali.gate.MtlTransducer",
params);
annieController.add(transducerPR);
7) Changes to the JAPE language
The Montreal Transducer is based on the Transducer from the ANNIE suite
but with the following added features:
-
It provides more comparison operators in
left hand side constraints
-
It allows conjunctions of constraints on different
types of annotation
-
It guarantees that the "*" and "+" Kleene operators are greedy
More comparison operators
The Montreal Transducer offers more comparison operators to put in left
hand side constraints of a JAPE grammar. The standard ANNIE transducer
allows constraints only like these:
-
{MyAnnot} // true if the current annotation is a MyAnnot annotation
-
{MyAnnot.attrib == "3"} // true if attrib attribute has
a value that is equal to 3
The Montreal Transducer allows the following constraints:
-
{!MyAnnot} // true if NO annotation at current point is a MyAnnot
-
{!MyAnnot.attrib == 3} // true if attrib is not equal
to 3
-
{MyAnnot.attrib != 3} // true if attrib is not equal
to 3
-
{MyAnnot.attrib > 3} // true if attrib > 3
-
{MyAnnot.attrib >= 3} // true if attrib ≥ 3
-
{MyAnnot.attrib < 3} // true if attrib < 3
-
{MyAnnot.attrib <= 3} // true if attrib ≤ 3
-
{MyAnnot.attrib =~ "[Dd]ogs?"} // true if regular expression matches
attrib
entirely
-
{MyAnnot.attrib !~ "[Dd]ogs?"} // true if regular expression does
not match attrib
See the notes on the equality operators, comparison
operators, pattern matching operators and negation
operator.
Notes on equality operators: "==" and "!="
The "!=" operator is the negation of the "==" operator, that is to say:
{Annot.attribute
!= value} is equivalent to {!Annot.attribute == value}.
When a constraint on an attribute cannot be evaluated because an annotation
does not have a value for the attribute, the equality operator returns
false (and the difference operator returns true).
If the constraint's attribute is a string, then the String.equals method
is called with the annotation's attribute as a parameter. If the constraint's
attribute is an integer, then the Long.equals method is called. If the
constraint's attribute is a float, then the Double.equals method is called.
And if the constraint's attribute is a boolean, then the Boolean.equals
method is called. The grammar parser does not allow other types of constraints.
Normally, when the types of the constraint's and the annotation's attribute
differ, they cannot be equal. However, because some ANNIE processing resources
(namely the tokeniser) set all attribute values as strings even when they
are numbers (Token.length is set to a string value, for example),
the Montreal Transducer can convert the string to a Long/Double/Boolean
before testing for equality. In other words, for the token "dog":
-
{Token.attrib == "3"} is true using either the ANNIE transducer
or the Montreal Transducer
-
{Token.attrib == 3} is false using the ANNIE transducer, but true using
the Montreal Transducer
Notes on comparison operators: ">", "<",
">=" and "<="
If the constraint's attribute is a string, then the String.compareTo
method is called with the annotation's attribute as a parameter (strings
can be compared alphabetically). If the constraint's attribute is an integer,
then the Long.compareTo method is called. If the constraint's attribute
is a float, then the Double.compareTo method is called. The transducer
issues a warning if an attempt is made to compare two Boolean because this
type does not extend the Comparable interface and thus has no compareTo
method.
The transducer issues a warning when it encounters an annotation's attribute
that cannot be compared to the constraint's attribute because the value
types are different, or because one value is null. For example, given a
constraint {MyAnnot.attrib > 2}, a warning is issued for any MyAnnot
in the document for which attrib is not an integer, such as attrib
= "dog" because we cannot evaluate "dog" > 2. Similarly,
{MyAnnot.attrib
> 2} cannot be compared to attrib = 2.5 because 2.5 is a
float. In this case, force 2 as a float with {MyAnnot.attrib > 2.0}.
The transducer does not issue a warning when the constraint's attribute
is an integer/float and the annotation's attribute is a string but can
be parsed as an integer/float. Some ANNIE processing resources (namely
the tokeniser) set all attribute values as strings even when they are numbers
(Token.length is set to a string value, for example), and because
{Token.length
< "10"} would lead to an alphabetical comparison, a workaround
was needed so we could write {Token.length < 10}.
Notes on pattern matching operators: "=~" and
"!~"
The "!~" operator is the negation of the "=~" operator, that is to say:
{Annot.attribute
!~ "value"} is equivalent to {!Annot.attribute =~ "value"}.
When a constraint on an attribute cannot be evaluated because an annotation
does not have a value for the attribute, the value defaults to an empty
string ("").
The regular expression must be enclosed in double quotes, otherwise
the transducer issues a warning:
-
{MyAnnot.attrib =~ "[Dd]ogs?"} is correct
-
{MyAnnot.attrib =~ 2} is incorrect
The regular expression must be a valid java.util.regex.Pattern, otherwise
a warning is issued.
To have a match, the regular expression must cover the entire attribute
string, not only a part of it. For example:
-
{MyAnnot.attrib =~ "do"} does not match "does"
-
{MyAnnot.attrib =~ "do.*"} matches "does"
Notes on the negation operator: "!"
Bindings: when a constraint contains both negated and regular elements,
the negated elements do not affect the bindings of the regular elements.
Thus, {Person, !Organization} binds to the same annotations (amongst
those that starts at current node in the annotation graph) as {Person};
the difference between the two is that the first will simply not match
if one of the annotations starting at current node is an Organization.
On the other hand, when a constraint contains only negated elements such
as {!Organization}, it binds to all annotations starting at current
node. It is important to keep that in mind especially when a rule ends
with a constraint with negated elements only: the longest annotation at
current node will be preferred.
Conjunctions of constraints on different
types of annotation
The Montreal Transducer allows constraints on different types of annotation.
Though the JAPE implementation exposed in the GATE 2.1 User Guide details
an algorithm that would allow such constraints, the ANNIE transducer does
not implement it. This transducer does. Those examples do not work as expected
with the ANNIE transducer but do with this transducer:
-
{Person, Organization}
-
{Person, Organization, Token.length == "10"}
-
{Person, !Organization}
As described in the algorithm, the first example above matches points in
the document (or nodes in the annotation graph) where both a Person and
an Organization annotations begin, even if they do not end at the same
point in the document and even if other annotations begin at the same point.
When a negation is involved, such as in the third example above, no annotation
of that kind must begin at a given point for a match to occur (see the
note on the negation operator below).
Greedy Kleene operators: "*" and "+"
The ANNIE transducer does not behave consistently regarding the "*"
and "+" Kleene operators. Suppose we have the following rule with 2 bindings:
-
({Lookup.majorType == title})+:titles ({Token.orth == upperInitial})+:names
Given the sentence "the Honourable Mr. John Atkinson", we expect
the following bindings:
-
titles: "Honourable Mr."
-
names: "John Atkinson"
But the ANNIE transducer could give something like:
-
titles: "Honourable"
-
names: "Mr. John Atkinson"
This is not incorrect, but according to convention, "*" and "+" operators
match as many tokens as possible before moving on to the next constraint.
The Montreal Transducer guarantees that "*" and "+" are greedy.
8) For developers
Developers will find comments on classes and methods through the javadoc
pages: doc/javadoc/index.html. Most of the source code comes from
the Jape Transducer in GATE. It was necessary to copy entire packages instead
of overriding a few methods because many class attributes and members were
not accessible outside the gate.xxx package. The Montreal Transducer needs
4 packages:
a) ca.umontreal.iro.rali.gate.creole
Contains only the MtlTransducer class, which is the module's interface
with the outside world. The MtlTransducer class is almost exactly the same
as gate.creole.Transducer (the basic Jape Transducer). The code of OntologyAwareTransducer
is also included in MtlTransducer. It was impossible to simply extend any
of those transducers because some members are private or package-protected.
b) ca.umontreal.iro.rali.gate.fsm
Same as the gate.fsm package. This package models the grammar as a finite
state machine. Only the convertComplexPE private method of the FSM class
has been substantially modified.
c) ca.umontreal.iro.rali.gate.jape
Almost the same as the gate.jape package. Significant modifications were
made to the SinglePhaseTransducer, Constraint and JdmAttribute classes.
d) ca.umontreal.iro.rali.gate.jape.parser
Almost the same as gate.jape.parser package. Modifications were made to
ParseCpsl.jj
so that the JAPE language could be extended. This file is to be compiled
with javacc. The other classes of the package are automatically generated
by javacc.
9) Licence
This work is a modification of some GATE libraries and therefore the binaries
and source code are distributed under the same licence as GATE itself.
GATE is licenced under the GNU
Library General Public License, version 2 of June 1991. That licence
is distributed with this module in the file LICENCE.htm. GATE binaries
and source code are available at http://gate.ac.uk.
Modifications to the original source code are detailed in the header of
each file.
Basically, the Montreal Transducer source code and binaries are free.
A work that would be a modification of it should also be free. However,
a work that would only USE the Montreal Transducer would be exempted from
the terms of the licence, provided the GATE and the Montreal Transducer
binaries, source code and licence are distributed with the embedding work
and provided the use of those softwares is acknowledged. For additional
help on the interpretation of the GATE licence, see http://www.gate.ac.uk/gate/doc/index.html.
10) Change log
1.1:
- Bug fixed: a constraint with multiple negated tests on the same attribute
of a given annotation type would match when at least one test succeeds,
but it should match only when ALL negated tests succeed.
1.0: Initial release.