This is the third in a series of progress reports for the GATE 2 project (EPSRC grant GR/M31699, running from July 1999 to summer 2002). The previous report is here.
Our work over the last 9 months, from January through September 2001 has centered on:
GATE has been upgraded as a result of the requirements of the EMILLE project, and both the core system and the bundled Information Extraction system are now proven capable of handling Indic (and many other) languages.
The software now supports display of all the EMILLE languages that are in the Unicode standard. This display is imperfect in JDK1.3 but has been improved in JDK1.4. We are currently working on porting to the latter system. In addition, the system supports input methods (for editing of text) for 27 languages, including these EMILLE languages: Bengali, Urdu, Hindi (two variants).
We have conducted successfull experiments in performing Named Entity recognition in Bengali.
The work has been made available to the community under the GNU open source library licence, and has already been taken up by the Max Planck Institute technical group in Nijmegen, who have extended the system's support for Chinese languages.