|
|||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | ||||||||
SUMMARY: INNER | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |
java.lang.Object | +--gate.util.AbstractFeatureBearer | +--gate.creole.AbstractResource | +--gate.creole.AbstractProcessingResource | +--gate.creole.tokeniser.DefaultTokeniser
Implementation of a Unicode rule based tokeniser.
The tokeniser gets its rules from a file an InputStream
or a Reader
which should be sent to one
of the constructors.
The implementations is based on a finite state machine that is built based
on the set of rules.
A rule has two sides, the left hand side (LHS)and the right hand side (RHS)
that are separated by the ">" character. The LHS represents a
regular expression that will be matched against the input while the RHS
describes a Gate2 annotation in terms of annotation type and attribute-value
pairs.
The matching is done using Unicode enumarated types as defined by the Character
class. At the time of writing this class the
suported Unicode categories were:
"UPPERCASE_LETTER" "LOWERCASE_LETTER"+ > Token;kind=upperInitial;
Field Summary | |
static int |
maxTypeId
The maximum int value used internally as a type i |
static Map |
stringTypeIds
Maps from type names to type internal id |
static Map |
typeIds
maps from int (the static value on Character to int
the internal value used by the tokeniser. |
static String[] |
typeMnemonics
Maps the internal type ids to the type name |
Constructor Summary | |
DefaultTokeniser()
Creates a tokeniser |
Method Summary | |
void |
addProgressListener(ProgressListener l)
|
void |
addStatusListener(StatusListener listener)
|
String |
getAnnotationSetName()
|
String |
getDFSMgml()
Returns a string representation of the deterministic FSM graph using GML. |
Document |
getDocument()
|
String |
getEncoding()
|
FeatureMap |
getFeatures()
Get the feature set |
String |
getFSMgml()
Returns a string representation of the non-deterministic FSM graph using GML (Graph modelling language). |
String |
getRulesResourceName()
|
URL |
getRulesURL()
Gets the value of the rulesURL property hich holds an
URL to the file containing the rules for this tokeniser. |
Resource |
init()
Initialises this tokeniser by reading the rules from an external source (provided through an URL) and building the finite state machine at the core of the tokeniser. |
void |
removeProgressListener(ProgressListener l)
|
void |
removeStatusListener(StatusListener listener)
|
void |
reset()
Prepares this Processing resource for a new run. |
void |
run()
The method that does the actual tokenisation. |
void |
setAnnotationSetName(String newAnnotationSetName)
|
void |
setDocument(Document newDocument)
|
void |
setEncoding(String newEncoding)
|
void |
setFeatures(FeatureMap features)
Set the feature set |
void |
setRulesResourceName(String newRulesResourceName)
|
void |
setRulesURL(URL newRulesURL)
Sets the value of the rulesURL property which holds an URL
to the file containing the rules for this tokeniser. |
Methods inherited from class gate.creole.AbstractProcessingResource |
check, reInit |
Methods inherited from class gate.creole.AbstractResource |
getName, setName |
Methods inherited from class java.lang.Object |
equals, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait |
Methods inherited from interface gate.ProcessingResource |
check, reInit |
Methods inherited from interface gate.util.FeatureBearer |
getName, setName |
Field Detail |
public static Map typeIds
Character
to int
the internal value used by the tokeniser. The ins values used by the
tokeniser are consecutive values, starting from 0 and going as high as
necessary.
They map all the public static int members onCharacter
public static int maxTypeId
public static String[] typeMnemonics
public static Map stringTypeIds
Constructor Detail |
public DefaultTokeniser()
Method Detail |
public Resource init() throws ResourceInstantiationException
init
in interface Resource
init
in class AbstractProcessingResource
ResourceInstantiationException
- public void reset()
public String getFSMgml()
public String getDFSMgml()
public FeatureMap getFeatures()
FeatureBearer
getFeatures
in interface FeatureBearer
getFeatures
in class AbstractFeatureBearer
public void setFeatures(FeatureMap features)
FeatureBearer
setFeatures
in interface FeatureBearer
setFeatures
in class AbstractFeatureBearer
public void run()
run
in interface Runnable
run
in class AbstractProcessingResource
public void addStatusListener(StatusListener listener)
public void removeStatusListener(StatusListener listener)
public void setRulesURL(URL newRulesURL)
rulesURL
property which holds an URL
to the file containing the rules for this tokeniser.newRulesURL
- public URL getRulesURL()
rulesURL
property hich holds an
URL to the file containing the rules for this tokeniser.public void setDocument(Document newDocument)
public Document getDocument()
public void setAnnotationSetName(String newAnnotationSetName)
public String getAnnotationSetName()
public void setRulesResourceName(String newRulesResourceName)
public String getRulesResourceName()
public void setEncoding(String newEncoding)
public String getEncoding()
public void removeProgressListener(ProgressListener l)
public void addProgressListener(ProgressListener l)
|
|||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | ||||||||
SUMMARY: INNER | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |