|
||||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | |||||||||
SUMMARY: INNER | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |
java.lang.Object | +--gate.util.AbstractFeatureBearer | +--gate.creole.AbstractResource | +--gate.creole.AbstractProcessingResource | +--gate.creole.tokeniser.DefaultTokeniser
Implementation of a Unicode rule based tokeniser.
The tokeniser gets its rules from a file an InputStream
or a Reader
which should be sent to one
of the constructors.
The implementations is based on a finite state machine that is built based
on the set of rules.
A rule has two sides, the left hand side (LHS)and the right hand side (RHS)
that are separated by the ">" character. The LHS represents a
regular expression that will be matched against the input while the RHS
describes a Gate2 annotation in terms of annotation type and attribute-value
pairs.
The matching is done using Unicode enumarated types as defined by the Character
class. At the time of writing this class the
suported Unicode categories were:
"UPPERCASE_LETTER" "LOWERCASE_LETTER"+ > Token;kind=upperInitial;
Field Summary | |
protected String |
annotationSetName
the annotations et where the new annotations will be adde |
private static boolean |
DEBUG
Debug flag |
protected static String |
defaultResourceName
|
protected Set |
dfsmStates
A set containng all the states of the deterministic machin |
protected DFSMState |
dInitialState
The initial state of the deterministic machin |
protected Document |
document
the document to be tokenised |
private String |
encoding
|
protected FeatureMap |
features
|
protected Set |
fsmStates
A set containng all the states of the non deterministic machin |
(package private) static Set |
ignoreTokens
A set of string representing tokens to be ignored (e.g. |
protected FSMState |
initialState
The initial state of the non deterministic machin |
(package private) static String |
LHStoRHS
The separator from LHS to RH |
static int |
maxTypeId
The maximum int value used internally as a type i |
protected List |
myProgressListeners
|
protected List |
myStatusListeners
|
private Vector |
progressListeners
|
private String |
rulesResourceName
|
private URL |
rulesURL
|
static Map |
stringTypeIds
Maps from type names to type internal id |
static Map |
typeIds
maps from int (the static value on Character to int
the internal value used by the tokeniser. |
static String[] |
typeMnemonics
Maps the internal type ids to the type name |
Fields inherited from class gate.creole.AbstractProcessingResource |
executionException |
Fields inherited from class gate.creole.AbstractResource |
serialVersionUID |
Constructor Summary | |
DefaultTokeniser()
Creates a tokeniser |
Method Summary | |
(package private) static void |
The static initialiser will inspect the class Character
using reflection to find all the public static members and will map them
to ids starting from 0. |
void |
addProgressListener(ProgressListener l)
|
void |
addStatusListener(StatusListener listener)
|
(package private) void |
eliminateVoidTransitions()
Converts the FSM from a non-deterministic to a deterministic one by eliminating all the unrestricted transitions. |
protected void |
fireProcessFinished()
|
protected void |
fireProcessFinishedEvent()
|
protected void |
fireProgressChanged(int e)
|
protected void |
fireProgressChangedEvent(int i)
|
protected void |
fireStatusChangedEvent(String text)
|
String |
getAnnotationSetName()
|
String |
getDFSMgml()
Returns a string representation of the deterministic FSM graph using GML. |
Document |
getDocument()
|
String |
getEncoding()
|
FeatureMap |
getFeatures()
Get the feature set |
String |
getFSMgml()
Returns a string representation of the non-deterministic FSM graph using GML (Graph modelling language). |
String |
getRulesResourceName()
|
URL |
getRulesURL()
Gets the value of the rulesURL property hich holds an
URL to the file containing the rules for this tokeniser. |
Resource |
init()
Initialises this tokeniser by reading the rules from an external source (provided through an URL) and building the finite state machine at the core of the tokeniser. |
private AbstractSet |
lambdaClosure(Set s)
Converts the finite state machine to a deterministic one. |
(package private) FSMState |
parseLHS(FSMState startState,
StringTokenizer st,
String until)
Parses a part or the entire LHS. |
(package private) String |
parseQuotedString(StringTokenizer st,
String until)
Parses from the given string tokeniser until it finds a specific delimiter. |
(package private) void |
parseRule(String line)
Parses one input line containing a tokeniser rule. |
void |
removeProgressListener(ProgressListener l)
|
void |
removeStatusListener(StatusListener listener)
|
void |
reset()
Prepares this Processing resource for a new run. |
void |
run()
The method that does the actual tokenisation. |
void |
setAnnotationSetName(String newAnnotationSetName)
|
void |
setDocument(Document newDocument)
|
void |
setEncoding(String newEncoding)
|
void |
setFeatures(FeatureMap features)
Set the feature set |
void |
setRulesResourceName(String newRulesResourceName)
|
void |
setRulesURL(URL newRulesURL)
Sets the value of the rulesURL property which holds an URL
to the file containing the rules for this tokeniser. |
protected static String |
skipIgnoreTokens(StringTokenizer st)
Skips the ignorable tokens from the input returning the first significant token. |
Methods inherited from class gate.creole.AbstractProcessingResource |
check, reInit |
Methods inherited from class gate.creole.AbstractResource |
getName, setName |
Methods inherited from class java.lang.Object |
clone, equals, finalize, getClass, hashCode, notify, notifyAll, registerNatives, toString, wait, wait, wait |
Methods inherited from interface gate.ProcessingResource |
check, reInit |
Methods inherited from interface gate.util.FeatureBearer |
getName, setName |
Field Detail |
private static final boolean DEBUG
protected FeatureMap features
protected List myProgressListeners
protected List myStatusListeners
protected String annotationSetName
protected FSMState initialState
protected Set fsmStates
protected DFSMState dInitialState
protected Set dfsmStates
static String LHStoRHS
static Set ignoreTokens
public static Map typeIds
Character
to int
the internal value used by the tokeniser. The ins values used by the
tokeniser are consecutive values, starting from 0 and going as high as
necessary.
They map all the public static int members onCharacter
public static int maxTypeId
public static String[] typeMnemonics
public static Map stringTypeIds
protected static String defaultResourceName
protected Document document
private String rulesResourceName
private URL rulesURL
private String encoding
private transient Vector progressListeners
Constructor Detail |
public DefaultTokeniser()
Method Detail |
public Resource init() throws ResourceInstantiationException
init
in interface Resource
init
in class AbstractProcessingResource
ResourceInstantiationException
- public void reset()
void parseRule(String line) throws TokeniserException
line
- the string containing the ruleFSMState parseLHS(FSMState startState, StringTokenizer st, String until) throws TokeniserException
startState
- a FSMState object representing the initial state for
the small FSM that will recognise the (part of) the rule parsed by this
method.st
- a StringTokenizer
that
provides the inputuntil
- the string that marks the end of the section to be
recognised. This method will first be called by parseRule(String)
with " >" in order to parse the entire
LHS. when necessary it will make itself another call to parseLHS
to parse a region of the LHS (e.g. a
"(",")" enclosed part.String parseQuotedString(StringTokenizer st, String until) throws TokeniserException
st
- a StringTokenizer
that
provides the inputuntil
- a String representing the end delimiter.protected static String skipIgnoreTokens(StringTokenizer st)
a set
private AbstractSet lambdaClosure(Set s)
s
- void eliminateVoidTransitions() throws TokeniserException
public String getFSMgml()
public String getDFSMgml()
public FeatureMap getFeatures()
FeatureBearer
getFeatures
in interface FeatureBearer
getFeatures
in class AbstractFeatureBearer
public void setFeatures(FeatureMap features)
FeatureBearer
setFeatures
in interface FeatureBearer
setFeatures
in class AbstractFeatureBearer
public void run()
run
in interface Runnable
run
in class AbstractProcessingResource
public void addStatusListener(StatusListener listener)
public void removeStatusListener(StatusListener listener)
protected void fireStatusChangedEvent(String text)
protected void fireProgressChangedEvent(int i)
protected void fireProcessFinishedEvent()
public void setRulesURL(URL newRulesURL)
rulesURL
property which holds an URL
to the file containing the rules for this tokeniser.newRulesURL
- public URL getRulesURL()
rulesURL
property hich holds an
URL to the file containing the rules for this tokeniser.public void setDocument(Document newDocument)
public Document getDocument()
public void setAnnotationSetName(String newAnnotationSetName)
public String getAnnotationSetName()
public void setRulesResourceName(String newRulesResourceName)
public String getRulesResourceName()
public void setEncoding(String newEncoding)
public String getEncoding()
public void removeProgressListener(ProgressListener l)
public void addProgressListener(ProgressListener l)
static void()
Character
using reflection to find all the public static members and will map them
to ids starting from 0.
After that it will build all the static data: typeIds
, maxTypeId
, typeMnemonics
, stringTypeIds
protected void fireProgressChanged(int e)
protected void fireProcessFinished()
|
||||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | |||||||||
SUMMARY: INNER | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |