gate.creole.tokeniser
Class SimpleTokeniser
java.lang.Object
|
+--gate.util.AbstractFeatureBearer
|
+--gate.creole.AbstractResource
|
+--gate.creole.AbstractProcessingResource
|
+--gate.creole.AbstractLanguageAnalyser
|
+--gate.creole.tokeniser.SimpleTokeniser
- All Implemented Interfaces:
- ANNIEConstants, Executable, FeatureBearer, LanguageAnalyser, NameBearer, ProcessingResource, Resource, Serializable
- public class SimpleTokeniser
- extends AbstractLanguageAnalyser
Implementation of a Unicode rule based tokeniser.
The tokeniser gets its rules from a file an InputStream
or a Reader
which should be sent to one
of the constructors.
The implementations is based on a finite state machine that is built based
on the set of rules.
A rule has two sides, the left hand side (LHS)and the right hand side (RHS)
that are separated by the ">" character. The LHS represents a
regular expression that will be matched against the input while the RHS
describes a Gate2 annotation in terms of annotation type and attribute-value
pairs.
The matching is done using Unicode enumarated types as defined by the Character
class. At the time of writing this class the
suported Unicode categories were:
- UNASSIGNED
- UPPERCASE_LETTER
- LOWERCASE_LETTER
- TITLECASE_LETTER
- MODIFIER_LETTER
- OTHER_LETTER
- NON_SPACING_MARK
- ENCLOSING_MARK
- COMBINING_SPACING_MARK
- DECIMAL_DIGIT_NUMBER
- LETTER_NUMBER
- OTHER_NUMBER
- SPACE_SEPARATOR
- LINE_SEPARATOR
- PARAGRAPH_SEPARATOR
- CONTROL
- FORMAT
- PRIVATE_USE
- SURROGATE
- DASH_PUNCTUATION
- START_PUNCTUATION
- END_PUNCTUATION
- CONNECTOR_PUNCTUATION
- OTHER_PUNCTUATION
- MATH_SYMBOL
- CURRENCY_SYMBOL
- MODIFIER_SYMBOL
- OTHER_SYMBOL
The accepted operators for the LHS are "+", "*" and "|" having the usual
interpretations of "1 to n occurences", "0 to n occurences" and
"boolean OR".
For instance this is a valid LHS:
"UPPERCASE_LETTER" "LOWERCASE_LETTER"+
meaning an uppercase letter followed by one or more lowercase letters.
The RHS describes an annotation that is to be created and inserted in the
annotation set provided in case of a match. The new annotation will span the
text that has been recognised. The RHS consists in the annotation type
followed by pairs of attributes and associated values.
E.g. for the LHS above a possible RHS can be:
Token;kind=upperInitial;
representing an annotation of type "Token" having one attribute
named "kind" with the value "upperInitial"
The entire rule willbe:
"UPPERCASE_LETTER" "LOWERCASE_LETTER"+ > Token;kind=upperInitial;
The tokeniser ignores all the empty lines or the ones that start with # or
//.
- See Also:
- Serialized Form
Fields inherited from interface gate.creole.ANNIEConstants |
ANNOTATION_COREF_FEATURE_NAME, DATE_ANNOTATION_TYPE, DOCUMENT_COREF_FEATURE_NAME, LOCATION_ANNOTATION_TYPE, LOOKUP_ANNOTATION_TYPE, LOOKUP_MAJOR_TYPE_FEATURE_NAME, LOOKUP_MINOR_TYPE_FEATURE_NAME, MONEY_ANNOTATION_TYPE, ORGANIZATION_ANNOTATION_TYPE, PERSON_ANNOTATION_TYPE, PERSON_GENDER_FEATURE_NAME, PR_NAMES, SENTENCE_ANNOTATION_TYPE, SPACE_TOKEN_ANNOTATION_TYPE, TOKEN_ANNOTATION_TYPE, TOKEN_CATEGORY_FEATURE_NAME, TOKEN_KIND_FEATURE_NAME, TOKEN_LENGTH_FEATURE_NAME, TOKEN_ORTH_FEATURE_NAME, TOKEN_STRING_FEATURE_NAME |
Methods inherited from class gate.creole.AbstractResource |
checkParameterValues, getName, getParameterValue, getParameterValue, removeResourceListeners, setName, setParameterValue, setParameterValue, setParameterValues, setParameterValues, setResourceListeners |
SIMP_TOK_DOCUMENT_PARAMETER_NAME
public static final String SIMP_TOK_DOCUMENT_PARAMETER_NAME
SIMP_TOK_ANNOT_SET_PARAMETER_NAME
public static final String SIMP_TOK_ANNOT_SET_PARAMETER_NAME
SIMP_TOK_RULES_URL_PARAMETER_NAME
public static final String SIMP_TOK_RULES_URL_PARAMETER_NAME
SIMP_TOK_ENCODING_PARAMETER_NAME
public static final String SIMP_TOK_ENCODING_PARAMETER_NAME
typeIds
public static Map typeIds
- maps from int (the static value on
Character
to int
the internal value used by the tokeniser. The ins values used by the
tokeniser are consecutive values, starting from 0 and going as high as
necessary.
They map all the public static int members onCharacter
maxTypeId
public static int maxTypeId
- The maximum int value used internally as a type i
typeMnemonics
public static String[] typeMnemonics
- Maps the internal type ids to the type name
stringTypeIds
public static Map stringTypeIds
- Maps from type names to type internal id
SimpleTokeniser
public SimpleTokeniser()
- Creates a tokeniser
init
public Resource init()
throws ResourceInstantiationException
- Initialises this tokeniser by reading the rules from an external source (provided through an URL) and building
the finite state machine at the core of the tokeniser.
- Overrides:
init
in class AbstractProcessingResource
- Throws:
ResourceInstantiationException
-
reset
public void reset()
- Prepares this Processing resource for a new run.
getFSMgml
public String getFSMgml()
- Returns a string representation of the non-deterministic FSM graph using
GML (Graph modelling language).
getDFSMgml
public String getDFSMgml()
- Returns a string representation of the deterministic FSM graph using
GML.
getFeatures
public FeatureMap getFeatures()
- Description copied from interface:
FeatureBearer
- Get the feature set
- Overrides:
getFeatures
in class AbstractFeatureBearer
setFeatures
public void setFeatures(FeatureMap features)
- Description copied from interface:
FeatureBearer
- Set the feature set
- Overrides:
setFeatures
in class AbstractFeatureBearer
execute
public void execute()
throws ExecutionException
- The method that does the actual tokenisation.
- Overrides:
execute
in class AbstractProcessingResource
setRulesURL
public void setRulesURL(URL newRulesURL)
- Sets the value of the
rulesURL
property which holds an URL
to the file containing the rules for this tokeniser.
- Parameters:
newRulesURL
-
getRulesURL
public URL getRulesURL()
- Gets the value of the
rulesURL
property hich holds an
URL to the file containing the rules for this tokeniser.
setAnnotationSetName
public void setAnnotationSetName(String newAnnotationSetName)
getAnnotationSetName
public String getAnnotationSetName()
setRulesResourceName
public void setRulesResourceName(String newRulesResourceName)
getRulesResourceName
public String getRulesResourceName()
setEncoding
public void setEncoding(String newEncoding)
getEncoding
public String getEncoding()