gate.creole.orthomatcher
Class OrthoMatcher

java.lang.Object
  |
  +--gate.util.AbstractFeatureBearer
        |
        +--gate.creole.AbstractResource
              |
              +--gate.creole.AbstractProcessingResource
                    |
                    +--gate.creole.AbstractLanguageAnalyser
                          |
                          +--gate.creole.orthomatcher.OrthoMatcher
All Implemented Interfaces:
ANNIEConstants, Executable, FeatureBearer, LanguageAnalyser, NameBearer, ProcessingResource, Resource, Serializable

public class OrthoMatcher
extends AbstractLanguageAnalyser
implements ANNIEConstants

See Also:
Serialized Form

Field Summary
static String OM_ANN_SET_PARAMETER_NAME
           
static String OM_ANN_TYPES_PARAMETER_NAME
           
static String OM_CASE_SENSITIVE_PARAMETER_NAME
           
static String OM_DOCUMENT_PARAMETER_NAME
           
static String OM_EXT_LISTS_PARAMETER_NAME
           
static String OM_ORG_TYPE_PARAMETER_NAME
           
static String OM_PERSON_TYPE_PARAMETER_NAME
           
 
Fields inherited from interface gate.creole.ANNIEConstants
ANNOTATION_COREF_FEATURE_NAME, DATE_ANNOTATION_TYPE, DOCUMENT_COREF_FEATURE_NAME, LOCATION_ANNOTATION_TYPE, LOOKUP_ANNOTATION_TYPE, LOOKUP_CLASS_FEATURE_NAME, LOOKUP_MAJOR_TYPE_FEATURE_NAME, LOOKUP_MINOR_TYPE_FEATURE_NAME, LOOKUP_ONTOLOGY_FEATURE_NAME, MONEY_ANNOTATION_TYPE, ORGANIZATION_ANNOTATION_TYPE, PERSON_ANNOTATION_TYPE, PERSON_GENDER_FEATURE_NAME, PR_NAMES, SENTENCE_ANNOTATION_TYPE, SPACE_TOKEN_ANNOTATION_TYPE, TOKEN_ANNOTATION_TYPE, TOKEN_CATEGORY_FEATURE_NAME, TOKEN_KIND_FEATURE_NAME, TOKEN_LENGTH_FEATURE_NAME, TOKEN_ORTH_FEATURE_NAME, TOKEN_STRING_FEATURE_NAME
 
Constructor Summary
OrthoMatcher()
           
 
Method Summary
 void execute()
          Run the resource.
 String getAnnotationSetName()
          get the name of the annotation set
 List getAnnotationTypes()
          get the types of the annotation
 Boolean getCaseSensitive()
          Are we running in a case-sensitive mode?
 URL getDefinitionFileURL()
           
 String getEncoding()
           
 Boolean getExtLists()
           
 String getOrganizationType()
           
 String getPersonType()
           
 Boolean getProcessUnknown()
          Return whether or not we're processing the Unknown annots
 Resource init()
          Initialise this resource, and return it.
 boolean matchRule0(String s1, String s2)
          RULE #0: If the two names are listed in table of spurius matches then they do NOT match Condition(s): - Applied to: all name annotations
 boolean matchRule1(String s1, String s2, boolean matchCase)
          RULE #1: If the two names are identical then they are the same no longer used, because I do the check for same string via the hash table of previous annotations Condition(s): depend on case Applied to: all name annotations
 boolean matchRule10(String s1, String s2)
          RULE #10: is one name the reverse of the other reversing around prepositions only? e.g.
 boolean matchRule11(String s1, String s2)
          RULE #11: does one name consist of contractions of the first two tokens of the other name? e.g.
 boolean matchRule12(String s1, String s2)
          RULE #12: do the first and last tokens of one name match the first and last tokens of the other? Condition(s): case-sensitive match Applied to: organisation annotations only
 boolean matchRule13(String s1, String s2)
          RULE #13: do multi-word names match except for one token e.g.
 boolean matchRule14(String s1, String s2)
          RULE #14: if the last token of one name matches the second name e.g.
 boolean matchRule15(String s1, String s2)
          RULE #15: does one token from a Person name appear as the other token Note that this rule has NOT been used in LaSIE's 1.5 namematcher; added for ACE by Di's request
 boolean matchRule2(String s1, String s2)
          RULE #2: if the two names are listed as equivalent in the lookup table (alias) then they match Condition(s): - Applied to: all name annotations
 boolean matchRule3(String s1, String s2)
          RULE #3: adding a possessive at the end of one name causes a match e.g.
 boolean matchRule4(String s1, String s2)
          RULE #4: Do all tokens other than the punctuation marks , and .
 boolean matchRule5(String s1, String s2)
          RULE #5: if the 1st token of one name matches the second name e.g.
 boolean matchRule6(String s1, String s2)
          RULE #6: if one name is the acronym of the other e.g.
 boolean matchRule7(String s1, String s2)
          RULE #7: if one of the tokens in one of the names is in the list of separators eg.
 boolean matchRule8(String s1, String s2)
          This rule is now obsolete, as The and the trailing CDG are stripped before matching.
 boolean matchRule9(String s1, String s2)
          RULE #9: does one of the names match the token just before a trailing company designator in the other name? The company designator has already been chopped off, so the token before it, is in fact the last token e.g.
 String regularExpressions(String text, String replacement, String regEx)
          substitute all multiple spaces, tabes and newlines with a single space
 void setAnnotationSetName(String newAnnotationSetName)
          set the annotation set name
 void setAnnotationTypes(List newType)
          set the types of the annotations
 void setCaseSensitive(Boolean newCase)
          set the caseSensitive flag
 void setDefinitionFileURL(URL definitionFileURL)
           
 void setEncoding(String encoding)
           
 void setExtLists(Boolean newExtLists)
          set the extLists flag
 void setOrganizationType(String newOrganizationType)
           
 void setPersonType(String newPersonType)
           
 void setProcessUnknown(Boolean processOrNot)
          set whether to process the Unknown annotations
 
Methods inherited from class gate.creole.AbstractLanguageAnalyser
getCorpus, getDocument, setCorpus, setDocument
 
Methods inherited from class gate.creole.AbstractProcessingResource
addProgressListener, addStatusListener, cleanup, interrupt, isInterrupted, reInit, removeProgressListener, removeStatusListener
 
Methods inherited from class gate.creole.AbstractResource
checkParameterValues, getName, getParameterValue, getParameterValue, removeResourceListeners, setName, setParameterValue, setParameterValue, setParameterValues, setParameterValues, setResourceListeners
 
Methods inherited from class gate.util.AbstractFeatureBearer
getFeatures, setFeatures
 
Methods inherited from class java.lang.Object
equals, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 
Methods inherited from interface gate.ProcessingResource
reInit
 
Methods inherited from interface gate.Resource
cleanup, getParameterValue, setParameterValue, setParameterValues
 
Methods inherited from interface gate.util.FeatureBearer
getFeatures, setFeatures
 
Methods inherited from interface gate.util.NameBearer
getName, setName
 
Methods inherited from interface gate.Executable
interrupt, isInterrupted
 

Field Detail

OM_DOCUMENT_PARAMETER_NAME

public static final String OM_DOCUMENT_PARAMETER_NAME
See Also:
Constant Field Values

OM_ANN_SET_PARAMETER_NAME

public static final String OM_ANN_SET_PARAMETER_NAME
See Also:
Constant Field Values

OM_CASE_SENSITIVE_PARAMETER_NAME

public static final String OM_CASE_SENSITIVE_PARAMETER_NAME
See Also:
Constant Field Values

OM_ANN_TYPES_PARAMETER_NAME

public static final String OM_ANN_TYPES_PARAMETER_NAME
See Also:
Constant Field Values

OM_ORG_TYPE_PARAMETER_NAME

public static final String OM_ORG_TYPE_PARAMETER_NAME
See Also:
Constant Field Values

OM_PERSON_TYPE_PARAMETER_NAME

public static final String OM_PERSON_TYPE_PARAMETER_NAME
See Also:
Constant Field Values

OM_EXT_LISTS_PARAMETER_NAME

public static final String OM_EXT_LISTS_PARAMETER_NAME
See Also:
Constant Field Values
Constructor Detail

OrthoMatcher

public OrthoMatcher()
Method Detail

init

public Resource init()
              throws ResourceInstantiationException
Initialise this resource, and return it.

Specified by:
init in interface Resource
Overrides:
init in class AbstractProcessingResource
ResourceInstantiationException

execute

public void execute()
             throws ExecutionException
Run the resource. It doesn't make sense not to override this in subclasses so the default implementation signals an exception.

Specified by:
execute in interface Executable
Overrides:
execute in class AbstractProcessingResource
ExecutionException

setExtLists

public void setExtLists(Boolean newExtLists)
set the extLists flag


setCaseSensitive

public void setCaseSensitive(Boolean newCase)
set the caseSensitive flag


setAnnotationSetName

public void setAnnotationSetName(String newAnnotationSetName)
set the annotation set name


setAnnotationTypes

public void setAnnotationTypes(List newType)
set the types of the annotations


setProcessUnknown

public void setProcessUnknown(Boolean processOrNot)
set whether to process the Unknown annotations


setOrganizationType

public void setOrganizationType(String newOrganizationType)

setPersonType

public void setPersonType(String newPersonType)

getAnnotationSetName

public String getAnnotationSetName()
get the name of the annotation set


getAnnotationTypes

public List getAnnotationTypes()
get the types of the annotation


getOrganizationType

public String getOrganizationType()

getPersonType

public String getPersonType()

getExtLists

public Boolean getExtLists()

getCaseSensitive

public Boolean getCaseSensitive()
Are we running in a case-sensitive mode?


getProcessUnknown

public Boolean getProcessUnknown()
Return whether or not we're processing the Unknown annots


matchRule0

public boolean matchRule0(String s1,
                          String s2)
RULE #0: If the two names are listed in table of spurius matches then they do NOT match Condition(s): - Applied to: all name annotations


matchRule1

public boolean matchRule1(String s1,
                          String s2,
                          boolean matchCase)
RULE #1: If the two names are identical then they are the same no longer used, because I do the check for same string via the hash table of previous annotations Condition(s): depend on case Applied to: all name annotations


matchRule2

public boolean matchRule2(String s1,
                          String s2)
RULE #2: if the two names are listed as equivalent in the lookup table (alias) then they match Condition(s): - Applied to: all name annotations


matchRule3

public boolean matchRule3(String s1,
                          String s2)
RULE #3: adding a possessive at the end of one name causes a match e.g. "Standard and Poor" == "Standard and Poor's" and also "Standard and Poor" == "Standard's" Condition(s): case-insensitive match Applied to: all name annotations


matchRule4

public boolean matchRule4(String s1,
                          String s2)
RULE #4: Do all tokens other than the punctuation marks , and . match? e.g. "Smith, Jones" == "Smith Jones" Condition(s): case-insensitive match Applied to: organisation and person annotations


matchRule5

public boolean matchRule5(String s1,
                          String s2)
RULE #5: if the 1st token of one name matches the second name e.g. "Pepsi Cola" == "Pepsi" Condition(s): case-insensitive match Applied to: all name annotations


matchRule6

public boolean matchRule6(String s1,
                          String s2)
RULE #6: if one name is the acronym of the other e.g. "Imperial Chemical Industries" == "ICI" Applied to: organisation annotations only


matchRule7

public boolean matchRule7(String s1,
                          String s2)
RULE #7: if one of the tokens in one of the names is in the list of separators eg. "&" then check if the token before the separator matches the other name e.g. "R.H. Macy & Co." == "Macy" Condition(s): case-sensitive match Applied to: organisation annotations only


matchRule8

public boolean matchRule8(String s1,
                          String s2)
This rule is now obsolete, as The and the trailing CDG are stripped before matching. DO NOT CALL!!! RULE #8: if the names match, ignoring The and and trailing company designator (which have already been stripped) e.g. "The Magic Tricks Co." == "Magic Tricks" Condition(s): case-sensitive match Applied to: organisation annotations only


matchRule9

public boolean matchRule9(String s1,
                          String s2)
RULE #9: does one of the names match the token just before a trailing company designator in the other name? The company designator has already been chopped off, so the token before it, is in fact the last token e.g. "R.H. Macy Co." == "Macy" Applied to: organisation annotations only


matchRule10

public boolean matchRule10(String s1,
                           String s2)
RULE #10: is one name the reverse of the other reversing around prepositions only? e.g. "Department of Defence" == "Defence Department" Condition(s): case-sensitive match Applied to: organisation annotations only


matchRule11

public boolean matchRule11(String s1,
                           String s2)
RULE #11: does one name consist of contractions of the first two tokens of the other name? e.g. "Communications Satellite" == "ComSat" and "Pan American" == "Pan Am" Condition(s): case-sensitive match Applied to: organisation annotations only


matchRule12

public boolean matchRule12(String s1,
                           String s2)
RULE #12: do the first and last tokens of one name match the first and last tokens of the other? Condition(s): case-sensitive match Applied to: organisation annotations only


matchRule13

public boolean matchRule13(String s1,
                           String s2)
RULE #13: do multi-word names match except for one token e.g. "Second Force Recon Company" == "Force Recon Company" Note that this rule has NOT been used in LaSIE's 1.5 namematcher Restrictions: - remove cdg first - shortest name should be 2 words or more - if N is the number of tokens of the longest name, then N-1 tokens should be matched Condition(s): case-sensitive match Applied to: organisation or person annotations only


matchRule14

public boolean matchRule14(String s1,
                           String s2)
RULE #14: if the last token of one name matches the second name e.g. "Hamish Cunningham" == "Cunningham" Condition(s): case-insensitive match Applied to: all person annotations


matchRule15

public boolean matchRule15(String s1,
                           String s2)
RULE #15: does one token from a Person name appear as the other token Note that this rule has NOT been used in LaSIE's 1.5 namematcher; added for ACE by Di's request


regularExpressions

public String regularExpressions(String text,
                                 String replacement,
                                 String regEx)
substitute all multiple spaces, tabes and newlines with a single space


setDefinitionFileURL

public void setDefinitionFileURL(URL definitionFileURL)

getDefinitionFileURL

public URL getDefinitionFileURL()

setEncoding

public void setEncoding(String encoding)

getEncoding

public String getEncoding()