gate.corpora
Class DocumentImpl

java.lang.Object
  |
  +--gate.util.AbstractFeatureBearer
        |
        +--gate.creole.AbstractResource
              |
              +--gate.creole.AbstractLanguageResource
                    |
                    +--gate.corpora.DocumentImpl
All Implemented Interfaces:
Comparable, Document, FeatureBearer, LanguageResource, Resource, Serializable
Direct Known Subclasses:
DocumentWrapper

public class DocumentImpl
extends AbstractLanguageResource
implements Document

Represents the commonalities between all sorts of documents.

Editing

The DocumentImpl class implements the Document interface. The DocumentContentImpl class models the textual or audio-visual materials which are the source and content of Documents. The AnnotationSetImpl class supplies annotations on Documents.

Abbreviations:

We add an edit method to each of these classes; for DC and AS the methods are package private; D has the public method.

   void edit(Long start, Long end, DocumentContent replacement)
   throws InvalidOffsetException;
 

D receives edit requests and forwards them to DC and AS. On DC, this method makes a change to the content - e.g. replacing a String range from start to end with replacement. (Deletions are catered for by having replacement = null.) D then calls AS.edit on each of its annotation sets.

On AS, edit calls replacement.size() (i.e. DC.size()) to figure out how long the replacement is (0 for null). It then considers annotations that terminate (start or end) in the altered or deleted range as invalid; annotations that terminate after the range have their offsets adjusted. I.e.:

A note re. AS and annotations: annotations no longer have offsets as in the old model, they now have nodes, and nodes have offsets.

To implement AS.edit, we have several indices:

   HashMap annotsByStartNode, annotsByEndNode;
 
which map node ids to annotations;
   RBTreeMap nodesByOffset;
 
which maps offset to Nodes.

When we get an edit request, we traverse that part of the nodesByOffset tree representing the altered or deleted range of the DC. For each node found, we delete any annotations that terminate on the node, and then delete the node itself. We then traverse the rest of the tree, changing the offset on all remaining nodes by:

   newOffset =
     oldOffset -
     (
       (end - start) -                                     // size of mod
       ( (replacement == null) ? 0 : replacement.size() )  // size of repl
     );
 
Note that we use the same convention as e.g. java.lang.String: start offsets are inclusive; end offsets are exclusive. I.e. for string "abcd" range 1-3 = "bc". Examples, for a node with offset 4:
 edit(1, 3, "BC");
 newOffset = 4 - ( (3 - 1) - 2 ) = 4

 edit(1, 3, null);
 newOffset = 4 - ( (3 - 1) - 0 ) = 2

 edit(1, 3, "BBCC");
 newOffset = 4 - ( (3 - 1) - 4 ) = 6
 

See Also:
Serialized Form

Field Summary
protected  DocumentContent content
          The content of the document
private static boolean DEBUG
          Debug flag
protected  AnnotationSet defaultAnnots
          The default annotation set
private  Vector documentListeners
           
protected  String encoding
          The encoding of the source of the document content
protected  FeatureMap features
          The features associated with this document.
private  Vector gateListeners
           
private  Boolean markupAware
           
protected  Map namedAnnotSets
          Named sets of annotations
protected  int nextAnnotationId
          The id of the next new annotation
protected  int nextNodeId
          The id of the next new node
(package private) static long serialVersionUID
          Freeze the serialization UID.
protected  URL sourceUrl
          The source URL
protected  Long sourceUrlEndOffset
          The end of the range that the content comes from at the source URL (or null if none).
protected  Long sourceUrlStartOffset
          The start of the range that the content comes from at the source URL (or null if none).
private  Vector statusListeners
           
private  String stringContent
           
 
Fields inherited from class gate.creole.AbstractLanguageResource
dataStore
 
Constructor Summary
DocumentImpl()
          Default construction.
 
Method Summary
 void addDocumentListener(DocumentListener l)
           
 void addGateListener(GateListener l)
           
 void addStatusListener(StatusListener l)
           
private  String annotationSetToXml(AnnotationSet anAnnotationSet)
          This method saves an AnnotationSet as XML.
protected  boolean check(Object a, Object b)
          Check: test 2 objects for equality
 int compareTo(Object o)
          Ordering based on URL.toString() and the URL offsets (if any)
 void edit(Long start, Long end, DocumentContent replacement)
          Propagate edit changes to the document content and annotations.
 boolean equals(Object other)
          Equals
private  String featuresToXml(FeatureMap aFeatureMap)
          This method saves a FeatureMap as XML elements.
protected  void fireAnnotationSetAdded(DocumentEvent e)
           
protected  void fireAnnotationSetRemoved(DocumentEvent e)
           
protected  void fireGateEvent(GateEvent e)
           
protected  void fireStatusChanged(String e)
           
 AnnotationSet getAnnotations()
          Get the default set of annotations.
 AnnotationSet getAnnotations(String name)
          Get a named set of annotations.
 DocumentContent getContent()
          The content of the document: a String for text; MPEG for video; etc.
 String getEncoding()
          Get the encoding of the document content source
 FeatureMap getFeatures()
          Get the features associated with this document.
 Boolean getMarkupAware()
          Get the markup awareness status of the Document.
 Map getNamedAnnotationSets()
          Returns a map with the named annotation sets.
 Integer getNextAnnotationId()
          Generate and return the next annotation ID
 Integer getNextNodeId()
          Generate and return the next node ID
protected  String getOrderingString()
          Utility method to produce a string for comparison in ordering.
 URL getSourceUrl()
          Documents are identified by URLs
 Long getSourceUrlEndOffset()
          Documents may be packed within files; in this case an optional pair of offsets refer to the location of the document.
 Long[] getSourceUrlOffsets()
          Documents may be packed within files; in this case an optional pair of offsets refer to the location of the document.
 Long getSourceUrlStartOffset()
          Documents may be packed within files; in this case an optional pair of offsets refer to the location of the document.
 int hashCode()
          Hash code
 Resource init()
          Initialise this resource, and return it.
 boolean isValidOffset(Long offset)
          Check that an offset is valid, i.e.
 boolean isValidOffsetRange(Long start, Long end)
          Check that both start and end are valid offsets and that they constitute a valid offset range, i.e.
 void removeAnnotationSet(String name)
          Removes one of the named annotation sets.
 void removeDocumentListener(DocumentListener l)
           
 void removeGateListener(GateListener l)
           
 void removeStatusListener(StatusListener l)
           
 void setContent(DocumentContent content)
          Set method for the document content
 void setEncoding(String encoding)
          Set the encoding of the document content source
 void setFeatures(FeatureMap features)
          Set the feature set
 void setMarkupAware(Boolean newMarkupAware)
          Make the document markup-aware.
 void setSourceUrl(URL sourceUrl)
          Set method for the document's URL
 void setSourceUrlEndOffset(Long sourceUrlEndOffset)
          Documents may be packed within files; in this case an optional pair of offsets refer to the location of the document.
 void setSourceUrlStartOffset(Long sourceUrlStartOffset)
          Documents may be packed within files; in this case an optional pair of offsets refer to the location of the document.
 void setStringContent(String newStringContent)
           
private  String textWithNodes(String aText)
          This method creates Node XML elements and inserts them at the corresponding offset inside the text.
 String toString()
          String respresentation
 String toXml()
          Returns a GateXml document
 
Methods inherited from class gate.creole.AbstractLanguageResource
getDataStore, setDataStore, sync
 
Methods inherited from class gate.creole.AbstractResource
getName, setName
 
Methods inherited from class java.lang.Object
, clone, finalize, getClass, notify, notifyAll, registerNatives, wait, wait, wait
 
Methods inherited from interface gate.LanguageResource
getDataStore, setDataStore, sync
 
Methods inherited from interface gate.util.FeatureBearer
getName, setName
 

Field Detail

DEBUG

private static final boolean DEBUG
Debug flag

features

protected FeatureMap features
The features associated with this document.

nextAnnotationId

protected int nextAnnotationId
The id of the next new annotation

nextNodeId

protected int nextNodeId
The id of the next new node

sourceUrl

protected URL sourceUrl
The source URL

content

protected DocumentContent content
The content of the document

encoding

protected String encoding
The encoding of the source of the document content

sourceUrlStartOffset

protected Long sourceUrlStartOffset
The start of the range that the content comes from at the source URL (or null if none).

sourceUrlEndOffset

protected Long sourceUrlEndOffset
The end of the range that the content comes from at the source URL (or null if none).

defaultAnnots

protected AnnotationSet defaultAnnots
The default annotation set

namedAnnotSets

protected Map namedAnnotSets
Named sets of annotations

statusListeners

private transient Vector statusListeners

documentListeners

private transient Vector documentListeners

gateListeners

private transient Vector gateListeners

stringContent

private String stringContent

markupAware

private Boolean markupAware

serialVersionUID

static final long serialVersionUID
Freeze the serialization UID.
Constructor Detail

DocumentImpl

public DocumentImpl()
Default construction. Content left empty.
Method Detail

init

public Resource init()
              throws ResourceInstantiationException
Initialise this resource, and return it.
Specified by:
init in interface Resource
Overrides:
init in class AbstractResource

getSourceUrl

public URL getSourceUrl()
Documents are identified by URLs
Specified by:
getSourceUrl in interface Document

setSourceUrl

public void setSourceUrl(URL sourceUrl)
Set method for the document's URL
Specified by:
setSourceUrl in interface Document

getSourceUrlOffsets

public Long[] getSourceUrlOffsets()
Documents may be packed within files; in this case an optional pair of offsets refer to the location of the document.
Specified by:
getSourceUrlOffsets in interface Document

getSourceUrlStartOffset

public Long getSourceUrlStartOffset()
Documents may be packed within files; in this case an optional pair of offsets refer to the location of the document. This method gets the start offset.
Specified by:
getSourceUrlStartOffset in interface Document

setSourceUrlStartOffset

public void setSourceUrlStartOffset(Long sourceUrlStartOffset)
Documents may be packed within files; in this case an optional pair of offsets refer to the location of the document. This method sets the start offset.

getSourceUrlEndOffset

public Long getSourceUrlEndOffset()
Documents may be packed within files; in this case an optional pair of offsets refer to the location of the document. This method gets the end offset.
Specified by:
getSourceUrlEndOffset in interface Document

setSourceUrlEndOffset

public void setSourceUrlEndOffset(Long sourceUrlEndOffset)
Documents may be packed within files; in this case an optional pair of offsets refer to the location of the document. This method sets the end offset.

getContent

public DocumentContent getContent()
The content of the document: a String for text; MPEG for video; etc.
Specified by:
getContent in interface Document

setContent

public void setContent(DocumentContent content)
Set method for the document content
Specified by:
setContent in interface Document

getEncoding

public String getEncoding()
Get the encoding of the document content source

setEncoding

public void setEncoding(String encoding)
Set the encoding of the document content source

getAnnotations

public AnnotationSet getAnnotations()
Get the default set of annotations. The set is created if it doesn't exist yet.
Specified by:
getAnnotations in interface Document

getAnnotations

public AnnotationSet getAnnotations(String name)
Get a named set of annotations. Creates a new set if one with this name doesn't exist yet.
Specified by:
getAnnotations in interface Document

setMarkupAware

public void setMarkupAware(Boolean newMarkupAware)
Make the document markup-aware. This will trigger the creation of a DocumentFormat object at Document initialisation time; the DocumentFormat object will unpack the markup in the Document and add it as annotations. Documents are not markup-aware by default.
Specified by:
setMarkupAware in interface Document
Parameters:
b - markup awareness status.

getMarkupAware

public Boolean getMarkupAware()
Get the markup awareness status of the Document. Documents are markup-aware by default.
Specified by:
getMarkupAware in interface Document
Returns:
whether the Document is markup aware.

toXml

public String toXml()
Returns a GateXml document
Specified by:
toXml in interface Document
Returns:
a string representing a Gate Xml document

featuresToXml

private String featuresToXml(FeatureMap aFeatureMap)
This method saves a FeatureMap as XML elements.

textWithNodes

private String textWithNodes(String aText)
This method creates Node XML elements and inserts them at the corresponding offset inside the text. Nodes are created from the default annotation set, as well as from all existing named annotation sets.
Parameters:
aText - The text representing the document's plain text.
Returns:
The text with empty elements.

annotationSetToXml

private String annotationSetToXml(AnnotationSet anAnnotationSet)
This method saves an AnnotationSet as XML.
Parameters:
anAnnotationSet - The annotation set that has to be saved as XML.
Returns:
a String like this: ....

getNamedAnnotationSets

public Map getNamedAnnotationSets()
Returns a map with the named annotation sets. It returns null if no named annotaton set exists.
Specified by:
getNamedAnnotationSets in interface Document

removeAnnotationSet

public void removeAnnotationSet(String name)
Removes one of the named annotation sets. Note that the default annotation set cannot be removed.
Specified by:
removeAnnotationSet in interface Document
Parameters:
name - the name of the annotation set to be removed

getFeatures

public FeatureMap getFeatures()
Get the features associated with this document.
Specified by:
getFeatures in interface FeatureBearer
Overrides:
getFeatures in class AbstractFeatureBearer

setFeatures

public void setFeatures(FeatureMap features)
Set the feature set
Specified by:
setFeatures in interface FeatureBearer
Overrides:
setFeatures in class AbstractFeatureBearer

edit

public void edit(Long start,
                 Long end,
                 DocumentContent replacement)
          throws InvalidOffsetException
Propagate edit changes to the document content and annotations.
Specified by:
edit in interface Document

isValidOffset

public boolean isValidOffset(Long offset)
Check that an offset is valid, i.e. it is non-null, greater than or equal to 0 and less than the size of the document content.

isValidOffsetRange

public boolean isValidOffsetRange(Long start,
                                  Long end)
Check that both start and end are valid offsets and that they constitute a valid offset range, i.e. start is greater than or equal to long.

getNextAnnotationId

public Integer getNextAnnotationId()
Generate and return the next annotation ID

getNextNodeId

public Integer getNextNodeId()
Generate and return the next node ID

compareTo

public int compareTo(Object o)
              throws ClassCastException
Ordering based on URL.toString() and the URL offsets (if any)
Specified by:
compareTo in interface Comparable

getOrderingString

protected String getOrderingString()
Utility method to produce a string for comparison in ordering. String is based on the source URL and offsets.

removeStatusListener

public void removeStatusListener(StatusListener l)

addStatusListener

public void addStatusListener(StatusListener l)

fireStatusChanged

protected void fireStatusChanged(String e)

check

protected boolean check(Object a,
                        Object b)
Check: test 2 objects for equality

equals

public boolean equals(Object other)
Equals
Overrides:
equals in class Object

hashCode

public int hashCode()
Hash code
Overrides:
hashCode in class Object

toString

public String toString()
String respresentation
Overrides:
toString in class Object

removeDocumentListener

public void removeDocumentListener(DocumentListener l)
Specified by:
removeDocumentListener in interface Document

addDocumentListener

public void addDocumentListener(DocumentListener l)
Specified by:
addDocumentListener in interface Document

fireAnnotationSetAdded

protected void fireAnnotationSetAdded(DocumentEvent e)

fireAnnotationSetRemoved

protected void fireAnnotationSetRemoved(DocumentEvent e)

removeGateListener

public void removeGateListener(GateListener l)
Specified by:
removeGateListener in interface Document

addGateListener

public void addGateListener(GateListener l)
Specified by:
addGateListener in interface Document

fireGateEvent

protected void fireGateEvent(GateEvent e)

setStringContent

public void setStringContent(String newStringContent)