gate.corpora
Class DocumentImpl

java.lang.Object
  |
  +--gate.util.AbstractFeatureBearer
        |
        +--gate.creole.AbstractResource
              |
              +--gate.creole.AbstractLanguageResource
                    |
                    +--gate.corpora.DocumentImpl
All Implemented Interfaces:
Comparable, Document, FeatureBearer, LanguageResource, Resource, Serializable
Direct Known Subclasses:
DocumentWrapper

public class DocumentImpl
extends AbstractLanguageResource
implements Document

Represents the commonalities between all sorts of documents.

Editing

The DocumentImpl class implements the Document interface. The DocumentContentImpl class models the textual or audio-visual materials which are the source and content of Documents. The AnnotationSetImpl class supplies annotations on Documents.

Abbreviations:

We add an edit method to each of these classes; for DC and AS the methods are package private; D has the public method.

   void edit(Long start, Long end, DocumentContent replacement)
   throws InvalidOffsetException;
 

D receives edit requests and forwards them to DC and AS. On DC, this method makes a change to the content - e.g. replacing a String range from start to end with replacement. (Deletions are catered for by having replacement = null.) D then calls AS.edit on each of its annotation sets.

On AS, edit calls replacement.size() (i.e. DC.size()) to figure out how long the replacement is (0 for null). It then considers annotations that terminate (start or end) in the altered or deleted range as invalid; annotations that terminate after the range have their offsets adjusted. I.e.:

A note re. AS and annotations: annotations no longer have offsets as in the old model, they now have nodes, and nodes have offsets.

To implement AS.edit, we have several indices:

   HashMap annotsByStartNode, annotsByEndNode;
 
which map node ids to annotations;
   RBTreeMap nodesByOffset;
 
which maps offset to Nodes.

When we get an edit request, we traverse that part of the nodesByOffset tree representing the altered or deleted range of the DC. For each node found, we delete any annotations that terminate on the node, and then delete the node itself. We then traverse the rest of the tree, changing the offset on all remaining nodes by:

   newOffset =
     oldOffset -
     (
       (end - start) -                                     // size of mod
       ( (replacement == null) ? 0 : replacement.size() )  // size of repl
     );
 
Note that we use the same convention as e.g. java.lang.String: start offsets are inclusive; end offsets are exclusive. I.e. for string "abcd" range 1-3 = "bc". Examples, for a node with offset 4:
 edit(1, 3, "BC");
 newOffset = 4 - ( (3 - 1) - 2 ) = 4

 edit(1, 3, null);
 newOffset = 4 - ( (3 - 1) - 0 ) = 2

 edit(1, 3, "BBCC");
 newOffset = 4 - ( (3 - 1) - 4 ) = 6
 

See Also:
Serialized Form

Constructor Summary
DocumentImpl()
          Default construction.
 
Method Summary
 void addDocumentListener(DocumentListener l)
           
 void addGateListener(GateListener l)
           
 void addStatusListener(StatusListener l)
           
 int compareTo(Object o)
          Ordering based on URL.toString() and the URL offsets (if any)
 void edit(Long start, Long end, DocumentContent replacement)
          Propagate edit changes to the document content and annotations.
 boolean equals(Object other)
          Equals
 AnnotationSet getAnnotations()
          Get the default set of annotations.
 AnnotationSet getAnnotations(String name)
          Get a named set of annotations.
 DocumentContent getContent()
          The content of the document: a String for text; MPEG for video; etc.
 String getEncoding()
          Get the encoding of the document content source
 FeatureMap getFeatures()
          Get the features associated with this document.
 Boolean getMarkupAware()
          Get the markup awareness status of the Document.
 Map getNamedAnnotationSets()
          Returns a map with the named annotation sets.
 Integer getNextAnnotationId()
          Generate and return the next annotation ID
 Integer getNextNodeId()
          Generate and return the next node ID
 URL getSourceUrl()
          Documents are identified by URLs
 Long getSourceUrlEndOffset()
          Documents may be packed within files; in this case an optional pair of offsets refer to the location of the document.
 Long[] getSourceUrlOffsets()
          Documents may be packed within files; in this case an optional pair of offsets refer to the location of the document.
 Long getSourceUrlStartOffset()
          Documents may be packed within files; in this case an optional pair of offsets refer to the location of the document.
 int hashCode()
          Hash code
 Resource init()
          Initialise this resource, and return it.
 boolean isValidOffset(Long offset)
          Check that an offset is valid, i.e.
 boolean isValidOffsetRange(Long start, Long end)
          Check that both start and end are valid offsets and that they constitute a valid offset range, i.e.
 void removeAnnotationSet(String name)
          Removes one of the named annotation sets.
 void removeDocumentListener(DocumentListener l)
           
 void removeGateListener(GateListener l)
           
 void removeStatusListener(StatusListener l)
           
 void setContent(DocumentContent content)
          Set method for the document content
 void setEncoding(String encoding)
          Set the encoding of the document content source
 void setFeatures(FeatureMap features)
          Set the feature set
 void setMarkupAware(Boolean newMarkupAware)
          Make the document markup-aware.
 void setSourceUrl(URL sourceUrl)
          Set method for the document's URL
 void setSourceUrlEndOffset(Long sourceUrlEndOffset)
          Documents may be packed within files; in this case an optional pair of offsets refer to the location of the document.
 void setSourceUrlStartOffset(Long sourceUrlStartOffset)
          Documents may be packed within files; in this case an optional pair of offsets refer to the location of the document.
 void setStringContent(String newStringContent)
           
 String toString()
          String respresentation
 String toXml()
          Returns a GateXml document
 
Methods inherited from class gate.creole.AbstractLanguageResource
getDataStore, setDataStore, sync
 
Methods inherited from class gate.creole.AbstractResource
getName, setName
 
Methods inherited from class java.lang.Object
getClass, notify, notifyAll, wait, wait, wait
 
Methods inherited from interface gate.LanguageResource
getDataStore, setDataStore, sync
 
Methods inherited from interface gate.util.FeatureBearer
getName, setName
 

Constructor Detail

DocumentImpl

public DocumentImpl()
Default construction. Content left empty.
Method Detail

init

public Resource init()
              throws ResourceInstantiationException
Initialise this resource, and return it.
Specified by:
init in interface Resource
Overrides:
init in class AbstractResource

getSourceUrl

public URL getSourceUrl()
Documents are identified by URLs
Specified by:
getSourceUrl in interface Document

setSourceUrl

public void setSourceUrl(URL sourceUrl)
Set method for the document's URL
Specified by:
setSourceUrl in interface Document

getSourceUrlOffsets

public Long[] getSourceUrlOffsets()
Documents may be packed within files; in this case an optional pair of offsets refer to the location of the document.
Specified by:
getSourceUrlOffsets in interface Document

getSourceUrlStartOffset

public Long getSourceUrlStartOffset()
Documents may be packed within files; in this case an optional pair of offsets refer to the location of the document. This method gets the start offset.
Specified by:
getSourceUrlStartOffset in interface Document

setSourceUrlStartOffset

public void setSourceUrlStartOffset(Long sourceUrlStartOffset)
Documents may be packed within files; in this case an optional pair of offsets refer to the location of the document. This method sets the start offset.

getSourceUrlEndOffset

public Long getSourceUrlEndOffset()
Documents may be packed within files; in this case an optional pair of offsets refer to the location of the document. This method gets the end offset.
Specified by:
getSourceUrlEndOffset in interface Document

setSourceUrlEndOffset

public void setSourceUrlEndOffset(Long sourceUrlEndOffset)
Documents may be packed within files; in this case an optional pair of offsets refer to the location of the document. This method sets the end offset.

getContent

public DocumentContent getContent()
The content of the document: a String for text; MPEG for video; etc.
Specified by:
getContent in interface Document

setContent

public void setContent(DocumentContent content)
Set method for the document content
Specified by:
setContent in interface Document

getEncoding

public String getEncoding()
Get the encoding of the document content source

setEncoding

public void setEncoding(String encoding)
Set the encoding of the document content source

getAnnotations

public AnnotationSet getAnnotations()
Get the default set of annotations. The set is created if it doesn't exist yet.
Specified by:
getAnnotations in interface Document

getAnnotations

public AnnotationSet getAnnotations(String name)
Get a named set of annotations. Creates a new set if one with this name doesn't exist yet.
Specified by:
getAnnotations in interface Document

setMarkupAware

public void setMarkupAware(Boolean newMarkupAware)
Make the document markup-aware. This will trigger the creation of a DocumentFormat object at Document initialisation time; the DocumentFormat object will unpack the markup in the Document and add it as annotations. Documents are not markup-aware by default.
Specified by:
setMarkupAware in interface Document
Parameters:
b - markup awareness status.

getMarkupAware

public Boolean getMarkupAware()
Get the markup awareness status of the Document. Documents are markup-aware by default.
Specified by:
getMarkupAware in interface Document
Returns:
whether the Document is markup aware.

toXml

public String toXml()
Returns a GateXml document
Specified by:
toXml in interface Document
Returns:
a string representing a Gate Xml document

getNamedAnnotationSets

public Map getNamedAnnotationSets()
Returns a map with the named annotation sets. It returns null if no named annotaton set exists.
Specified by:
getNamedAnnotationSets in interface Document

removeAnnotationSet

public void removeAnnotationSet(String name)
Removes one of the named annotation sets. Note that the default annotation set cannot be removed.
Specified by:
removeAnnotationSet in interface Document
Parameters:
name - the name of the annotation set to be removed

getFeatures

public FeatureMap getFeatures()
Get the features associated with this document.
Specified by:
getFeatures in interface FeatureBearer
Overrides:
getFeatures in class AbstractFeatureBearer

setFeatures

public void setFeatures(FeatureMap features)
Set the feature set
Specified by:
setFeatures in interface FeatureBearer
Overrides:
setFeatures in class AbstractFeatureBearer

edit

public void edit(Long start,
                 Long end,
                 DocumentContent replacement)
          throws InvalidOffsetException
Propagate edit changes to the document content and annotations.
Specified by:
edit in interface Document

isValidOffset

public boolean isValidOffset(Long offset)
Check that an offset is valid, i.e. it is non-null, greater than or equal to 0 and less than the size of the document content.

isValidOffsetRange

public boolean isValidOffsetRange(Long start,
                                  Long end)
Check that both start and end are valid offsets and that they constitute a valid offset range, i.e. start is greater than or equal to long.

getNextAnnotationId

public Integer getNextAnnotationId()
Generate and return the next annotation ID

getNextNodeId

public Integer getNextNodeId()
Generate and return the next node ID

compareTo

public int compareTo(Object o)
              throws ClassCastException
Ordering based on URL.toString() and the URL offsets (if any)
Specified by:
compareTo in interface Comparable

removeStatusListener

public void removeStatusListener(StatusListener l)

addStatusListener

public void addStatusListener(StatusListener l)

equals

public boolean equals(Object other)
Equals
Overrides:
equals in class Object

hashCode

public int hashCode()
Hash code
Overrides:
hashCode in class Object

toString

public String toString()
String respresentation
Overrides:
toString in class Object

removeDocumentListener

public void removeDocumentListener(DocumentListener l)
Specified by:
removeDocumentListener in interface Document

addDocumentListener

public void addDocumentListener(DocumentListener l)
Specified by:
addDocumentListener in interface Document

removeGateListener

public void removeGateListener(GateListener l)
Specified by:
removeGateListener in interface Document

addGateListener

public void addGateListener(GateListener l)
Specified by:
addGateListener in interface Document

setStringContent

public void setStringContent(String newStringContent)