gate.corpora
Class XmlDocumentFormat

java.lang.Object
  |
  +--gate.util.AbstractFeatureBearer
        |
        +--gate.creole.AbstractResource
              |
              +--gate.creole.AbstractLanguageResource
                    |
                    +--gate.DocumentFormat
                          |
                          +--gate.corpora.TextualDocumentFormat
                                |
                                +--gate.corpora.XmlDocumentFormat
All Implemented Interfaces:
FeatureBearer, LanguageResource, Resource, Serializable

public class XmlDocumentFormat
extends TextualDocumentFormat

The format of Documents. Subclasses of DocumentFormat know about particular MIME types and how to unpack the information in any markup or formatting they contain into GATE annotations. Each MIME type has its own subclass of DocumentFormat, e.g. XmlDocumentFormat, RtfDocumentFormat, MpegDocumentFormat. These classes register themselves with a static index residing here when they are constructed. Static getDocumentFormat methods can then be used to get the appropriate format class for a particular document.

See Also:
Serialized Form

Constructor Summary
XmlDocumentFormat()
          Default construction
 
Method Summary
 Resource init()
          Initialise this resource, and return it.
 void unpackMarkup(Document doc)
          Unpack the markup in the document.
 void unpackMarkup(Document doc, String originalContentFeature)
          Unpack the markup in the document.
 
Methods inherited from class gate.corpora.TextualDocumentFormat
getDataStore
 
Methods inherited from class gate.DocumentFormat
addStatusListener, getDocumentFormat, getDocumentFormat, getDocumentFormat, getElement2StringMap, getFeatures, getMarkupElementsMap, getMimeType, removeStatusListener, setElement2StringMap, setFeatures, setMarkupElementsMap, setMimeType
 
Methods inherited from class gate.creole.AbstractLanguageResource
setDataStore, sync
 
Methods inherited from class gate.creole.AbstractResource
getName, setName
 
Methods inherited from class java.lang.Object
equals, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 
Methods inherited from interface gate.LanguageResource
setDataStore, sync
 
Methods inherited from interface gate.util.FeatureBearer
getName, setName
 

Constructor Detail

XmlDocumentFormat

public XmlDocumentFormat()
Default construction
Method Detail

unpackMarkup

public void unpackMarkup(Document doc)
                  throws DocumentFormatException
Unpack the markup in the document. This converts markup from the native format (e.g. XML) into annotations in GATE format. Uses the markupElementsMap to determine which elements to convert, and what annotation type names to use. If the document was created from a String, then is recomandable to set the doc's sourceUrl to null. So, if the document has a valid URL, then the parser will try to parse the XML document pointed by the URL.If the URL is not valid, or is null, then the doc's content will be parsed. If the doc's content is not a valid XML then the parser might crash.
Overrides:
unpackMarkup in class TextualDocumentFormat
Parameters:
Document - doc The gate document you want to parse. If doc.getSourceUrl() returns null then the content of doc will be parsed. Using a URL is recomended because the parser will report errors corectlly if the XML document is not well formed.

unpackMarkup

public void unpackMarkup(Document doc,
                         String originalContentFeature)
                  throws DocumentFormatException
Unpack the markup in the document. This converts markup from the native format (e.g. XML, RTF) into annotations in GATE format. Uses the markupElementsMap to determine which elements to convert, and what annotation type names to use. It uses the same behaviour as unpackMarkup(Document doc); but the document's old content is preserved into a feature attached to the doc.
Overrides:
unpackMarkup in class TextualDocumentFormat
Parameters:
gate.Document - doc The gate document you want to parse and create annotations
String - originalContentFeatureType The name of a feature that will preserve the old content of the document.

init

public Resource init()
              throws ResourceInstantiationException
Initialise this resource, and return it.
Overrides:
init in class TextualDocumentFormat