gate.corpora
Class XmlDocumentFormat
java.lang.Object
|
+--gate.util.AbstractFeatureBearer
|
+--gate.creole.AbstractResource
|
+--gate.creole.AbstractLanguageResource
|
+--gate.DocumentFormat
|
+--gate.corpora.TextualDocumentFormat
|
+--gate.corpora.XmlDocumentFormat
- All Implemented Interfaces:
- FeatureBearer, LanguageResource, NameBearer, Resource, Serializable
- public class XmlDocumentFormat
- extends TextualDocumentFormat
The format of Documents. Subclasses of DocumentFormat know about
particular MIME types and how to unpack the information in any
markup or formatting they contain into GATE annotations. Each MIME
type has its own subclass of DocumentFormat, e.g. XmlDocumentFormat,
RtfDocumentFormat, MpegDocumentFormat. These classes register themselves
with a static index residing here when they are constructed. Static
getDocumentFormat methods can then be used to get the appropriate
format class for a particular document.
- See Also:
- Serialized Form
Field Summary |
private static boolean |
DEBUG
Debug flag |
Methods inherited from class gate.DocumentFormat |
addStatusListener, areEqual, decideBetweenThreeMimeTypes, decideBetweenTwoMimeTypes, fireStatusChanged, getDocumentFormat, getDocumentFormat, getDocumentFormat, getElement2StringMap, getFeatures, getFileSufix, getMarkupElementsMap, getMimeType, getMimeType, getMimeType, guessTypeUsingMagicNumbers, removeStatusListener, runMagicNumbers, setElement2StringMap, setFeatures, setMarkupElementsMap, setMimeType, unpackMarkup |
Methods inherited from class gate.creole.AbstractResource |
checkParameterValues, getName, getParameterValue, getParameterValue, removeResourceListeners, setName, setParameterValue, setParameterValue, setParameterValues, setParameterValues, setResourceListeners |
Methods inherited from class java.lang.Object |
, clone, equals, finalize, getClass, hashCode, notify, notifyAll, registerNatives, toString, wait, wait, wait |
DEBUG
private static final boolean DEBUG
- Debug flag
XmlDocumentFormat
public XmlDocumentFormat()
- Default construction
unpackMarkup
public void unpackMarkup(Document doc)
throws DocumentFormatException
- Unpack the markup in the document. This converts markup from the
native format (e.g. XML) into annotations in GATE format.
Uses the markupElementsMap to determine which elements to convert, and
what annotation type names to use. If the document was created from a
String, then is recomandable to set the doc's sourceUrl to null.
So, if the document has a valid URL, then the parser will try to
parse the XML document pointed by the URL.If the URL is not valid, or
is null, then the doc's content will be parsed. If the doc's content is
not a valid XML then the parser might crash.
- Overrides:
unpackMarkup
in class TextualDocumentFormat
- Parameters:
Document
- doc The gate document you want to parse. If
doc.getSourceUrl()
returns null then the content of
doc will be parsed. Using a URL is recomended because the parser will
report errors corectlly if the XML document is not well formed.
parseDocumentWithoutURL
private void parseDocumentWithoutURL(Document aDocument)
throws DocumentFormatException
- Called from unpackMarkup() if the document have been created from a
string
init
public Resource init()
throws ResourceInstantiationException
- Initialise this resource, and return it.
- Overrides:
init
in class TextualDocumentFormat