gate.corpora
Class TextualDocumentFormat
java.lang.Object
|
+--gate.util.AbstractFeatureBearer
|
+--gate.creole.AbstractResource
|
+--gate.creole.AbstractLanguageResource
|
+--gate.DocumentFormat
|
+--gate.corpora.TextualDocumentFormat
- All Implemented Interfaces:
- FeatureBearer, LanguageResource, Resource, Serializable
- Direct Known Subclasses:
- EmailDocumentFormat, HtmlDocumentFormat, RtfDocumentFormat, SgmlDocumentFormat, XmlDocumentFormat
- public class TextualDocumentFormat
- extends DocumentFormat
The format of Documents. Subclasses of DocumentFormat know about
particular MIME types and how to unpack the information in any
markup or formatting they contain into GATE annotations. Each MIME
type has its own subclass of DocumentFormat, e.g. XmlDocumentFormat,
RtfDocumentFormat, MpegDocumentFormat. These classes register themselves
with a static index residing here when they are constructed. Static
getDocumentFormat methods can then be used to get the appropriate
format class for a particular document.
- See Also:
- Serialized Form
Field Summary |
private static boolean |
DEBUG
Debug flag |
Fields inherited from class gate.DocumentFormat |
element2StringMap, features, isGateXmlDocument, magic2mimeTypeMap, markupElementsMap, mimeString2ClassHandlerMap, mimeString2mimeTypeMap, mimeType, myStatusListeners, statusListeners, suffixes2mimeTypeMap |
Methods inherited from class gate.DocumentFormat |
addStatusListener, areEqual, decideBetweenThreeMimeTypes, decideBetweenTwoMimeTypes, fireStatusChanged, getDocumentFormat, getDocumentFormat, getDocumentFormat, getElement2StringMap, getFeatures, getFileSufix, getMarkupElementsMap, getMimeType, getMimeType, getMimeType, guessTypeUsingMagicNumbers, removeStatusListener, runMagicNumbers, setElement2StringMap, setFeatures, setMarkupElementsMap, setMimeType |
Methods inherited from class java.lang.Object |
, clone, equals, finalize, getClass, hashCode, notify, notifyAll, registerNatives, toString, wait, wait, wait |
DEBUG
private static final boolean DEBUG
- Debug flag
TextualDocumentFormat
public TextualDocumentFormat()
- Default construction
init
public Resource init()
throws ResourceInstantiationException
- Initialise this resource, and return it.
- Overrides:
init
in class AbstractResource
unpackMarkup
public void unpackMarkup(Document doc)
throws DocumentFormatException
- Unpack the markup in the document. This converts markup from the
native format (e.g. XML, RTF) into annotations in GATE format.
Uses the markupElementsMap to determine which elements to convert, and
what annotation type names to use.
- Overrides:
unpackMarkup
in class DocumentFormat
unpackMarkup
public void unpackMarkup(Document doc,
String originalContentFeatureType)
throws DocumentFormatException
- Description copied from class:
DocumentFormat
- Unpack the markup in the document. This converts markup from the
native format (e.g. XML, RTF) into annotations in GATE format.
Uses the markupElementsMap to determine which elements to convert, and
what annotation type names to use.
- Overrides:
unpackMarkup
in class DocumentFormat
getDataStore
public DataStore getDataStore()
- Description copied from interface:
LanguageResource
- Get the data store that this LR lives in. Null for transient LRs.
- Overrides:
getDataStore
in class AbstractLanguageResource