gate.corpora
Class SgmlDocumentFormat
java.lang.Object
|
+--gate.util.AbstractFeatureBearer
|
+--gate.creole.AbstractResource
|
+--gate.creole.AbstractLanguageResource
|
+--gate.DocumentFormat
|
+--gate.corpora.TextualDocumentFormat
|
+--gate.corpora.SgmlDocumentFormat
- All Implemented Interfaces:
- FeatureBearer, LanguageResource, Resource, Serializable
- public class SgmlDocumentFormat
- extends TextualDocumentFormat
The format of Documents. Subclasses of DocumentFormat know about
particular MIME types and how to unpack the information in any
markup or formatting they contain into GATE annotations. Each MIME
type has its own subclass of DocumentFormat, e.g. XmlDocumentFormat,
RtfDocumentFormat, MpegDocumentFormat. These classes register themselves
with a static index residing here when they are constructed. Static
getDocumentFormat methods can then be used to get the appropriate
format class for a particular document.
- See Also:
- Serialized Form
Methods inherited from class gate.DocumentFormat |
addStatusListener, getDocumentFormat, getDocumentFormat, getDocumentFormat, getElement2StringMap, getFeatures, getMarkupElementsMap, getMimeType, removeStatusListener, setElement2StringMap, setFeatures, setMarkupElementsMap, setMimeType |
SgmlDocumentFormat
public SgmlDocumentFormat()
- Default construction
unpackMarkup
public void unpackMarkup(Document doc)
throws DocumentFormatException
- Unpack the markup in the document. This converts markup from the
native format (e.g. SGML) into annotations in GATE format.
Uses the markupElementsMap to determine which elements to convert, and
what annotation type names to use.
The doc's content is first converted to a wel formed XML.
If this succeddes then the document is saved into a temp file and parsed
as an XML document.
- Overrides:
unpackMarkup
in class TextualDocumentFormat
- Parameters:
Document
- doc The gate document you want to parse.
unpackMarkup
public void unpackMarkup(Document doc,
String originalContentFeatureType)
throws DocumentFormatException
- Unpack the markup in the document. This converts markup from the
native format (e.g. XML, RTF) into annotations in GATE format.
Uses the markupElementsMap to determine which elements to convert, and
what annotation type names to use.
It uses the same behaviour as
unpackMarkup(Document doc);
but the document's old content is
preserved into a feature attached to the doc.
- Overrides:
unpackMarkup
in class TextualDocumentFormat
- Parameters:
gate.Document
- doc The gate document you want to parse and create
annotationsString
- originalContentFeatureType The name of a feature that will
preserve the old content of the document.
init
public Resource init()
throws ResourceInstantiationException
- Initialise this resource, and return it.
- Overrides:
init
in class TextualDocumentFormat