gate.corpora
Class HtmlDocumentFormat

java.lang.Object
  |
  +--gate.util.AbstractFeatureBearer
        |
        +--gate.creole.AbstractResource
              |
              +--gate.creole.AbstractLanguageResource
                    |
                    +--gate.DocumentFormat
                          |
                          +--gate.corpora.TextualDocumentFormat
                                |
                                +--gate.corpora.HtmlDocumentFormat
All Implemented Interfaces:
FeatureBearer, LanguageResource, NameBearer, Resource, Serializable

public class HtmlDocumentFormat
extends TextualDocumentFormat

The format of Documents. Subclasses of DocumentFormat know about particular MIME types and how to unpack the information in any markup or formatting they contain into GATE annotations. Each MIME type has its own subclass of DocumentFormat, e.g. XmlDocumentFormat, RtfDocumentFormat, MpegDocumentFormat. These classes register themselves with a static index residing here when they are constructed. Static getDocumentFormat methods can then be used to get the appropriate format class for a particular document.

See Also:
Serialized Form

Field Summary
private static boolean DEBUG
          Debug flag
 
Fields inherited from class gate.DocumentFormat
element2StringMap, isGateXmlDocument, magic2mimeTypeMap, markupElementsMap, mimeString2ClassHandlerMap, mimeString2mimeTypeMap, suffixes2mimeTypeMap
 
Fields inherited from class gate.creole.AbstractLanguageResource
dataStore, lrPersistentId
 
Fields inherited from class gate.creole.AbstractResource
name
 
Constructor Summary
HtmlDocumentFormat()
          Default construction
 
Method Summary
 Resource init()
          Initialise this resource, and return it.
 Boolean supportsRepositioning()
          We could collect repositioning information during XML parsing
 void unpackMarkup(Document doc)
          Old style of unpackMarkup (without collecting of RepositioningInfo)
 void unpackMarkup(Document doc, RepositioningInfo repInfo, RepositioningInfo ampCodingInfo)
          Unpack the markup in the document.
 
Methods inherited from class gate.corpora.TextualDocumentFormat
annotateParagraphs, getDataStore, setNewLineProperty
 
Methods inherited from class gate.DocumentFormat
addStatusListener, areEqual, decideBetweenThreeMimeTypes, decideBetweenTwoMimeTypes, fireStatusChanged, getDocumentFormat, getDocumentFormat, getDocumentFormat, getElement2StringMap, getFeatures, getMarkupElementsMap, getMimeType, getShouldCollectRepositioning, guessTypeUsingMagicNumbers, removeStatusListener, runMagicNumbers, setElement2StringMap, setFeatures, setMarkupElementsMap, setMimeType, setShouldCollectRepositioning, unpackMarkup
 
Methods inherited from class gate.creole.AbstractLanguageResource
cleanup, getLRPersistenceId, getParent, isModified, setDataStore, setLRPersistenceId, setParent, sync
 
Methods inherited from class gate.creole.AbstractResource
checkParameterValues, getName, getParameterValue, getParameterValue, removeResourceListeners, setName, setParameterValue, setParameterValue, setParameterValues, setParameterValues, setResourceListeners
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 
Methods inherited from interface gate.LanguageResource
getLRPersistenceId, getParent, isModified, setDataStore, setLRPersistenceId, setParent, sync
 
Methods inherited from interface gate.Resource
cleanup, getParameterValue, setParameterValue, setParameterValues
 
Methods inherited from interface gate.util.NameBearer
getName, setName
 

Field Detail

DEBUG

private static final boolean DEBUG
Debug flag

See Also:
Constant Field Values
Constructor Detail

HtmlDocumentFormat

public HtmlDocumentFormat()
Default construction

Method Detail

supportsRepositioning

public Boolean supportsRepositioning()
We could collect repositioning information during XML parsing

Overrides:
supportsRepositioning in class DocumentFormat

unpackMarkup

public void unpackMarkup(Document doc)
                  throws DocumentFormatException
Old style of unpackMarkup (without collecting of RepositioningInfo)

Overrides:
unpackMarkup in class TextualDocumentFormat
DocumentFormatException

unpackMarkup

public void unpackMarkup(Document doc,
                         RepositioningInfo repInfo,
                         RepositioningInfo ampCodingInfo)
                  throws DocumentFormatException
Unpack the markup in the document. This converts markup from the native format (e.g. HTML) into annotations in GATE format. Uses the markupElementsMap to determine which elements to convert, and what annotation type names to use. It always tryes to parse te doc's content. It doesn't matter if the sourceUrl is null or not.

Overrides:
unpackMarkup in class TextualDocumentFormat
DocumentFormatException

init

public Resource init()
              throws ResourceInstantiationException
Initialise this resource, and return it.

Specified by:
init in interface Resource
Overrides:
init in class TextualDocumentFormat
ResourceInstantiationException