TXmlParser Reference

This document gives in-depth information about the TXmlParser XML Parser for Borland Delphi. You can download the latest version of the parser at www.destructor.de. I assume that you have an understanding of the terms of XML. The first letter of XML terms is capitalized in this document.

General
- Optimized for Speed
Public Interface

General

TXmlParser is a Delphi CLASS which is responsible for

handling the Document buffer (either by handling its own buffer [then, the FBufferSize attribute is > 0] or by handling a pointer to the buffer of another object [then, FBufferSize is 0])
handling general information about the XML document
handling the declarations of the DTD
scanning through the document and thereby processing External Entity References
passing back all important information to the application

Like the XML specification points out, TXmlParser is the "XML Processor". The code that uses TXmlParser is the "Application".

The common container class for TXmlParser is the TObjectList that comes with Delphi 5. If you have a version below 5, you can easily derive one from TList. TObjectList assumes that the objects added to the list 'belong' to the list; so calls to "Delete", "Clear" or "Free" also destroy the objects.

It is out of the scope of TXmlParser to load documents from HTTP or FTP servers. If you want to implement an application which is able to do this, you will have to use your own HTTP or FTP (or whatever) network client and hand over the loaded document to the TXmlParser using the LoadFromBuffer or SetBuffer methods.

Optimized for Speed

TXmlParser was developed with speed in mind. You will notice this when you compare the speed with other XML parsers. The idea is that there is a PChar pointer running through the document, analyzing what's at its position.

For this reason, the entire document must be a null terminated string.

Public Interface

Properties

Note: Properties with the prefix "Cur" hold data concerning the current "Scan" step. All other properties are more or less independent of scanning.

XmlVersion

Type: AnsiString, read-only

This string contains the XML version number which is declared in the document's XML Prolog. The string is filled when the Prolog is scanned. Before that, it has the value '1.0'.

Encoding

Type: AnsiString, read-only

The name of the character encoding. Default for XML is 'UTF-8'. Another widely used encoding is 'ISO-8859-1', which is about the same as the Windows "ANSI" 1252 character set (which would be 'windows-1252').

The Encoding property holds the encoding name for the XML Document. If you want to determine the encoding of the current content while scanning, you must use "CurEncoding" (encode can change during the document when the parser reads an external entity which has a different encoding than the root document).

You can read the Encoding field everytime, but it will be reset by the XML Prolog (PartType = ptXmlProlog)

Standalone

Type: Boolean, read-only

If the XML Prolog says "standalone='yes'", then the Standalone property is TRUE, else it is FALSE.

RootName

Type: AnsiString, read-only

The name of the Root Element. This is determined from the DOCTYPE declaration. Until the DOCTYPE declaration is found in the document, "RootName" is empty. It remains set after a "StartScan". Note that RootName will not be set by the start tag of the root element.

Normalize

Type: Boolean, read/write

Set this property to TRUE if you want Element Content to be normalized. This means that:

Contiguous Whitespace characters will be compressed to one Space (#x20) character
Whitespace at the beginning and end of Content wiol be trimmed off

You can set Normalize at any time. The next "Scan" call will immediately work accordingly. So if you find a Start Tag with an "xml:space" attribute, you can (and must) set "Normalize" yourself.

Note: Normalization of Attribute Values is completely governed by the XML Spec.

Note: The XML spec requires that all line breaks be changed to single linefeed (#x0A) characters. TXmlParser doesn't change line break characters, so will normally have CR+LF sequences, depending on what's in your XML file.

Source

Type: AnsiString, read-only

This is not the Document source itself but instead the name of the source you got the document from:

If you loaded it via LoadFromFile, "Source" contains the Filename
If you loaded it via LoadFromBuffer, "Source" contains the string '<MEM>'
If you loaded it via SetBuffer, "Source" contains the string '<BUFFER>'

DocBuffer

Type: PChar, read-only

Returns a pointer to the first character of the document. If there is no document loaded, "DocBuffer" returns a pointer to a null (#x00) character. So you always have a valid pointer and never NIL.

Elements

Type: TElemList (derived from TObjectList), Attribute

This list contains all Element declarations which have been found in the DTD. Every Element definition is stored in a TElemDef object.

Entities

Type: TNvpList (derived from TObjectList), Attribute

This list contains all General Entity declarations which have been found in the DTD. Every Entity definition is stored in a TEntityDef object.

ParEntities

Type: TNvpList (derived from TObjectList), Attribute

This list contains all Parameter Entity declarations which have been found in the DTD. Every Parameter Entity definition is also stored in a TEntityDef object.

Notations

Type: TNvpList (derived from TObjectList), Attribute

This list contains all Notation declarations which have been found in the DTD. Every Notation definition is stored in a TNotationDef object.

CurPartType

Type: TPartType (Enumeration Type), Attribute

Every time the "Scan" method returns, CurPartType holds the type of the current part which has been found by "Scan". This can be one of the following part types:

CurPartType	Meaning	CurName	CurContent
ptNone	This should never be returned. If it is, there must be an error in the XML document (or in my code ;-)	Undefined	Undefined
ptXmlProlog	The XML Prolog has been read in. Now you can read the properties XmlVersion, Encoding, and Standalone.	Undefined	Undefined
ptComment	A comment has been found. You can retrieve the comment by extracting the buffer part from CurStart to CurFinal	Undefined	Untouched
ptPI	A Processing Instruction has been found. If it has "pseudo attributes", you can find these in the CurAttr list	Target name	PI content
ptStartTag	A Start Tag has been found. You can find the Attributes in the CurAttr list	Element name	Untouched
ptEmptyTag	An Empty-Element tag has been found. You can find the Attributes in the CurAttr list. NOTE: TXmlParser distinguishes between Empty-Element Tags and a Start Tag directly followed by an End Tag. So <BR/> will be returned as an Empty-Element Tag (ptEmptyTag) and <BR></BR> will be returned as a Start Tag followed by an End Tag.	Tag name	Untouched
ptEndTag	An End Tag has been found.	Tag name	Untouched
ptContent	Text Content (the part between Tags) has been found. General Entities have been resolved. If "Normalize" is TRUE, the content is already normalized. The Encoding has been transferred by the "TranslateEncoding" method. If "Normalize" is true, then Whitespace-only content will not be return (i.e. there will be no ptContent part for them).	Untouched	Content
ptCData	A CDATA section has been found. The Encoding has been transferred by the "TranslateEncoding" method. Whitespace is unchanged.	Empty	Content

CurName

Type: AnsiString, Attribute

The Name of the last part which has a name (e.g. start tags or PIs have a name, comments or text contents don't have a name). If there is a part without a name, the CurName attribute stays untouched. So when you have a ptContent part, you (usually) know the name of the last tag by looking at CurName.

CurContent

Type: AnsiString, Attribute

The last Content (from a ptContent, ptCData or ptPI). Like CurName, CurContent is not overwritten by parts which have no content (like Tags or Comments).

CurStart, CurFinal

Type: PChar, Attribute

A pointer to the first (CurStart) and last (CurFinal) character of the current part returned by the Scan method. You can use these pointers to retrieve the exact part string.

Example: You want to extract the contents of a comment (which is not done by the Scan method), you can use CurStart and CurFinal:

SetString (MyComment, CurStart, CurFinish - CurStart + 1);

or you could use the SetStringSF function which is exported by the LibXmlParser unit:

SetStringSF (MyComment, CurStart, CurFinish);

CurAttr

Type: TAttrList, Attribute

This is a list of TAttr Objects. Every TAttr has a Name and a Value field, which contain the name and value of one attribute. The ValueType field of TAttr tells you where the value comes from:

ValueType	Meaning
vtNormal	The attribute has been specified completely in the tag
vtImplied	The attribute value is undefined; the attribute is defined as #IMPLIED by the DTD. Your application must know how to handle this attribute
vtFixed	The attribute is defined as #FIXED in the DTD. If there was an attribute value in the tag, it has been overwritten by the attribute default value from the DTD
vtDefault	The attribute was not specified in the tag; instead, it was added because it was defined in the DTD. The attribute value is the default value from the DTD's ATTLIST definition.

The AttrType field tells you the type of the Attribute. It is copied from the TAttrDef object which was created when the attribute was declared in the DTD.

CurEncoding

Type: AnsiString, read-only property

This is the name of the current Encoding. Encoding can change in the middle of the document if the parser has to parse an External Parsed Entity which has a different Encoding than the main document. This value is mainly used by the TranslateEncoding method. But it can also be used by the application.

Methods

PROCEDURE LoadFromFile (Filename : STRING);

Loads the File into the internal Buffer of the TXmlParser instance. If this is successfull, then the Source property holds the name of the file.

PROCEDURE LoadFromBuffer (Buffer : PChar);

Loads the null-terminated string given by Buffer into the internal Buffer of the TXmlParser instance. The Source property has the value '<MEM>' after this step.

PROCEDURE SetBuffer (Buffer : PChar);

If you already have the XML Document loaded into memory and you don't want TXmlParser to keep the entire document in its own piece of memory, you can use the SetBuffer method. SetBuffer will not allocate memory but instead will let the internal FBuffer attribute point to your Buffer.

Note: The XML document must be null-terminated (i.e. there is a NULL (#x0;) character at the end).

You must not free your buffer before you free your TXmlParser instance or call the Clear method. This would cause access violations.

PROCEDURE Clear;

Clears all internal variables and deallocates all buffer space previously allocated by the TXmlParser instance. After this, the TXmlParser is prepared for loading a new buffer. Clear is automatically called by the loading methods like LoadFromFile, LoadFromBuffer, or SetBuffer.

PROCEDURE StartScan;

While you scan through your document (using the Scan method), there is always a pointer pointing to the current part of the document (in fact, it's two pointers: CurStart and CurFinal). StartScan initializes all pointers and all Cur* attributes in order to prepare for a new scan from the beginning of the Document.

You may call StartScan as often as you want and at any time.

FUNCTION Scan : BOOLEAN;

This is where scanning through the XML document really happens. The Scan method performs the following steps:

It determines the current position in the document (using the CurFinal pointer)
It goes one character further and looks, what it finds there (a tag, text content, whatever)
It analyzes what it has found, thereby giving all Cur* attributes new values

After that, Scan returns a boolean value which is

true, if Scan has found a new part in your Document
false, if Scan reached the end of the Document (i.e. a NULL #x00 character).

This behaviour of the Scan method was chosen so that you can write a WHILE loop for scanning through the document.

With this WHILE loop, you can handle everything that you need in local variables of the procedure/function/method which analyzes your XML Document. In an event centric model, there would be a procedure call for every Document part and so you would have to handle everything in more or less global variables.

Virtual Methods

With the virtual methods of TXmlParser you can modify the behaviour of TXmlParser. Just override them in a class descendant of your own.

FUNCTION LoadExternalEntity (SystemId, PublicId, Notation : STRING) : TXmlParser; VIRTUAL;

Override this method if you have implemented a special mechanism for loading documents, if you want to process PUBLIC IDs or Notations.

LoadExternalEntity is called for every External Entity (be it Parsed or Unparsed). The known System and/or Public IDs and the Notation is passed as a string to this method. It has to create a new TXmlParser instance and load the desired Entity into the buffer of that instance.

LoadExternalEntity is also called when the External DTD Subset is to be loaded.

FUNCTION TranslateEncoding (CONST Source : STRING) : STRING; VIRTUAL;

NOTE: This function has been dropped in Version 2. All conversions are now done automatically.

For Version 1: The XML Specification states that every XML parser must be able to handle UTF-8 and UTF-16 documents. Beside these, parsers should be able to handle other Encodings. The encoding for a document is defined in the XML Prolog (for entire XML Documents) or in a Text Declaration at the beginning of Parsed External Entities or External DTD subsets.

So there is a source Encoding (the Encoding of the Document and its external parts) and a destination encoding (the encoding your application wishes to process). For every content string which is passed to your application (Text Content between Tags, CDATA sections, Attribute values) the TranslateEncoding method is called. It retrieves the current source encoding by looking at the CurEncoding property and translates the passed "Source" string into the desired destination encoding.

The TranslateEncoding method that is built into TXmlParser assumes that the destination encoding is the Windows ANSI encoding used in Windows apps. It can handle UTF-8 and ISO-8859-1 as source encodings. Note: It is assumed here that ISO-8859-1 and "Windows ANSI" are the same, which is not exactly true for some characters. But for the largest part of documents this should be true.

UTF-8 correctly translated into the single-byte ANSI Windows-1252 format.

At the time of this writing, TXmlParser is not able to handle multi-byte character strings. This is likely to change in the future.

PROCEDURE DtdElementFound (DtdElementRec : TDtdElementRec); VIRTUAL;

The Scan method just tells you when it just scanned the DTD declaration. It doesn't tell you anything about what it found in the DTD. (You could scan the Lists Elements, Entities, ParEntities, Notations but then you know nothing about comments or PIs inside the DTD.)

If you want to build a validating parser or a tool which presents the elements of the DTD of your XML document or you want to handle comments or PIs inside the DTD, you can override the DtdElementFound virtual method. It is called everytime there is a DTD element found during the scan of the Document Type Declaration.

DtdElementFound gets passed a TDtdElementRec, which is a variant record with the following declaration:

  TDtdElementRec = RECORD    // --- This Record is returned by the DTD parser callback function
                     CASE ElementType : TDtdElemType OF
                       deElement,
                       deAttList  : (ElemDef      : TElemDef);
                       deEntity   : (EntityDef    : TEntityDef);
                       deNotation : (NotationDef  : TNotationDef);
                       dePI       : (Target       : PChar;
                                     Content      : PChar;
                                     AttrList     : TAttrList);
                       deComment  : (Start, Final : PChar);
                       deError    : (Pos          : PChar);
                   END;

The ElementType field tells you which type of DTD element the parser has just found. Depending on this field, you can find out what has been found:

ElementType Field	Description
deElement, deAttList	An <!ELEMENT> or <!ATTLIST> declaration has been found. The ElemDef field points to the TElemDef instance created (for deElement) or filled with Attribute definitions (for deAttList)
deEntity	An <!ENTITY> declaration has been found. The EntityDef field points to the TEntityDef instance created
deNotation	A <!NOTATION> declaration has been found. The NotationDef field points to the TNotationDef instance created
dePI	A Processing Instruction (PI) has been found inside the DTD. Target points to a null-terminated string containing the PI target; Content points to the part between the target and ?> in the PI. AttrList is the list of pseudo attributes in the PI.
deComment	A comment has been found inside the DTD. Start points to the opening '<!--' and Final points to the closing '>' of the comment.
deError	There is an error inside the DTD. Pos points to the position of the error

Note that all pointers are only valid when DtdElementFound is called. Don't keep them for later use.

Other Classes

TNvpNode, TNvpList

Name/Value Pairs (NVPs) are not handled using a TStringList (and its Names and Values properties). Instead, they are handled in a TNvpNode with the fields Name and Value and a special list for such nodes, TNvpList, which has special methods to get elements from the list. This concept was introduced because there are nodes derived from TNvpNode (like TAttr) which have additional fields.

Method	Description
PROCEDURE Add (Node : TNvpNode)	Adds a new node to the list. Nodes are always sorted by name so the Node method can use a binary search
FUNCTION Node (Name : STRING) : TNvpNode	Retrieves the node instance with the given name. If the node can not be found, NIL is returned.
FUNCTION Node (Index : INTEGER) : TNvpNode	Retrieves the node instance at the given position in the list. Index is zero-based, i.e. for Index=1 the second element is returned. If Index is smaller than zero or larger than the number of nodes in the list (minus 1), NIL is returned.
FUNCTION Value (Name : STRING) : STRING	Retrieves the string value of the given name. If there is no node with the name, an empty string is returned.
FUNCTION Value (Index : INTEGER) : STRING	Retrieves the string value at the given position in the list. Index is zero-based, i.e. for Index=1 the second element is returned. If Index is smaller than zero or larger than the number of nodes in the list (minus 1), NIL is returned.
FUNCTION Name (Index : INTEGER) : STRING;	Retrieves the name of the attribute at the given position in the list. Index is zero-based, i.e. for Index=1 the second element is returned. If Index is smaller than zero or larger than the number of nodes in the list (minus 1), NIL is returned.

TAttr, TAttrList

Derived from TNvpNode and TNvpList. Used for passing back Tag attributes to the application.

TAttr has two additional fields to hold informations about the Attribute, ValueType and AttrType:

Field

Description

ValueType

See the description of the CurAttr property

AttrType

The type of the attribute, as declared in the DTD:

Type	Description
atUnknown	Unknown type
atCData	Character data only
atID	An ID (Unique attribute value)
atIdRef	An ID reference
atIdRefs	Several ID References, separated by Whitespace
atEntity	Name of an unparsed Entity
atEntities	Several unparsed Entity names, separated by Whitespace
atNmToken	Name Token
atNmTokens	Several Name Tokens, separated by Whitespace
atNotation	A selection of Notation names (Unparsed Entity), separated by pipe symbols. You can find these in the Notations field of the TAttrDef definition where the Attribute belongs to
atEnumeration	Enumeration (possible values, separated by pipe symbols). You can find the enumeration definition in the TypeDef field of the TAttrDef definition where the Attribute belongs to

TEntityStack, TEntityStackNode

When the parser scans through the document, it can find a reference to a parsed entity, internal or external. In this case, the current position pointer is pushed to a stack (the EntityStack) and set to the first character of the entity replacement text. After the entity is scanned, the old pointer is popped off the stack and processing of the original document continues. As Entity references may nest, this has to be organized as a stack.

TAttrDef

Every <!ATTLIST> element gets transferred into a TAttrDef instance, which is inserted into the TElemDef where the Attribute definition belongs to. TAttrDef has the following fields:

Field	Description
Name	Name of the Atribute
Value	Default value
TypeDef	The Type definition from the <!ATTLIST> declaration
Notations	The listing of notations, if it is a NOTATION attribute. The notation names are separated by pipe symbols.
AttrType	Type of the Attribute
DefaultType	Type of the default value declaration of the Attribute (normal default value, #REQUIRED, #IMPLIED, #FIXED)

TElemDef, TElemList

TElemDef holds the data of an <!ELEMENT> definition:

Field

Description

Name

Name of the Element

ElemType

Type of the element:

Type	Description
etEmpty	Element is always an Empty Element
etAny	Element can have any mixture of PCDATA and any elements
etChildren	Element must contain only elements, no PCDATA
etMixed	Mixed PCDATA and elements. The Definition field holds the exact definition as specified in the DTD.

Definition

The exact definition of the element from the DTD

As TElemDef is both, a node and a list, there is a special TElemList, which has almost the same code as TNvpList.

TEntityDef

TEntityDef holds the data of an <!ENTITY> definition. Depending on the type (General or Parameter Entity), the TEntityDef node is added to the Entities or ParEntities list.

Field	Description
Name	Name of the entity
Value	The replacement text of the entity
SystemId	For External Entities, this field contains the SYSTEM id
PublicId	For External Entities, this field contains the PUBLIC id. This field may be empty.
NotationName	For NDATA Unparsed External Entities, this field contains the Notation Name.

TNotationDef

TNotationDef holds the data of a <!NOTATION> definition:

Field	Description
Name	Name of the notation
Value	SYSTEM id
PublicId	PUBLIC id

Standalone Procedures and Functions

FUNCTION ConvertWs (Source: STRING; PackWs: BOOLEAN) : STRING;

Converts all Whitespace characters (Space, Tab, Carriage Return, Linefeed) in the String to Space #x20 characters. If the PackWs parameter is true, contiguous whitespace characters will be packed to one space character.

PROCEDURE SetStringSF (VAR S : STRING; BufferStart, BufferFinal : PChar);

The same as SysUtils.SetString. Exception: the second parameter denotes the position of the last character to transfer into the string, not the length.

FUNCTION StrSFPas (Start, Finish : PChar) : STRING;

Same as SysUtils.StrPas. In addition to the start of the string, the last character is also passed (Finish).

FUNCTION TrimWs (Source : STRING) : STRING;

Trimms all whitespace characters off the beginning and end of the Source string.

FUNCTION AnsiToUtf8 (Source : ANSISTRING) : STRING;

Converts the Windows 1252 ANSI Source string to a UTF-8 string.

FUNCTION Utf8ToAnsi (Source : STRING; UnknownChar : CHAR = '¿') : ANSISTRING;

Converts the given UTF-8 string to Windows ANSI. Unicode characters which don't fit into the Windows-1252 range are converted to the "UnknownChar" character, which defaults to a reverse question mark.

TXmlParser Reference

Contents

General

Optimized for Speed

Public Interface

Properties

XmlVersion

Encoding

Standalone

RootName

Normalize

Source

DocBuffer

Elements

Entities

ParEntities

Notations

CurPartType

CurName

CurContent

CurStart, CurFinal

CurAttr

CurEncoding

Methods

PROCEDURE LoadFromFile (Filename : STRING);

PROCEDURE LoadFromBuffer (Buffer : PChar);

PROCEDURE SetBuffer (Buffer : PChar);

PROCEDURE Clear;

PROCEDURE StartScan;

FUNCTION Scan : BOOLEAN;

Virtual Methods

FUNCTION LoadExternalEntity (SystemId, PublicId, Notation : STRING) : TXmlParser; VIRTUAL;

FUNCTION TranslateEncoding (CONST Source : STRING) : STRING; VIRTUAL;

PROCEDURE DtdElementFound (DtdElementRec : TDtdElementRec); VIRTUAL;

Other Classes

TNvpNode, TNvpList

TAttr, TAttrList

TEntityStack, TEntityStackNode

TAttrDef

TElemDef, TElemList

TEntityDef

TNotationDef

Standalone Procedures and Functions

FUNCTION ConvertWs (Source: STRING; PackWs: BOOLEAN) : STRING;

PROCEDURE SetStringSF (VAR S : STRING; BufferStart, BufferFinal : PChar);

FUNCTION StrSFPas (Start, Finish : PChar) : STRING;

FUNCTION TrimWs (Source : STRING) : STRING;

FUNCTION AnsiToUtf8 (Source : ANSISTRING) : STRING;

FUNCTION Utf8ToAnsi (Source : STRING; UnknownChar : CHAR = '¿') : ANSISTRING;