www.destructor.de
This document gives in-depth information about the TXmlParser XML Parser for Borland Delphi. You can download the latest version of the parser at www.destructor.de. I assume that you have an understanding of the terms of XML. The first letter of XML terms is capitalized in this document.
TXmlParser is a Delphi CLASS which is responsible for
Like the XML specification points out, TXmlParser is the "XML Processor". The code that uses TXmlParser is the "Application".
The common container class for TXmlParser is the TObjectList that comes with Delphi 5. If you have a version below 5, you can easily derive one from TList. TObjectList assumes that the objects added to the list 'belong' to the list; so calls to "Delete", "Clear" or "Free" also destroy the objects.
It is out of the scope of TXmlParser to load documents from HTTP or FTP servers. If you want to implement an application which is able to do this, you will have to use your own HTTP or FTP (or whatever) network client and hand over the loaded document to the TXmlParser using the LoadFromBuffer or SetBuffer methods.
TXmlParser was developed with speed in mind. You will notice this when you compare the speed with other XML parsers. The idea is that there is a PChar pointer running through the document, analyzing what's at its position.
For this reason, the entire document must be a null terminated string.
Note: Properties with the prefix "Cur" hold data concerning the current "Scan" step. All other properties are more or less independent of scanning.
Type: AnsiString, read-only
This string contains the XML version number which is declared in the document's XML Prolog. The string is filled when the Prolog is scanned. Before that, it has the value '1.0'.
Type: AnsiString, read-only
The name of the character encoding. Default for XML is 'UTF-8'. Another widely used encoding is 'ISO-8859-1', which is about the same as the Windows "ANSI" 1252 character set (which would be 'windows-1252').
The Encoding property holds the encoding name for the XML Document. If you want to determine the encoding of the current content while scanning, you must use "CurEncoding" (encode can change during the document when the parser reads an external entity which has a different encoding than the root document).
You can read the Encoding field everytime, but it will be reset by the XML Prolog (PartType = ptXmlProlog)
Type: Boolean, read-only
If the XML Prolog says "standalone='yes'", then the Standalone property is TRUE, else it is FALSE.
Type: AnsiString, read-only
The name of the Root Element. This is determined from the DOCTYPE declaration. Until the DOCTYPE declaration is found in the document, "RootName" is empty. It remains set after a "StartScan". Note that RootName will not be set by the start tag of the root element.
Type: Boolean, read/write
Set this property to TRUE if you want Element Content to be normalized. This means that:
You can set Normalize at any time. The next "Scan" call will immediately work accordingly. So if you find a Start Tag with an "xml:space" attribute, you can (and must) set "Normalize" yourself.
Note: Normalization of Attribute Values is completely governed by the XML Spec.
Note: The XML spec requires that all line breaks be changed to single linefeed (#x0A) characters. TXmlParser doesn't change line break characters, so will normally have CR+LF sequences, depending on what's in your XML file.
Type: AnsiString, read-only
This is not the Document source itself but instead the name of the source you got the document from:
Type: PChar, read-only
Returns a pointer to the first character of the document. If there is no document loaded, "DocBuffer" returns a pointer to a null (#x00) character. So you always have a valid pointer and never NIL.
Type: TElemList (derived from TObjectList), Attribute
This list contains all Element declarations which have been found in the DTD. Every Element definition is stored in a TElemDef object.
Type: TNvpList (derived from TObjectList), Attribute
This list contains all General Entity declarations which have been found in the DTD. Every Entity definition is stored in a TEntityDef object.
Type: TNvpList (derived from TObjectList), Attribute
This list contains all Parameter Entity declarations which have been found in the DTD. Every Parameter Entity definition is also stored in a TEntityDef object.
Type: TNvpList (derived from TObjectList), Attribute
This list contains all Notation declarations which have been found in the DTD. Every Notation definition is stored in a TNotationDef object.
Type: TPartType (Enumeration Type), Attribute
Every time the "Scan" method returns, CurPartType holds the type of the current part which has been found by "Scan". This can be one of the following part types:
CurPartType | Meaning | CurName | CurContent |
---|---|---|---|
ptNone | This should never be returned. If it is, there must be an error in the XML document (or in my code ;-) | Undefined | Undefined |
ptXmlProlog | The XML Prolog has been read in. Now you can read the properties XmlVersion, Encoding, and Standalone. | Undefined | Undefined |
ptComment | A comment has been found. You can retrieve the comment by extracting the buffer part from CurStart to CurFinal | Undefined | Untouched |
ptPI | A Processing Instruction has been found. If it has "pseudo attributes", you can find these in the CurAttr list | Target name | PI content |
ptStartTag | A Start Tag has been found. You can find the Attributes in the CurAttr list | Element name | Untouched |
ptEmptyTag | An
Empty-Element tag has been found. You can find the
Attributes in the CurAttr list. NOTE: TXmlParser distinguishes between Empty-Element Tags and a Start Tag directly followed by an End Tag. So <BR/> will be returned as an Empty-Element Tag (ptEmptyTag) and <BR></BR> will be returned as a Start Tag followed by an End Tag. |
Tag name | Untouched |
ptEndTag | An End Tag has been found. | Tag name | Untouched |
ptContent | Text Content (the part between Tags) has been found. General Entities have been resolved. If "Normalize" is TRUE, the content is already normalized. The Encoding has been transferred by the "TranslateEncoding" method. If "Normalize" is true, then Whitespace-only content will not be return (i.e. there will be no ptContent part for them). | Untouched | Content |
ptCData | A CDATA section has been found. The Encoding has been transferred by the "TranslateEncoding" method. Whitespace is unchanged. | Empty | Content |
Type: AnsiString, Attribute
The Name of the last part which has a name (e.g. start tags or PIs have a name, comments or text contents don't have a name). If there is a part without a name, the CurName attribute stays untouched. So when you have a ptContent part, you (usually) know the name of the last tag by looking at CurName.
Type: AnsiString, Attribute
The last Content (from a ptContent, ptCData or ptPI). Like CurName, CurContent is not overwritten by parts which have no content (like Tags or Comments).
Type: PChar, Attribute
A pointer to the first (CurStart) and last (CurFinal) character of the current part returned by the Scan method. You can use these pointers to retrieve the exact part string.
Example: You want to extract the contents of a comment (which is not done by the Scan method), you can use CurStart and CurFinal:
SetString (MyComment, CurStart, CurFinish - CurStart + 1);
or you could use the SetStringSF function which is exported by the LibXmlParser unit:
SetStringSF (MyComment, CurStart, CurFinish);
Type: TAttrList, Attribute
This is a list of TAttr Objects. Every TAttr has a Name and a Value field, which contain the name and value of one attribute. The ValueType field of TAttr tells you where the value comes from:
ValueType | Meaning |
---|---|
vtNormal | The attribute has been specified completely in the tag |
vtImplied | The attribute value is undefined; the attribute is defined as #IMPLIED by the DTD. Your application must know how to handle this attribute |
vtFixed | The attribute is defined as #FIXED in the DTD. If there was an attribute value in the tag, it has been overwritten by the attribute default value from the DTD |
vtDefault | The attribute was not specified in the tag; instead, it was added because it was defined in the DTD. The attribute value is the default value from the DTD's ATTLIST definition. |
The AttrType field tells you the type of the Attribute. It is copied from the TAttrDef object which was created when the attribute was declared in the DTD.
Type: AnsiString, read-only property
This is the name of the current Encoding. Encoding can change in the middle of the document if the parser has to parse an External Parsed Entity which has a different Encoding than the main document. This value is mainly used by the TranslateEncoding method. But it can also be used by the application.
Loads the File into the internal Buffer of the TXmlParser instance. If this is successfull, then the Source property holds the name of the file.
Loads the null-terminated string given by Buffer into the internal Buffer of the TXmlParser instance. The Source property has the value '<MEM>' after this step.
If you already have the XML Document loaded into memory and you don't want TXmlParser to keep the entire document in its own piece of memory, you can use the SetBuffer method. SetBuffer will not allocate memory but instead will let the internal FBuffer attribute point to your Buffer.
Note: The XML document must be null-terminated (i.e. there is a NULL (#x0;) character at the end).
You must not free your buffer before you free your TXmlParser instance or call the Clear method. This would cause access violations.
Clears all internal variables and deallocates all buffer space previously allocated by the TXmlParser instance. After this, the TXmlParser is prepared for loading a new buffer. Clear is automatically called by the loading methods like LoadFromFile, LoadFromBuffer, or SetBuffer.
While you scan through your document (using the Scan method), there is always a pointer pointing to the current part of the document (in fact, it's two pointers: CurStart and CurFinal). StartScan initializes all pointers and all Cur* attributes in order to prepare for a new scan from the beginning of the Document.
You may call StartScan as often as you want and at any time.
This is where scanning through the XML document really happens. The Scan method performs the following steps:
After that, Scan returns a boolean value which is
This behaviour of the Scan method was chosen so that you can write a WHILE loop for scanning through the document.
With this WHILE loop, you can handle everything that you need in local variables of the procedure/function/method which analyzes your XML Document. In an event centric model, there would be a procedure call for every Document part and so you would have to handle everything in more or less global variables.
With the virtual methods of TXmlParser you can modify the behaviour of TXmlParser. Just override them in a class descendant of your own.
Override this method if you have implemented a special mechanism for loading documents, if you want to process PUBLIC IDs or Notations.
LoadExternalEntity is called for every External Entity (be it Parsed or Unparsed). The known System and/or Public IDs and the Notation is passed as a string to this method. It has to create a new TXmlParser instance and load the desired Entity into the buffer of that instance.
LoadExternalEntity is also called when the External DTD Subset is to be loaded.
NOTE: This function has been dropped in Version 2. All conversions are now done automatically.
For Version 1: The XML Specification states that every XML parser must be able to handle UTF-8 and UTF-16 documents. Beside these, parsers should be able to handle other Encodings. The encoding for a document is defined in the XML Prolog (for entire XML Documents) or in a Text Declaration at the beginning of Parsed External Entities or External DTD subsets.
So there is a source Encoding (the Encoding of the Document and its external parts) and a destination encoding (the encoding your application wishes to process). For every content string which is passed to your application (Text Content between Tags, CDATA sections, Attribute values) the TranslateEncoding method is called. It retrieves the current source encoding by looking at the CurEncoding property and translates the passed "Source" string into the desired destination encoding.
The TranslateEncoding method that is built into TXmlParser assumes that the destination encoding is the Windows ANSI encoding used in Windows apps. It can handle UTF-8 and ISO-8859-1 as source encodings. Note: It is assumed here that ISO-8859-1 and "Windows ANSI" are the same, which is not exactly true for some characters. But for the largest part of documents this should be true.
UTF-8 correctly translated into the single-byte ANSI Windows-1252 format.
At the time of this writing, TXmlParser is not able to handle multi-byte character strings. This is likely to change in the future.
The Scan method just tells you when it just scanned the DTD declaration. It doesn't tell you anything about what it found in the DTD. (You could scan the Lists Elements, Entities, ParEntities, Notations but then you know nothing about comments or PIs inside the DTD.)
If you want to build a validating parser or a tool which presents the elements of the DTD of your XML document or you want to handle comments or PIs inside the DTD, you can override the DtdElementFound virtual method. It is called everytime there is a DTD element found during the scan of the Document Type Declaration.
DtdElementFound gets passed a TDtdElementRec, which is a variant record with the following declaration:
TDtdElementRec = RECORD // --- This Record is returned by the DTD parser callback function CASE ElementType : TDtdElemType OF deElement, deAttList : (ElemDef : TElemDef); deEntity : (EntityDef : TEntityDef); deNotation : (NotationDef : TNotationDef); dePI : (Target : PChar; Content : PChar; AttrList : TAttrList); deComment : (Start, Final : PChar); deError : (Pos : PChar); END;
The ElementType field tells you which type of DTD element the parser has just found. Depending on this field, you can find out what has been found:
ElementType Field | Description |
---|---|
deElement, deAttList | An <!ELEMENT> or <!ATTLIST> declaration has been found. The ElemDef field points to the TElemDef instance created (for deElement) or filled with Attribute definitions (for deAttList) |
deEntity | An <!ENTITY> declaration has been found. The EntityDef field points to the TEntityDef instance created |
deNotation | A <!NOTATION> declaration has been found. The NotationDef field points to the TNotationDef instance created |
dePI | A Processing Instruction (PI) has been found inside the DTD. Target points to a null-terminated string containing the PI target; Content points to the part between the target and ?> in the PI. AttrList is the list of pseudo attributes in the PI. |
deComment | A comment has been found inside the DTD. Start points to the opening '<!--' and Final points to the closing '>' of the comment. |
deError | There is an error inside the DTD. Pos points to the position of the error |
Note that all pointers are only valid when DtdElementFound is called. Don't keep them for later use.
Name/Value Pairs (NVPs) are not handled using a TStringList (and its Names and Values properties). Instead, they are handled in a TNvpNode with the fields Name and Value and a special list for such nodes, TNvpList, which has special methods to get elements from the list. This concept was introduced because there are nodes derived from TNvpNode (like TAttr) which have additional fields.
Method | Description |
---|---|
PROCEDURE Add (Node : TNvpNode) | Adds a new node to the list. Nodes are always sorted by name so the Node method can use a binary search |
FUNCTION Node (Name : STRING) : TNvpNode | Retrieves the node instance with the given name. If the node can not be found, NIL is returned. |
FUNCTION Node (Index : INTEGER) : TNvpNode | Retrieves the node instance at the given position in the list. Index is zero-based, i.e. for Index=1 the second element is returned. If Index is smaller than zero or larger than the number of nodes in the list (minus 1), NIL is returned. |
FUNCTION Value (Name : STRING) : STRING | Retrieves the string value of the given name. If there is no node with the name, an empty string is returned. |
FUNCTION Value (Index : INTEGER) : STRING | Retrieves the string value at the given position in the list. Index is zero-based, i.e. for Index=1 the second element is returned. If Index is smaller than zero or larger than the number of nodes in the list (minus 1), NIL is returned. |
FUNCTION Name (Index : INTEGER) : STRING; | Retrieves the name of the attribute at the given position in the list. Index is zero-based, i.e. for Index=1 the second element is returned. If Index is smaller than zero or larger than the number of nodes in the list (minus 1), NIL is returned. |
Derived from TNvpNode and TNvpList. Used for passing back Tag attributes to the application.
TAttr has two additional fields to hold informations about the Attribute, ValueType and AttrType:
Field | Description | ||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
ValueType | See the description of the CurAttr property | ||||||||||||||||||||||||
AttrType | The type of the attribute, as declared in the DTD:
|
When the parser scans through the document, it can find a reference to a parsed entity, internal or external. In this case, the current position pointer is pushed to a stack (the EntityStack) and set to the first character of the entity replacement text. After the entity is scanned, the old pointer is popped off the stack and processing of the original document continues. As Entity references may nest, this has to be organized as a stack.
Every <!ATTLIST> element gets transferred into a TAttrDef instance, which is inserted into the TElemDef where the Attribute definition belongs to. TAttrDef has the following fields:
Field | Description |
---|---|
Name | Name of the Atribute |
Value | Default value |
TypeDef | The Type definition from the <!ATTLIST> declaration |
Notations | The listing of notations, if it is a NOTATION attribute. The notation names are separated by pipe symbols. |
AttrType | Type of the Attribute |
DefaultType | Type of the default value declaration of the Attribute (normal default value, #REQUIRED, #IMPLIED, #FIXED) |
TElemDef holds the data of an <!ELEMENT> definition:
Field | Description | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|
Name | Name of the Element | ||||||||||
ElemType | Type of the element:
|
||||||||||
Definition | The exact definition of the element from the DTD |
As TElemDef is both, a node and a list, there is a special TElemList, which has almost the same code as TNvpList.
TEntityDef holds the data of an <!ENTITY> definition. Depending on the type (General or Parameter Entity), the TEntityDef node is added to the Entities or ParEntities list.
Field | Description |
---|---|
Name | Name of the entity |
Value | The replacement text of the entity |
SystemId | For External Entities, this field contains the SYSTEM id |
PublicId | For External Entities, this field contains the PUBLIC id. This field may be empty. |
NotationName | For NDATA Unparsed External Entities, this field contains the Notation Name. |
TNotationDef holds the data of a <!NOTATION> definition:
Field | Description |
---|---|
Name | Name of the notation |
Value | SYSTEM id |
PublicId | PUBLIC id |
Converts all Whitespace characters (Space, Tab, Carriage Return, Linefeed) in the String to Space #x20 characters. If the PackWs parameter is true, contiguous whitespace characters will be packed to one space character.
The same as SysUtils.SetString. Exception: the second parameter denotes the position of the last character to transfer into the string, not the length.
Same as SysUtils.StrPas. In addition to the start of the string, the last character is also passed (Finish).
Trimms all whitespace characters off the beginning and end of the Source string.
Converts the Windows 1252 ANSI Source string to a UTF-8 string.
Converts the given UTF-8 string to Windows ANSI. Unicode characters which don't fit into the Windows-1252 range are converted to the "UnknownChar" character, which defaults to a reverse question mark.