Using the XML parser

The Baby X resource compiler contains a "vanilla" XML parser.

What is XML

XML is a format for storing hierarchical data. It was originally based on a generalisation of HTML, using the same basic idea of nested tags.

Example XML file

<bookstore>
      <book category="COOKING">  
        <title lang="en">Everyday Italian</title>
        <author>Giada De Laurentiis</author>
        <year>2005</year>
        <price>30.00</price>
      </book>
      <book category="CHILDREN">
        <title lang="en">Harry Potter</title>
        <author>J K. Rowling</author>
        <year>2005</year>
        <price>29.99</price>
      </book>
      <book category="WEB">
        <title lang="en">Learning XML</title>
        <author^gt;Erik T. Ray</author>
        <year>2003</year>
        <price>39.95</price>
      </book>
    </bookstore>

Each element has an open and close tag. Unlike HTML where some tags like <BR> are traditionally stand-alone. Data can be associated with elements in three ways, as attributes, such as the book "category" in the example, as embedded text between the tags open and close, as we see with leaf elements, or as child elements.

With XML version 1.1 the format has been complicated considerably and we don't support all the complications in the vanilla parser. In particular some of them, like recursively defined elements, are highly undesirable as they make it trivially easy to write malicious XML files which expand to a vast amount of data.

The XML file parser

The XML parser is a vanilla parser. It doesn't support everything. It just converts the XML into a simple tree representation.

    
        typedef struct xmlattribute
        {
        char *name;                /* attriibute name */
        char *value;               /* attribute value (without quotes) */
        struct xmlattribute *next; /* next pointer in linked list */
        } XMLATTRIBUTE;
        
        typedef struct xmlnode
        {
        char *tag;                 /* tag to identify data type */
        XMLATTRIBUTE *attributes;  /* attributes */
        char *data;                /* data as ascii */
        int position;              /* position of the node within parent's data
        string */
        int lineno;                /* line number of node in document */
        struct xmlnode *next;      /* sibling node */
        struct xmlnode *child;     /* first child node */
        } XMLNODE;
        
        typedef struct
        {
        XMLNODE *root;             /* the root node */
        } XMLDOC;
        
        
        XMLDOC *loadxmldoc(const char *fname, char *errormessage, int Nerr);
        XMLDOC *floadxmldoc(FILE *fp, char *errormessage, int Nerr);
        XMLDOC *xmldocfromstring(const char *str,char *errormessage, int Nerr);
        void killxmldoc(XMLDOC *doc);
        
        XMLNODE *xml_getroot(XMLDOC *doc);
        const char *xml_gettag(XMLNODE *node);
        const char *xml_getdata(XMLNODE *node);
        const char *xml_getattribute(XMLNODE *node, const char *attr);
        int xml_Nchildren(XMLNODE *node);
        int xml_Nchildrenwithtag(XMLNODE *node, const char *tag);
        XMLNODE *xml_getchild(XMLNODE *node, const char *tag, int index);
        XMLNODE **xml_getdescendants(XMLNODE *node, const char *tag, int *N);

Technical documentation on the functions here.

You will often need to access these structures directly to process the XML tree efficiently. For instance if you do not know beforehand which attributes are assoicated with an element, the only way to get that information is to walk the attribute list an examne the "name" members.

Whilst in reality the structure is a binary tree, it is treated as an anary tree because the pointers are labelled "child" and "next" rather than "childa" and "childb". So the "next" pointer is interpreted as a linked list of younger siblings. You therefore iterate over the "next" pinter and recurse down the "child" pointer."

Example code

    

    
    /*
      test if two nodes have the same structure. Do the attributes and
      the hierarchy under them match, and are the element tag names the same?
      Params:
         nodea - the first node
         nodeb - the second node
         useattributes - if set, check that the attribute lists match
         usechildren - if set, check that the children match
    Returns 1 if the nodes have the same structure, else 0
     */
    int nodeshavesamestructure(XMLNODE *nodea, XMLNODE *nodeb,
                                int useattributes, int usechildren)
    {
       XMLATTRIBUTE *attra, *attrb;
       XMLNODE *childa, *childb;

       if (strcmp(nodea->tag, nodeb->tag))
          return 0;

       if (useattributes)
       {
          attra = nodea->attributes;
          attrb = nodeb->attributes;
          while (attra && attrb)
          {
             if (strcmp(attra->name, attrb->name))
               return 0;
             attra = attra->next;
             attrb = attrb->next;
          }
          if (attra != NULL || attrb != NULL)
             return 0;
       }
       if (usechildren)
       {
          childa = nodea->child;
          childb = nodeb->child;
          while (childa && childb)
          {
            if (!nodeshavesamestructure(childa, childb, useattributes,
                                        usechildren))
              return 0;
            childa = childa->next;
            childb = childb->next;
         }
         if (childa != NULL || childb != NULL)
            return 0;
       }
       
       return 1;
    }

The structure is a simple tree which you can manipulate easily and cleanly.