Attention: Project moved to GitHub!
libhpxml is a high performance XML stream parser library written in C with a simple API. It
is intended to parse
XML files very speed efficient. This may be required when processing
huge XML files
like the OSM
planet file which currently has 260GB+.
The development goals are speed
efficiency and
simple memory handling to reduce the
risk
of
memory leaks.
These objectives are achieved through
Being a stream parser, libhpxml returns in a loop one XML element after the other. It uses a result buffer which is initialized once and then reused for every element. Thus, repeated calls to malloc() and free() are omitted. The input data is read in blocks. The result buffer does not contain the data itself but just pointers to the XML elements within the input buffer. Thus, data is not copied, it is just pointed to.
libhpxml provides a set of functions and structures. hpx_ctrl_t is a control structure which contains all relevant information for a XML stream. The contents of the structure are used internally by the library and should not be modified in any way. The structure is initialized with a call to hpx_init() and must be freed again with hpx_free(). Note that hpx_free() will not close the file descriptor.
hpx_ctrl_t *hpx_init(int fd, int len); void hpx_free(hpx_ctrl_t *ctl);
The arguments to hpx_init() is a file descriptor to an open XML file and the length of the block buffer. It will initialize a hpx_ctrl_t structure and returns a pointer to it. In case of error NULL is returned and errno is set appropriately. The buffer size must be at least as large as the longest XML element in the file but it is recommended to be much larger. The larger the buffer the lesser the number of reads. If there is enough system memory available, it is safe to choose 100MB or even more.
Libhpxml now supports memory mapping through the system call mmap(). This is activated if hpx_init() is called with a negative len parameter. In case of memory mapping, len must be as long as the (negative value) of the total length of the XML input file. Memory mapping of files greater than 2 GB is currently just supported on 64 bit architectures (see manpage mmap(2) or POSIX manpage mmap(3), respectively).
The main application for memory mapping is if libhpxml is not just used as stream parser but
XML objects are kept in memory during the whole runtime. This is necessary if on-the-fly object
processing is not possible. This typically is the case if XML objects are nested or they depend
on each other. An example is the rendering process of OSM data.
Keeping pointers valid (see hpx_get_elem())
is still possible without memory mapping, but it requires that the buffer
is as large as the file itself because it has to pull in the whole file at once. Thus, this
works just if the system has enough memory. Memory mapping in contrast does not require physical memory,
hence, even a file with several hundred GB may be used.
Note that the preprocessor macro WITH_MMAP
must be defined at compile time to compile libhpxml with mmap()
support.
If it was not compiled with WITH_MMAP, hpx_init()
will fail, in which case NULL is returned
and errno is set to
EINVAL.
While parsing an XML file libhpxml returns pointers to the elements and attributes. C strings are usually '\0'-terminated but this is not applicable here because it would require that '\0' characters are inserted after each element, resulting in huge data movement. Thus, libhpxml uses "B strings" which are hold in the bstring_t structure. The structure contains a pointer to the string and its length. Additionally, a set of function is provided to handle those strings.
typedef struct bstring { int len; char *buf; } bstring_t;
After initializing the control structure, XML elements are subsequently retrieved by repeated calls to hpx_get_elem().
int hpx_get_elem(hpx_ctrl_t *ctl, bstring_t *b, int *in_tag, size_t *lno);
The function processes the buffer and fills out the bstring pointing to the next XML element. ctl is the pointer to control structure. in_tag is filled with either 0 or 1, either if the XML element is a tag (<...>) or if it is literal text between tags. lno is filled with the line number at which this element starts. Both, in_tag and lno may be NULL if it is not used. hpx_get_elem() returns the length of the bstring, 0 on EOF, and -1 in case of error. Such an element can now be parsed with a call to hpx_process_elem().
int hpx_process_elem(bstring_t b, hpx_tag_t *p); typedef struct hpx_tag { bstring_t tag; // name of tag int type; // type of tag int nattr; // number of attributes int mattr; // size of attr array hpx_attr_t attr[]; // array containing attributes } hpx_tag_t; typedef struct hpx_attr { bstring_t name; //! name of attribute bstring_t value; //! value of attribute char delim; //! delimiter character of attribute value } hpx_attr_t;
It takes a bstring which contains an XML element and parses it into
the hpx_tag_t structure.
This structure may be initialized using
hpx_tm_create() but it
may also be initialized manually. In the latter
case the structure member mattr must contain the size of the
attribute
array. Otherwise the program may segfault. The argument to hpx_tm_create() is the
maximum number of expected attributes. The tag structure should be
freed again with hpx_tm_free() after use. It is recommended
to
reuse the tag structure. This reduces unnecessary memory management
system calls.
Please note that a call to hpx_get_elem() may invalidate the
pointers within previously filled-out tag structures because it might read in the
next block of the input file. Thus, the tag must be processed
before the next call to hpx_get_elem().
The type member of hpx_tag_t
defines the type this XML element.
Currently, the following types are known.
enum | Description | Example |
HPX_ILL | Element unknown. This may indicate a syntax error. | |
HPX_OPEN | An XML opening tag. | <tagname attrname="attrval"...> |
HPX_SINGLE | A single, closed XML tag. | <tagname attrname="attrval".../> |
HPX_CLOSE | An XML closing tag. | </tagname> |
HPX_LITERAL | No tag, just text between tags. | |
HPX_ATT | Declarations. | <! ..... > |
HPX_INSTR | Instructions. | <? .... ?> |
HPX_COMMENT | Comments. | <!-- .... --> |
hpx_tag_t *hpx_tm_create(int n); void hpx_tm_free(hpx_tag_t *t);
The tag structure further contains an array of attributes. The member nattr contains the actual number of attributes parsed. It is always at most mattr elements. If an XML tag has more than mattr elements they are just ignored. At the current version there's no feedback to the calling function. This will be improved in future releases. The attributes themselves are stored each in an hpx_attr_t structure. It contains two bstrings, one for the name and one for the value of the attribute. The third member delim keeps the delimiter of the value which is either '\'' (single quote, 0x27) or '"' (double quote, 0x22).
This example parses an XML file and outputs some stats about each XML element. You can download the example directly here.