Selecting an XML-Database for reBiND

Jump to: navigation, search

This article describes the decision process of selecting an XML database to be used for the reBiND project of the Botanic Garden and Botanical Museum Berlin-Dahlem. It describes the general idea of XML databases and the different underlaying concepts and the specific features and technologies that are commonly available for such databases. It provides an analysis of the features that are needed for the reBiND project and a details feature comparison matrix of 15 available XML databases, as well as the final decision of what database will be used for the reBiND project.

Disclaimer:
Though the information contained in this document was thoroughly researched, there is no guarantee of its correctness/completeness. Also this document reflects the state of September 2011 and it WILL NOT BE UPDATED, as any newer information is irrelevant for the decision about the XML database for the reBiND project.


Contents

[hide]


Introduction

The History of XML Databases

In the late nineties the first databases appeared, that were specifically designed to store XML data. In the following years, many new XML database projects (academic, commercial and open source) formed and subsequently started a real XML database hype. Many scientific papers were published about it in that time and many other articles and documentations were written as well, most notably the XML Database lists from Ronald Bourret, who tried to maintain an almost complete list of all XML databases and related products.

However around 2005 this hype came to an end, following the classic Hype Cycle curve. Many of the academic and open source projects had their last activity around that time. Companies providing XML databases went out of business or discontinued these products. This phase can be seen as a consolidation of the XML database market and the projects that survived it came out stronger.

There are still a lot of companies out there, selling XML databases. Some of the Open Source products survived as well. All of these databases were developed further in the recent years. In the past two years, several new XML standards or versions of existing standards were released by the W3C, such as XQuery 1.0, XPath 2.0, XSLT 2.0, XQuery Update Facility 1.0, XQuery and XPath Full Text 1.0, any several other.

Unfortunately a lot of the available articles and documentation about XML databases in general still reflects the state of around 2005, which makes it hard to get up to date comparisons of different databases. This is one of the reasons, why this document was created.


XML Enabled Databases vs Native XML Databases (NXDs)

There are generally two different types of XML databases. One type are the XML enabled databases. Internally the XML enabled databases store the document in tables with rows and columns, just like a relational database. In order to generate these tables, the schema of the XML file is needed and once the tables are created, only documents that are valid to that schema can be stored. For other XML documents a new set of tables has to be created based their schema.

"When your only tool is a relational database, everything looks like a table."
   — Elliotte Rusty Harold, Managing XML data: Native XML databases

The alternative are so called Native XML databases. A native XML database (NXD) is characterized by an internal storage mechanism that was specifically designed for XML data. This makes most of the NXDs schema agnostic, which means that does not need to know the schema of a document in order to store it. This also means that invalid XML files can be stored as well. The XML to be stored needs to be well formed, however. Indexing the data for searching/querying works better than with XML enabled databases, since the NXDs are aware of the structure of the XML and optimized for it. NXDs usually have a lot of additional functionality already build into it, such as transformation of XML files, the possibility to update the parts of the XML document, while it is stored or APIs for programmable access to the data.

Though most of the NXDs are databased specifically designed to store XML, relational databases with a special storage type for XML (such as Microsoft SQL Server) are usually also considered NXDs, though this positional is not universally accepted.

For the following comparison only Native XML Databases were considered.


Analysis

General Requirements and Use Case for reBiND

In order to select an XML database suitable for reBiND it is necessary to take a look at the use case of the database within the project and extract some general requirements.

ReBiND aims to save threatened data of the field of biodiversity science from legacy databases or other outdated sources and make them available to other researchers. This is done by first converting the existing data into the XML based ABCD format (Access to Biological Collection Data). The resulting XML documents will most likely be invalid, as some of original data might not be formatted as expected by the ABCD, e.g. the dates could be formatted differently. So the next step would be to correct these files and make them valid. This can partially be done automatic (e.g. detecting different common date formats and converting them into the expected format). Other corrections might require human intervention, like missing fields in the taxonomy. The documents will then be stored in the XML database. Since the human corrections might take some time, it would be useful to also be able to save documents which are not yet valid and will be corrected soon.

The database will then be connected to biodiversity networks like BioCASe and GBIF, so other researchers can search and access the stored data. An additional web interface would also allow to query the stored data sets directly.

Based on this general use case, the following characteristics and requirements were extracted:


With these requirements a more detailed search was possible, to find common features and technologies of XML databases and evaluate if they could be useful for the reBiND project.


Existing Features And Technologies For XML Databases

The features in this table are parially based on the chapter Features of Native XML Databases from the document XML and Databases by Ronald Bourret. The text is outdated, but provides a good overview of some of the issues that might become important for us.

Feature Note relevant for reBiND explaination
Architecture Text-Based (better for returning the original document) vs Model-Based (better for a lot of querying) Model-Based querying the documents and only retrieving parts of it, will probably be more common than returning the entire documents.
Collections the possibility of the NXD to combine several stored documents into a collection Maybe not really needed, but could be useful, for example to combine all documents from one scientist or from one project
Querying Query the stored documents to find relevant parts of them, using query languages like XPath or XQuery needed no question about it
Schema Agnosticity An XML database is schema agnostic, when it does not need to know the schema of a document in order to store it. This also allows invalid XML files to be stored, as long as they are well formed. needed Since some of the ABCD files are invalid, after they have been automatically generated, it is useful to still be able to store them and correct them afterwards (especially since some of the corrections can not be done automatically).
Validation of Schemas The possibility to validate an XML document against its schema. Therefore the schema either has to be linked in the document or it the schema file has to be registered with the database. Different schemas can be supported. There are two different types of validation: implicit validation (it is done when the document is stored and only valid documents are allowed) or explicit validation (validation is done on request after the document has been stored and errors are shown, but document is not removed from database if validation fails). Yes In order to be able to correct the documents after they are stored, it is needed to have explicit validation. The XML files can have more than one schema (e.g. ABDC with additional data in the BioCASe schema), the database must be able to handle such documents. Most of the schemas will be XML Schema (the W3C recommendation), but it would be nice to have at least additional support for DTD as well.
Updating/Deleting The possibility to change the stored documents using special languages like XQuery Update or XUpdate. Maybe Usually the documents will not change, once they are in the database. The only exception could be if we store invalid files and correct them later on. But the question remains, if these correction will be done using such update mechanisms instead of retrieving the document from the database, modifying it using external programs and then storing it again in the database.
Transactions/Rollbacks combine changes to one transaction and only save them when all were executed correctly, just like for RDBMS No see above
Locking The ability to protect a document or a node from editing, while someone edits it No even if we were to use the update mechanism for correcting the documents, no locking is needed since it will most likely only be one person editing the document.
APIs The ability to access the data through an interface for other programs. This could be a language dependent interface, like the XML:DB API or XQJ (both for Java) or an language independent API via HTTP (like REST) Yes Would be useful to run Query request and get the data, independently from the data presentation on the web platform.
Round Tripping The ability to load an XML document into the database and then export the content of the database to a new document which is identical to the first one. For Data Centric documents this feature is not important. In extreme cases things like Comments, CDATA elements or element order will be lost. There are also several ways of how it could be supported. Character save round tripping means that exactly the same document will be returned (i.e. input and output file have the same hash). For certain legal or medical documents this is needed to store the exact copy of a record, as required by law. Character safe round tripping is usually ensured by using a text based architecture. For model based architectures small changes, like removing duplicate white spaces could occur. Yes but it does not need to be character safe.
Remote Data The ability to load additional data from other sources when creating the XML documents No
Indexing There are generally three basic kinds of indexes: Value Indexes (for the content of elements and attributes), Structural Indexes (for the names of elements and attributes) and Full Text Indexes (for full text search independently of elements) Yes structural and value indexes are definitely needed, full text indexes would be nice
External Storage The ability to reference external documents Yes could be useful for linking images and other non-XML files
Normalization Originating from RDBMS, the ability to store information without redundancy and to save storage. It also ensures that not data inconsistencies occur, when parts of the data are changed. No our data should not have that much redundant information that the storage capacity should become an issue. And since our data is clearly focused on the individual documents, which are independent from each other, the problem of data inconsistencies does not apply to us.
References The way the database can handle external (between documents within the server and to resources outside of the database) and internal pointers (within one document) and is able to enforce their integrity. For external references within the database an example would be that a document can not be deleted, if a pointer from another document points to it. For internal references this means that a node that has a pointer to it can not be deleted. No Though internal references and references to other documents in the database (and their nodes) would be useful, it still does not imply here, as editing references or deleting notes/documents is not intended for our database, as this will probably not be needed in order to make the documents valid.
Scalability The ability to let the database be run in a cluster and let this cluster grow, as the demand grows. No We will most likely not have so much traffic, that we need additional servers.
Version Control provide the previous versions of a document Maybe The individual changes done in order to make the documents valid do not have to be stored. Having the original file and the valid version stored as two different files would to just fine. The same goes for other (more hypothetical) cases where a new version of a document is needed (e.g. if a scientist discovers the presumingly missing second half of a collection, which is already in the database). However, if a database provides such functionality, there is not reason not to use it for these cases.
Live Editing the possibility to edit a document and save it directly to the database No If the documents are corrected after loading the documents to the database, it will most likely be done either by retrieving the document, correcting it externally and store it again or by using XQuery Update or something like that, so live editing is not needed.
Technology Note useful for reBiND explanation
Querying
XPath 1.0 WP addressing notes in the document (W3C Recommendation 16 November 1999) No not needed, if XPath 2.0 is supported
XPath 2.0 WP XPath 2.0 was developed as part of XQuery 1.0, in close coordination with XSLT 2.0. It is only partially backward compatible to XPath 1.0. (Second Edition: W3C Recommendation 14 December 2010) Yes The Backward compatibility should not be an issue for us, if we start with 2.0.
XQuery 1.0 WP "What SQL is to relational databases, XQuery is to native XML databases." [7]. It is the most common way to query XML. The specification also includes XPath 2.0 and was developed in close coordination with XSLT 2.0. (Second Edition W3C Recommendation 14 December 2010) Yes
XQuery 3.0 WP Formerly planned as XQuery 1.1. This is currently only a working draft and BaseX is the only NXD that has already implemented it. Probably because project founder Christian Grün is a member of the W3C XML Query Working Group. (W3C Working Draft 14 June 2011) No 1.0 will do just fine. Judging by the success of XQuery 1.0 it can be assumed that all still active projects will implement version 3.0 eventually when it has become a recommendation.
XQueryX 1.0 WP XQuery request in XML. Not intended for human readability, but for automated processing. Can also be automatically generated out of a regular XQuery. (Second Edition: W3C Recommendation 14 December 2010) No could be useful, if we want to store queries to the database in the database itself
XQuery and XPath Full Text WP Allowing Full Text search via XQuery. (W3C Recommendation 17 March 2011) Yes would be useful for full text search, but unlikely that it is already implemented. A NXD with its own full text search would do just fine
SPARQL WP Part1: Query Language, Part2: Protocol, 3: Query Results XML Format Query Language for RDF data. (Version 1.0: W3C Recommendation 15 January 2008 , Version 1.1 as Working Draft) No we don't use RDF data
Updating
XQuery Update WP Officially called XQuery Update Facility. The possibility to change XML documents via XQuery. This includes updating, deleting and moving elements and attributes. (W3C Recommendation 17 March 2011) Maybe could become useful, if we want to correct the data after it has been published to the database
XUpdate WP not to be confused with XQuery Update Facility. XUpdate was developed by the XML:DB Working Group, which wanted to develop a way to modify data in XML documents. This was before XQuery Update was even in development. The Group has been inactive since 2003, but the specification is still implemented in several NXDs that have been around since the early days. (Working Draft September 14, 2000) No XQuery Update is the better choice here.
APIs
XQJ WP XQuery API for Java Maybe could be useful for other applications that build upon the data we provide
XML:DB API Programming language neutral API for vendor neutral accessing of NXDs. Another project of the inactive XML:DB Working Group. No if we need something like that XQJ seems to be the better choice
REST-ful API WP accessing (and modifying) XML Data via HTTP by addressing it with URIs. Concidered to be simpler then SOAP. Maybe This could allow other applications to call the data via regular HTTP
SOAP WP Part 0: Primer, Part1: Messaging Framework, Part 2: Adjuncts. Exchanging structured data (XML) over HTTP. Payload is put in a SOAP envelope. (Second Edition: W3C Recommendation 27 April 2007) No a REST-ful API would be the better choice
XML-RPC WP Remote Procedure calls via XML. Written by Dave Winder (creator of XML). Specification released on 15. June 1999 No
Schemas
DTD WP common but limited schema language for XML files Maybe we currently have no XML data with DTD files, but it could become useful if we have data that is already in its own XML format with its own DTD
XML Schema WP Part 0: Primer Part 1: Structures Part 2: Datatypes. Schema language with a lot more features than DTD. To have a NXD that supports schema means that the data can be validated. However there are some which only do this optional, also allowing non-valid XML to be loaded. (Version 1.0 Second Edition W3C Recommendation 28 October 2004, Version 1.1 W3C Candidate Recommendation 21 July 2011) Yes ABCD and BioCASe use XML Schema. Useful for checking if the documents are valid. However, we need a NXD that does not enforce validation, which would allow us to load invalid files in the database and fix them later on.

Support for XML Schema is a Must-have if the database requires a schema in order to handle the documents. If the database is schema agnostic XML Schema support for optional validation is a should have.

RELAX NG WP another schema language. Better than DTD, different from XML Schema. See also: Wikipedia: XML Schema Language comparison No
Access
XACML WP Markup for Access Control. (XACML 2.0 Specification 1 February 2005) No
WebDAV WP Web-based Distributed Authoring and Versioning with features like locking, namespace management and properties. (RFC 4918, June 2007) No
Processing
XSLT 1.0 WP Transforming XML with stylesheets to other documents. (W3C Recommendation 16 November 1999) No not needed, if version 2.0 is supported, but most likely both will.
XSLT 2.0 WP Was developed in close coordination with XQuery 1.0 See also: Wikipedia: XSLT elements (W3C Recommendation 23 January 2007) Yes for transforming the output to HTML or other output formats
XProc WP defining XML pipelines. (W3C Recommendation 11 May 2010) Maybe could be useful for building up automatic correction systems also in XML (for example using XQuery Update) which are run when the content is imported, or afterwards
XInclude WP linking to external XML or plain text content, which is included in the document, when it is rendered. (Second Edition W3C Recommendation 15 November 2006) No
XQuery Scripting Extension WP "adding imperative (procedural) features such as variable assignment and explicit sequencing to XQuery". (W3C Working Draft 8 April 2010) Maybe could be useful, if we want to do the error corrections in XML as well, for example by using XML pipelines and XQuery Update
Linking
XPointer WP addressing components of XML, similar to XPath but can also address entire specific sections (W3C Recommendation 25 March 2003) No
XLink WP linking within an XML document, to other XML documents and non-XML files (Version 1.1 W3C Recommendation 06 May 2010) Yes could be used to link to images and other non-XML files. might also be practical to reference other related documents from the database.


Specific Requirements

Based on the detailed analysis of the existing technologies and features for NXDs and the general requirements mentioned earlier, this detailed list of desired features was created. It also shows the importance of a specific feature for us.

Category Features Priority
Storage store XML Documents must
Retrieval Export full XML documents (with Round Tripping) must
Querying and Addressing sections and elements via XQuery/XPath must
Indexing for fast Querying and Addressing must
Transforming output via XSLT must
Full Text Search should
API for Programmable Access (XQJ/REST) should
Error Correction Schema agnostic database (import invalid documents) must
Schema Validation for XML Schema (and DTD) should
Updating of documents should
Versioning of documents could
XProc for programming repair procedures could
XQueryX for storing repair procedures in the database itself could
External Storage XLink for linking to external documents should
Indexing of external documents could
Soft Requirements store many documents
store large documents
good reading speed
writing speed not important
low traffic


Feature Comparison

About Must Have Should Have Nice To Have Can't Hurt To Have
developed by License Version XPath XQuery XSLT Schema Agnostic XML Schema Full Text XQuery Update REST XQJ XLink DTD XProc Scripting XQueryX XUpdate XML:DB API RELAX NG XPointer XInclude Links Other
Open Source
BaseX University of Konstanz BSD 6.7.1 (2011-07-28) 2.0 3.0 2.0[1] Yes No[2] Yes Yes Yes Yes No[3] No[2] No No No No Yes No[2] No No WP [8] written in Java
eXist eXist LGPL 1.4.1 (2011-08-16) 2.0 1.0 2.0[4] Yes Yes[5] Yes Yes Yes Yes and Similar[6] Similar[7] Yes[5] Yes[8] No No Yes Yes Yes[5] Partially Yes WP, Fact Sheet written in Java 6
Sedna Russian Academy of Sciences[9] Apache 2.0 3.4.66 (2010-09-30) 2.0 1.0 No Yes No[10] Yes Planned [11] No Yes No No[10] No No No Yes Yes No[10] No Partially [12] WP written in C/C++
Virtuoso Open OpenLink Software GPL 2 6.1.13 (2011-03-30) 2.0 Incomplete[13] 1.1 Yes No[14] Yes No Yes Similar[15] No No[14] No No No No Similar[15] No[14] No No WP, Commercial hybrid server
Free of Charge
Qizx Free PIXware License 4.2 (2011-06-06) 2.0 1.1[16] 2.0 Yes No[17] Yes Yes Yes Similar and Planned [18] No No[17] No Partially No No No No[17] No No Features, Comparison size limitations[19]
DB2 Express-C IBM commercial (no charge) 9.7 (2009-06) 2.0 1.0 2.0? Yes Yes[20] Similar Yes No No No Partially No No No No No No No No WP:DB2, WP:pureXML No support or update[21]
Commercial
Documentum xDB EMC² commercial[22] 10.1. (2011-5) 2.0 1.0[23] Yes Yes Yes[24] Partially[25] Yes No Similar[26] Yes Yes[24] Yes[27] Similar[28] No No Similar[26] No Yes Yes[27] Docu written in Java
MarkLogic Server MarkLogic Server commercial 4.2 (2010-10-19) 2.0 1.0 2.0 Yes Yes[29] Yes[30] Maybe[31] Yes Maybe[32] No No No[33] No No Maybe[31] Maybe[34] No Yes Yes WP, Fact Sheet written in C++
Oracle XML DB Oracle commercial 11g Release 2 (2010-09) 2.0 1.0 2.0 Yes Yes Similar[35] No No Yes Yes Yes No No No No No No No Yes Tech Overview
TigerLogic XDMS TigerLogic commercial 3.0 (2007-06-19)[36] 2.0 1.0 2.0 Yes Yes[37] Yes Similar[38] No Yes No Yes[37] No No No Similar[38] No Maybe[37] No No Technical Overview Support Forum quite dead
XQuantum XML DB Cognetic Systems commercial 1.5 (2008-09) 2.0 1.0 Yes Yes Yes[39] Yes No No Similar[40] No No No No No No Similar[40] No No No Brochure site not changed since Sept 09
Tamino Software AG commercial 8.2 (2011-3)[41] 2.0 1.0 Maybe[42] Yes Yes Yes Maybe[43] No Maybe[44] No Yes No No No Maybe[43] Maybe[45] No No No Fact Sheet Support Forum quite dead
XMS Xpriori commercial 3.2.2.47 (~2007) 2.0 1.0 Yes Yes No[46] Yes Maybe[47] No Similar[48] No No[46] No No No Similar[47] Similar[48] No[46] No No Features fka. myXMLdb. Seems dead
MS SQL Server Microsoft commercial 2008 R2 (2010) 2.0? 1.0 Yes Yes Yes[49] Similar[50] Similar[51] No No No No No No No Similar[50] No No No No XML in SQL Server XML as special column type
TEXTML IXIA Soft commercial 4.1 (2010-04-27) Yes Similar[52] No Yes Yes[53] Yes No[54] No Similar[55] No Yes[53] No No No No[54] Similar[55] No No No Data Sheet

Notes

  1. BaseX XSLT: 1.0 has native Support, 2.0 is supported via Saxon
  2. 2.0 2.1 2.2 BaseX: Schema validation is not supported: https://mailman.uni-konstanz.de/pipermail/basex-talk/2010-July/000482.html
  3. BaseX XLink: not supported, but EXPath File is supported instead
  4. eXist XSLT: 1.0 based on Apache Xalan, 2.0 is optional via Saxon
  5. 5.0 5.1 5.2 eXist: Schema validation is possible, either as implicit validation (only valid documents are allowed) or as explicit validation (validation is done on request and errors are shown, but document is not removed from database if validation fails). XML Schema, DTD, RelaxNG and other schemas are supported by the various validation engines. See: http://www.exist-db.org/validation.html
  6. eXist XQJ: http://www.xqjapi.com/exist/. Alternatively eXist also offers the Fluent API http://fluent.exist-db.org/
  7. eXist XLink: Image Module allows images in the database, including metadata
  8. eXist XProc: via an Extension
  9. Sedna Developed by: The Institute for System Programming of the Russian Academy of Sciences
  10. 10.0 10.1 10.2 Sedna: Schema validation is not supported: http://www.mail-archive.com/sedna-discussion@lists.sourceforge.net/msg00555.html
  11. Sedna XQuery Update: planned, but no release date set http://www.mail-archive.com/sedna-discussion@lists.sourceforge.net/msg00750.html. Sedna currently implements its own update language (http://www.sedna.org/progguide/ProgGuidesu6.html#x12-440002.3 ), based on the Diploma Thesis by Patrick Lehti. “Design and Implementation of a Data Manipulation Processor for a XML Query Language” (http://www.lehti.de/beruf/diplomarbeit.pdf 2001)
  12. Sedna XInclude: http://www.mail-archive.com/sedna-discussion@lists.sourceforge.net/msg00968.html
  13. Virtuoso XQuery: implementation of XQuery 1.0 is allegedly incomplete. This was suggested by a comment in the corresponding Wikipedia article before Version 6.0 was released (preview version was already available), but no entries in the Change Log or the Feature Comparison suggest that these limitations have been fixed.
  14. 14.0 14.1 14.2 Virtuoso: Schema validation is not supported.
  15. 15.0 15.1 Virtuoso API: Proprietary APIs for Java and other Programming Languages exist
  16. Qizx XQuery: XQuery 1.1 became XQuery 3.0, but the current implementation does not reflect these changes, as it was done, on a Working Draft: http://www.w3.org/TR/2009/WD-xquery-11-20091215/
  17. 17.0 17.1 17.2 Qizx: Schema validation is not supported: http://www.xmlmind.com/qizx/features.html (section "XML Standards")
  18. Qizx XQJ: A similar JavaAPI is implemented and "Support for XQJ is planned but not with high priority.": http://www.xmlmind.com/qizx/features.html
  19. Qizx Limitations: The Free version comes with a size limitation of approximately 1000 megabytes of source XML.
  20. DB2 Schema: optional validation of one or more schemas is possible
  21. DB2 Limitation: No support or update for the free version
  22. Documentum xDB Price: ~ $1X,XXX [1][2]
  23. Documentum xDB XQuery: XQuery 3.0 is being implemented: https://community.emc.com/thread/120215
  24. 24.0 24.1 Documentum xDB validation of XML Schema and DTD is possible
  25. Documentum xDB Full Text: http://developer.emc.com/docs/documentum/xdb/manual/#doc:topic/xquery_full_text.html and it also provides its own full text search
  26. 26.0 26.1 Documentum xDB has its own Java API
  27. 27.0 27.1 Documentum xDB XProc: can be added via the XProc Engine Plugin: https://community.emc.com/docs/DOC-4242 , which also handles XInclude
  28. Documentum xDB Scripting: has a command line client which allows for scripting via a proprietary scripting language: http://developer.emc.com/docs/documentum/xdb/manual/#doc:topic/command_line_client.html#CommandLineClient-0308B7DE
  29. MarkLogic Server: validation of XML Schema is possible
  30. MarkLogic Full Text:Full text search is supported, though it is not clear if this is done via the XQuery Full Text Extension
  31. 31.0 31.1 MarkLogic Update: the possibility of updating exist, but it is not specified in the marketing material which technology is implemented.
  32. MarkLogic JQX: an API for Java exists, but it is not specified in the marketing material which one.
  33. MarkLogic XProc:"MarkLogic will likely support it in a near future revision (Norman Walsh, now of MarkLogic, is the editor for the specification)" http://broadcast.oreilly.com/2009/03/xproc-xml-pipelines-and-restfu.html (March 2009)
  34. MarkLogic XML:DB API: an API for Java exists, but it is not specified in the marketing material which one.
  35. Oracle Full Text: see http://download.oracle.com/docs/cd/B28359_01/appdev.111/b28369/xdb09sea.htm
  36. TigerLogic Version: the date comes from the last published update: [3]
  37. 37.0 37.1 37.2 TigerLogic XDMS: validation of some schemas (XML Schema, DTD, maybe also others) is possible
  38. 38.0 38.1 TigerLogic Update: Updating is possible, but on the homepage it is not specified with what technology this is possible. It only says "Document Create, Read, Update and Delete" or "Node level updates and node-level locking" in the feature list. It might be a proprietary format.
  39. XQuantum: Validation of XML Schema is possible
  40. 40.0 40.1 XQuantum API: Proprietary APIs for Java and other Programming Languages exist
  41. Tamino Version: on the Tamino homepage it still says 8.0 (release in 2009) is the latest
  42. Tamino XSLT: There is a demo for an extension for Tamino 3.1, no mentioning in the marketing material about it, could have been dropped or included
  43. 43.0 43.1 Tamino Update: the possibility of updating exist, but it is not specified in the marketing material which technology is implemented.
  44. Tamino JQX: an API for Java exists, but it is not specified in the marketing material which one.
  45. Tamino XML:DB API: an API for Java exists, but it is not specified in the marketing material which one.
  46. 46.0 46.1 46.2 XMS: Schema validation is not supported
  47. 47.0 47.1 XMS Update: Some way of updating is possible, though it is not specified how this can be done: http://www.xpriori.com/products/xms/developers
  48. 48.0 48.1 XMS API: Proprietary APIs for Java and other Programming Languages exist
  49. Microsoft SQL Server: validation of XML Schema is possible. However, if schema validation is enabled, only valid files can be loaded into the database. Only one schema per file is supported
  50. 50.0 50.1 Microsoft SQL Server Full Text:Full text search is supported, but not via the XQuery Full Text Extension
  51. Microsoft SQL Server Update:Updating data is possible via the proprietary XML Data Modification Language (XML DML): http://msdn.microsoft.com/en-us/library/ms177454.aspx
  52. TEXTML XQuery: a proprietary TEXTML Query language is used
  53. 53.0 53.1 TEXTML: validation of some schema is possible
  54. 54.0 54.1 TEXTML Update: Apparently it is possible to update data via the API
  55. 55.0 55.1 TEXTML API: Proprietary APIs for Java and other Programming Languages exist


Detailed Comparison

After a comparison of the different XML Databases based on the feature matrix above, three databases were selected to be looked at in more detail in order to make the final decision. The three final candidates were: eXist-db, IBMs DB2 Express C and the Microsoft SQL Server. The SQL Server was considered because it is already in service at the Botanic Museum and Botanical Garden, which means it can be used free of charge for the reBiND project. Though some of the other commercial products had impressive features, the budget of the reBiND project did not allow for any of them.

Feature eXist Microsoft SQL Server DB2 with pureXML
Performance eXist outperformed SQL Server in a simple test case of uploading a 375 MB XML file and performing some XQueries. untested
faster

Tests where run on a Desktop-Computer.

eXist automatically optimizes itself for queries that are run often.

slower

Tests where run on the Server.

Queries needed the same time when run again

Stability and Scalability presumably good[1] presumably good presumably good
available documentation, examples and tutorials good[2] problematic[3] good[4]
existing server No Yes No
future availability currently active community and it has been around for over 10 years is a Microsoft Product is an IBM Product
Cost of Acquisition free free for the reBiND project[5] free with limitations
Restrictions No 2 GB limit for XML files Free Version limited to 2 CPU-Cores with max 4GB RAM
Validation Yes[6] Yes[7] Yes[8]
APIs REST, XQL, XML:DB, XML-RPC, Fluent[9] No No
Versioning Yes[10] No No
Editing/Updating Data XQuery Update, XUpdate XML DML[11] XQuery Update
Collections Yes No No
External Storage/References Image Module, XInclude, XPointer (partially), Expath-Zip, Expath-Packaging No No
Uploading and Accessing Data via APIs, Client GUI, Browser GUI, WebDAV via SQL via SQL
Other Images and binary data can be uploaded just like XML documents.

Notes

  1. the website of the Office of the Historian of the US State Department: http://history.state.gov runs entirely on eXist. The site is about rendering text stored in the XML based TEI-Documents. Last year the site contained 50.000+ documents with a total of 2 GB XML + 10 GB Images: http://tei.oucs.ox.ac.uk/Talks/2010-07-oxford/materials/workshops/eXist/exist-tei-workshop/slides/exist-slides.pdf
  2. The site http://www.exist-db.org/ itself contains a lot of documentation and tutorials and related sites do so as well. There is an active mailing list for specific problems. However, searching for specific information outside of these sources can be tricky due to the ambiguous name. Search queries must be combinded as "exist db", "exist database" or "exist xml" and even then some unrelated sites will show up.
  3. There are some articles in the Microsoft Knowledge Base, which outline the basics. However the examples a sometimes confusing and inconsistent. Searching for specific problems "Microsoft SQL Server 2008 R2 XML" often brings up some unrelated sites, where the XML is just a technology keyword used somewhere in the text for a different purpose.
  4. Several very comprehensive IBM Redbooks [4][5][6] contain a lot of information about the use of XML in DB2. Searches for specific problems combined with "purexml" usually show relevant results.
  5. a Microsoft SQL Server already exist at Botanic Museum and Botanical Garden which can be used for the project.
  6. eXist is schema agnostic. Schema validation is possible, either as implicit validation (only valid documents are allowed) or as explicit validation (validation is done on request and errors are shown, but document is not removed from database if validation fails). XML Schema, DTD, RelaxNG and other schemas are supported by the various validation engines. XML must be wellformed. See: http://www.exist-db.org/validation.html
  7. Microsoft SQL Server is schema agnostic, validation of XML Schema is possible. However, if schema validation is enabled, only valid files can be loaded into the database. Only one schema per file is supported. DTD and RelaxNG are not supported. XML must be well formed, however loading fragments (no common root element, like <text>Hello</text><text>World</text>) is possible
  8. DB2 Schema: optional validation of one or more schemas is possible. XML must be well formed.
  9. eXist API: for Fluent, see http://fluent.exist-db.org/
  10. eXist Versioning: The versioning module has to be enabled explicitly
  11. Microsoft SQL Server Update:Updating data is possible via the proprietary XML Data Modification Language (XML DML): http://msdn.microsoft.com/en-us/library/ms177454.aspx


The Final Decision

After reviewing the results from the detail comparison, it became obvious that eXist-db is the XML database which is suited best for the needs of the reBiND project.

Here are some of the key features of eXist which are particularly important for the reBiND project.


Appendix

other XML Databases not further considered

As already mentioned in the section The History of XML Databases, a lot of projects/companies stopped their development at around 2005. For the sake of completeness, here is a list of all the XML databases that where on any of the (mostly outdated) lists of XML databases, have been looked at in brief but were not considered any further for the comparison, for various reasons (mostly because they were not active anymore).


XML-Products that were listed as XML-DBs but are not


Further Reading

As already mentioned, most of the available articles is quite outdated, therefor most of the articles here are more than 5 years old. Nevertheless they can help with the general understanding of the underlaying concepts, though any reference to any specific technology or product must be considered with caution.