Selecting an XML-Database for reBiND
This article describes the decision process of selecting an XML database to be used for the reBiND project of the Botanic Garden and Botanical Museum Berlin-Dahlem.
It describes the general idea of XML databases and the different
underlaying concepts and the specific features and technologies that are
commonly available for such databases. It provides an analysis of the
features that are needed for the reBiND project and a details feature
comparison matrix of 15 available XML databases, as well as the final
decision of what database will be used for the reBiND project.
Though the information contained in this document was thoroughly researched, there is no guarantee of its correctness/completeness. Also this document reflects the state of September 2011 and it WILL NOT BE UPDATED, as any newer information is irrelevant for the decision about the XML database for the reBiND project.
Contents[hide] |
Introduction
The History of XML Databases
In the late nineties the first databases appeared, that were specifically designed to store XML data. In the following years, many new XML database projects (academic, commercial and open source) formed and subsequently started a real XML database hype. Many scientific papers were published about it in that time and many other articles and documentations were written as well, most notably the XML Database lists from Ronald Bourret, who tried to maintain an almost complete list of all XML databases and related products.
However around 2005 this hype came to an end, following the classic Hype Cycle curve. Many of the academic and open source projects had their last activity around that time. Companies providing XML databases went out of business or discontinued these products. This phase can be seen as a consolidation of the XML database market and the projects that survived it came out stronger.
There are still a lot of companies out there, selling XML databases. Some of the Open Source products survived as well. All of these databases were developed further in the recent years. In the past two years, several new XML standards or versions of existing standards were released by the W3C, such as XQuery 1.0, XPath 2.0, XSLT 2.0, XQuery Update Facility 1.0, XQuery and XPath Full Text 1.0, any several other.
Unfortunately a lot of the available articles and documentation about XML databases in general still reflects the state of around 2005, which makes it hard to get up to date comparisons of different databases. This is one of the reasons, why this document was created.
XML Enabled Databases vs Native XML Databases (NXDs)
There are generally two different types of XML databases. One type are the XML enabled databases. Internally the XML enabled databases store the document in tables with rows and columns, just like a relational database. In order to generate these tables, the schema of the XML file is needed and once the tables are created, only documents that are valid to that schema can be stored. For other XML documents a new set of tables has to be created based their schema.
"When your only tool is a relational database, everything looks like a table."
— Elliotte Rusty Harold, Managing XML data: Native XML databases
The alternative are so called Native XML databases. A native XML database (NXD) is characterized by an internal storage mechanism that was specifically designed for XML data. This makes most of the NXDs schema agnostic, which means that does not need to know the schema of a document in order to store it. This also means that invalid XML files can be stored as well. The XML to be stored needs to be well formed, however. Indexing the data for searching/querying works better than with XML enabled databases, since the NXDs are aware of the structure of the XML and optimized for it. NXDs usually have a lot of additional functionality already build into it, such as transformation of XML files, the possibility to update the parts of the XML document, while it is stored or APIs for programmable access to the data.
Though most of the NXDs are databased specifically designed to store XML, relational databases with a special storage type for XML (such as Microsoft SQL Server) are usually also considered NXDs, though this positional is not universally accepted.
For the following comparison only Native XML Databases were considered.
Analysis
General Requirements and Use Case for reBiND
In order to select an XML database suitable for reBiND it is necessary to take a look at the use case of the database within the project and extract some general requirements.
ReBiND aims to save threatened data of the field of biodiversity science from legacy databases or other outdated sources and make them available to other researchers. This is done by first converting the existing data into the XML based ABCD format (Access to Biological Collection Data). The resulting XML documents will most likely be invalid, as some of original data might not be formatted as expected by the ABCD, e.g. the dates could be formatted differently. So the next step would be to correct these files and make them valid. This can partially be done automatic (e.g. detecting different common date formats and converting them into the expected format). Other corrections might require human intervention, like missing fields in the taxonomy. The documents will then be stored in the XML database. Since the human corrections might take some time, it would be useful to also be able to save documents which are not yet valid and will be corrected soon.
The database will then be connected to biodiversity networks like BioCASe and GBIF, so other researchers can search and access the stored data. An additional web interface would also allow to query the stored data sets directly.
Based on this general use case, the following characteristics and requirements were extracted:
- store large XML documents (most of the documents will probably be under 1 Megabyte, but large documents with several hundred Megabyte or even up to 1 Gigabyte might also be possible)
- store many XML documents
- different documents with different structure
- could be subject to common schema(s): ABCD, any of the ABCD-extensions, BioCASe
- Querying and Addressing of parts of the document
- generally: read only, no manipulation of the documents intended, once the valid files are saved
- possible exception for the correction process
- probably low volume traffic
- good reading speed
- writing speed not important
With these requirements a more detailed search was possible, to find
common features and technologies of XML databases and evaluate if they
could be useful for the reBiND project.
Existing Features And Technologies For XML Databases
The features in this table are parially based on the chapter Features of Native XML Databases from the document XML and Databases by Ronald Bourret. The text is outdated, but provides a good overview of some of the issues that might become important for us.
Feature | Note | relevant for reBiND | explaination |
---|---|---|---|
Architecture | Text-Based (better for returning the original document) vs Model-Based (better for a lot of querying) | Model-Based | querying the documents and only retrieving parts of it, will probably be more common than returning the entire documents. |
Collections | the possibility of the NXD to combine several stored documents into a collection | Maybe | not really needed, but could be useful, for example to combine all documents from one scientist or from one project |
Querying | Query the stored documents to find relevant parts of them, using query languages like XPath or XQuery | needed | no question about it |
Schema Agnosticity | An XML database is schema agnostic, when it does not need to know the schema of a document in order to store it. This also allows invalid XML files to be stored, as long as they are well formed. | needed | Since some of the ABCD files are invalid, after they have been automatically generated, it is useful to still be able to store them and correct them afterwards (especially since some of the corrections can not be done automatically). |
Validation of Schemas | The possibility to validate an XML document against its schema. Therefore the schema either has to be linked in the document or it the schema file has to be registered with the database. Different schemas can be supported. There are two different types of validation: implicit validation (it is done when the document is stored and only valid documents are allowed) or explicit validation (validation is done on request after the document has been stored and errors are shown, but document is not removed from database if validation fails). | Yes | In order to be able to correct the documents after they are stored, it is needed to have explicit validation. The XML files can have more than one schema (e.g. ABDC with additional data in the BioCASe schema), the database must be able to handle such documents. Most of the schemas will be XML Schema (the W3C recommendation), but it would be nice to have at least additional support for DTD as well. |
Updating/Deleting | The possibility to change the stored documents using special languages like XQuery Update or XUpdate. | Maybe | Usually the documents will not change, once they are in the database. The only exception could be if we store invalid files and correct them later on. But the question remains, if these correction will be done using such update mechanisms instead of retrieving the document from the database, modifying it using external programs and then storing it again in the database. |
Transactions/Rollbacks | combine changes to one transaction and only save them when all were executed correctly, just like for RDBMS | No | see above |
Locking | The ability to protect a document or a node from editing, while someone edits it | No | even if we were to use the update mechanism for correcting the documents, no locking is needed since it will most likely only be one person editing the document. |
APIs | The ability to access the data through an interface for other programs. This could be a language dependent interface, like the XML:DB API or XQJ (both for Java) or an language independent API via HTTP (like REST) | Yes | Would be useful to run Query request and get the data, independently from the data presentation on the web platform. |
Round Tripping | The ability to load an XML document into the database and then export the content of the database to a new document which is identical to the first one. For Data Centric documents this feature is not important. In extreme cases things like Comments, CDATA elements or element order will be lost. There are also several ways of how it could be supported. Character save round tripping means that exactly the same document will be returned (i.e. input and output file have the same hash). For certain legal or medical documents this is needed to store the exact copy of a record, as required by law. Character safe round tripping is usually ensured by using a text based architecture. For model based architectures small changes, like removing duplicate white spaces could occur. | Yes | but it does not need to be character safe. |
Remote Data | The ability to load additional data from other sources when creating the XML documents | No | |
Indexing | There are generally three basic kinds of indexes: Value Indexes (for the content of elements and attributes), Structural Indexes (for the names of elements and attributes) and Full Text Indexes (for full text search independently of elements) | Yes | structural and value indexes are definitely needed, full text indexes would be nice |
External Storage | The ability to reference external documents | Yes | could be useful for linking images and other non-XML files |
Normalization | Originating from RDBMS, the ability to store information without redundancy and to save storage. It also ensures that not data inconsistencies occur, when parts of the data are changed. | No | our data should not have that much redundant information that the storage capacity should become an issue. And since our data is clearly focused on the individual documents, which are independent from each other, the problem of data inconsistencies does not apply to us. |
References | The way the database can handle external (between documents within the server and to resources outside of the database) and internal pointers (within one document) and is able to enforce their integrity. For external references within the database an example would be that a document can not be deleted, if a pointer from another document points to it. For internal references this means that a node that has a pointer to it can not be deleted. | No | Though internal references and references to other documents in the database (and their nodes) would be useful, it still does not imply here, as editing references or deleting notes/documents is not intended for our database, as this will probably not be needed in order to make the documents valid. |
Scalability | The ability to let the database be run in a cluster and let this cluster grow, as the demand grows. | No | We will most likely not have so much traffic, that we need additional servers. |
Version Control | provide the previous versions of a document | Maybe | The individual changes done in order to make the documents valid do not have to be stored. Having the original file and the valid version stored as two different files would to just fine. The same goes for other (more hypothetical) cases where a new version of a document is needed (e.g. if a scientist discovers the presumingly missing second half of a collection, which is already in the database). However, if a database provides such functionality, there is not reason not to use it for these cases. |
Live Editing | the possibility to edit a document and save it directly to the database | No | If the documents are corrected after loading the documents to the database, it will most likely be done either by retrieving the document, correcting it externally and store it again or by using XQuery Update or something like that, so live editing is not needed. |
Technology | Note | useful for reBiND | explanation | |
---|---|---|---|---|
Querying | ||||
XPath 1.0 | WP | addressing notes in the document (W3C Recommendation 16 November 1999) | No | not needed, if XPath 2.0 is supported |
XPath 2.0 | WP | XPath 2.0 was developed as part of XQuery 1.0, in close coordination with XSLT 2.0. It is only partially backward compatible to XPath 1.0. (Second Edition: W3C Recommendation 14 December 2010) | Yes | The Backward compatibility should not be an issue for us, if we start with 2.0. |
XQuery 1.0 | WP | "What SQL is to relational databases, XQuery is to native XML databases." [7]. It is the most common way to query XML. The specification also includes XPath 2.0 and was developed in close coordination with XSLT 2.0. (Second Edition W3C Recommendation 14 December 2010) | Yes | |
XQuery 3.0 | WP | Formerly planned as XQuery 1.1. This is currently only a working draft and BaseX is the only NXD that has already implemented it. Probably because project founder Christian Grün is a member of the W3C XML Query Working Group. (W3C Working Draft 14 June 2011) | No | 1.0 will do just fine. Judging by the success of XQuery 1.0 it can be assumed that all still active projects will implement version 3.0 eventually when it has become a recommendation. |
XQueryX 1.0 | WP | XQuery request in XML. Not intended for human readability, but for automated processing. Can also be automatically generated out of a regular XQuery. (Second Edition: W3C Recommendation 14 December 2010) | No | could be useful, if we want to store queries to the database in the database itself |
XQuery and XPath Full Text | WP | Allowing Full Text search via XQuery. (W3C Recommendation 17 March 2011) | Yes | would be useful for full text search, but unlikely that it is already implemented. A NXD with its own full text search would do just fine |
SPARQL | WP | Part1: Query Language, Part2: Protocol, 3: Query Results XML Format Query Language for RDF data. (Version 1.0: W3C Recommendation 15 January 2008 , Version 1.1 as Working Draft) | No | we don't use RDF data |
Updating | ||||
XQuery Update | WP | Officially called XQuery Update Facility. The possibility to change XML documents via XQuery. This includes updating, deleting and moving elements and attributes. (W3C Recommendation 17 March 2011) | Maybe | could become useful, if we want to correct the data after it has been published to the database |
XUpdate | WP | not to be confused with XQuery Update Facility. XUpdate was developed by the XML:DB Working Group, which wanted to develop a way to modify data in XML documents. This was before XQuery Update was even in development. The Group has been inactive since 2003, but the specification is still implemented in several NXDs that have been around since the early days. (Working Draft September 14, 2000) | No | XQuery Update is the better choice here. |
APIs | ||||
XQJ | WP | XQuery API for Java | Maybe | could be useful for other applications that build upon the data we provide |
XML:DB API | Programming language neutral API for vendor neutral accessing of NXDs. Another project of the inactive XML:DB Working Group. | No | if we need something like that XQJ seems to be the better choice | |
REST-ful API | WP | accessing (and modifying) XML Data via HTTP by addressing it with URIs. Concidered to be simpler then SOAP. | Maybe | This could allow other applications to call the data via regular HTTP |
SOAP | WP | Part 0: Primer, Part1: Messaging Framework, Part 2: Adjuncts. Exchanging structured data (XML) over HTTP. Payload is put in a SOAP envelope. (Second Edition: W3C Recommendation 27 April 2007) | No | a REST-ful API would be the better choice |
XML-RPC | WP | Remote Procedure calls via XML. Written by Dave Winder (creator of XML). Specification released on 15. June 1999 | No | |
Schemas | ||||
DTD | WP | common but limited schema language for XML files | Maybe | we currently have no XML data with DTD files, but it could become useful if we have data that is already in its own XML format with its own DTD |
XML Schema | WP | Part 0: Primer Part 1: Structures Part 2: Datatypes. Schema language with a lot more features than DTD. To have a NXD that supports schema means that the data can be validated. However there are some which only do this optional, also allowing non-valid XML to be loaded. (Version 1.0 Second Edition W3C Recommendation 28 October 2004, Version 1.1 W3C Candidate Recommendation 21 July 2011) | Yes |
ABCD and BioCASe use XML Schema. Useful for checking if the documents
are valid. However, we need a NXD that does not enforce validation,
which would allow us to load invalid files in the database and fix them
later on.
Support for XML Schema is a Must-have if the database requires a schema in order to handle the documents. If the database is schema agnostic XML Schema support for optional validation is a should have. |
RELAX NG | WP | another schema language. Better than DTD, different from XML Schema. See also: Wikipedia: XML Schema Language comparison | No | |
Access | ||||
XACML | WP | Markup for Access Control. (XACML 2.0 Specification 1 February 2005) | No | |
WebDAV | WP | Web-based Distributed Authoring and Versioning with features like locking, namespace management and properties. (RFC 4918, June 2007) | No | |
Processing | ||||
XSLT 1.0 | WP | Transforming XML with stylesheets to other documents. (W3C Recommendation 16 November 1999) | No | not needed, if version 2.0 is supported, but most likely both will. |
XSLT 2.0 | WP | Was developed in close coordination with XQuery 1.0 See also: Wikipedia: XSLT elements (W3C Recommendation 23 January 2007) | Yes | for transforming the output to HTML or other output formats |
XProc | WP | defining XML pipelines. (W3C Recommendation 11 May 2010) | Maybe | could be useful for building up automatic correction systems also in XML (for example using XQuery Update) which are run when the content is imported, or afterwards |
XInclude | WP | linking to external XML or plain text content, which is included in the document, when it is rendered. (Second Edition W3C Recommendation 15 November 2006) | No | |
XQuery Scripting Extension | WP | "adding imperative (procedural) features such as variable assignment and explicit sequencing to XQuery". (W3C Working Draft 8 April 2010) | Maybe | could be useful, if we want to do the error corrections in XML as well, for example by using XML pipelines and XQuery Update |
Linking | ||||
XPointer | WP | addressing components of XML, similar to XPath but can also address entire specific sections (W3C Recommendation 25 March 2003) | No | |
XLink | WP | linking within an XML document, to other XML documents and non-XML files (Version 1.1 W3C Recommendation 06 May 2010) | Yes | could be used to link to images and other non-XML files. might also be practical to reference other related documents from the database. |
Specific Requirements
Based on the detailed analysis of the existing technologies and features for NXDs and the general requirements mentioned earlier, this detailed list of desired features was created. It also shows the importance of a specific feature for us.
Category | Features | Priority |
---|---|---|
Storage | store XML Documents | must |
Retrieval | Export full XML documents (with Round Tripping) | must |
Querying and Addressing sections and elements via XQuery/XPath | must | |
Indexing for fast Querying and Addressing | must | |
Transforming output via XSLT | must | |
Full Text Search | should | |
API for Programmable Access (XQJ/REST) | should | |
Error Correction | Schema agnostic database (import invalid documents) | must |
Schema Validation for XML Schema (and DTD) | should | |
Updating of documents | should | |
Versioning of documents | could | |
XProc for programming repair procedures | could | |
XQueryX for storing repair procedures in the database itself | could | |
External Storage | XLink for linking to external documents | should |
Indexing of external documents | could | |
Soft Requirements | store many documents | |
store large documents | ||
good reading speed | ||
writing speed not important | ||
low traffic |
Feature Comparison
About | Must Have | Should Have | Nice To Have | Can't Hurt To Have | ||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
developed by | License | Version | XPath | XQuery | XSLT | Schema Agnostic | XML Schema | Full Text | XQuery Update | REST | XQJ | XLink | DTD | XProc | Scripting | XQueryX | XUpdate | XML:DB API | RELAX NG | XPointer | XInclude | Links | Other | |
Open Source | ||||||||||||||||||||||||
BaseX | University of Konstanz | BSD | 6.7.1 (2011-07-28) | 2.0 | 3.0 | 2.0[1] | Yes | No[2] | Yes | Yes | Yes | Yes | No[3] | No[2] | No | No | No | No | Yes | No[2] | No | No | WP [8] | written in Java |
eXist | eXist | LGPL | 1.4.1 (2011-08-16) | 2.0 | 1.0 | 2.0[4] | Yes | Yes[5] | Yes | Yes | Yes | Yes and Similar[6] | Similar[7] | Yes[5] | Yes[8] | No | No | Yes | Yes | Yes[5] | Partially | Yes | WP, Fact Sheet | written in Java 6 |
Sedna | Russian Academy of Sciences[9] | Apache 2.0 | 3.4.66 (2010-09-30) | 2.0 | 1.0 | No | Yes | No[10] | Yes | Planned [11] | No | Yes | No | No[10] | No | No | No | Yes | Yes | No[10] | No | Partially [12] | WP | written in C/C++ |
Virtuoso Open | OpenLink Software | GPL 2 | 6.1.13 (2011-03-30) | 2.0 | Incomplete[13] | 1.1 | Yes | No[14] | Yes | No | Yes | Similar[15] | No | No[14] | No | No | No | No | Similar[15] | No[14] | No | No | WP, Commercial | hybrid server |
Free of Charge | ||||||||||||||||||||||||
Qizx Free | PIXware | License | 4.2 (2011-06-06) | 2.0 | 1.1[16] | 2.0 | Yes | No[17] | Yes | Yes | Yes | Similar and Planned [18] | No | No[17] | No | Partially | No | No | No | No[17] | No | No | Features, Comparison | size limitations[19] |
DB2 Express-C | IBM | commercial (no charge) | 9.7 (2009-06) | 2.0 | 1.0 | 2.0? | Yes | Yes[20] | Similar | Yes | No | No | No | Partially | No | No | No | No | No | No | No | No | WP:DB2, WP:pureXML | No support or update[21] |
Commercial | ||||||||||||||||||||||||
Documentum xDB | EMC² | commercial[22] | 10.1. (2011-5) | 2.0 | 1.0[23] | Yes | Yes | Yes[24] | Partially[25] | Yes | No | Similar[26] | Yes | Yes[24] | Yes[27] | Similar[28] | No | No | Similar[26] | No | Yes | Yes[27] | Docu | written in Java |
MarkLogic Server | MarkLogic Server | commercial | 4.2 (2010-10-19) | 2.0 | 1.0 | 2.0 | Yes | Yes[29] | Yes[30] | Maybe[31] | Yes | Maybe[32] | No | No | No[33] | No | No | Maybe[31] | Maybe[34] | No | Yes | Yes | WP, Fact Sheet | written in C++ |
Oracle XML DB | Oracle | commercial | 11g Release 2 (2010-09) | 2.0 | 1.0 | 2.0 | Yes | Yes | Similar[35] | No | No | Yes | Yes | Yes | No | No | No | No | No | No | No | Yes | Tech Overview | |
TigerLogic XDMS | TigerLogic | commercial | 3.0 (2007-06-19)[36] | 2.0 | 1.0 | 2.0 | Yes | Yes[37] | Yes | Similar[38] | No | Yes | No | Yes[37] | No | No | No | Similar[38] | No | Maybe[37] | No | No | Technical Overview | Support Forum quite dead |
XQuantum XML DB | Cognetic Systems | commercial | 1.5 (2008-09) | 2.0 | 1.0 | Yes | Yes | Yes[39] | Yes | No | No | Similar[40] | No | No | No | No | No | No | Similar[40] | No | No | No | Brochure | site not changed since Sept 09 |
Tamino | Software AG | commercial | 8.2 (2011-3)[41] | 2.0 | 1.0 | Maybe[42] | Yes | Yes | Yes | Maybe[43] | No | Maybe[44] | No | Yes | No | No | No | Maybe[43] | Maybe[45] | No | No | No | Fact Sheet | Support Forum quite dead |
XMS | Xpriori | commercial | 3.2.2.47 (~2007) | 2.0 | 1.0 | Yes | Yes | No[46] | Yes | Maybe[47] | No | Similar[48] | No | No[46] | No | No | No | Similar[47] | Similar[48] | No[46] | No | No | Features | fka. myXMLdb. Seems dead |
MS SQL Server | Microsoft | commercial | 2008 R2 (2010) | 2.0? | 1.0 | Yes | Yes | Yes[49] | Similar[50] | Similar[51] | No | No | No | No | No | No | No | Similar[50] | No | No | No | No | XML in SQL Server | XML as special column type |
TEXTML | IXIA Soft | commercial | 4.1 (2010-04-27) | Yes | Similar[52] | No | Yes | Yes[53] | Yes | No[54] | No | Similar[55] | No | Yes[53] | No | No | No | No[54] | Similar[55] | No | No | No | Data Sheet |
Notes
- ↑ BaseX XSLT: 1.0 has native Support, 2.0 is supported via Saxon
- ↑ 2.0 2.1 2.2 BaseX: Schema validation is not supported: https://mailman.uni-konstanz.de/pipermail/basex-talk/2010-July/000482.html
- ↑ BaseX XLink: not supported, but EXPath File is supported instead
- ↑ eXist XSLT: 1.0 based on Apache Xalan, 2.0 is optional via Saxon
- ↑ 5.0 5.1 5.2 eXist: Schema validation is possible, either as implicit validation (only valid documents are allowed) or as explicit validation (validation is done on request and errors are shown, but document is not removed from database if validation fails). XML Schema, DTD, RelaxNG and other schemas are supported by the various validation engines. See: http://www.exist-db.org/validation.html
- ↑ eXist XQJ: http://www.xqjapi.com/exist/. Alternatively eXist also offers the Fluent API http://fluent.exist-db.org/
- ↑ eXist XLink: Image Module allows images in the database, including metadata
- ↑ eXist XProc: via an Extension
- ↑ Sedna Developed by: The Institute for System Programming of the Russian Academy of Sciences
- ↑ 10.0 10.1 10.2 Sedna: Schema validation is not supported: http://www.mail-archive.com/sedna-discussion@lists.sourceforge.net/msg00555.html
- ↑ Sedna XQuery Update: planned, but no release date set http://www.mail-archive.com/sedna-discussion@lists.sourceforge.net/msg00750.html. Sedna currently implements its own update language (http://www.sedna.org/progguide/ProgGuidesu6.html#x12-440002.3 ), based on the Diploma Thesis by Patrick Lehti. “Design and Implementation of a Data Manipulation Processor for a XML Query Language” (http://www.lehti.de/beruf/diplomarbeit.pdf 2001)
- ↑ Sedna XInclude: http://www.mail-archive.com/sedna-discussion@lists.sourceforge.net/msg00968.html
- ↑ Virtuoso XQuery: implementation of XQuery 1.0 is allegedly incomplete. This was suggested by a comment in the corresponding Wikipedia article before Version 6.0 was released (preview version was already available), but no entries in the Change Log or the Feature Comparison suggest that these limitations have been fixed.
- ↑ 14.0 14.1 14.2 Virtuoso: Schema validation is not supported.
- ↑ 15.0 15.1 Virtuoso API: Proprietary APIs for Java and other Programming Languages exist
- ↑ Qizx XQuery: XQuery 1.1 became XQuery 3.0, but the current implementation does not reflect these changes, as it was done, on a Working Draft: http://www.w3.org/TR/2009/WD-xquery-11-20091215/
- ↑ 17.0 17.1 17.2 Qizx: Schema validation is not supported: http://www.xmlmind.com/qizx/features.html (section "XML Standards")
- ↑ Qizx XQJ: A similar JavaAPI is implemented and "Support for XQJ is planned but not with high priority.": http://www.xmlmind.com/qizx/features.html
- ↑ Qizx Limitations: The Free version comes with a size limitation of approximately 1000 megabytes of source XML.
- ↑ DB2 Schema: optional validation of one or more schemas is possible
- ↑ DB2 Limitation: No support or update for the free version
- ↑ Documentum xDB Price: ~ $1X,XXX [1][2]
- ↑ Documentum xDB XQuery: XQuery 3.0 is being implemented: https://community.emc.com/thread/120215
- ↑ 24.0 24.1 Documentum xDB validation of XML Schema and DTD is possible
- ↑ Documentum xDB Full Text: http://developer.emc.com/docs/documentum/xdb/manual/#doc:topic/xquery_full_text.html and it also provides its own full text search
- ↑ 26.0 26.1 Documentum xDB has its own Java API
- ↑ 27.0 27.1 Documentum xDB XProc: can be added via the XProc Engine Plugin: https://community.emc.com/docs/DOC-4242 , which also handles XInclude
- ↑ Documentum xDB Scripting: has a command line client which allows for scripting via a proprietary scripting language: http://developer.emc.com/docs/documentum/xdb/manual/#doc:topic/command_line_client.html#CommandLineClient-0308B7DE
- ↑ MarkLogic Server: validation of XML Schema is possible
- ↑ MarkLogic Full Text:Full text search is supported, though it is not clear if this is done via the XQuery Full Text Extension
- ↑ 31.0 31.1 MarkLogic Update: the possibility of updating exist, but it is not specified in the marketing material which technology is implemented.
- ↑ MarkLogic JQX: an API for Java exists, but it is not specified in the marketing material which one.
- ↑ MarkLogic XProc:"MarkLogic will likely support it in a near future revision (Norman Walsh, now of MarkLogic, is the editor for the specification)" http://broadcast.oreilly.com/2009/03/xproc-xml-pipelines-and-restfu.html (March 2009)
- ↑ MarkLogic XML:DB API: an API for Java exists, but it is not specified in the marketing material which one.
- ↑ Oracle Full Text: see http://download.oracle.com/docs/cd/B28359_01/appdev.111/b28369/xdb09sea.htm
- ↑ TigerLogic Version: the date comes from the last published update: [3]
- ↑ 37.0 37.1 37.2 TigerLogic XDMS: validation of some schemas (XML Schema, DTD, maybe also others) is possible
- ↑ 38.0 38.1 TigerLogic Update: Updating is possible, but on the homepage it is not specified with what technology this is possible. It only says "Document Create, Read, Update and Delete" or "Node level updates and node-level locking" in the feature list. It might be a proprietary format.
- ↑ XQuantum: Validation of XML Schema is possible
- ↑ 40.0 40.1 XQuantum API: Proprietary APIs for Java and other Programming Languages exist
- ↑ Tamino Version: on the Tamino homepage it still says 8.0 (release in 2009) is the latest
- ↑ Tamino XSLT: There is a demo for an extension for Tamino 3.1, no mentioning in the marketing material about it, could have been dropped or included
- ↑ 43.0 43.1 Tamino Update: the possibility of updating exist, but it is not specified in the marketing material which technology is implemented.
- ↑ Tamino JQX: an API for Java exists, but it is not specified in the marketing material which one.
- ↑ Tamino XML:DB API: an API for Java exists, but it is not specified in the marketing material which one.
- ↑ 46.0 46.1 46.2 XMS: Schema validation is not supported
- ↑ 47.0 47.1 XMS Update: Some way of updating is possible, though it is not specified how this can be done: http://www.xpriori.com/products/xms/developers
- ↑ 48.0 48.1 XMS API: Proprietary APIs for Java and other Programming Languages exist
- ↑ Microsoft SQL Server: validation of XML Schema is possible. However, if schema validation is enabled, only valid files can be loaded into the database. Only one schema per file is supported
- ↑ 50.0 50.1 Microsoft SQL Server Full Text:Full text search is supported, but not via the XQuery Full Text Extension
- ↑ Microsoft SQL Server Update:Updating data is possible via the proprietary XML Data Modification Language (XML DML): http://msdn.microsoft.com/en-us/library/ms177454.aspx
- ↑ TEXTML XQuery: a proprietary TEXTML Query language is used
- ↑ 53.0 53.1 TEXTML: validation of some schema is possible
- ↑ 54.0 54.1 TEXTML Update: Apparently it is possible to update data via the API
- ↑ 55.0 55.1 TEXTML API: Proprietary APIs for Java and other Programming Languages exist
Detailed Comparison
After a comparison of the different XML Databases based on the feature matrix above, three databases were selected to be looked at in more detail in order to make the final decision. The three final candidates were: eXist-db, IBMs DB2 Express C and the Microsoft SQL Server. The SQL Server was considered because it is already in service at the Botanic Museum and Botanical Garden, which means it can be used free of charge for the reBiND project. Though some of the other commercial products had impressive features, the budget of the reBiND project did not allow for any of them.
Feature | eXist | Microsoft SQL Server | DB2 with pureXML |
---|---|---|---|
Performance | eXist outperformed SQL Server in a simple test case of uploading a 375 MB XML file and performing some XQueries. | untested | |
faster
Tests where run on a Desktop-Computer. eXist automatically optimizes itself for queries that are run often. | slower
Tests where run on the Server. Queries needed the same time when run again | ||
Stability and Scalability | presumably good[1] | presumably good | presumably good |
available documentation, examples and tutorials | good[2] | problematic[3] | good[4] |
existing server | No | Yes | No |
future availability | currently active community and it has been around for over 10 years | is a Microsoft Product | is an IBM Product |
Cost of Acquisition | free | free for the reBiND project[5] | free with limitations |
Restrictions | No | 2 GB limit for XML files | Free Version limited to 2 CPU-Cores with max 4GB RAM |
Validation | Yes[6] | Yes[7] | Yes[8] |
APIs | REST, XQL, XML:DB, XML-RPC, Fluent[9] | No | No |
Versioning | Yes[10] | No | No |
Editing/Updating Data | XQuery Update, XUpdate | XML DML[11] | XQuery Update |
Collections | Yes | No | No |
External Storage/References | Image Module, XInclude, XPointer (partially), Expath-Zip, Expath-Packaging | No | No |
Uploading and Accessing Data | via APIs, Client GUI, Browser GUI, WebDAV | via SQL | via SQL |
Other | Images and binary data can be uploaded just like XML documents. |
Notes
- ↑ the website of the Office of the Historian of the US State Department: http://history.state.gov runs entirely on eXist. The site is about rendering text stored in the XML based TEI-Documents. Last year the site contained 50.000+ documents with a total of 2 GB XML + 10 GB Images: http://tei.oucs.ox.ac.uk/Talks/2010-07-oxford/materials/workshops/eXist/exist-tei-workshop/slides/exist-slides.pdf
- ↑ The site http://www.exist-db.org/ itself contains a lot of documentation and tutorials and related sites do so as well. There is an active mailing list for specific problems. However, searching for specific information outside of these sources can be tricky due to the ambiguous name. Search queries must be combinded as "exist db", "exist database" or "exist xml" and even then some unrelated sites will show up.
- ↑ There are some articles in the Microsoft Knowledge Base, which outline the basics. However the examples a sometimes confusing and inconsistent. Searching for specific problems "Microsoft SQL Server 2008 R2 XML" often brings up some unrelated sites, where the XML is just a technology keyword used somewhere in the text for a different purpose.
- ↑ Several very comprehensive IBM Redbooks [4][5][6] contain a lot of information about the use of XML in DB2. Searches for specific problems combined with "purexml" usually show relevant results.
- ↑ a Microsoft SQL Server already exist at Botanic Museum and Botanical Garden which can be used for the project.
- ↑ eXist is schema agnostic. Schema validation is possible, either as implicit validation (only valid documents are allowed) or as explicit validation (validation is done on request and errors are shown, but document is not removed from database if validation fails). XML Schema, DTD, RelaxNG and other schemas are supported by the various validation engines. XML must be wellformed. See: http://www.exist-db.org/validation.html
- ↑ Microsoft SQL Server is schema agnostic,
validation of XML Schema is possible. However, if schema validation is
enabled, only valid files can be loaded into the database. Only one
schema per file is supported. DTD and RelaxNG are not supported. XML
must be well formed, however loading fragments (no common root element,
like
<text>Hello</text><text>World</text>
) is possible - ↑ DB2 Schema: optional validation of one or more schemas is possible. XML must be well formed.
- ↑ eXist API: for Fluent, see http://fluent.exist-db.org/
- ↑ eXist Versioning: The versioning module has to be enabled explicitly
- ↑ Microsoft SQL Server Update:Updating data is possible via the proprietary XML Data Modification Language (XML DML): http://msdn.microsoft.com/en-us/library/ms177454.aspx
The Final Decision
After reviewing the results from the detail comparison, it became obvious that eXist-db is the XML database which is suited best for the needs of the reBiND project.
Here are some of the key features of eXist which are particularly important for the reBiND project.
- Addressing parts of the document with XPath
- Querying documents with XQuery
- Automated Indexing of XML for faster addressing and querying, including Full Text search
- Transforming XML using XSLT
- Schema Agnostic
- optional Schema Validation
- Updating XML files in the database using the XQuery Update Facility
- several APIs to access the data from other programs, like REST, XQJ, Fluent and others
- Versioning of files is supported
- Access to the files via a Web Interface and a GUI-Client make it comfortable to use eXist also for people who are not programmers or power users
- eXist is Free and Open Source, which is in the spirit of the reBiND mission statement: "[...] to develop cost-efficient workflows for rescuing legacy databases [...]"
- it is well documented and an active community of supporters can help with other potential problems
Appendix
other XML Databases not further considered
As already mentioned in the section The History of XML Databases, a lot of projects/companies stopped their development at around 2005. For the sake of completeness, here is a list of all the XML databases that where on any of the (mostly outdated) lists of XML databases, have been looked at in brief but were not considered any further for the comparison, for various reasons (mostly because they were not active anymore).
- DOMSafeXML, not active anymore, last update July 2009, Project Domain: http://www.domsafexml.com/ does not exist anymore, not listed on manufacturers website anymore
- EsTerra XML Storage Serve, most available information is in Japanese: http://www.mediafusion.co.jp/XSS/xss.html
- GemFire from VMware, formerly owned by GemStone, appeared on some lists as XML database, but neither on the homepage nor in the data sheets any mentioning of XML
- Toshiba TX1, description only in Japanese
- TIMBER small university project, no edits since 2006
- NeoCore XML DB, description only in Japanese, English Site not available anymore
- XStreamDB, apparently the company Bluestream does not offer it anymore. Their only product xDocs (for XML bases Content Component Management) is based on a (my)SQL database
- GoXML DB, not available anymore
- X-Hive bought by EMC, now (turned into/part of) EMC Documentum xDB
- Ozone, project inactive, last changes 2005
- eXcelon eXtensible Information Server (XIS), company was bought by Sonic Software, which turned the XIS into Sonic XML Server and was then bough by Progress Software, neither XIS nor Sonic XML Server mentioned on the homepage anymore
- Sekaiju (known as Yggdrasill in Japan), potential predecessor of ExTerra, again homepage only in Japanese
- Infonyte DB, (formerly PDOM), company apparently does not exist anymore
- Cyber Luxeon, site in Japanese, now (cooperation with/bought by) NeoCore, English version of homepage quite brief
- 4 Suite website http://www.4suite.org/ not available anymore
- DBDom, last changes 2001
- DBXML, website http://www.dbxml.com/ shows default Wordpress page, last changes 2004
- eXtc, last changes 2005
- ExtraWay, website only in Italian
- Lore, project stopped in 2000
- M/DB:X, 5 releases from 14. June 2009 till 6 July 2009, no new ones since, still a little activity on the Mailing List
- myXMLDB, inactive since 2004, also more like a Middle-Ware to a mySQL-DB
- Natix, inactive since 2007 alternative domain does not exist anymore
- SQL/XML-IMDB from Quilogic, site confusing, no information about releases or versions. According to the Internet Archive the site has been the same since 2003.
- TOTAL XML (formerly Socrates XML), company apparently stopped the project in the beginning of 2007 (based on Archive.org redirects)
- xml.gax.com (formerly NaX Base), website of company not available anymore, company was allegedly bought by Naxoft, their website is only in Japanese
- Xyleme Zone Server not offered anymore by the company
- MonetDB/XQuery Project retired in March 2011, allegedly the functionality is still available in combination with the Pathfinder-Project but it looks more like a middleware, which translates XQuery into SQL or MIL (Monet Interpreter Language)
- Oracle Berkley DB XML not a native XML Database, build on top a legacy version of Berkley DB Wikipedia Data Sheet
- Ipedo XML Database Company appears to be inactive since the beginning of 2009
- Toshiba Large-Scale Distributed XML Database only a research project, no code released
- Toshiba Lightweight XML Database only a research project, no code released, also intended for mobile devices
- infozone: Somehow related to Ozone. website now used differently
- RDFDB: only for RDF data
- Redfoot: only for RDF data and website not available anymore
- XDBM: website not available anymore
- Xindice: inactive. Moved to the Apache Attic.
- TeraText Database System (DBS): Because of the missing support for XQuery, XPath and XSLT not further considered
XML-Products that were listed as XML-DBs but are not
- Altova XMLSpy
- IB Engine XML Search Engine (not sure if XML DB) for embedded systems, also the homepage is very confusing.
- Dieselpoint
- XAware
Further Reading
As already mentioned, most of the available articles is quite outdated, therefor most of the articles here are more than 5 years old. Nevertheless they can help with the general understanding of the underlaying concepts, though any reference to any specific technology or product must be considered with caution.
- Ronald Bourret: XML and Databases (2005) One of the most extensive description about XML Databases, their structures, features etc. As everything else a bit outdated, but very useful to get an wide overview of the topic.
- Ronald Bourret: XML Database Products (though it says the list was last updated in 2010, many of the products in the section Native XML Databases are way outdated)
- Wikipedia article: XML database (a little outdated, but the comparison chart is pretty much cleaned up, though quite small)
- A feature comparison of different XML databases (pretty up to date, who would have thought that?)
- The State of Native XML Databases (2007), short chapters about Mark Logic, eXist, DB2 9 and Berkley DB XML
- Using XML and Databases (2008) - W3C Standards in Practice. White Paper sponsored by ECM². Good overview over the XML technologies.
- IMB developerWorks: Comparing XML database approaches - What are the similarities and differences between pureXML and native XML databases? (2008)
- IMB developerWorks: Managing XML data: Native XML databases - Theory and reality (2005)
- IMB developerWorks: Working XML: Comparing XSLT 2.0 and XQuery - Two dialects of XPath for different tasks (2006)
- IBM Redbooks Publications:' XML for DB2 Information Integration (2004)pdf, Chapter 1: XML and databases (very good and with a lot of details and examples)
- Introduction to Native XML Databases (2001)
- Ronald Bourret: Going native: Use cases for native XML databases (2007)
- An Exploration of XML in Database Management Systems (2001), general introduction of XML, DTD, XML Schema, XPath, XQuery. Chapter on XML Databases way outdated