I agree with you ilsaloving. At MarkLogic, we think data consistency is a critical feature of a DBMS. Some NoSQL technologies have it, some don't. A NoSQL choice doesn't have to mean eventual consistency. And yes, many banks do use our software. As you point out, transactional consistency and durability are important to them.
Since you're asking an honest question, I'll try to provide a good answer.
Enterprises choose NoSQL when they have data that's hard to represent in a relational model. Often that's just due to the complexity of the data. Equities transactions, for example, are easily modeled in rows and columns, but derivatives transactions are like contracts. They're hard to model in a relational database. Not impossible, but hard. Health records are another good example; these are frequently document-centric data structures with lots of freeform text in them. Journal articles, emails, legal documents, etc. are all examples of this kind of hard-to-model data. Sure you can blob or clob it into a relational database, but then it's hard to access. Another type of complex data is very sparse data. If you have to have tens of thousands of columns because you occasionally need to store data in a queryable field, that's not efficient. In many sparse data models, the "columns" change frequently as well. One day you may discover that you have to track some new trait. Do you add a column for that? With a document-centric model it's much easier. Just ingest a document that has that element in it, and MarkLogic will index it for you.
That brings me to the other reason why people opt for NoSQL. Even if the data is relatively simple, if you have many data sources, and they change all the time, it's hard to keep up if you have a relational schema to keep up to date. This is the problem financial institutions have with reference data. They get feeds from a variety of sources, and these feeds change from time to time with no notice. They also have to track metadata like the provenance of each piece of data, which complicates the problem. So if you're trying to combine lots of different data sources that are changing all the time, that's another good use case for NoSQL.
Hey ilsaloving, if you have an application where you need to parse your JSON to extract only the entries that satisfy some criteria (in other words query your JSON with arbitrarily complex queries), and you need to do it fast, you should check out MarkLogic. We can do that.
Hi Anonymous. As I mentioned in a previous post, I run Engineering at MarkLogic. I'm sorry to hear about the difficulties you had. Here's the link to our admin API, which allows you to script backups: http://docs.marklogic.com/admin/database. Hope this helps. If you have other questions, feel free to contact Support, who is there 24x7 to assist our customers.
It's also worth pointing out (since we're talking about backups), that in order to back up a database, you need to be able to get a consistent view of the data at a point in time and hold that consistency for the duration of the backup. For databases that don't provide strong consistency, you usually have to quiesce the database before backing it up. Another advantage of ACID transactions, even for applications that aren't OLTP-type applications.
Hi Chris. XML and JSON are data serialization formats. It is true that they are often used for interchange (especially JSON), but for data that is best modeled as denormalized documents, it makes sense to store it in a denormalized form (versus shredding it into a complex relational schema that requires a lot of joins to re-combine it when you want to use it again). XQuery is not a data interchange or serialization format. It's a query language for content conforming to the XML data model.
If your source data is XML, using a database like MarkLogic that can work well with XML is actually like using a screwdriver as a screwdriver. Using an RDBMS to do this is more like the hammer metaphor.
Just to be clear, MarkLogic doesn't store XML on disk. We store a tokenized, compressed binary representation of a hierarchical document structure. That could be XML or JSON (or just flat text). We tokenize it on ingest as part of our indexing strategy, so compression takes no extra overhead beyond what we take to index the document. We index both full text and structure, so we can answer database-style queries combined with full-text queries. Most queries can be resolved out of indexes, resulting in sub-second response times (in healthcare.gov case, MarkLogic's response time is less than 1/4 second for 99.9% of queries). There is some overhead in indexing, which you recapture many times over on the query end by being able to do complex queries quickly.
Yes, this is definitely a complex project. This is exactly the type of thing you want to use something like MarkLogic for. Combining complex data from a variety of sources and making it queryable in real time is what we do (I work for MarkLogic).
Hello Ganjadude. I run Engineering at MarkLogic, so I'm quite familiar with the technology, and I'd be happy to answer any questions about what it is and how it works. Here's a start:
MarkLogic (the company) is a privately held company based in Silicon Valley, backed by Sequoia Capital, Tenaya Capital, and Northgate Capital. We've been around for over a decade, and have hundreds of customers in production. We like to think we were "NoSQL before NoSQL was cool."
MarkLogic (the product) is an Enterprise-class document-centric NoSQL database. Enterprise-class means that it doesn't throw out the enterprise functionality you need to run mission-critical applications. Things like ACID transactions, automatic failover, database replication, and item-level role-based security. Document-centric NoSQL means that it's optimized for storing complex data that contains a mix of traditional values (dates, names, etc.) and unstructured full text in denormalized (document) form. The document-centric model allows it to store hierarchical, sparse, or repeating data easily. MarkLogic is schema-agnostic, which means that entities (documents) in the database do not need to conform to a schema (although if they do, we can take advantage of it to do things like schema validation). This also means that a database can contain a mix of data from different sources in different formats that can all be queried together. This is a key reason why many customers use us. If you want to combine complex data from multiple sources (especially if those sources are changing over time) and query it in real-time with sub-second response time, MarkLogic lets you do that.
Like many NoSQL technologies, MarkLogic is built with a scale-out architecture that allows it to scale horizontally. It processes queries in parallel across nodes in the cluster, and it uses a sophisticated indexing scheme that mixes full-text and structured indexing together to provide sub-second response time to complex queries across huge amounts of data. MarkLogic uses MVCC to provide ACID transactions. For transactions that span nodes in a cluster, we use two-phase commit. Unlike many NoSQL technologies, we were designed for enterprise use, so for our customers, data consistency is important.
MarkLogic is not the only technology in use in the healthcare.gov architecture, but we are used in places where it makes sense to take advantage of our ability to integrate data from a variety of sources and query it quickly and consistently. In this particular application MarkLogic is performing well, responding in less than 1/4 second for 99.9% of queries.
MarkLogic is in use in hundreds of Global 1000 companies, and in many applications in public sector, civilian, military, and intelligence. We power the emergency operations network for the FAA. We powered the BBC's live coverage of the 2012 Olympics. We power the operational trade store for major investment banks. If you haven't heard of us, you should check us out. You can read more about us in Gartner's recent operational DBMS Magic Quadrant: http://gtnr.it/Ieh0hq. You can also download and play with MarkLogic at http://developer.marklogic.com/products.
To be honest, I had never even heard of MarkLogic until this whole blowup happened, so I think I will have to look into what it can do.
We'd love to have you check us out. If you want to learn more about it, you can read more about us on our web site at http://www.marklogic.com/what-is-marklogic or in Gartner's recent operational DBMS Magic Quadrant: http://gtnr.it/Ieh0hq. You can also download and play with MarkLogic at http://developer.marklogic.com/products. Let me know if you'd like more info or help.
Best,
-- David
I agree with you ilsaloving. At MarkLogic, we think data consistency is a critical feature of a DBMS. Some NoSQL technologies have it, some don't. A NoSQL choice doesn't have to mean eventual consistency. And yes, many banks do use our software. As you point out, transactional consistency and durability are important to them.
Since you're asking an honest question, I'll try to provide a good answer.
Enterprises choose NoSQL when they have data that's hard to represent in a relational model. Often that's just due to the complexity of the data. Equities transactions, for example, are easily modeled in rows and columns, but derivatives transactions are like contracts. They're hard to model in a relational database. Not impossible, but hard. Health records are another good example; these are frequently document-centric data structures with lots of freeform text in them. Journal articles, emails, legal documents, etc. are all examples of this kind of hard-to-model data. Sure you can blob or clob it into a relational database, but then it's hard to access. Another type of complex data is very sparse data. If you have to have tens of thousands of columns because you occasionally need to store data in a queryable field, that's not efficient. In many sparse data models, the "columns" change frequently as well. One day you may discover that you have to track some new trait. Do you add a column for that? With a document-centric model it's much easier. Just ingest a document that has that element in it, and MarkLogic will index it for you.
That brings me to the other reason why people opt for NoSQL. Even if the data is relatively simple, if you have many data sources, and they change all the time, it's hard to keep up if you have a relational schema to keep up to date. This is the problem financial institutions have with reference data. They get feeds from a variety of sources, and these feeds change from time to time with no notice. They also have to track metadata like the provenance of each piece of data, which complicates the problem. So if you're trying to combine lots of different data sources that are changing all the time, that's another good use case for NoSQL.
Hope that helps,
-- David
Hey ilsaloving, if you have an application where you need to parse your JSON to extract only the entries that satisfy some criteria (in other words query your JSON with arbitrarily complex queries), and you need to do it fast, you should check out MarkLogic. We can do that.
Hi Anonymous. As I mentioned in a previous post, I run Engineering at MarkLogic. I'm sorry to hear about the difficulties you had. Here's the link to our admin API, which allows you to script backups: http://docs.marklogic.com/admin/database. Hope this helps. If you have other questions, feel free to contact Support, who is there 24x7 to assist our customers.
It's also worth pointing out (since we're talking about backups), that in order to back up a database, you need to be able to get a consistent view of the data at a point in time and hold that consistency for the duration of the backup. For databases that don't provide strong consistency, you usually have to quiesce the database before backing it up. Another advantage of ACID transactions, even for applications that aren't OLTP-type applications.
-- David
Hey Timmy, I'm curious as to why you would say that NoSQL is ideal for low-volume applications.
Hi Chris. XML and JSON are data serialization formats. It is true that they are often used for interchange (especially JSON), but for data that is best modeled as denormalized documents, it makes sense to store it in a denormalized form (versus shredding it into a complex relational schema that requires a lot of joins to re-combine it when you want to use it again). XQuery is not a data interchange or serialization format. It's a query language for content conforming to the XML data model. If your source data is XML, using a database like MarkLogic that can work well with XML is actually like using a screwdriver as a screwdriver. Using an RDBMS to do this is more like the hammer metaphor.
Just to be clear, MarkLogic doesn't store XML on disk. We store a tokenized, compressed binary representation of a hierarchical document structure. That could be XML or JSON (or just flat text). We tokenize it on ingest as part of our indexing strategy, so compression takes no extra overhead beyond what we take to index the document. We index both full text and structure, so we can answer database-style queries combined with full-text queries. Most queries can be resolved out of indexes, resulting in sub-second response times (in healthcare.gov case, MarkLogic's response time is less than 1/4 second for 99.9% of queries). There is some overhead in indexing, which you recapture many times over on the query end by being able to do complex queries quickly.
Yes, this is definitely a complex project. This is exactly the type of thing you want to use something like MarkLogic for. Combining complex data from a variety of sources and making it queryable in real time is what we do (I work for MarkLogic).
Hello Ganjadude. I run Engineering at MarkLogic, so I'm quite familiar with the technology, and I'd be happy to answer any questions about what it is and how it works. Here's a start:
MarkLogic (the company) is a privately held company based in Silicon Valley, backed by Sequoia Capital, Tenaya Capital, and Northgate Capital. We've been around for over a decade, and have hundreds of customers in production. We like to think we were "NoSQL before NoSQL was cool."
MarkLogic (the product) is an Enterprise-class document-centric NoSQL database. Enterprise-class means that it doesn't throw out the enterprise functionality you need to run mission-critical applications. Things like ACID transactions, automatic failover, database replication, and item-level role-based security. Document-centric NoSQL means that it's optimized for storing complex data that contains a mix of traditional values (dates, names, etc.) and unstructured full text in denormalized (document) form. The document-centric model allows it to store hierarchical, sparse, or repeating data easily. MarkLogic is schema-agnostic, which means that entities (documents) in the database do not need to conform to a schema (although if they do, we can take advantage of it to do things like schema validation). This also means that a database can contain a mix of data from different sources in different formats that can all be queried together. This is a key reason why many customers use us. If you want to combine complex data from multiple sources (especially if those sources are changing over time) and query it in real-time with sub-second response time, MarkLogic lets you do that.
Like many NoSQL technologies, MarkLogic is built with a scale-out architecture that allows it to scale horizontally. It processes queries in parallel across nodes in the cluster, and it uses a sophisticated indexing scheme that mixes full-text and structured indexing together to provide sub-second response time to complex queries across huge amounts of data. MarkLogic uses MVCC to provide ACID transactions. For transactions that span nodes in a cluster, we use two-phase commit. Unlike many NoSQL technologies, we were designed for enterprise use, so for our customers, data consistency is important.
MarkLogic is not the only technology in use in the healthcare.gov architecture, but we are used in places where it makes sense to take advantage of our ability to integrate data from a variety of sources and query it quickly and consistently. In this particular application MarkLogic is performing well, responding in less than 1/4 second for 99.9% of queries.
MarkLogic is in use in hundreds of Global 1000 companies, and in many applications in public sector, civilian, military, and intelligence. We power the emergency operations network for the FAA. We powered the BBC's live coverage of the 2012 Olympics. We power the operational trade store for major investment banks. If you haven't heard of us, you should check us out. You can read more about us in Gartner's recent operational DBMS Magic Quadrant: http://gtnr.it/Ieh0hq. You can also download and play with MarkLogic at http://developer.marklogic.com/products.
-- David