New challenges demand new approaches.
For past 30 years, industry professionals have developed applications using relational databases. When ask, “What database do you use?” they expect to hear industry standards like Oracle or SQL Server. Relational databases rule the world.
So why have people started developing and using alternative database solutions in the last few years? It turns out to be several things:
Data Explosion – The Big Data Movement. We have witnessed an explosive growth of applications on the web, and mobile and social networks. Correspondingly, enterprises big and small want to capture as much data about users as possible, and use it predict the consumer’s next move via analytics and other statistical techniques. Storing and processing this amount of data demands new approaches. Let see how typical RDBMS system scale:
Companies have to buy bigger servers and more expensive hardware in order to keep up with increasing amount of data. This has become very costly. On the other hand, physical limitations and greater heat dissipation mean that chip manufacturers, such as Intel and AMD, are unable to produce increasingly powerful processors with higher clock speeds and
more cores. According to Moore’s Law, over the history of computing hardware, the number of transistors on integrated circuits doubles approximately every two years. The period is often quoted as “18 months.” This is due to Intel executive David House, who predicted chip performance would double every one and a half years. This trend has continued unabated for more than half a century. Sources in 2005 expected it to continue until at least 2015 or 2020.
However, the 2010 update to the International Technology Roadmap for Semiconductors has reported growth slowing at the end of 2013, after which time transistor counts and densities are to double only every three years. It means bigger and faster servers will hit the market at a far slower rate. However, demand for speed is only increasing.
So the concept of vertical scaling is definitely not working.
To satisfy market demands, a new crop of databases, using different approaches to store and manage data have been invented. All have one common factor – horizontal scalability.
Practically, all are distributed systems, designed to run on huge clusters of commodity hardware. If you need to store and process more data, you simply just add more machines to the cluster.
Each node stores and processes of a portion of the data, and central controllers manage the whole cluster. This type of architecture is called sharding. It allows the database to distribute load across multiple machines, as well as harvest the power of many machines and processors.
- Performance Demands – Low latency access.
Today, users expect everything to work instantly; three-seconds delays, for example, are considered intolerable for page load. Relational database cannot deliver speed when they grow beyond a certain size because of hardware limitations, bottlenecks in drive I/O, and the way they store and access the data. Additionally, we all know that joins in relational databases are slow.
Contrary to RDBMS, most NoSQL solutions do not have joins. They store data either as embedded documents like mongoDB or columns as HBase. This allows much faster data access.
According to Wikipedia, “the object-relational impedance mismatch is a set of conceptual and technical difficulties that are often encountered when a relational database management system (RDBMS) is being used by a program written in an object-oriented programming language or style; this is particularly apparent when objects or class definitions are mapped in a straightforward way to database tables or relational schemata”
To address this difficulty in various design patterns and tools were developed, including pojo, dao, hibernate, and active records to name a few.
Nevertheless, while they hide impedance mismatch from developers, they add substantial complexity to the systems and sometimes impact performance. NoSQL databases, to the contrary, can store and retrieve objects represented in more native format, like json documents or simple maps.
This allows developers to create solutions considerably faster, with no need to read 500 pages of “Hibernate in Action” and deliver products to market much faster with better quality.
- Shifting Deployment Patterns – Cloud computing.
With a proliferation of cloud solutions, (such as Infrastructure as Service and Platform as Service) it is much easier and cheaper to deploy data storage and applications to a smaller cluster of machines.
Offerings like Amazon AWS and Microsoft Azure make it much easier and financially feasible to store and process huge amount of data by acquiring a set of small virtual machines. Companies can even provision additional resources in a matter of minutes to distribute database load during peak time and then shrink the cluster to save money. The distributed database model of NoSQL fits this approach very well. That why we see noSQL hosted database offerings from companies like Amazon (dynamoDB), Heroku, CloudBees, to name a few.
This allows small companies to explore and build solutions without huge upfront investment in databases.
- Different Databases for Different Problems
Before, we had one database technology for any problem. Regardless of the problem, we had only one solution – RDBMS.
Now, we have choices that help us to deliver solutions either faster and make previously impossible task possible. For example, MongoDB provides easy programmability, a query interface, high availability with automated failover, and automated sharding capabilities. It allows for a smooth transition to NoSQL data stores from the RDBMS model, with the inclusion of familiar concepts, such as the ability to define indexes.
Graph databases store information as arbitrarily interconnected nodes linked by named relations, rather than as tables and joins. Schema-less and highly extensible, they are an excellent choice for modeling semi-structured data in complex domains. Neo4j is the front-runner in the space – both its REST API and its Cypher query language support simple and fast storage and traversal of graphs.
Riak and Redis are distributed key-value stores that are schema-less and data-type agnostic. They can be put to good use in write-heavy projects to store data such as sessions, shopping carts and streaming logs, whilst retaining the ability to perform complex queries in full-text search. The distributed cluster can self-recover without a single master, has tunable consistency and availability settings and can do collision detection and resolution if needed – all of which can be particularly helpful in high availability environments.
Couchbase is a persistent cache with auto-sharding features, master-less clusters and replicated data to avoid cache-misses. Because it supports the Memcached protocol, it allows drop-in replacements for Memcached based systems.
Casandra stores data in column as opposite to the record storage in RDBMS, allowing very fast access patterns. Cassandra’s ColumnFamily data model offers the convenience of column indexes with the performance of log-structured updates, strong support for materialized views, and powerful built-in caching.
Corporations, from multinational giants to tiny startups are using different types of NoSQL storage to solve their problems. The database world is no longer a binary choice between Oracle and SQL Server, but has a multicolored framework of solutions. We live in exciting times, when we can paint our solution using the full palette of rich, evocative colors.