NoSQL is complementary to SQL (not NoSQL versus SQL)steemCreated with Sketch.

in nosql •  7 years ago  (edited)


Normalization and De-Normalization

Relational databases, through a process called Normalization, help eliminate data duplication (and consequently, errors related to duplication – for e.g. – having to update the same piece of data in multiple places). In addition, they helped define an optimal design for storing and retrieving data through a fully normalized data schema. 

   When I say 'they', I mean mathematicians, who, are, arguably, the most rigorous group of scientists. I once had a math professor spend 15 minutes rigorously proving that 0+1 = 1+0 = 1. When I asked my math professor why we went to the trouble of proving something obvious, he gave me an example (from group theory), where this wasn't true! 

   SQL essentially resulted from advanced mathematics (SET theory) and solving the problem of optimizing the storage and retrieval of data in the most efficient manner possible.

   In other words, there is NOT a more efficient way to handle both storage and retrieval, in general, than that provided by SQL. It isn't mathematically possible.  

    I used the term 'in general' in the statement above. And, in general, the statement above is true. However, there are certain specific cases where SQL may incur more overhead than necessary.For e.g. - a query that says find me all the comments on this tweet . All you care about is a particular tweet and all the comments on it. You are not concerned about any other tweet or any other aspect of the current tweet (such as how many people shared it).   

   If what you are after is THAT specific, that you do not care about anything else, then there are more efficient ways to store your data (and retrieve it).Instead of creating complex relationships between Tweets, Users, Comments and other entities, all you do is store a tweet and it's comments in ONE PLACE and one place only. 

   Instead of spanning multiple records across multiple tables, all your comments are available in one table, in one record (in a sense).It should be obvious that retrieving just ONE record from one place is a lot faster than performing a JOIN across tables and records to return the same result.

And that, in effect, is what denormalization (NoSQL) accomplishes; it reworks data storage in such a way that certain (and only certain) queries are fully optimized to return in the shortest time possible.

Sounds great, but as you guessed, it only works for specific cases. Going back to the same twitter example - remember, all tweets are associated with YOU (say John Doe) and each tweet has a set of comments associated with it. If you were to change your name (say John MiddleName Doe), guess what? 

  There is no way to retain all the tweets that belonged to you along with all their comments. Why? Because you never stored the relationship between your name and the tweets in the first place! If you HAD stored those relationships, you would be back to Normalization and Relational Data!

Use Cases around Updates and Transactional Updates of Data

So, denormalization speeds up certain queries, but suffers if you have to perform updates (for e.g. updating the name of the tweeter , in the example above). In fact, if your use cases consist of lots and lots of updates, NoSQL is not only inefficient, it is much more error prone. 

   Think back to why SQL was invented; to eliminate errors due to duplication through solid relationships between data. If you remove the relationships, you remove the data consistency.ACID Transactions are only available in SQL Databases. Say your ecommerce site just submitted an order and you automatically decremented the available inventory. 

    Those two operations are part of a single transaction. Suppose, they were done separately - and the inventory decrement failed? Well - you would end up showing an incorrect inventory, and maybe have another order come in, when it shouldn't. This is why those operations need to be grouped together into a single unit, which either fails together or succeeds in it's entirety.

    Relational Databases were designed address this ACID transactionality issue, even across multiple nodes. 

NoSQL database do not guarantee ACIDity of your transactions. Eventual Consistency, which is what NoSQL provides in lieu of ACID transactions, is not enough to meet the needs of most, highly transactional, backend systems. Which is where NoSQL is the real deal breaker for most production workloads today.However, if all you have are simple relationships (such as key-value pairs), this becomes less of an issue (as discussed in the 'Use Cases for NoSQL' section below). DNS Servers are a good example of 'eventual consistency'; but an e-commerce system simply cannot rely on anything except 'immediate' consistency.

Use Case of Primarily Reading of Data

Conversely, if your app queries are primarily READS, then, traversing a set of SQL relationships, just to read a single record of user data, seems like a lot of overhead. Here NoSQL wins hands down; by storing all the relevant data in ONE place, instead of across records spanning multiple tables.

But how can NoSQL work without defining any relationships?

In SQL, Foreign Keys help define relationships between entities. In NoSQL , while you eliminate the need for Foreign Keys, you can an do define relationships between data. These relationships are a lot simpler than the FOREIGN KEYS of SQL, and are thus, very limited in their capabilities. Mostly, these are simple Parent-Child based relationships.Back to the twitter example - say you want to associate (relate) UserComments with a Tweet

  You would define Comments as a child of Tweet - this would constitute a Parent-Child relationship. Now, you could run queries (searches is more appropriate) against this parent child COLLECTION of Tweet-Comments objects, and it should outperform a SQL based query of the same type.

But can't you de-normalize data in regular SQL as well?

That is true. DeNormalization existed long before NoSQL came around. A materialized view, in SQL, can serve you denormalized data from a normalized data store. So you can have your cake (fast reads) and eat it too (data integrity).Where NoSQL distinguishes itself is by offering a simpler approach to scaling when the volume of data gets humongous (think, 1 million comments on the tweet above). One essentially creates a cluster of 'nodes', each node being able to store a few thousand comments. 

  This ability to store (and retrieve) data across a large cluster, in relatively short times, is what gives NoSQL it's edge.In effect, the nodes can be related to each other because there isn't the requirement that the data be consistent (there is no ACID requirement). In SQL, even though you can create the same cluster of nodes, due to the ACID requirement that all nodes must satisfy, their performance does not match up to that of a simple NoSQL cluster.

So, what are the use cases that fit NoSQL well?

  1. Simple Relationships, Big Volumes of data - Think key-value pairs (or Hash Maps). Anything that can be stored as key value pairs, is a good candidate for NoSQL. Web Sessions, Comments on a post, cached objects are all key-value friendly use cases. Things that need to be sorted, as a counterexample, are not. Things that need to be looked up in reverse - e.g. Value to Key - are also not ideal for key-value pairs storage. However, there is a large universe of applications that can be helped with simple key-value pairs, stored and retrieved at lightning fast speeds.
  2. Low Latency, High Throughput - This goes hand-in-hand with the key-value, simple relationships point above. Data retrievals that require low latency and has high I/O requirements, are good candidates for NoSQL. Think of the millions of sessions that amazon.com experiences every second, or the thousands of comments on a tweet that the POTUS writes.
  • Flexible schema and No-Schema (Schemaless). As opposed to traditional row and column oriented data structures, NoSQL supports document-oriented, key-value oriented, column-oriented and even graph oriented. This is extremely powerful, as one does not need to introduce new data storage paradigms to support say, graphical data (like a complex tree graph). In addition, document oriented stores essentially remove the need for a schema - as your entire document, with all it's fields, can be stored and searched through.
  • Parallel computing. MapReduce is built into several NoSQL products, making it easy to exploit parallel computing alongside the node clustering storage that NoSQL provides.

What are the use cases that will not fit NoSQL well?

  1. Complex data with relationships which are updated through transactions (multiple operations part of single transactional unit). ACID Transactions are only available in SQL Databases. Say your ecommerce site just submitted an order - and you automatically decremented the available inventory. Those two operations are part of a single transaction. Suppose, they were done separately - and the inventory decrement failed? Well - you would end up showing an incorrect inventory, and maybe have another order come in, when it shouldn't. Relational Databases address the ACID transactionality issue, NoSQL ones do not. Eventual Consistency, which is what NoSQL provides, is not enough to meet the needs of most, highly transactional, backend systems.
  2. Pre-Existing relational data (Don't let anyone tell you they can take all your relational data to a NoSQL store seamlessly)
  3. BLOB data - Again, NoSQL is not ideal for large BLOB types of data, which cannot be handled in a low latency manner.
  4. Writes and Updates across multiple entities are also better served by SQL databases.

Who is using NoSQL? Are they relying completely on NoSQL?

Some of Facebook's core features (messaging, monitoring) utilize NoSQL (HBase) as their backend. Hadoop/Hive is used for data analytics at FB. Linked In and Google are other heavy users of NoSQL.It is important to realize that all these companies use traditional SQL databases as their primary backend.  

   Usually, these are MySQL variants or traditional Oracle RACs, SQL Server Always On Groups etc. NoSQL is used ALONGSIDE traditional SQL, in addition to relational databases. I am not aware of any large company that was built entirely on NoSQL.SUMMARYSQL schema design is basically antithetical to the data needs of web applications. Most web app use cases involve data in key value pairs. This was why property bags, a poor man's key value pair storage using objects, became popular in languages such as C# and Java.

    This is also why ORM layers were needed, to reduce the impedance mismatch between the OO business objects and the underlying relational schema.Moving away from SQL was a natural outcome of these impedance mismatches with the data requirements of typical web applications. NoSQL was a natural outcome of this movement. In addition to providing simplicity of data definition and storage, it added horizontal scalability to tackle huge volumes of data. 

  The advent of facebook, twitter and similar high data volume applications turned out to be textbook use-cases for NoSQL. And it's success and importance should not be underemphasized.However, the important thing to understand is that there is trade-off made whenever we de-normalize data to achieve better performing searches. Some important trade-offs are:

  1. Poorer performance on UPDATES and INSERTS than traditional SQL would provide, unless your data relationships are extremely simple (key value pairs).
  2. Scalability at the loss of data integrity. NoSQL data is spread out across shards (cluster nodes) - transactions based around these nodes are not guaranteed to be ACID, unlike transactions in the SQL world.
  3. Poorer performance on storing and retrieving complex data (multiple relationships).

On the other hand, the positives of NoSQL include:

  1. BigData - NoSQL's biggest strength is it's ability to utilize a cluster of nodes to scale almost seamlessly. Think of twitter, with it's needs to store 10 TB of data PER DAY. Obviously, this cannot go on a single node, and needs the ability to rapidly spin up new storage nodes,
  2. No Single point of Failure - While not usually touted, this is a hidden strength of the NoSQL cluster. Should one node fail, there are several nodes available to take on the volumes of incoming data.
  3. Parallel Computing - with MapReduce support, is built into most NoSQL products.

NoSQL should not be thought of as a competing technology against SQL. Rather, it is a complimentary technology, providing advantages over traditional SQL Storage. It was never meant to replace SQL. It has never been about NoSQL versus SQL; rather NoSQL alongside SQL. 

Authors get paid when people like you upvote their post.
If you enjoyed what you read here, create your account today and start earning FREE STEEM!