Databases: To SQL or Not to SQL ???
Similarities | Differences | Use Cases | Joins | Scaling
Similarities
- Both SQL and NoSQL are interfaces/abstractions to build and manage databases. Each uses a programming language to query and modify data (e.g: MySQL, PostgreSQL — SQL and MongoDB, REDIS — NoSQL)
- Both patterns use a data structure blueprint called schema
- They share some data types although SQL is a lot more comprehensive. and there are subtle differences. For example an email attribute is stored as a string of characters in both cases. A simple decimal number is stored as an integer. However a date type in mongo db is stored as 64-bit integer vs in SQL databases it is stored as a string.
- Indexing works much the same way in both by storing a subset of data in a certain order so that data lookups and sometimes sorting is quick and efficient. If indexing did not exist, you would essentially have to scan every document in a collection/table. This is encapsulated by the key constraint.
Where they differ
- Relational vs Non-Relational. This is the biggest difference. SQL is build around the concept of the relation between two entities. Let us take the example of a social media user and posts. In a real world scenario we need to associate a post with a particular user so that we know who posted “My new puppy is cute”. In SQL there are standard ways to represent different types of relations such one-to-one, one-to-many, many-to-many. We can also establish a relation in NoSQL, for example using aggregation in MongoDB but it is not as performant as we will observe below.
- Normalized vs Denormalized. Relational databases such as SQL due to their normalized nature can use a JOIN clause which allows you to combine rows from two or more tables using a common field (e.g id => user_id) between them. Document-oriented databases such as MongoDB are designed to store de-normalized data, which can have different formats. But using a ORM you can achieve some data consistency more easily and also as of MongoDB 3.2 a new
$lookup
operator was introduced which can perform a LEFT-OUTER-JOIN-like operation on two or more collections. This however is only permitted in aggregation operations which basically compute results in memory and run slower. - Constraints. SQL databases enforce a number of constraints varying from key constraints to guarantee uniqueness to foreign keys for referencial integrity and not null to ensure data presence. The later two are not enforced in NoSQL databases. There are ways to maintain a relationship between two tables either at the application level or having a data layer abstraction such as an ORM but joins are typically more expensive as we will discuss.
- Nomenclature. SQL data is stored in entities called tables which contain rows of data with a uniform structure, tables have a predefined schema and all the CRUD operations are done in accordance with it. NoSQL on the other hand is comprised of collections which also follows a certain schema but it doesn’t have to adhere to it, hence it is a flexible schema although as mentioned above using a ORM you can enforce data consistency since incorrectly formatted documents will not be allowed to be saved into the DB.
- Transactions and ACID properties: Relational databases support transactions, which are sets of related operations that either succeed or fail as a whole. This feature ensures the ACID properties (Atomicity, Consistency, Isolation, and Durability) are enforced, guaranteeing data consistency and integrity. This is much harder in NoSQL due to the asynchronous nature of updates and horizontal scaling. So the C part in “ACID” more closely resembles “Eventually Consistent” and Atomicity is not always guaranteed unless you use databases migrations or a similar mechanism to perform updates.
__________________________________________________________
SQL Joins
SQL Server creates a new table (not physical) by combining rows that have matching values in two or more tables. This join is based on a logical relationship (or a common field) between the tables and is used to retrieve data that appears in both tables.
SELECT Column_List
FROM TABLE1
INNER JOIN TABLE2
ON Table1.ColName = Table2.ColName(e.g: inner join on customer id between orders and customers)
SELECT Orders.OrderID, Customers.CustomerName, Orders.OrderDate
FROM Orders
INNER JOIN Customers
ON Orders.CustomerID=Customers.CustomerID;
NoSQL Joins
This is possible using MongoDB’s aggregation framework, which is modeled on the concept of data processing pipelines. Documents enter a multi-stage pipeline that transforms the documents into an aggregated result. For example:
db.orders.aggregate([
{ $match: { status: "A" } },
{ $group: { _id: "$cust_id", total: { $sum: "$amount" } } }
])First Stage: The $match stage filters the documents by the status field and passes to the next stage those documents that have status equal to "A".Second Stage: The $group stage groups the documents by the cust_id field to calculate the sum of the amount for each unique cust_id.The most basic pipeline stages provide filters that operate like queries and document transformations that modify the form of the output document.* The pipeline provides efficient data aggregation using native operations within MongoDB, and is the preferred method for data aggregation in MongoDB.
Finally if need to work with mongo db i highly recommend this article to incorporate mongoose into your project. It as an excellent ORM library that creates an abstraction layer between the database and the backend.
Scaling
Vertical scaling, referred to as scale-up, is the process of adding more power (CPU, RAM, etc.) to your servers. Horizontal scaling, referred to as scale-out, allows you to scale by adding more servers into your pool of resources.
When traffic is low, vertical scaling is a great option, and the simplicity of vertical scaling is its main advantage. This simplicity makes it ideal for financial application or healthcare for example when data needs to be highly reliable. Unfortunately, it comes with serious limitations.
- Vertical scaling has a hard physical limit. It is impossible to add unlimited CPU and memory to a single server.
- Also it does handle failover and lacks redundancy. If one server goes down, the website/app goes down with it completely for that particular server. We will explore this a little more in depth.
Hence horizontal scaling is more desirable for large scale applications. But its implementation for SQL Databases is a lot more cumbersome.
MySQL databases can actually scale horizontally by a technique called sharding. We are basically splitting or partitioning a table to be stored across multiple servers. These smaller tables make up the shards. The databases in your cluster each store one of the shards. Together, the databases in the cluster constitute your full data set. When working with large data sets and high throughput, sharding helps decrease the load on a single server and enables scaling through the addition or removal of servers, depending on the need. But while it is possible does not mean it is the right solution. Since the data lives on different data servers performing a join across shard comes at a cost to performance at larger scales. It also complicates application logic or introduce coupling with the data layer.
MongoDB and other NoSQL databases, on the other hand, are designed to accommodate sharding at a structural level. A shard is a subset of data and MongoDB lets us horizontally scale by deploying shards as replica sets. Replica sets are clusters of at least three nodes with redundant copies of the same data. They provide availability and redundancy when spread across large environments and are not confined by a predetermined schema. While they lack the rigidly enforced consistency of ACID compliant SQL databases, MongoDB and other NoSQL databases excel at availability and partition tolerance.
Ok, so when do you use each?
- Small databases, a few million records or less with structured data: SQL
- Large databases with unstructured data: NoSQL
- Large databases with rigid structured data (financial transactions): SQL
Of course there will always be counterpoints to that general guideline. You can scale MySQL horizontally, and MongoDB started supporting Multi-Document ACID Transactions. The more we understand how these databases are designed, we gain the insight to pick the best tool for the job.
There is a third category which we did not about that which don’t require disk storage at all. These group is:
3. In-Memory Databases
These are super fast as everything’s stored in memory. They are typically used for fast lookups of data that is meant to be temporary (e.g: Once a task has completed). Examples include Redis, Memcached. They typically have a very simple interface that lets you get and set data based on a key-value relationship similar to a map data structure.