The choice of a database can significantly influence the performance, functionality, and scalability of your application. In another guide, we explored the differences between SQL and NoSQL databases.
This guide provides a deep dive into popular database options, specifically SQL databases like MySQL and PostgreSQL, as well as NoSQL options like MongoDB.
When choosing a database for your project, there are a few factors to consider:
- Supported Data Types: Databases support a variety of data types, which specify which type of data can be stored and manipulated within a given structure. This can include integers (whole numbers), strings (text), blobs (binary data), and complex types like JSON or XML. Choosing a database that can handle the kind of data you need is essential.
- Indexing: Indexing enhances database performance by optimizing the speed at which data can be retrieved. An index provides a pointer to data in a table, enabling faster data lookups and retrievals.
- Concurrency Control: As many applications have multiple users accessing data simultaneously, how a database manages simultaneous data access becomes crucial. Concurrency control mechanisms like locking (which restricts multiple users from modifying the same data simultaneously) or optimistic concurrency control (which allows simultaneous data access but checks for conflicts before committing changes) can help ensure smooth data operations.
- Scalability: As your user base or data grows, the database should be able to handle the increased load. With horizontal scaling, we can add more machines to a network to manage increased demand. Vertical scalability involves adding more power to an existing machine.
- Primary-Replica Configuration: For enhanced data availability, databases use replication. In this process, data from one database (primary) is copied to another (replica). This ensures that if the primary server goes down, the replica can take over, ensuring high availability.
- ACID Compliance: ACID (Atomicity, Consistency, Isolation, Durability) is a set of properties ensuring that database transactions are processed reliably. It's especially important for systems where consistent and accurate data is crucial, like in financial or inventory management systems.
Let’s take a look at some primary differences between MySQL, PostgreSQL, and MongoDB.
MySQL offers a comprehensive set of standard SQL data types, like INTEGER
, FLOAT
, VARCHAR
and TEXT
. This makes it flexible for applications that require a wide range of basic data types, from numbers to text to timestamps.
-- Creating a table named "users"CREATE TABLE users( id INT AUTO_INCREMENT PRIMARY KEY, name VARCHAR(50), email VARCHAR(100) UNIQUE);
-- Inserting a record into the "users" table
MySQL uses several index types.
- Primary Key: A unique index where all key column values must be unique.
- Unique Key: Similar to a primary key but allows for one
NULL
value. - Full-Text: Allows text searches.
- Spatial Index: Useful for spatial data types.
CREATE INDEX idx_username ON users (username);
MySQL uses Multi-Version Concurrency Control (MVCC). MVCC allows multiple "versions" of a data record, ensuring smooth simultaneous data operations without conflicts.
MySQL was designed primarily for vertical scaling, enhancing a single server's capability for better performance.
MySQL primarily utilizes a primary-replica replication model. In this model, one server (the primary) handles write operations, while one or more replicas replay these operations to ensure data consistency with the primary. MySQL is well-suited for read-heavy workloads. However, if the primary server fails, it can introduce a single point of failure.
MySQL is fully ACID-compliant, guaranteeing reliable data transaction processing.
Beyond standard SQL data types, PostgreSQL supports a variety of unique and specialized types such as HSTORE
for key-value stores, JSON
for JSON data, and geometric and network address types. This makes PostgreSQL a good choice for applications requiring complex or flexible data structures.
-- Creating a table named "users"CREATE TABLE users( id SERIAL PRIMARY KEY, name VARCHAR(50), email VARCHAR(100) UNIQUE, profile_data JSON, location POINT, ip_address INET);
-- Inserting a record into the "users" tableINSERT INTO users(name, email, profile_data, location, ip_address) VALUES ('John Doe', '[email protected]', '{"age": 30, "address": {"city": "New York", "state": "NY"}}', POINT(40.730610, -73.935242), '192.168.1.1');
PostgreSQL offers more advanced indexing techniques, including partial, expression-based, and materialized views, which can be indexed directly.
Partial Indexing
This type of index is built over a subset of rows in a table, usually defined by a WHERE
clause. For example, if you have a table of orders and frequently query for pending orders, you could use a partial index on orders where the status is "pending"
. This would make that particular query much faster, and the index would consume less disk space because it only applies to a subset of the table.
CREATE INDEX idx_partial_orders ON orders (order_id) WHERE status = 'pending';
Expression-based Indexing
You can create indexes not just on a column, but also on the result of an expression or function. This can speed up queries that use that exact expression or function.
For example, if you often query a table to find the length of a text field, you could create an index based on the LENGTH
function to speed up those queries.
CREATE INDEX idx_expression_orders ON orders (LENGTH(order_text));
Materialized Views
A VIEW
in a database is a virtual table created by a query that pulls data from one or more existing tables, allowing you to display specific information without physically storing it.
Unlike regular views, materialized views query the data once and then save the results as a table. This can drastically improve performance for complex queries that don't change often but are read frequently. Moreover, you can index materialized views directly, making reads even faster. You also have the option to refresh a materialized view to update the persisted data, ensuring that it stays current with any changes in the underlying data sources.
CREATE MATERIALIZED VIEW mat_view_orders AS SELECT * FROM orders WHERE status = 'pending';CREATE INDEX idx_mat_view_orders ON mat_view_orders (order_id);
The right type of index can significantly boost your application's read performance, so it's worth investing time to understand and implement the most effective indexing strategy for your particular use case.
Similar to MySQL, PostgreSQL uses Multi-Version Concurrency Control (MVCC) which allows for multiple versions of a data record to exist at the same time, which enables non-blocking reads and writes. This is important in environments with high transaction rates.
PostgreSQL is more flexible when it comes to scaling and can be scaled both vertically and horizontally.
For horizontal scaling, PostgreSQL provides built-in partitioning and supports sharding through extensions like Citus.
CREATE TABLE sales ( sale_id serial PRIMARY KEY, sale_date DATE NOT NULL, product_id INT, quantity INT) PARTITION BY RANGE (sale_date);
-- Creating partitions for specific date rangesCREATE TABLE sales_2022 PARTITION OF sales FOR VALUES FROM ('2022-01-01') TO ('2023-01-01');CREATE TABLE sales_2023 PARTITION OF sales FOR VALUES FROM ('2023-01-01') TO ('2024-01-01');
PostgreSQL offers a range of replication solutions, including both synchronous and asynchronous options such as log-based and statement-based replication. It also provides various high-availability options like Warm Standby and Hot Standby.
Log-based vs. Statement-based Replication
- Log-based Replication: This is the most common method of replication in PostgreSQL. Changes to the primary servers's data are recorded in the write-ahead log. The replicas then use these logs to replay changes, ensuring they have the same data as the primary.
- Statement-based Replication: Instead of replicating the results of a transaction (like in log-based replication), the SQL statements that caused the changes are sent to the replicas. While it can be efficient for certain workloads, it might cause inconsistencies with non-deterministic functions (e.g.
NOW()
orRANDOM()
), as they can produce different results on the primary and replicas.
Synchronous vs. Asynchronous Replication
- Synchronous Replication: In this mode, a write transaction is only considered complete when the data has been written to the primary and at least one replica. This ensures immediate consistency between the primary and the replica, but it can introduce latency if the write operation has to wait for confirmation from the replica.
- Asynchronous Replication: The primary server doesn't wait for acknowledgments from the replicas. The transaction is considered complete as soon as data is written to the primary, making this method faster. However, there's a potential risk of data loss if the primary crashes before the replicas have caught up.
High-Availability Options
- Warm Standby: The replica server is running and continuously updates using the logs from the primary. However, it doesn't accept connections until it's promoted to be a primary server (often in the event the primary fails). The transition from warm standby to primary is relatively quick, ensuring minimal downtime.
- Hot Standby: This is an enhanced version of the warm standby. The replica is not just receiving the logs and updating its data; it also allows read-only connections. This means users can query the replica server even while it's updating, providing load balancing and reducing the read load on the primary server.
Like MySQL, PostgreSQL is fully ACID-compliant, ensuring that all database transactions are reliably processed. This is important for applications where data integrity is crucial, such as banking or healthcare systems.
MongoDB uses a schema-less model that allows fields to contain data of any supported data type, even within the same document. This can include other documents, arrays, and arrays of documents. This is ideal for applications with unpredictable or flexible data.
// Creating a collection named "users"db.createCollection("users")
// Inserting a document into the "users" collectiondb.users.insertOne({ name: 'John Doe', profile_data: { age: 30, address: { city: 'New York', state: 'NY' } }, location: { coordinates: [-73.935242, 40.730610] }, ip_address: "192.168.1.1"});
Supports various index types including hashed indexes for sharding, text indexes for text search, and geospatial indexes for querying geographical data.
// Create a hashed index on the 'userId' fielddb.users.createIndex({ "userId": "hashed" });
// Create a text index on the 'description' fielddb.articles.createIndex({ "description": "text" });
// Create a 2dsphere index on the 'location' fielddb.places.createIndex({ "location": "2dsphere" });
MongoDB doesn't use MVCC but offers concurrency via horizontal scaling and partitioning, which distributes data across multiple servers. This helps in load balancing and handling more concurrent operations.
This database is built from scratch to support horizontal scaling. They distribute data across multiple servers, making them ideal for managing large data sets and high-velocity data.
MongoDB uses a replica set model to achieve high availability. A replica set is a group of MongoDB servers maintaining the same data set. If the primary server fails, MongoDB provides automated failover to a backup server.
MongoDB offers a flexible approach to balancing performance with data accuracy tailored to the application's needs but doesn't conform to ACID properties by default.
It follows the CAP theorem, which states that distributed systems can maintain only two out of the three core properties simultaneously— Consistency, Availability, or Partition tolerance.
- Write Concerns: MongoDB allows you to specify how many nodes must acknowledge a write operation before considering it successful. This can range from just the primary node to every node in the cluster. By ensuring that a certain number of nodes have acknowledged a write before considering it successful, MongoDB can guarantee data durability to varying degrees. The more nodes that need to acknowledge, the higher the durability (and trade-offs with write speed).
- Read Concerns: MongoDB provides granular control over read operations. You can prioritize data freshness (often reading from the primary node) or high availability, which may involve reading older data from secondary nodes. By allowing users to control where their reads come from, MongoDB lets users balance between reading the most up-to-date data (consistency) and system availability.
- Transactions: MongoDB (v4.0+) supports multi-document transactions, enabling atomic operations across multiple documents and even databases. Multi-document transactions ensure that operations are atomic, meaning all changes in a transaction are committed or none are. If a transaction is interrupted, none of its changes are saved, maintaining data integrity.
- Retryable Writes: If a write fails initially (e.g. network issues), MongoDB tries again to ensure that the write takes place, thereby enhancing the guarantee that once committed, data will remain saved.
- Session Consistency: Within a single client session in MongoDB, reads will reflect previous writes, which guarantees that subsequent reads in the same session will see that write once a write is acknowledged.
While MySQL and PostgreSQL offer robust, reliable, ACID-compliant relational data storage solutions, NoSQL databases like MongoDB provide flexible, horizontally scalable alternatives. The choice between these databases should be based on the specific needs of the application, considering factors such as data integrity, performance, scalability, and the type of the data being handled.