This post briefly covers the history of databases, transaction fundamentals, and then proceeds to explain what I would change if I saw your database. It is highly-opinionated, and based on decades of experience in working with relational databases.
History
The term relational database was Invented by E. F. Codd at IBM in 1970 in his paper “A Relational Model of Data for Large Shared Data Banks.” The first commercially available RDBMS was Oracle, released in 1979 by Relational Software, now Oracle Corporation. Currently, the most popular relational databases are: Oracle, MySql, Microsoft SQL Server, PostgreSQL, DB2, SQLite, Sybase. They all use a variant of SQL. My personal favorite database is PostreSQL. It is free, open-source, and most faithfully implements ANSI-SQL.
Relational Databases Today
The relational database can provide the right persistence solution for most business needs. They are extraordinarily fast and reliable. They can manage petabyte scale data. There are plenty of tools written for them. They are simple to use. Right or wrong, they are also one of the most common enterprise integration patterns.
The Advent of NoSQL
What about NoSQL?
NoSQL solutions provide a viable alternative persistence solution, depending on your needs.
Many NoSQL solutions are distributed by design, thus horizontally scalable. They often compromise consistency (CAP theorem) in favor of availability, partition tolerance, and speed.
Examples of the different types of NoSQL solutions are:
- Key Value Store (Riak, Memcached)
- Data-Structure Store (Redis)
- Big Table or Column Store (Cassandra, HBase)
- Document Store (Couchbase, Lucene, MongoDB)
Ironically, most NoSQL solutions have an SQL like interface.
When to use RDBMS
The RDBMS is the right choice for a very wide variety of use cases, such as when:
- Data is inherently relational.
- Data is tabular.
- Data is user-generated (as opposed to machine generated).
- You need a standard interface such as SQL used by business people or multiple applications.
- Your application fits into the 99.9% of the applications that can perform and scale perfectly with a RDBMS*.
- Your application truly needs ACID.
Fundamentals
Transactions
A transaction is a group of SQL statements that are a logical unit of work attributed to one connection. Most RDBS support transactions.
It has four fundamental properties:
Atomic Consistent Isolation Durable. ACID.
“Atomic” describes how after the transaction finishes the changes that each statement made seem as though were completed as one statement.
“Consistent” describes how the RDBMS will have valid data after the transaction.
“Isolation” describes the rules of how one transaction can change the values in another transaction happening at the same time.
“Durable” describes that persistent nature of an RDBMS. Once the transaction has been committed, the update will be persisted, even if there is a system failure.
Transaction Isolation Levels
Isolation Level | Phantom Reads | Nonrepeatable Reads | Dirty Reads |
---|---|---|---|
READ_UNCOMMITTED | yes | yes | yes |
READ_COMMITTED | yes | yes | no |
REPEATABLE_READS | yes | no | no |
SERIALIZABLE | no | no | no |
Phantom Read | T1 does query. T2 inserts rows. T1 does same query and finds new row. |
Nonrepeatable Read | T1 reads a row. T2 updates the same row. T1 reads same row again. |
Dirty Read | T1 updates a row, but without commit. T2 reads updated row. T1 does rollback. |
I encourage you to experiment with isolation levels to understand the difference. Experiment by creating two connections to a database. T1 is connection one. T2 is connection two. Please note that not all databases support all isolation levels.
What I would Change
Use a Migration Tool
Introduce migrations to manage your database development and management process. My favorite choice is Liquibase, however, there are also Active Record Migrations, and Liquibase.
When possible, prepare are rollback plan for each migration. Also, remember that is simpler to add a column to a production system than to remove one.
Introduce only manageable incremental migrations per sprint.
Below is an example of how to run liquibase through docker-compose.
1 2 3 4 5 6 7 8 9 |
liquibase.db: image: liquibase/liquibase:4.18 command: --defaultsFile=/liquibase/config/liquibase.properties update depends_on: db: condition: service_healthy volumes: - ./conf/liquibase/liquibase.properties:/liquibase/config/liquibase.properties - ./microservice/src/main/resources/db/changelog:/liquibase/changelog |
Continuously Document Your Schema
Introduce a utility that publishes your schema entity relationship diagram after every database migration.
You’ll be surprised how many dialogues this creates up during design, problem solving sessions, agile grooming, planning poker, and many other scenarios.
Below is an example of how to run schemaspy through docker-compose.
1 2 3 4 5 6 7 8 9 10 11 12 |
schemaspy.db: image: schemaspy/schemaspy:6.2.2 volumes: - schemaspy.db:/output:rw - ./conf/schemaspy/schemaspy.properties:/schemaspy.properties command: SCHEMASPY_OUTPUT=/schemaspy/db /usr/local/bin/schemaspy -schemas public depends_on: liquibase.db: condition: service_completed_successfully volumes: schemaspy.db: {} |
Introduce Coding Conventions
- TABLE names are plural.
- COLUMN names are singular
- Each TABLE has: id INTEGER PRIMARY KEY,.
- Index each PRIMARY KEY and FOREIGN KEY columns.
- FOREIGN KEY columns are named with the pattern singular_table_name_id.
- Use COMMENT data-definition language.
- Favor NOT NULL since null columns are normalization columns candidates.
Identifiers should use the following suffices.
_PK Primary Key
_CK Check Constraint
_UK Unique Key Constraint
_V View
_MV Materialized View
_FK Foreign Key
_X Index
_UX Unique Index
_FX Function Based Index
_SEQ Sequences
_TRG Triggers
_PKG Packages and Package Bodies
Constraint Type | Naming Convention |
---|---|
Foreign Key | REFERENCING_TABLE#REFERENCED_TABLE#REFERENCED_COLUMN_NAME_FK |
Unique | TABLE_NAME#COLUMN_NAME_UX |
Check | TABLE_NAME#COLUMN_NAME_CK |
Not Null | TABLE_NAME#COLUMN_NAME_NN |
Not everybody will agree on these conventions, and you might have different conventions in mind. However, it is better to stick to one convention than to have no conventions because this will lead to higher maintainability.
Always Start with 3rd Normal Form
When starting your database design, start with and favor 3rd Normal Form (3NF). Deviate from that only for optimization purposes, after you’ve already tried 3NF. Do NOT prematurely optimize away from 3NF.
Use Foreign Key Constraints
Favor Foreign Key Constraints for applications still in development, or for “small” applications.
Although Foreign Key Constraints protect the integrity of the database, they are not necessary in a bug-free application that will never violate Foreign Key Constraints. They affect performance.
Foreign Key constraints make it very difficult to manage or refactor large “enterprise” databases with thousands of tables.
Only Use Numeric Id Primary Keys
Favor auto-generated integer PRIMARY KEY columns that are populated by a SEQUENCE.
Use the name “id” for the PRIMARY KEY column. It is the most popular convention, and it will allow your database to more easily integrate with other frameworks such as Ruby on Rails.
Do NOT use a COMPOSITE PRIMARY KEY, instead use a UNIQUE INDEX.
Use Compact Data-Types
Use a proper data-type. Do not make everything type “text”!
Use the smallest byte sized data-type appropriate for the given column. This will help performance in the long run on the server and on the client.
3rd Normal Form will allow for a more compact data types since it removes redundant data (i.e. an association will show an integer id, which is more compact, instead of text).
Design for Concurrency
Design a schema that is concurrently safe with multiple clients or processes.
- Avoid row contention.
- Favor row inserts to sum amounts.
- Use optimistic locking with row version numbers.
Design for Growth
Depending on your needs, you may need to also architect for growth. It’s easy to scale up a database (better processors, more ram). However, that also has its limits, and you may need to use a distributed design that may incorporate the following.
- Caching
- Partitioning
- Sharding
- Replication
- An intelligent driver or client
MySql has very nice sharding and partitioning capabilities. The MySQL NDB cluster can automatically partitions tables across nodes, and queries will access the correct shards. Sharding is transparent to the application. Unlike other databases, users can perform JOIN operations, use ACID-guarantees, when perform queries and transactions across shards (source MySQL NDB Cluster: Scalability).
Introduce Auditing
Introduce auditing into your design.
Each DML statement can be audited with a version number, user id, update time, and create time that will show current values on the target table, and the history in a history table.
This will also help with optimistic locking, future integrations with other system, and with trouble shooting.
Below is an example audit trigger that I use with PostgreSQL.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 |
-- Create trigger that does two things: 1) it updates the version and updated_at fields in the case of an -- UPDATE. 2) on every UPDATE, and DELETE operation it adds a record to the _H audit table. CREATE OR REPLACE FUNCTION audit_update_trigger() RETURNS TRIGGER LANGUAGE plpgsql AS 'BEGIN IF (TG_OP = ''UPDATE'') AND (NEW.last_updated_by > 0) THEN NEW.version := OLD.version + 1; NEW.updated_at := CURRENT_TIMESTAMP; EXECUTE format(''INSERT INTO audit.%I_h SELECT nextval(''''audit.%I_h_id_seq''''), %L, now(), ($1).*'', TG_TABLE_NAME, TG_TABLE_NAME, TG_OP) USING OLD; RETURN NEW; ELSEIF (TG_OP = ''DELETE'') THEN EXECUTE format(''INSERT INTO audit.%I_h SELECT nextval(''''audit.%I_h_id_seq''''), %L, now(), ($1).*'', TG_TABLE_NAME, TG_TABLE_NAME, TG_OP) USING OLD; RETURN OLD; END IF; RETURN NEW; END;'; COMMENT ON FUNCTION audit_update_trigger() IS 'Trigger that does two things: 1) it updates the version and updated_at fields of a record in the case of an UPDATE 2) on every UPDATE, and DELETE operation it adds a record to the _H audit table.' |
Use Prepared Statements for ALL Production Code
They are optimized on the server, and thus perform faster.
They are cached on the server, but you can also cache them on the client.
They are secure, and prevent SQL injection attacks.
Use Batch Updates
JDBC Batch Update allowed me to support 7 billion impressions per day on MySpaces’s Ad Server. That’s a throughput of 81,000 impressions per second!
Batch Update allowed me to perform 15 million inserts across two indexed tables in 7 minutes. Without batch update, the same code would take 15 hours!
Additionally, streaming frameworks such as Apache Flink’s JDBC Connector also uses JDBC Batch to perform its upserts.
The Java code below shows how to perform JDBC batch updates.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 |
con.setAutoCommit(false); try ( PreparedStatement seq = con.prepareStatement("SELECT nextval('products_seq') FROM generate_series(1, 1000)"); PreparedStatement insert = con.prepareStatement("INSERT INTO products(id, name) VALUES(?, ?)"); ) { int b = 0; int c = 0; for (int i = 0; i < 1000; i++) { final ResultSet products_seq = seq.executeQuery(); while (products_seq.next()) { long product_id = products_seq.getLong(1); insert.setLong(1, product_id); insert.setString(2, "T-Shirt #" + product_id); insert.addBatch(); b++; if (b % 100 == 0) { insert.executeBatch(); c++; } if (c % 10 == 0) { con.commit(); } } } if (b % 100 != 0) { insert.executeBatch(); } if (c % 10 != 0) { con.commit(); } } |
Eliminate all N+1 Queries from Production!
Your associate programmers are just happy to make the application functional.
Your senior programmers should replace N + 1 with join queries! (or another variant if join won’t work).
The code below shows an example of the N + 1 query problem misuse.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 |
try ( Connection con = buildConnection(); PreparedStatement p = con.prepareStatement("SELECT id, name FROM products"); ) { ResultSet prs = p.executeQuery(); while(prs.next()) { PreparedStatement o = con.prepareStatement("SELECT id, name FROM options WHERE product_id = ?"); o.setLong(1, prs.getLong(1)); ResultSet ors = o.executeQuery(); while(ors.next()) { long option_id = ors.getLong(1); //remaining business logic … } } } |
Instead, use an SQL join to bring in the data in one query instead of N + 1 queries.
1 2 3 4 5 6 7 8 9 |
SELECT p.id AS product_id, p.name AS product_name, o.id AS option_id, o.name AS option_name FROM products p LEFT JOIN options on o.id = p.product_id |
Introduce Production Query Audits
Your development process should collect all queries used in production code. This allows experts to optimize the query with an EXPLAIN or EXPLAIN ANALYZE plan to introduce indices on the right columns or refactor the query.
Introduce Performance Monitoring
Focus on variance.
If you vertically scaled, then focus on Top-N worst performing queries.
If you horizontally scaled, then focus on the Top-P worst performing hosts.
Additional Tips
- Consider query paging when querying large tables.
- Distributed caches can sometimes increase performance.
- Use connection pooling appropriately.
- Know your database frameworks. You probably only need to know three persistence frameworks to survive as a Java Developer, you must learn then and master them:
- JDBC
- JPA
- Spring JDBC
Spring JDBC In a Nutshell
Although JDBC is very versatile, it is quite low-level when compared to the simplicity that Spring JDBC provides on top it.
Spring JDBC will:
- Provide full-control over the production queries.
- Provide a very simple interface for result set extraction.
- Provide transactional control through annotations and Aspect Oriented Programming.
- Provide named query parameters!
JPA in a Nutshell
JPA is the standard ORM for Java. It does all things expected of ORM, and more.
- Named Queries are optimized and utilize prepared statements.
- Developers have less control over the generated queries, so more care must be taken to avoid the N+1 query problem.
- JPA provides the cliche table mapping features expected of an ORM framework, but it can also map ANY query to ANY class with the @ConstructorResult annotation.
References
Source Code
https://github.com/minmay/improving-your-relational-database-architecture