This post briefly covers the history of databases, transaction fundamentals, and then proceeds to explain what I would change if I saw your database. It is highly-opinionated, and based on decades of experience in working with relational databases.

History

The term relational database was Invented by E. F. Codd at IBM in 1970 in his paper “A Relational Model of Data for Large Shared Data Banks.” The first commercially available RDBMS was Oracle, released in 1979 by Relational Software, now Oracle Corporation. Currently, the most popular relational databases are: Oracle, MySql, Microsoft SQL Server, PostgreSQL, DB2, SQLite, Sybase. They all use a variant of SQL. My personal favorite database is PostreSQL. It is free, open-source, and most faithfully implements ANSI-SQL.

Relational Databases Today

The relational database can provide the right persistence solution for most business needs. They are extraordinarily fast and reliable. They can manage petabyte scale data. There are plenty of tools written for them. They are simple to use. Right or wrong, they are also one of the most common enterprise integration patterns.

The Advent of NoSQL

What about NoSQL?

NoSQL solutions provide a viable alternative persistence solution, depending on your needs.

Many NoSQL solutions are distributed by design, thus horizontally scalable. They often compromise consistency (CAP theorem) in favor of availability, partition tolerance, and speed.

Examples of the different types of NoSQL solutions are:

Key Value Store (Riak, Memcached)
Data-Structure Store (Redis)
Big Table or Column Store (Cassandra, HBase)
Document Store (Couchbase, Lucene, MongoDB)

Ironically, most NoSQL solutions have an SQL like interface.

When to use RDBMS

The RDBMS is the right choice for a very wide variety of use cases, such as when:

Data is inherently relational.
Data is tabular.
Data is user-generated (as opposed to machine generated).
You need a standard interface such as SQL used by business people or multiple applications.
Your application fits into the 99.9% of the applications that can perform and scale perfectly with a RDBMS*.
Your application truly needs ACID.

Fundamentals

Transactions

A transaction is a group of SQL statements that are a logical unit of work attributed to one connection. Most RDBS support transactions.

It has four fundamental properties:

Atomic Consistent Isolation Durable. ACID.

“Atomic” describes how after the transaction finishes the changes that each statement made seem as though were completed as one statement.

“Consistent” describes how the RDBMS will have valid data after the transaction.

“Isolation” describes the rules of how one transaction can change the values in another transaction happening at the same time.

“Durable” describes that persistent nature of an RDBMS. Once the transaction has been committed, the update will be persisted, even if there is a system failure.

Transaction Isolation Levels

Isolation Level	Phantom Reads	Nonrepeatable Reads	Dirty Reads
READ_UNCOMMITTED	yes	yes	yes
READ_COMMITTED	yes	yes	no
REPEATABLE_READS	yes	no	no
SERIALIZABLE	no	no	no

Phantom Read	T1 does query. T2 inserts rows. T1 does same query and finds new row.
Nonrepeatable Read	T1 reads a row. T2 updates the same row. T1 reads same row again.
Dirty Read	T1 updates a row, but without commit. T2 reads updated row. T1 does rollback.

I encourage you to experiment with isolation levels to understand the difference. Experiment by creating two connections to a database. T1 is connection one. T2 is connection two. Please note that not all databases support all isolation levels.

What I would Change

Use a Migration Tool

Introduce migrations to manage your database development and management process. My favorite choice is Liquibase, however, there are also Active Record Migrations, and Liquibase.

When possible, prepare are rollback plan for each migration. Also, remember that is simpler to add a column to a production system than to remove one.

Introduce only manageable incremental migrations per sprint.

Below is an example of how to run liquibase through docker-compose.

  liquibase.db:
    image: liquibase/liquibase:4.18
    command: --defaultsFile=/liquibase/config/liquibase.properties update
    depends_on:
      db:
        condition: service_healthy
    volumes:
      - ./conf/liquibase/liquibase.properties:/liquibase/config/liquibase.properties
      - ./microservice/src/main/resources/db/changelog:/liquibase/changelog

liquibase.db:

image: liquibase/liquibase:4.18

command: --defaultsFile=/liquibase/config/liquibase.properties update

depends_on:

db:

condition: service_healthy

volumes:

- ./conf/liquibase/liquibase.properties:/liquibase/config/liquibase.properties

- ./microservice/src/main/resources/db/changelog:/liquibase/changelog

Continuously Document Your Schema

Introduce a utility that publishes your schema entity relationship diagram after every database migration.

You’ll be surprised how many dialogues this creates up during design, problem solving sessions, agile grooming, planning poker, and many other scenarios.

Below is an example of how to run schemaspy through docker-compose.

  schemaspy.db:
    image: schemaspy/schemaspy:6.2.2
    volumes:
      - schemaspy.db:/output:rw
      - ./conf/schemaspy/schemaspy.properties:/schemaspy.properties
    command: SCHEMASPY_OUTPUT=/schemaspy/db /usr/local/bin/schemaspy -schemas public
    depends_on:
      liquibase.db:
        condition: service_completed_successfully

volumes:
  schemaspy.db: {}

schemaspy.db:

image: schemaspy/schemaspy:6.2.2

volumes:

- schemaspy.db:/output:rw

- ./conf/schemaspy/schemaspy.properties:/schemaspy.properties

command: SCHEMASPY_OUTPUT=/schemaspy/db /usr/local/bin/schemaspy -schemas public

depends_on:

liquibase.db:

condition: service_completed_successfully

volumes:

schemaspy.db: {}

Introduce Coding Conventions

TABLE names are plural.
COLUMN names are singular
Each TABLE has: id INTEGER PRIMARY KEY,.
Index each PRIMARY KEY and FOREIGN KEY columns.
FOREIGN KEY columns are named with the pattern singular_table_name_id.
Use COMMENT data-definition language.
Favor NOT NULL since null columns are normalization columns candidates.

Identifiers should use the following suffices.

_PK Primary Key
_CK Check Constraint
_UK Unique Key Constraint
_V View
_MV Materialized View
_FK Foreign Key
_X Index
_UX Unique Index
_FX Function Based Index
_SEQ Sequences
_TRG Triggers
_PKG Packages and Package Bodies

Constraint Type	Naming Convention
Foreign Key	REFERENCING_TABLE#REFERENCED_TABLE#REFERENCED_COLUMN_NAME_FK
Unique	TABLE_NAME#COLUMN_NAME_UX
Check	TABLE_NAME#COLUMN_NAME_CK
Not Null	TABLE_NAME#COLUMN_NAME_NN

Not everybody will agree on these conventions, and you might have different conventions in mind. However, it is better to stick to one convention than to have no conventions because this will lead to higher maintainability.

Always Start with 3rd Normal Form

When starting your database design, start with and favor 3rd Normal Form (3NF). Deviate from that only for optimization purposes, after you’ve already tried 3NF. Do NOT prematurely optimize away from 3NF.

Use Foreign Key Constraints

Favor Foreign Key Constraints for applications still in development, or for “small” applications.

Although Foreign Key Constraints protect the integrity of the database, they are not necessary in a bug-free application that will never violate Foreign Key Constraints. They affect performance.

Foreign Key constraints make it very difficult to manage or refactor large “enterprise” databases with thousands of tables.

Only Use Numeric Id Primary Keys

Favor auto-generated integer PRIMARY KEY columns that are populated by a SEQUENCE.

Use the name “id” for the PRIMARY KEY column. It is the most popular convention, and it will allow your database to more easily integrate with other frameworks such as Ruby on Rails.

Do NOT use a COMPOSITE PRIMARY KEY, instead use a UNIQUE INDEX.

Use Compact Data-Types

Use a proper data-type. Do not make everything type “text”!

Use the smallest byte sized data-type appropriate for the given column. This will help performance in the long run on the server and on the client.

3rd Normal Form will allow for a more compact data types since it removes redundant data (i.e. an association will show an integer id, which is more compact, instead of text).

Design for Concurrency

Design a schema that is concurrently safe with multiple clients or processes.

Avoid row contention.
Favor row inserts to sum amounts.
Use optimistic locking with row version numbers.

Design for Growth

Depending on your needs, you may need to also architect for growth. It’s easy to scale up a database (better processors, more ram). However, that also has its limits, and you may need to use a distributed design that may incorporate the following.

Caching
Partitioning
Sharding
Replication
An intelligent driver or client

MySql has very nice sharding and partitioning capabilities. The MySQL NDB cluster can automatically partitions tables across nodes, and queries will access the correct shards. Sharding is transparent to the application. Unlike other databases, users can perform JOIN operations, use ACID-guarantees, when perform queries and transactions across shards (source MySQL NDB Cluster: Scalability).

Introduce Auditing

Introduce auditing into your design.

Each DML statement can be audited with a version number, user id, update time, and create time that will show current values on the target table, and the history in a history table.

This will also help with optimistic locking, future integrations with other system, and with trouble shooting.

Below is an example audit trigger that I use with PostgreSQL.

-- Create trigger that does two things: 1) it updates the version and updated_at fields in the case of an
-- UPDATE. 2) on every UPDATE, and DELETE operation it adds a record to the _H audit table.
CREATE OR REPLACE FUNCTION audit_update_trigger() RETURNS TRIGGER LANGUAGE plpgsql AS
'BEGIN
    IF (TG_OP = ''UPDATE'') AND (NEW.last_updated_by > 0) THEN
        NEW.version := OLD.version + 1;
        NEW.updated_at := CURRENT_TIMESTAMP;
        EXECUTE format(''INSERT INTO audit.%I_h SELECT nextval(''''audit.%I_h_id_seq''''), %L, now(), ($1).*'', TG_TABLE_NAME, TG_TABLE_NAME, TG_OP) USING OLD;
        RETURN NEW;
    ELSEIF (TG_OP = ''DELETE'') THEN
        EXECUTE format(''INSERT INTO audit.%I_h SELECT nextval(''''audit.%I_h_id_seq''''), %L, now(), ($1).*'', TG_TABLE_NAME, TG_TABLE_NAME, TG_OP) USING OLD;
        RETURN OLD;
    END IF;

    RETURN NEW;
END;';

COMMENT ON FUNCTION audit_update_trigger() IS 'Trigger that does two things: 1) it updates the version and updated_at fields of a record in the case of an UPDATE 2) on every UPDATE, and DELETE operation it adds a record to the _H audit table.'

-- Create trigger that does two things: 1) it updates the version and updated_at fields in the case of an

-- UPDATE. 2) on every UPDATE, and DELETE operation it adds a record to the _H audit table.

CREATE OR REPLACE FUNCTION audit_update_trigger() RETURNS TRIGGER LANGUAGE plpgsql AS

'BEGIN

IF (TG_OP = ''UPDATE'') AND (NEW.last_updated_by > 0) THEN

NEW.version := OLD.version + 1;

NEW.updated_at := CURRENT_TIMESTAMP;

EXECUTE format(''INSERT INTO audit.%I_h SELECT nextval(''''audit.%I_h_id_seq''''), %L, now(), ($1).*'', TG_TABLE_NAME, TG_TABLE_NAME, TG_OP) USING OLD;

RETURN NEW;

ELSEIF (TG_OP = ''DELETE'') THEN

EXECUTE format(''INSERT INTO audit.%I_h SELECT nextval(''''audit.%I_h_id_seq''''), %L, now(), ($1).*'', TG_TABLE_NAME, TG_TABLE_NAME, TG_OP) USING OLD;

RETURN OLD;

END IF;

RETURN NEW;

END;';

COMMENT ON FUNCTION audit_update_trigger() IS 'Trigger that does two things: 1) it updates the version and updated_at fields of a record in the case of an UPDATE 2) on every UPDATE, and DELETE operation it adds a record to the _H audit table.'

Use Prepared Statements for ALL Production Code

They are optimized on the server, and thus perform faster.

They are cached on the server, but you can also cache them on the client.

They are secure, and prevent SQL injection attacks.

Use Batch Updates

JDBC Batch Update allowed me to support 7 billion impressions per day on MySpaces’s Ad Server. That’s a throughput of 81,000 impressions per second!

Batch Update allowed me to perform 15 million inserts across two indexed tables in 7 minutes. Without batch update, the same code would take 15 hours!

Additionally, streaming frameworks such as Apache Flink’s JDBC Connector also uses JDBC Batch to perform its upserts.

The Java code below shows how to perform JDBC batch updates.

con.setAutoCommit(false);
try ( 
  PreparedStatement seq = con.prepareStatement("SELECT nextval('products_seq') FROM generate_series(1, 1000)"); PreparedStatement insert = con.prepareStatement("INSERT INTO products(id, name) VALUES(?, ?)");
) {
  int b = 0;  int c = 0;
  for (int i = 0; i < 1000; i++) {
     final ResultSet products_seq = seq.executeQuery();
     while (products_seq.next()) {
        long product_id = products_seq.getLong(1);        insert.setLong(1, product_id);
        insert.setString(2, "T-Shirt #" + product_id);        insert.addBatch(); b++;
        if (b % 100 == 0) { insert.executeBatch(); c++; }
        if (c % 10 == 0) { con.commit(); }
     }
  }
  if (b % 100 != 0) { insert.executeBatch(); }
  if (c % 10 != 0) { con.commit(); }
}

con.setAutoCommit(false);

try (

PreparedStatement seq = con.prepareStatement("SELECT nextval('products_seq') FROM generate_series(1, 1000)"); PreparedStatement insert = con.prepareStatement("INSERT INTO products(id, name) VALUES(?, ?)");

) {

int b = 0; int c = 0;

for (int i = 0; i < 1000; i++) {

final ResultSet products_seq = seq.executeQuery();

while (products_seq.next()) {

long product_id = products_seq.getLong(1); insert.setLong(1, product_id);

insert.setString(2, "T-Shirt #" + product_id); insert.addBatch(); b++;

if (b % 100 == 0) { insert.executeBatch(); c++; }

if (c % 10 == 0) { con.commit(); }

}

if (b % 100 != 0) { insert.executeBatch(); }

if (c % 10 != 0) { con.commit(); }

}

Eliminate all N+1 Queries from Production!

Your associate programmers are just happy to make the application functional.

Your senior programmers should replace N + 1 with join queries! (or another variant if join won’t work).

The code below shows an example of the N + 1 query problem misuse.

try (
  Connection con = buildConnection(); 
  PreparedStatement p = con.prepareStatement("SELECT id, name FROM products");
) {   
  ResultSet prs = p.executeQuery();
  while(prs.next()) { 
    PreparedStatement o = con.prepareStatement("SELECT id, name FROM options WHERE product_id = ?");
    o.setLong(1, prs.getLong(1));  
    ResultSet ors = o.executeQuery();
    while(ors.next()) { 
      long option_id = ors.getLong(1); 
        //remaining business logic …
    }
  }
}

try (

Connection con = buildConnection();

PreparedStatement p = con.prepareStatement("SELECT id, name FROM products");

) {

ResultSet prs = p.executeQuery();

while(prs.next()) {

PreparedStatement o = con.prepareStatement("SELECT id, name FROM options WHERE product_id = ?");

o.setLong(1, prs.getLong(1));

ResultSet ors = o.executeQuery();

while(ors.next()) {

long option_id = ors.getLong(1);

//remaining business logic …

}

Instead, use an SQL join to bring in the data in one query instead of N + 1 queries.

SELECT 
   p.id AS product_id, 
   p.name AS product_name,
   o.id AS option_id,
   o.name AS option_name
FROM
   products p
LEFT JOIN
   options on o.id = p.product_id

SELECT

p.id AS product_id,

p.name AS product_name,

o.id AS option_id,

o.name AS option_name

FROM

products p

LEFT JOIN

options on o.id = p.product_id

Introduce Production Query Audits

Your development process should collect all queries used in production code. This allows experts to optimize the query with an EXPLAIN or EXPLAIN ANALYZE plan to introduce indices on the right columns or refactor the query.

Introduce Performance Monitoring

Focus on variance.

If you vertically scaled, then focus on Top-N worst performing queries.

If you horizontally scaled, then focus on the Top-P worst performing hosts.

Additional Tips

Consider query paging when querying large tables.
Distributed caches can sometimes increase performance.
Use connection pooling appropriately.
Know your database frameworks. You probably only need to know three persistence frameworks to survive as a Java Developer, you must learn then and master them:
1. JDBC
2. JPA
3. Spring JDBC

Spring JDBC In a Nutshell

Although JDBC is very versatile, it is quite low-level when compared to the simplicity that Spring JDBC provides on top it.

Spring JDBC will:

Provide full-control over the production queries.
Provide a very simple interface for result set extraction.
Provide transactional control through annotations and Aspect Oriented Programming.
Provide named query parameters!

JPA in a Nutshell

JPA is the standard ORM for Java. It does all things expected of ORM, and more.

Named Queries are optimized and utilize prepared statements.
Developers have less control over the generated queries, so more care must be taken to avoid the N+1 query problem.
JPA provides the cliche table mapping features expected of an ORM framework, but it can also map ANY query to ANY class with the @ConstructorResult annotation.

References

Source Code

https://github.com/minmay/improving-your-relational-database-architecture

Code

by Marco Villalobos

What I would change if I saw your database

History

Relational Databases Today

The Advent of NoSQL

When to use RDBMS

Fundamentals

Transactions

Transaction Isolation Levels

What I would Change

Use a Migration Tool

Continuously Document Your Schema

Introduce Coding Conventions

Always Start with 3rd Normal Form

Use Foreign Key Constraints

Only Use Numeric Id Primary Keys

Use Compact Data-Types

Design for Concurrency

Design for Growth

Introduce Auditing

Use Prepared Statements for ALL Production Code

Use Batch Updates

Eliminate all N+1 Queries from Production!

Introduce Production Query Audits

Introduce Performance Monitoring

Additional Tips

Spring JDBC In a Nutshell

JPA in a Nutshell

References

Source Code