What I would change if I saw your database

This post briefly covers the history of databases, transaction fundamentals, and then proceeds to explain what I would change if I saw your database. It is highly-opinionated, and based on decades of experience in working with relational databases.

History

The term relational database was Invented by E. F. Codd at IBM in 1970 in his paper “A Relational Model of Data for Large Shared Data Banks.” The first commercially available RDBMS was Oracle, released in 1979 by Relational Software, now Oracle Corporation. Currently, the most popular relational databases are: Oracle, MySql, Microsoft SQL Server, PostgreSQL, DB2, SQLite, Sybase. They all use a variant of SQL. My personal favorite database is PostreSQL. It is free, open-source, and most faithfully implements ANSI-SQL.

Relational Databases Today

The relational database can provide the right persistence solution for most business needs. They are extraordinarily fast and reliable. They can manage petabyte scale data. There are plenty of tools written for them. They are simple to use. Right or wrong, they are also one of the most common enterprise integration patterns.

The Advent of NoSQL

What about NoSQL?

NoSQL solutions provide a viable alternative persistence solution, depending on your needs.

Many NoSQL solutions are distributed by design, thus horizontally scalable. They often compromise consistency (CAP theorem) in favor of availability, partition tolerance, and speed.

Examples of the different types of NoSQL solutions are:

  • Key Value Store (Riak, Memcached)
  • Data-Structure Store (Redis)
  • Big Table or Column Store (Cassandra, HBase)
  • Document Store (Couchbase, Lucene, MongoDB)

Ironically, most NoSQL solutions have an SQL like interface.

When to use RDBMS

The RDBMS is the right choice for a very wide variety of use cases, such as when:

  • Data is inherently relational.
  • Data is tabular.
  • Data is user-generated (as opposed to machine generated).
  • You need a standard interface such as SQL used by business people or multiple applications.
  • Your application fits into the 99.9% of the applications that can perform and scale perfectly with a RDBMS*.
  • Your application truly needs ACID.

Fundamentals

Transactions

A transaction is a group of SQL statements that are a logical unit of work attributed to one connection. Most RDBS support transactions.

It has four fundamental properties:

Atomic Consistent Isolation Durable. ACID.

“Atomic” describes how after the transaction finishes the changes that each statement made seem as though were completed as one statement.

“Consistent” describes how the RDBMS will have valid data after the transaction.

“Isolation” describes the rules of how one transaction can change the values in another transaction happening at the same time.

“Durable” describes that persistent nature of an RDBMS. Once the transaction has been committed, the update will be persisted, even if there is a system failure.

Transaction Isolation Levels

Isolation LevelPhantom ReadsNonrepeatable ReadsDirty Reads
READ_UNCOMMITTEDyesyesyes
READ_COMMITTEDyesyesno
REPEATABLE_READSyesnono
SERIALIZABLEnonono
Phantom ReadT1 does query. T2 inserts rows. T1 does same query and finds new row.
Nonrepeatable Read
T1 reads a row. T2 updates the same row. T1 reads same row again.
Dirty ReadT1 updates a row, but without commit. T2 reads updated row. T1 does rollback.

I encourage you to experiment with isolation levels to understand the difference. Experiment by creating two connections to a database. T1 is connection one. T2 is connection two. Please note that not all databases support all isolation levels.

What I would Change

Use a Migration Tool

Introduce migrations to manage your database development and management process. My favorite choice is Liquibase, however, there are also Active Record Migrations, and Liquibase.

When possible, prepare are rollback plan for each migration. Also, remember that is simpler to add a column to a production system than to remove one.

Introduce only manageable incremental migrations per sprint.

Below is an example of how to run liquibase through docker-compose.

Continuously Document Your Schema

Introduce a utility that publishes your schema entity relationship diagram after every database migration.

You’ll be surprised how many dialogues this creates up during design, problem solving sessions, agile grooming, planning poker, and many other scenarios.

Below is an example of how to run schemaspy through docker-compose.

Introduce Coding Conventions

  1. TABLE names are plural.
  2. COLUMN names are singular
  3. Each TABLE has:  id INTEGER PRIMARY KEY,.
  4. Index each PRIMARY KEY and FOREIGN KEY columns.
  5. FOREIGN KEY columns are named with the pattern singular_table_name_id.
  6. Use COMMENT data-definition language.
  7. Favor NOT NULL since null columns are normalization columns candidates.

Identifiers should use the following suffices.

_PK Primary Key
_CK Check Constraint
_UK Unique Key Constraint
_V View
_MV Materialized View
_FK Foreign Key
_X Index
_UX Unique Index
_FX Function Based Index
_SEQ Sequences
_TRG Triggers
_PKG Packages and Package Bodies

Constraint TypeNaming Convention
Foreign Key
REFERENCING_TABLE#REFERENCED_TABLE#REFERENCED_COLUMN_NAME_FK
UniqueTABLE_NAME#COLUMN_NAME_UX
CheckTABLE_NAME#COLUMN_NAME_CK
Not NullTABLE_NAME#COLUMN_NAME_NN

Not everybody will agree on these conventions, and you might have different conventions in mind. However, it is better to stick to one convention than to have no conventions because this will lead to higher maintainability.

Always Start with 3rd Normal Form

When starting your database design, start with and favor 3rd Normal Form (3NF). Deviate from that only for optimization purposes, after you’ve already tried 3NF. Do NOT prematurely optimize away from 3NF.

Use Foreign Key Constraints

Favor Foreign Key Constraints for applications still in development, or for “small” applications.

Although Foreign Key Constraints protect the integrity of the database, they are not necessary in a bug-free application that will never violate Foreign Key Constraints. They affect performance.

Foreign Key constraints make it very difficult to manage or refactor large “enterprise” databases with thousands of tables.

Only Use Numeric Id Primary Keys

Favor auto-generated integer PRIMARY KEY columns that are populated by a SEQUENCE.

Use the name “id” for the PRIMARY KEY column. It is the most popular convention, and it will allow your database to more easily integrate with other frameworks such as Ruby on Rails.

Do NOT use a COMPOSITE PRIMARY KEY, instead use a UNIQUE INDEX.

Use Compact Data-Types

Use a proper data-type. Do not make everything type “text”!

Use the smallest byte sized data-type appropriate for the given column. This will help performance in the long run on the server and on the client.

3rd Normal Form will allow for a more compact data types since it removes redundant data (i.e. an association will show an integer id, which is more compact, instead of text).

Design for Concurrency

Design a schema that is concurrently safe with multiple clients or processes.

  • Avoid row contention.
  • Favor row inserts to sum amounts.
  • Use optimistic locking with row version numbers.

Design for Growth

Depending on your needs, you may need to also architect for growth. It’s easy to scale up a database (better processors, more ram). However, that also has its limits, and you may need to use a distributed design that may incorporate the following.

  • Caching
  • Partitioning
  • Sharding
  • Replication
  • An intelligent driver or client

MySql has very nice sharding and partitioning capabilities. The MySQL NDB cluster can automatically partitions tables across nodes, and queries will access the correct shards. Sharding is transparent to the application.  Unlike other databases, users can perform JOIN operations, use ACID-guarantees, when perform queries and transactions across shards (source MySQL NDB Cluster: Scalability).

Introduce Auditing

Introduce auditing into your design.

Each DML statement can be audited with a version number, user id, update time, and create time that will show current values on the target table, and the history in a history table.

This will also help with optimistic locking, future integrations with other system, and with trouble shooting.

Below is an example audit trigger that I use with PostgreSQL.

Use Prepared Statements for ALL Production Code

They are optimized on the server, and thus perform faster.

They are cached on the server, but you can also cache them on the client.

They are secure, and prevent SQL injection attacks.

Use Batch Updates

JDBC Batch Update allowed me to support 7 billion impressions per day on MySpaces’s Ad Server. That’s a throughput of 81,000 impressions per second!

Batch Update allowed me to perform 15 million inserts across two indexed tables in 7 minutes. Without batch update, the same code would take 15 hours!

Additionally, streaming frameworks such as Apache Flink’s JDBC Connector also uses JDBC Batch to perform its upserts.

The Java code below shows how to perform JDBC batch updates.

Eliminate all N+1 Queries from Production!

Your associate programmers are just happy to make the application functional.

Your senior programmers should replace N + 1 with join queries! (or another variant if join won’t work).

The code below shows an example of the N + 1 query problem misuse.

Instead, use an SQL join to bring in the data in one query instead of N + 1 queries.

Introduce Production Query Audits

Your development process should collect all queries used in production code. This allows experts to optimize the query with an EXPLAIN or EXPLAIN ANALYZE plan to introduce indices on the right columns or refactor the query.

Introduce Performance Monitoring

Focus on variance.

If you vertically scaled, then focus on Top-N worst performing queries.

If you horizontally scaled, then focus on the Top-P worst performing hosts.

Additional Tips

  1. Consider query paging when querying large tables.
  2. Distributed caches can sometimes increase performance.
  3. Use connection pooling appropriately.
  4. Know your database frameworks. You probably only need to know three persistence frameworks to survive as a Java Developer, you must learn then and master them:
    1. JDBC
    2. JPA
    3. Spring JDBC

Spring JDBC In a Nutshell

Although JDBC is very versatile, it is quite low-level when compared to the simplicity that Spring JDBC provides on top it.

Spring JDBC will:

  1. Provide full-control over the production queries.
  2. Provide a very simple interface for result set extraction.
  3. Provide transactional control through annotations and Aspect Oriented Programming.
  4. Provide named query parameters!

JPA in a Nutshell

JPA is the standard ORM for Java. It does all things expected of ORM, and more.

  1. Named Queries are optimized and utilize prepared statements.
  2. Developers have less control over the generated queries, so more care must be taken to avoid the N+1 query problem.
  3. JPA provides the cliche table mapping features expected of an ORM framework, but it can also map ANY query to ANY class with the @ConstructorResult annotation.

References

Source Code

https://github.com/minmay/improving-your-relational-database-architecture