Designing Data Intensive Applications: Data Models & Query Languages

December 31, 2021

This blog captures important notes from the book "Designing Data-Intensive Applications" by Martin Kleppmann.

Data Models & Query Languages

Most applications are built by applying one data model on top of another. As a app developer, data is modelled in terms of real world entities. Represented as JSON, XML, relational tables or graphs Each layer abstracts the lower layer model. Data model has profound impact on what the software can and cannot do.

Relational vs Document Model

Data is organized in tables and SQL is used to query data in tables. Relational is more common in today's world. No SQL is not only SQL. Become prominent for setting up distributed datasets and data with less relations. Most applications are coded as relational models. Object Relation Mapping (And Active record) is a layer above DB modelling.

For example: For a data structure like resume with multiple education & job details, a JSON structure might be easier to avoid complex queries and multiple joins. Mondo DB, Rethink DB, couch DB are document based DBs.

Many-Many and Many-One relationships:

Using ID for referring to common nouns like cities, regions etc help to maintain this data only in one place. It's easier to do updates on this data. It's helps in UI layer to render dropdown for these nouns.

In document DB, the joins happen at the application layer. In resume example, a query for profile, a query for education references and a query for job companies are made in parallel and then joined based on the reference columns.

The network model:

In a tree model, one node can have only one parent. In a network model, one node can have multiple parents. The application should take care of handling all relations.

Tha relational model:

A relation or table is a collection of rows. Can read all or some rows based on conditions. Indexes are built by query optimizer. To query data on different dimensions, new indexes are built.

Relational vs Document Databases:

Document DBs store in hierarchical structure. In both relational vs document, a reference ID to other relation or document is maintained. Joins are made using this relation ID.

In document DBs, joins are made by application and in case of many-many, the application can become complex and slow when compared to joins in native DB layer. Hence highly connected data are recommended to use relational DBs.

Document DBs are schema-less or schema on read. Schemas are at application layer. Schema on read are runtime checks while schema on write are compile time checks. Schema changes are application level changes and both old & new schemas should be supported.

In relational, a schema change is migration of DB relations and move all data to latest schema version.

Both relational & document DBs support XML, JSON based modelling.

A hybrid of relational and document DB features is quite flexible and might be the path in the future.

SQL is declarative query language. It is more attractive then the imperative language in which many applications are coded. Declarative queries can be parallelized while it's difficult to parallelize imperative code because it executes on an order. CSS is also a declarative language. CSS styling is better than using JS to style elements imperatively.

Map reduce querying:

It's neither a declarative nor a imperative. Somewhere in-between. It's expressed in code. The map operations collect records and reduce are fold / eject records. Restrictive in what they are allowed to do. They must accept pure functions and must not have side effects. MR is lower level query language and SQL can be expressed as set of MR functions.

Graph like data models:

Common use case is many to many relationships. Relational DBs make many such relations complex. Instead vertices & edges represent these relations better.

Property graphs:

Each vertex consists of unique identifier and collections of properties. Each edge has a identifier, a head & tail vertex and collections of properties. Pros:

Any vertex can be connected by edge to any vertices.
Given a vertex, you can query for it edges or relations to other vertices.
By using labels for different kind of relationships, we can store any kind of data and can have varying granularity.
New relations are easier to create and easily evolvable.

Cipher, Triple-Stores, SPARQL are examples.

Cipher is built for Neo4j graph database. Queries are made based on the properties in vertex and edges.

In Triple-Stores, all information are stored in form of subject, predicate or verb & object. Subject and object are vertices and predicate is the edge.

The RDF data model:

The Resource Description Framework is a World Wide Web Consortium standard originally designed as a data model for metadata. It has come to be used as a general method for description and exchange of graph data

SPARQL is an RDF query language—that is, a semantic query language for databases

Datalog is older language and laid the foundation for many new query languages. It's model is predicate, subject & object.