Disclaimer: Don’t read this review, go ahead and read the book, it will be much better for you. The only reason that I wrote this is for my future reference.
First of all, I love to read books and I do not believe that a review should influence someone to read or not a book. However, as a researcher, one of the most important things that I learned during my years at the university was the importance to write a review of papers. After reading a paper, we should write a review. Not to publish, but to describe in our words what we read and also to future references. So, do not expect here a complete resume of the book or the concepts present in it. Also, do not think that this is a replacement or a complement of the book, it is not. This book is fantastic and you should read it if you like this subject.
After the important disclaimer and the clarification about this review, here are my notes after reading this book.
The book Designing Data-Intensive Applications by Martin Kleppmann is divided in 3 parts. Part 1, called Foundations of Data Systems, is composed of chapters 1 to 4 and describes the foundations of the book, partners to storage, encoding, and retrieval information. Part 2, called Distributed Data, is composed of chapters 5 to 9 and it presents commons problems and solutions in distributed systems. Part 3, called Derived Data, is the last of this book and it is composed of chapters 10 to 12. This part is dedicated to offline and near-real-time systems.
Chapter 1 is the introduction of the book and describes in many details the 3 keywords of the book: Reliability, Scalability, and Maintainability. Chapter 2 compares different data models and query languages. This chapter shows examples of how to model the same problem in relational and NoSQL databases and compares the pros and cons of each solution. Also, it provides a historic overview of data models and query languages improvements.
Chapter 3 describes different ways of storing and retrieve information. It has an excellent introduction to Log-structured merge tree (LSM-Trees) and B-trees. In addition, this chapter finishes comparing different types of storage to Online transaction processing (OLTP) and Online Analytic Processing (OLAP). Chapter 4, the last of part 1, describes ways to serialize data when we need to send it through a network. Also, it describes the forward and backward compatibility problem when updating a data serialization model. An overview of serialization problems is presented with possible solutions. CSV, XML, and JSON are non-binary data format discussed, while Thrift, Protocol buffers, and Avro are presented as a binary format options. At the end of the chapter, REST, Remote procedure call (RPC), and Message-Passing data flow are described. If you are not interested in how to serialize data or how to enable 2 services to communicate via data, you can skip this chapter.
Chapter 5 describes problems and solutions to do replication, which is a very simple concept but difficult to implement correctly. Chapter 6 complements the previous and explains the partition problem and possible solutions. Even though replication and partition are similar problems, they differ by the proposal. Replication is more used to distribute requests between servers while the partition is used to divide the data between servers. In other words, we use replication to store the same data in different places while we use partition to split the data in different servers. These two concepts are complementary and used together.
Chapter 7 explains the concept of transactions and what this can ensure. In addition, it explains many isolation levels and what happens when a race condition occurs. It is a very important chapter if you would like to understand possible problems when working with transactions in databases. Chapter 8 describes many different types of errors that can happen in a distributed system. This chapter is so useful if you would like to start understanding how difficult distributed systems could be. The Chapter 9, the last of the part 2, presents central ideas in distributed system: Consistency and Consensus. It explains how difficult could be to resolve the problems presented in Chapter 8. In addition, it shows and describes many techniques to resolve them.
Chapter 10 is dedicated to batch processing and it shows a very good example of this technique with Unix tools. Also, it describes the foundation of MapReduce and what are the problems which this technology resolves. Of course, it shows the limitation of this technology too. Chapter 11 presents stream processing and its similarities and differences with batch processing. Part 3 finishes with Chapter 12 which is a vision of the author about the future of data systems.
When I was a young software engineer, I always tried to learn all the concepts present in this book and I remembered that it was not easy to find good references to study these subjects. I believe that this book is exactly this: a strong reference to distributed system. Also, this book, in my opinion, is in the category: Should be read! Another aspect that got my attention in this book is the literature reference that it provides. So, only by reading this book, you could find excellent references to dive into any subject present in the book. I hope you will enjoy reading this book!