Balancing Data Locality and Scalability

When I was studying at the University of Cincinnati, I took a class in Databases with Professor Lawrence J Mazlack. In one of his early lectures, he shared a joke that went like this:

“What do databases and real estate have in common? Location, Location, Location.”

When discussing data, we often emphasize data locality as it directly impacts performance and maintenance. Data that is closely located together can deliver higher performance and can be backed up more efficiently. Leveraging B-trees, a popular data structure, relational database engines achieve high performance and offer various consistency models. This is possible because RDBMS manages well-structured records available on the same computer/network.

On the application side, we have developed both monolithic and loosely coupled systems. Monoliths aim to maximize software performance with a fixed set of hardware resources. Loosely coupled solutions, such as microservices, provide linear scalability, allowing them to handle larger data loads by employing more machines.

Over time, several interesting things have happened as more people embraced technology:

  • Data grew and got stored in different technologies
  • Containerization replaced virtualization
  • Microservices gained popularity

Achieving linear scalability is often considered the gold standard for modern solutions. However, during each transition, we sacrifice locality to gain scalability:

  • Storing data in different formats performs much worse than storing it in a single RDBMS
  • Containers require instructions to pass through several layers before reaching the CPU
  • Microservices duplicate data for consistency, often guaranteeing only “eventual consistency.”

While scaling systems, we’ve sacrificed data locality, leading to various problems, such as:

  • Processing data from different sources requires more Memory/RAM as data needs to be buffered when answering queries or manipulating it
  • Container orchestrators have become new points of failure
  • Microservices introduce complexities with consistency validation due to additional layers like message brokers, API gateways, registers, and distributed caches

When designing systems, data locality is often not given enough consideration. When deciding on the data storage stack, several factors should be considered:

  • Cost – Startups are likely to choose a technology that is free, open-source, and has a supportive community maintaining the database engine.
  • Consistency Needs – Does the system require ACID operations at a record level?
  • Partitioning and Sharding – Is there a need to store data in different storages represented in different shards?
  • Encryption – Does the data need to be encrypted at rest and during transport?
  • Redundancy – Does the data need to be geographically redundant and available to customers in different markets?
  • Ingestion Speed Needs – Will the database engine be able to handle millions of records per second?
  • Reporting Needs – Can the engine accommodate the required reporting tools?
  • Recovery – What kind of backup and recovery mechanisms are required?
  • High Availability – What kind of SLA should the system deliver?
  • Blob Data – How should blob data be handled by the database?
  • User Base and Data Growth – How many users will use the system, and how much data will be added every quarter/year?

Using a single database is often the right choice for most systems, and there are many database engines that provide packaged solutions for various challenging topics. A microservice-based solution, hosted in Kubernetes clusters, highly available, and geographically redundant, can be very expensive to build and maintain. Furthermore, the benefits of a microservice solution become evident when handling massive loads and utilizing significant processing power from multiple machines. In a simple scenario, a monolith will outperform a microservice-based solution due to the reduced communication hops between systems.

Conclusion

Addressing data needs and data locality in each system component can save a lot of money in the long run. Loosely coupled solutions, such as microservices, add more complexity because data locality is sacrificed for linear scalability.

Similar Posts