Designing Data-Intensive Applications Learning Note

wang in Los Angeles

Tue May 28 2024

终于开始学习 DDIA 了，想要记录一些笔记和工作中相关的经验在博客上。虽然工作内容和这本书中的知识很相关，但是我经验不是很多，想法也很可能比较幼稚，写下来主要是为了自己总结和反思。具体的工作内容和技术也会模糊处理。

那么开始吧！今天是万事开头难的第一章

# Thinking About Data Systems

Data-intensive, 这个概念是针对 data 的量级、复杂性和变更速度的问题，是相对于compute_intensive（即 CPU 资源的开销）来说的。被认为是应用的 limiting factor.

Data Systems, 用这个雨伞词来涵盖 databases, queues, caches 这些不同工具的原因

从前的分类不再适用于新的工具了，这里给了两个例子
- Datastores that also used as message queues (Redis)
- Message queues with database-like durability guarantees (Apache Kafka)
  虽然这两个工具我都有接触过，但完全不知道他们有这种特性。只知道 Redis 是内存数据库，可以做 cache；Kafka 是 message queue。所以书上为什么这么说呢？
读一下 Redis 的文档，Redis can be used as a database, cache, streaming engine, message broker, and more. Redis 确实能做 mq！https://redis.io/glossary/redis-queue/ 。大概了解了一下，redis 队列适用高性能、低延迟的发布/订阅场景；但对于需要长时间保留或不容丢失数据的场景，不适用 Redis 内存存储，而且对于高吞吐量、复杂的消息路由和过滤的要求也不能满足。看来工作中没有用 Redis 做消息队列确实是有原因的。

至于 Kafka，就适用于分布式大吞吐量的实时数据处理了。它设计了一些机制来保障数据持久性。如在磁盘上用 partition log 的形式有序存储，存储后不会被修改；可以配置多个副本；可以配置消息 ack 级别；持久化的存储消费者 offset，消费者即使重启也能持续消费；基于时间和大小的日志清理策略。消息 partition 和 offset 都是工作中接触过的内容，现在我对它们对数据持久性的作用有概念了！（终于）
应用中也常常使用不只一种工具

这部分提出了几个 tricky questions

How do you ensure that the data remains correct and complete, even when things go wrong internally?
这个应该就是 reliability 的关注点了，大概...日志回放来补完数据？在线/离线数据校验来纠正错误？就工作的内容经验来说是这样的，理论上会有最终一致。
How do you provide consistently good performance to clients, even when parts of your system are degraded?
还是 reliability，这方面接触过的有有数据分片(fault isolation)，主备存储切换(redundancy and replication), load balancer 把请求传到没问题的地方，auto-scaling 扩容。
How do you scale to handle an increase in load?
这个应该就是 scalability 了，现在接触过的只有扩容！加机器！加大 cpu 核数！加大 cache。感觉也够用了，只要每个环节都能疯狂扩容....
What does a good API for the service look like?
Maintainability 问题，现在我的想法是兼顾通用性和特定的需求

# Reliability

Continuing to work correctly, even when things go wrong.

这里先解释了一下 fault 和 failure，我粗浅的理解，fault 就是系统中的一部分不行了，failure 就是整个系统不行了；对于 fault 要能 tolerate(resilient), 才会扼杀 failure。

# Hardware Faults

这里让我复习了一下 MTTF(mean time to failure)的概念，硬件故障通常是永久的，期望一个寿命均值。

想起了去问数据库的同事“机器为什么坏了”，得到回答“机器就是会坏啊！”啊，我的这块知识就像失忆了一样。

应对 hardware faults 的常见方式就是 add redundancy，多点备用及时替换上。如果用的机器多，机器坏掉的几率也大，比如 cloud platform (AWS)实例会 become unavailable without warning。因为它的设计优先了 flexibility 和 elasticity 而不是单机稳定性。

迁移实例！oncall 第一个学会的操作

# Software Errors

书中给出了几个例子：

A software bug that causes every instance of an application server to crash when given a particular bad input. For example, consider the leap second on June 30, 2012, that caused many applications to hang simultaneously due to a bug in the Linux kernel.

闰秒问题，还是第一次听说，留个链接防止以后忘了 https://juejin.cn/post/7177561474288582714
A runaway process that uses up some shared resource—CPU time, memory, disk space, or network bandwidth.

这个倒是遇到过，内存泄漏，无限重试应该都是这类
A service that the system depends on that slows down, becomes unresponsive, or starts returning corrupted responses.

核心依赖有问题，上游服务就会也出问题，确实
Cascading failures, where a small fault in one component triggers a fault in another component, which in turn triggers further faults.

目前还没有遇到，感觉很厉害，以后注意一下（当然不遇到最好！）

# Human Errors

预防/解决这类问题：系统设计时减少人为错误的机会；建立 sandbox 环境；详尽的 unit test 和自动化和手动测试；及时回滚和逐渐发布的机制；详细清晰的监控系统。

几乎都一直有接触过，看来工作环境还是很规范的！

# Thinking About Data Systems

# Reliability

# Hardware Faults

# Software Errors

# Human Errors