System Design之Design Dropbox/Google Drive

本文是北美模拟面试题 Design Dropbox 的笔记,原视频可以在 System Design Guru 频道查看。

北美的 System Design 面试没有标准答案,全部为开放式问答,只要言之有理能讲清楚各种选择下的 tradeoff 即可。

原题:

请设计一个类似 Dropbox 的文件系统,其支持

1. 用户上传并下载文件

2. 多客户端/用户之间同步文件

考点:

  1. 数据存储(Data Storage)
    1. 元数据存储:
      • 使用关系数据库(Relational Database, RDBMS)存储元数据,因为元数据结构良好、具有固定模式(schema),并且从ACID事务中受益。这有助于避免不一致性并有效地处理并发性。
    2. 文件存储:
      • 使用Blob存储(Blob Store)来存储文件,因为它在成本上更具效益,并且支持非结构化数据。
      • 提及Blob存储而不是具体技术,如S3,以避免具体实现的局限性。
  2. 数据分片(Data Sharding)
    1. 按项ID分片(Shard by Item ID):
      • 优点:
        1. 均匀分布: 确保跨分片的负载均衡。
        2. 容错性: 如果一个分片宕机,只有一些文件不可访问,而不是某个用户的所有文件。
      • 缺点:
        1. 复杂查询: 跨分片查询更复杂,并可能导致更高的网络延迟。
    2. 按用户ID分片(Shard by User ID):
      • 优点:
        1. 数据本地化: 一个用户上传的所有文件存储在一个分片中,简化了用户特定查询。
      • 缺点:
        1. 潜在热点: 重度用户可能会创建热点,例如某个用户有数百万个文件。
        2. 容错性: 如果一个分片宕机,受影响的用户将失去对所有文件的访问。
        3. 数据移动: 在重新平衡期间,将整个用户数据集移动到不同的分片是复杂的。
  3. 文件上传
    1. 异步上传(Async Uploading):
      • 使用消息队列处理异步作业,以高效地处理大规模和并发文件上传。
    2. 数据分块(Data Chunking):
      • 原因:
        1. 处理大文件: 分块将大文件分解成可管理的部分,改善资源管理和可靠性。
        2. 并行上传: 通过允许多个分块同时上传来加速上传过程,最大化吞吐量。
        3. 恢复能力: 允许在中断后从最后一个成功上传的分块恢复上传。
        4. 可扩展性: 将上传负载分布在多个服务器或存储节点上。
      • 分块大小:
        • 小分块: 最小化重试影响,但增加开销。
        • 大分块: 减少开销,但增加大数据重传的风险。
        • 典型分块大小: 4MB到8MB是一个常见的平衡点。
    3. 偏移量(Offset):
      • 使用偏移量来确定分块在文件中的位置,从而在所有分块上传完成后准确地重构文件。
    4. 身份验证(Authentication):
      • 更倾向于使用上传服务来实现更好的身份验证、验证和对文件上传的控制。
      • 直接连接可以通过使用具有短TTL的预签名URL(pre-signed URLs)来管理。
    5. 故障恢复(重试机制):
      • 使用指数退避(exponential backoff)、抖动(jitter)和重试限制(retry limits)来高效地处理故障。
    6. 文件校验(File Checksum):
      • 原因: 通过验证上传的文件与原始文件匹配来确保数据完整性。
      • 实现: 在元数据中存储校验和(checksum)。上传完成后,比较原始校验和与实际校验和。
    7. 上传更新的推送与拉取(Push vs. Pull for Uploading Updates) (通知系统(Notification System)):
      • 推送(Push):
        • 优点:
          1. 实时更新。
          2. 减少不必要的网络请求。
        • 缺点:
          1. 由于需要保持开放连接,资源消耗较大。
      • 拉取(Pull):
        • 优点:
          1. 客户端可以控制更新频率。
        • 缺点:
          1. 更新依赖于轮询间隔,而不是实时的。
          2. 频繁的轮询会增加网络流量和服务器负载。
    8. 版本控制(冲突解决):
      • 处理冲突,例如两个客户端修改同一文件。
      • 解决方案包括“先写胜”(First Write Wins)或“后写胜”(Last Write Wins)策略。

以下是上文的英文版本:

Key Points:

  1. Two Data Storage
    1. Metadata Storage:
      • Use a Relational Database (RDBMS) for metadata because metadata is well-structured, has a fixed schema, and benefits from ACID transactions. This helps avoid inconsistency and handle concurrency effectively.
    2. File Storage:
      • Use a Blob Store for file storage as it is cost-effective and supports unstructured data.
      • Mention Blob Store instead of specific technologies like S3 to avoid limitations of specific implementations.
  2. Data Sharding
    1. Shard by Item ID:
      • Pros:
        1. Even Distribution: Ensures balanced load across shards.
        2. Fault Tolerance: If a shard is down, only some files are inaccessible rather than all files for a user.
      • Cons:
        1. Complex Queries: Cross-shard queries are more complex and can result in higher network latency.
    2. Shard by User ID:
      • Pros:
        1. Data Locality: All files uploaded by one user are stored together in one shard, simplifying user-specific queries.
      • Cons:
        1. Potential Hot Spots: Heavy users can create hotspots, e.g., a user with millions of files.
        2. Fault Tolerance: If a shard is down, affected users lose access to all their files.
        3. Data Movement: Moving entire user datasets between shards during rebalancing is complex.
  3. File Uploading
    1. Async Uploading:
      • Use message queues for async jobs to handle large-scale and concurrent file uploads efficiently.
    2. Data Chunking:
      • Why:
        1. Handling Large Files: Chunking breaks large files into manageable pieces, improving resource management and reliability.
        2. Parallel Uploads: Speeds up uploads by allowing multiple chunks to be uploaded simultaneously, maximizing throughput.
        3. Resume Capability: Allows uploads to resume from the last successfully uploaded chunk after an interruption.
        4. Scalability: Distributes upload load across multiple servers or storage nodes.
      • Chunk Size:
        • Small Chunks: Minimize retry impact but increase overhead.
        • Large Chunks: Reduce overhead but increase the risk of large data retransmission.
        • Typical Chunk Size: 4MB to 8MB is a common balance.
    3. Offset:
      • Use offsets to determine the location of a chunk in the file, enabling accurate file reconstruction after all chunks are uploaded.
    4. Authentication:
      • Prefer an uploading service for better authentication, validation, and control over file uploads.
      • Direct connection can be managed using pre-signed URLs with short TTLs.
    5. Failure Recovery (Retry Mechanism):
      • Use exponential backoff, jitter, and retry limits to handle failures efficiently.
    6. File Checksum:
      • Why: Ensure data integrity by verifying that the uploaded file matches the original file.
      • Implementation: Store checksums in the metadata. Compare original and actual checksums after upload.
    7. Push vs. Pull for Uploading Updates (Notification System):
      • Push:
        • Pros:
          1. Real-time updates.
          2. Reduces unnecessary network requests.
        • Cons:
          1. Resource-intensive due to maintaining open connections.
      • Pull:
        • Pros:
          1. Clients control update frequency.
        • Cons:
          1. Updates depend on polling intervals, not real-time.
          2. Frequent polling increases network traffic and server load.
    8. Versioning (Conflict Resolution):
      • Handle conflicts, such as two clients modifying the same file.
      • Solutions include "First Write Wins" or "Last Write Wins" strategies.

  • 65
    点赞
  • 16
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值