系统设计学习（三）Design Amazon‘s sales rank by category feature

本文链接：https://blog.csdn.net/xiaocaij_icai/article/details/125241602

本文详细介绍了如何设计一个高可用的服务来计算过去一周内各类别中最受欢迎的产品。通过使用MapReduce处理销售日志，统计每个商品的销售量并进行排序，最终存储在SQL数据库中。系统设计考虑了大规模交易数据（10亿/月）和读取请求（1000亿/月），并采用多阶段MapReduce确保数据处理效率。同时，提出了使用REST API对外提供查询接口，以便用户获取过去一周的热门商品信息。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

老规矩，先看题

Use cases
We’ll scope the problem to handle only the following use case
Service calculates the past week’s most popular products by category
User views the past week’s most popular products by category
Service has high availability
Out of scope
The general e-commerce site：Design components only for calculating sales rank Constraints and assumptions
State assumptions Traffic is not evenly distributed
Items can be in multiple categories
Items cannot change categories
There are no subcategories ie foo/bar/baz Results must be updated hourly
More popular products might need to be updated more frequently
10 million products
1000 categories
1 billion transactions per month
100 billion read requests per month
100:1 read to write ratio

那么接下来的第一步是进行计算，理清楚，如果要进行粗略的使用情况的计算的话
首先看下每笔交易要用到多大的存储空间
creared_at 5bytes
product_at 8bytes
category_id 4bytes
seller_id 8bytes
buyer_id 8bytes
quantity 4bytes
total_price 5bytes
total 40bytes
因为他说一个月1billion次交易，所以每个月要花费40gb去进行存储这些交易国恒
如果三年的话需要1.44tb进行存储

那么完成粗略的计算之后我们要简单的设计一下这个系统
在这里插入图片描述

完成了系统轮廓的设计之后我们要进行细节的填充了
首先，既然服务要统计过去一周最受欢迎的商品的话，那他就一定要统计所有的订单的信息，我们可以使用sale api去存储处理好的日志，在Object Store而不是管理我们自己的文件系统

我们假设看到的日志如下

timestamp   product_id  category_id    qty     total_price   seller_id    buyer_id
t1          product1    category1      2       20.00         1            1
t2          product1    category2      2       20.00         2            2
t2          product1    category2      1       10.00         2            3
t3          product2    category1      3        7.00         3            4
t4          product3    category2      7        2.00         4            5
t5          product4    category1      1        5.00         5            6
...

Sales Rank service可以使用sale api打印的日志作为输入数据，然后内部使用mapreduce进行一个分布式日志的统计，最后得出sales_rank等信息存储在一个SQL数据库中

我们将使用一个多部的mapreduce
1.转换数据为 (category, product_id), sum(quantity)
2.进行一个分布式的排序

class SalesRanker(MRJob):

    def within_past_week(self, timestamp):
        """Return True if timestamp is within past week, False otherwise."""
        ...

    def mapper(self, _ line):
        """Parse each log line, extract and transform relevant lines.

        Emit key value pairs of the form:

        (category1, product1), 2
        (category2, product1), 2
        (category2, product1), 1
        (category1, product2), 3
        (category2, product3), 7
        (category1, product4), 1
        """
        timestamp, product_id, category_id, quantity, total_price, seller_id, \
            buyer_id = line.split('\t')
        if self.within_past_week(timestamp):
            yield (category_id, product_id), quantity

    def reducer(self, key, value):
        """Sum values for each key.

        (category1, product1), 2
        (category2, product1), 3
        (category1, product2), 3
        (category2, product3), 7
        (category1, product4), 1
        """
        yield key, sum(values)

    def mapper_sort(self, key, value):
        """Construct key to ensure proper sorting.

        Transform key and value to the form:

        (category1, 2), product1
        (category2, 3), product1
        (category1, 3), product2
        (category2, 7), product3
        (category1, 1), product4

        The shuffle/sort step of MapReduce will then do a
        distributed sort on the keys, resulting in:

        (category1, 1), product4
        (category1, 2), product1
        (category1, 3), product2
        (category2, 3), product1
        (category2, 7), product3
        """
        category_id, product_id = key
        quantity = value
        yield (category_id, quantity), product_id

    def reducer_identity(self, key, value):
        yield key, value

    def steps(self):
        """Run the map and reduce steps."""
        return [
            self.mr(mapper=self.mapper,
                    reducer=self.reducer),
            self.mr(mapper=self.mapper_sort,
                    reducer=self.reducer_identity),
        ]

结果是下面的排好序的列表，我们可以将其存入sales_rank表

(category1, 1), product4
(category1, 2), product1
(category1, 3), product2
(category2, 3), product1
(category2, 7), product3

sales_rank表的解构大概如下

id int NOT NULL AUTO_INCREMENT
category_id int NOT NULL
total_sold int NOT NULL
product_id int NOT NULL
PRIMARY KEY(id)
FOREIGN KEY(category_id) REFERENCES Categories(id)
FOREIGN KEY(product_id) REFERENCES Products(id)

我们将在 id 、 category_id 和 product_id 上创建索引以加快查找速度（记录时间而不是扫描整个表）并将数据保存在内存中。
这样的话，当用户想要访问过去一周最受欢迎的产品（每一个类别里面）的时候

client发送请求到webserver，执行反向代理，
webserver直接请求read api，
read api读数据从sql数据库中
我们使用一个公共的rest api

$ curl https://amazon.com/api/v1/popular?category_id=1234

{
    "id": "100",
    "category_id": "1234",
    "total_sold": "100000",
    "product_id": "50",
},
{
    "id": "53",
    "category_id": "1234",
    "total_sold": "90000",
    "product_id": "200",
},
{
    "id": "75",
    "category_id": "1234",
    "total_sold": "80000",
    "product_id": "3",
},