Design ad aggregator
Functional Requirement
Users can click on an ad and be redirected to the advertiser’s website
Advertisers can query ad click metrics over time with a minimum granularity of 1 minute.
Non functional Requirement
- the system is fault tolerant and the aggregator count should be avaiaible
- the system is able to scale to handle high QPS
- the query should be fast < 500ms
- As real time as possible
- idempotent of each click
API interface
input: user click data
output: advertisers’ matrics
High Level Design
flink 的好处
aggregate data我可设置成1 min
但是flush的时间我可以短一些,保证及时性
Deep Dive
我们如何支持10k click data
sharding!!!!
Click Processor Service: We can easily scale this service horizontally by adding more instances. Most modern cloud providers like AWS, Azure, and GCP provide managed services that automatically scale services based on CPU or memory usage. We’ll need a load balancer in front of the service to distribute the load across instances.
Stream: Both Kafka and Kinesis are distributed and can handle a large number of events per second but need to be properly configured. Kinesis, for example, has a limit of 1MB/s or 1000 records/s per shard, so we’ll need to add some sharding. Sharding by AdId is a natural choice, this way, the stream processor can read from multiple shards in parallel since they will be independent of each other (all events for a given AdId will be in the same shard).
Stream Processor: The stream processor, like Flink, can also be scaled horizontally by adding more tasks or jobs. We’ll have a seperate Flink job reading from each shard doing the aggregation for the AdIds in that shard.
OLAP Database: The OLAP database can be scaled horizontally by adding more nodes. While we could shard by AdId, we may also consider sharding by AdvertiserId instead. In doing so, all the data for a given advertiser will be on the same node, making queries for that advertiser’s ads faster. This is in anticipation of advertisers querying for all of their active ads in a single view. Of course, it’s important to monitor the database and query performance to ensure that it’s meeting the SLAs and adapting the sharding strategy as needed.
如何ensure 我们不丢失数据?
7 days retention policy
flink有checkpointing,在aggregation window比较低的时候其实不太适合,比较costly
高吞吐量场景:在数据量非常大的情况下,从检查点恢复可能比重新处理所有数据更快。
如何让系统fault tolerance?
引入spark去存储
如何让user 保持ideopotency?
The better approach is to have the Ad Placement Service generate a unique impression ID for each ad instance shown to the user. This impression ID would be sent to the browser along with the ad and will serve as an idempotency key. When the user clicks on the ad, the browser sends the impression ID along with the click data. This way we can dedup clicks based on the impression ID.
We won’t be able to do this easily in Flink because we need to dedup across aggregation windows. If a duplicate click comes in on either side of a minute boundary, it will be counted as two clicks. Instead, we should dedup before we put the click in the stream. When a click comes in, we check if the impression ID exists in a cache. If it does, then its a duplicate and we ignore it. If it doesn’t, then we put the click in the stream and add the impression ID to the cache.
Ad Placement Service generates a unique impression ID for each ad instance shown to the user.
The impression ID is signed with a secret key and sent to the browser along with the ad.
When the user clicks on the ad, the browser sends the impression ID along with the click data.
The Click Processor verifies the signature of the impression ID.
The Click Processor checks if the impression ID exists in a cache. If it does, then it’s a duplicate, and we ignore it. If it doesn’t, then we put the click in the stream and add the impression ID to the cache.
我们如何低延时query metrics?(其实)
Where this query can still be slow is when we are aggregating over larger time windows, like a days, weeks, or even years. In this case, we can pre-aggregate the data in the OLAP database. This can be done by creating a new table that stores the aggregated data at a higher level of granularity, like daily or weekly. This can be via a nighly
cron job that runs a query to aggregate the data and store it in the new table. When an advertiser queries the data, they can query the pre-aggregated
table for the higher level of granularity and then drill down to the lower level of granularity if needed.