Tap-News project

StellaLiu萤窗小语

于 2021-03-09 06:24:56 发布

阅读量282

点赞数

分类专栏： # 系统设计

本文链接：https://blog.csdn.net/anqi3776/article/details/114568354

版权

系统设计专栏收录该内容

10 篇文章 0 订阅

订阅专栏

Tap-News project

Major User Cases:

Project: Collaborative Online Judge System
● Implemented a web-based collaborative code editor which supports multiple user editing simultaneously (ACE, Socket.io, Redis);
● Designed and developed a single-page web application for coding problems (Angular2, Auth0, Node.js, MongoDB);
● Built a user-code executor service which can build and execute user’s code (Docker, Flask);
● Refactored and Improved system throughput by decoupling services using RESTful API and loading balancing by
Nginx (REST API, Nginx).
Skills:
JavaScript, Python, Angular2, Node.js, Express, Redis, MongoDB, Docker, RESTful API, Nginx.
Tools:
Unix / Linux, VirtualBox, Git, Sublime(Atom/Vim), Apache Bench.

(Tap News) Tell me about this project
In my recent project, I designed and implemented a real time news scraping and recommendation system. The system uses a news pipeline to scrape latest news from various of sources such CNN, BBC. To render the news, I built a single-page web application using React. In addition, in order to customize news list for users, I designed and built a training pipeline for news topic modeling using Tensorflow.

(Tap News) What’s the biggest challenge in this project?
One challenge I was facing is the architecture design of the entire system. The entire system consists of different subsystems. Some requires prompt response like Web server. And some needs longer time to process but could be asynchronous such as news fetching. To make all subsystems work together, I decoupled the Web server, backend server and machine learning serving system using RPC. Besides, I used Worker Thread pattern to implement news fetching with RabbitMQ integrated. As a result, all subsystems are organized in Service Oriented Architecture.

(Tap News) What can be improved upon your current implementation? There are several directions I can improve.
First, Web server is a single node.js server which can be the bottleneck of the system. We can deploy a Nginx reverse proxy as a load balancer to support multiple node.js servers and make them scalable. Same methodology can be applied to the backend server and machine learning serving system. One option is to migrate to gPRC for native loading balancing feature.
Second, we still have a lot of room for machine learning improvement. One direction is to do more research and introduce extra features into the training, such as news text, news source and date, This is an open question.
Third, we can improve the latency by introducing caching in web server and backend server. We can use Memcached or Redis as an in-memory cache to store news details. The caching will help decrease the number of RPC calls, as well as improve the latency.

Tap-News project is a full-stack system that allows registered users to browse personalized recommended-news fetched from multiple news-websites. The document covers details of the implementation of: web_server, backend_server, and several services based on SOA design from an engineering perspective.

Tap新闻项目是一个全栈系统，允许注册用户浏览从多个新闻网站获取的个性化推荐新闻。该文档从工程的角度介绍了实现的细节：web\u服务器、后端\u服务器和基于SOA设计的若干服务。

Major User Cases:

User can signUp and Login to the system in order to view news.
User can click on the news they are interested, and he would be re-directed to the news source for more details.
User can roll down their scroll-bar for more news on the web-page.
Different user see different recommended news based on the history of their click behavior.
User can search news by keywords.
User can click on tag-name and the web-page would display only the news with the same tag in order of date/time.

用户可以注册并登录系统查看新闻。
用户可以点击他们感兴趣的新闻，他将被重新定向到新闻来源了解更多细节。
用户可以向下滚动滚动条以查看网页上的更多新闻。
不同的用户根据其点击行为的历史记录看到不同的推荐新闻。
用户可以通过关键字搜索新闻。
用户可以点击标签名称，网页将只显示具有相同标签的新闻，按日期/时间顺序排列。

在这里插入图片描述

1. Front-end

Front-end is proposed to add three more feathers based on the existing Tap-news project:
● Today’s New:
Provide a “today’s new” sectionsession displaying today’s newly added news.
● Display news based on topics:
When the user clicks a tag, we display all the news with this tag on top
of the page. New user(registered and use the news-webpage for the first time) can click on the topic they like as a cold start. The top two news in each topic would be: (1) newest based on date/time. If multiple, show random one. (2) the news get clicked the most in the past 3 days. If none, show one of the newest news based on date/time. If multiple, show the newer one.
● Search bar:
Provide a “search bar” for the users. Lucene or Elastic search is proposed to use here for searching news by keywords.

1前端前端拟在现有Tap新闻项目的基础上再增加三个功能： ●今天的新产品：提供一个“今天的新”部分，展示今天新增的新闻。
●根据主题显示新闻：当用户单击一个标签时，我们会在上面显示这个标签的所有新闻
页面的。新用户（首次注册并使用新闻网页）可以点击他们喜欢的主题作为冷启动。每个主题的前两条新闻是：（1）根据日期/时间最新的。如果有多个，则随机显示一个。（2）
这条新闻是过去三天点击率最高的。如果没有，则根据日期/时间显示一条最新新闻。如果有多个，则显示较新的一个。 ●搜索栏：
为用户提供一个“搜索栏”。本文提出用Lucene或Elastic搜索按关键字搜索新闻。

2. Create a new config_service to config all the modules and services.

In our project, the default package.json file could be replaced with a config_service. A new folder named config could be created, and each module/service should have a corresponding .json file in the config folder. In specific module, we create a new config_client.py file, and request config parameters from config_service via sending rpc request.

3. Apply machine learning to display personalized news list.

● Mark tags for news based on topics:
Machine learning is used to label the topics for all the fetched news. Machine learning model would be accomplished with Tensorflow, DNN, NLP. The training results are to be visualized on with tensorBoard.
One way is to utilize the click behavior of the user. By setting a threshold, we can compare the click frequency and/or time spent on a specific news, and then save the news above threshold as user’s preference list. Google analytics could be used here to get the time user spent on a news. If the time is larger than the threshold, the preference adds weight.
One thing that needs to be considered is the news topic distribution of our provided news. If we have a bias of the news topics in the first place, we may wrongly predict the user’s prefered topic. Also, the news are going to be re- ranked every 24 hours.
● Store Search bar keywords for user preferences:
A mongodb is used to store user information:{userid, keywords_searched[ ], clickedNews_category_rates[ ]}
keywords_Searched is a queue to store the keywords the user searched. News which match the keywords better would be displayed for this user.
Clickednews_category_rates would be used to store user preference. Time_decay_model is used to store the possibilities for each topic based on user’s clicks.
● Recommended for you features personalization:
Assume the user logged in his account. If we have either keywords_searched[ ], or clickedNews_category_rates[ ] not empty, then we display the top 1 news for him based on his search-word, or the most clicked news in his category/categories. Otherwise, this session is hided.

三。应用机器学习显示个性化新闻列表。
●根据主题标记新闻标签：
机器学习用于标记所有获取的新闻的主题。用Tensorflow、DNN、NLP实现机器学习模型。训练结果将用张力板显示。
一种方法是利用用户的点击行为。通过设置阈值，我们可以比较特定新闻的点击频率和/或时间，然后将高于阈值的新闻保存为用户的首选项列表。谷歌分析可以在这里得到用户花在新闻上的时间。如果时间大于阈值，则首选项会增加权重。
需要考虑的一点是我们提供的新闻的新闻主题分布。如果我们一开始对新闻主题有偏见，我们可能会错误地预测用户喜欢的主题。此外，新闻将每24小时重新排列一次。
●存储用户首选项的搜索栏关键字： mongodb用于存储用户信息：{userid，keywords\u
searched[]，clickedNews\u category\u rates[]}
关键字搜索是存储用户搜索的关键字的队列。将为该用户显示与关键字更匹配的新闻。 Clickednews\u category\u
rates将用于存储用户偏好。时间衰减模型用于存储基于用户点击的每个主题的可能性。
●为您推荐的个性化功能：
假设用户登录了他的帐户。如果我们有关键字\u searched[]，或者clickedNews \u category \u
rates[]不为空，那么我们会根据他的搜索词显示他排名前1的新闻，或者显示他所属类别中点击次数最多的新闻。否则，将隐藏此会话。

4.Add a complete Logging, and use a suitable

Monitoring system to build a visual monitoring system
to monitor key parameters and indicatorsLogging data includes two parts: the logging written in local file, and the metrics for the tap-news running system. Logging for local file includes token, cpu usage, memory usage; while the metrics includes user information and server responses, for example, IP, login speed, requests/min, loading time, exceptions and debugging. Each step should include a timestamp.
Logging data would be collected through a plugin app (e.g. statsD and Collectd, not decided yet). Then the data is fed into carbon, which later written in whisper for long- term storage. User works with graphite web UI and the requested graph is constructed and shown with data from carbon and whisper. Graphite would display things like rabbitMQ status, operation system loading status.
Another thinking is to use TrakERR (https://trakerr.io/#/; https://github.com/trakerr-com/trakerr-python) to track and visualize logging and system key parameters.

添加完整的日志记录，并使用合适的建立可视化监控系统
监控关键参数和指标日志数据包括两部分：写入本地文件的日志和tap新闻运行系统的度量。本地文件的日志记录包括令牌、cpu使用率、内存使用率；而度量包括用户信息和服务器响应，例如IP、登录速度、请求/分钟、加载时间、异常和调试。每个步骤都应该包括一个时间戳。
日志数据将通过插件应用程序收集（例如statsD和Collectd，尚未决定）。然后这些数据被输入到碳中，然后被写在耳语中进行长期存储。用户使用graphite
web用户界面，所请求的图形被构造出来，并用carbon和whisper的数据显示出来。Graphite将显示rabbitMQ状态、操作系统加载状态等内容。
另一个想法是使用特拉克尔(https://trakerr.io/#/;
https://github.com/trakerr-com/trakerr-python)跟踪和可视化测井和系统关键参数。

5. Continuous deployment pipeline and load balancing.

Jenkins and gitlab would be used to realize continuous deployment. Nginx would be used for load balancing. Unit tests and end-to-end tests would be conducted.
For Nginx, after installing, two folders are created in
And by including load_balancer into /nginx/nginx.conf, we can use the listed three servers to load balance http://executor. In our project, all the servers should have a Nginx load-balancer.

Scalability:

Docker is in consideration for scalability, since we can let multiple docker images running for our project.

5连续部署管道和负载平衡。 Jenkins和gitlab将用于实现连续部署。Nginx将用于负载平衡。将进行单元测试和端到端测试。
对于Nginx，安装后，在中创建两个文件夹
通过在/nginx中包含负载均衡器/nginx.conf文件，我们可以使用列出的三个服务器进行负载平衡http://executor。在我们的项目中，所有服务器都应该有一个Nginx负载平衡器。
可扩展性： Docker考虑的是可伸缩性，因为我们可以让多个Docker映像为我们的项目运行。

StellaLiu萤窗小语

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
Tap-News project

Tap-News projectMajor User Cases:1. Front-end2. Create a new config_service to config all the modules and services.3. Apply machine learning to display personalized news list.4.Add a complete Logging, and use a suitable5. Continuous deployment pipeline and
复制链接

扫一扫

专栏目录