News Pipeline
- Real Time News Scraping and Recommendation System
- Building Record
-
-
- POST Design
- React Frontend UI
- NodeJS Web Server
- Frontend and Backend Http Protocol(RESTful API)
- Backend - SOA (Service Oriented Architrcture) Design
- Backend - MongoDB connection
- CloudAMQP: Message Queue
- Pylint (Python Coding Style Check)
- :hammer: Refactor : Create an Operator to Receive all API Request from Backend Server
- News Data Pineline
- Authentication UI
- Authentication Logic
- Web Server Feature - Pagination
- Web Server Feature - Preference Model
- Web Server Feature - Click Log Processor
-
- React FrontEnd Build Up
-
- Decouple into Components
- Create React App
- Build up APP Component
- Build up NewsPanel Component
- Build up NewsCard Component
- Refactor those Components into Web Server file
- Continuous loading News (Server-Side REST API - NodeJS & Client-Side - React)
- Express application generator - NodeJS Server
- Configure App.js
- Server Side Routing
- RESTful API: Send Backend data from Server
- NewsPanel Requests to Backend for Loading More JSON data
- Access Control Allow Origin
- Handle Scrolling
- Debounce
- SOA (Service Oriented Architrcture)
-
- SOA Desgin Pattern
- Noraml Application Design Ligic
- RPC Backend Service
- BackEnd Server
- JSONRPClib Libraries
- Testing by Postman
- NodeJS Server as a RPCclient - jayson
- MongoDB
- Mongo Syntax
- Backend Connect to MongoDB - pymongo
- CloudAMQP
- CloudAMQP && Pika
- Heart Beat
- Backend API send Request to CloudAMQPClient API for Asking News in Queue
- Pylint
- Refactor: Operations
- Refactor : Let Utils be used by Both Backend Server and Data pipeline
- News Pipeline
- Authentication
- Authentication Implementation
-
- Login
- SignUp
- Web
- FrontEnd Auth
- JWT and Salt
- Base Component with Login and SignUp
- React Router in Client
- Server Side Auth
- Server for getting DB (validateLogin/SignUp/Passport)
- bcrypt - Salt and Hash
- Login Passport
- 綁定兩個LocalStratgy 到 app.js
- Middleware
- Body Parser
- Auth API
- Validator
- WebServer Features
- Pagination
- Backend Server (Web Server doesn't deal with business Logic)
-
- Operations.py
- Connect with FrontEnd - RPC Client(in web server)
- Refactor the Get News API
- Change Client - NewsPanel.js
- Preference Model
- Click Log Processor - Modify the model by User Clicks
- Log Processer
- Recommendation Service
- news_Classes - JSON dictionary (List) about all topics
- Recommendation Service
- Recommendation service client
- Week 4
Real Time News Scraping and Recommendation System
- Implemented a data pipeline which monitors, scrapes and dedupes latest news (MongoDB, Redis, RabbitMQ);
- Designed data monitors for obtaining latest news from famous websites and recommend to web server.
- Successfully fetch useful data from original news websites by building news scrapers.
- Build dedupers which filter same news by using NLP (TF-IDF) to analyze similarities of articles scraped from news websites.
- Use Tensorflow for machine learning which can shows news according to users interests.
Build a single-page web.
Building Record
POST Design
React Frontend UI
- Build up App Component with Marerialize Styling
- Build up NewsPanel Component
- Build up NewsCard Component
- Refactor those Components into Web Server file
NodeJS Web Server
- Express application generator - NodeJS Server
- Configure APP.js
- Server Side Routing
- RESTful API: Send Backend data from Server(Mock Data)
RestFul API features (By Routing)
Frontend and Backend Http Protocol(RESTful API)
- NewsPanel Requests to Backend for Loading More JSON data
- Access Control Allow Origin
- Handle Scrolling
- Debounce
Backend - SOA (Service Oriented Architrcture) Design
- SOA Desgin Pattern
- RPC Backend Service
- JSONRPClib Libraries
- Testing by Postman
- NodeJS Server as a RPCclient - jayson
Backend - MongoDB connection
CloudAMQP: Message Queue
- CloudAMQP
- CloudAMQP & Pika
- CloudAMQP with Python(doc)
- Heart Beat
- Backend API send Request to CloudAMQPClient API for Asking News in Queue
Pylint (Python Coding Style Check)
🔨 Refactor : Create an Operator to Receive all API Request from Backend Server
- Refactor: Operations
- [CloudAMQP_Client]
- [Mongodb_Client]
- [News_api_Client]
- [News_recommendation_service_Client]
News Data Pineline
Monitor -> Q(scrape) -> Fetcher -> Q(dedupe)
News Monior
- News Monitor w/ Redis, RabbitMQ, News API
- Send News to Redis (hashlib)
- Send to RabbitMQ
- Create a Took to clean the Queue
News Fetcher(Scrawler)
- Web Scrapers(has been replaced)
- News Fetcher
- Newspaper 3k replaces XPath in News Fetcher
- Newspaper3k(doc)
- Test Monitor and Fetcher
News Deduper
Authentication UI
Authentication Logic
Frontend - src/Auth
- Check if user owns a token or redirect to login page
- FrontEnd Auth - token base
- Base Component with Login and SignUp
- Send Http Request to Backend to handle login logic
- LoginPage(deal with logic)
- SignUpPage
React Router - With Auth
- isUserAuthenticated()
- React Router in Client
Backend auth
-
Hash and Salt the password since we couldn’t directly save into Database
-
Valide the Email Input to aviod Rainbow attack
-
Deat With DB connection and Passpord Campare
-
Check Token the user own to authoritize user to load more news
🔨 Auth Refactor
Web Server Feature - Pagination
Web Server Feature - Preference Model
Web Server Feature - Click Log Processor
React FrontEnd Build Up
Decouple into Components
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-pYBZKS6p-1608896291243)(image/app_structure.png)]
Components:
- Base : Whole React App (Navbar + App)
- App : Image(title) + NewsPanel
- NewsPanel : Concludes many NewsCard such as News Lists (While user scorlling, backend send new NewsCard continously)
- NewsCard : Single News adding into NewsPanel with News image, News title, News contents, News description, News Tage and Links to News.(Would record the clicked by users in future)
Create React App
- Deal with webpack and give a whole framework
- Install CRA (Global) : Local Developing Tool
sudo npm install -g create-react-app
- Create a new React App
create-react-app top-news
- Test Connection
cd top-news
npm start
Build up APP Component
- public : images
- src : Each Component has its own folder
App / App.js
App.js
- There is only one div tag in render function
- import React , ‘./App.css’, CSS file and logo
- Use “className” instead of “class” : Since in ES6, we use class for define a App class
import React from 'react';
import './App.css';
import logo from './logo.png';
class App extends React.Component {
render() {
return(
<div>
<img className = 'logo' src = {
logo} alt = 'logo'/>
<div className = 'container'>
{
/* TODO */}
</div>
</div>
);
}
}
export default App;
- Why use “default App”?
- If not, while you want to import App from other file, you need to type :
import {
App } from './App.js';
- But if you have default, you could get rid of {}
- CSS setup
.App {
text-align: center;
}
.logo {
display: block;
margin-left: auto;
margin-right: auto;
padding-top: 30px;
width: 20%;
}
Materialize CSS Design
- install in Client sie
npm install materialize-css --save
- Import
import 'materialize-css/dist/css/materialize.min.css';
index.js in Client Side
- Build a index.js for starting the client side
touch src/index.js
- index.js
import React from 'react';
import ReactDOM from 'react-dom';
import App from './App/App';
ReactDOM.render(
<App />,
document.getElementById('root')
);
- Where is root?
- public -> index.html
<div id="root"></div>
Build up NewsPanel Component
Save all NewsCard and connect with BackEnd
- Create NewsPanel folder and NewsPanel.js
mkdir src/NewsPanel
code src/NewsPanel/NewsPanel.js
import React from 'react';
import './NewsPanel.css';
- Since we Need to save the News content, we need an internal variable (need constructor)
class NewsPanel extends React.Component {
constructor() {
super();
this.state = {
news: null };
}
- state = {news: null} -> lists of JSON
- Render conditions : there is a news and then create a NewsCard or show the loading message
render() {
if (this.state.news) {
return (
<div>
{
this.renderNews()}
</div>
);
} else {
return (
<div>
Loading ...
</div>
);
- local function, renderNews() : Render out the News and dynamactiy deal with the NewCards.
- Clickable - Use A tag in HTML
- Key - in React, if you would like to use a list, need to give a ‘key’ since the Virtual DOM need to know which items were changed in lists and just change that item insteads of renewing all items.
- “list-group-item” needs to be put into “list-group” and show the {news_list} in list group
- Get All News from state news -> Make all news in list -> Make all news become a NewsCard -> Put NewsCards into list-group
renderNews() {
const news_list = this.state.news.map(news => {
return (
<a className = 'list-group-item' key = {
news.digest} href = '#'>
<NewsCard news = {
news} />
</a>
);
});
return (
<div className = 'container-fluid'>
<div className = "list-group">
{
news_list}
</div>
</div>
);
}
- local function, loadMoreNews() : Get News from backend - init load. (Now we gave a mock data)
loadMoreNews() {
this.setState({
news : [
{
....data
}]
});
}
- After render() was ran, it will execute componentDidMount() -> Load News in state
componentDidMount () {
this.loadMoreNews();
}
- Import NewsCard
import NewsCard from '../NewsCard/NewsCard';
- Export NewPanel
export default NewsPanel;
Add NewsPanel CSS
- By default for future using
touch src/NewsPanel/NewsPanel.css
Import NewsPanel into App.js
- App.js
import NewsPanel from '../NewsPanel/NewsPanel';
<div className = 'container'>
<NewsPanel />
</div>
Build up NewsCard Component
- Create NewsCard Component Folder
mkdir src/NewsCard
touch src/NewsCard/NewsCard.js
src/NewsCard/NewsCard.css
- class NewsCard (For HTML contents)
class NewsCard extends React.Component {
render() {
return(
HTML....
HTML Structure
- news-container
- row
- col s4 fill
- image
- col s8
- news-intro-col
- news-intro-panel
- news-description
- news-chip
- onClick -> redirectToUrl()
redirectToUrl(url, event) {
event.preventDefault();
window.open(url, '_blank');
}
- Get the data from props.news from NewsPanel.js
<h4>
{
this.props.news.title}
</h4>
- NewsCard could get the data from NewsPanel since it was passed from :
<a className = 'list-group-item' key = {
news.digest} href = '#'>
<NewsCard news = {
news} />
</a>
- Dont get chips if there is no source (this.props.news.source != null &&)
{
this.props.news.source != null && <div className='chip light-blue news-chip'>{
this.props.news.source}</div>}
- CSS file
.news-intro-col {
display: inline-flex;
color: black;
height: 100%;
}
CSS....
Refactor those Components into Web Server file
- Create a web_server file and move top-news which was renamed “client” into it
mkdir web_server
mv top-news/ ./web_server/client
Continuous loading News (Server-Side REST API - NodeJS & Client-Side - React)
-
Deploy to AWS, there is no different likes server and client.
-
Create React App provide “Development Server” for developing, but we wont use this to serve Users
-
Development: Node Server + Development Server
-
Publishment: Node Server + build (built by React App)
Express application generator - NodeJS Server
- Install Globally
sudo npm install express-generator -g
- Create a Server in web_server
express server // Usage: express [options] [dir]
- Install dependencies
cd server
npm install
npm start
Configure App.js
(defualtly installed lots of requirements)
- Delete :
- bodyParser: POST Request
- cookieParser: Authentication
- logger: Login
- users: Login
- Change views engine
- Put the default folder to /client/build
app.set('views', path.join(__dirname, '../client/build'));
- Express Static : Find the image **** Find Bug!!! -> missing: ‘/static’
app.use('/static',
express.static(path.join(__dirname, '../client/build/static')));
- Client Webpack: Build a build folder for server to use
- static - css
- static - js
- Error Handler
app.use(function(req, res, next) {
res.status(404);
});
r
- package.json : change start
"scripts": {
"start": "nodemon ./bin/www"
},
Server Side Routing
index.js receive index.html from build
- Since init run the ‘/’, redirect to the routes/ index.js
app.use('/', index);
- index.js : send index.html from build to server side
- Get home page!
var express = require('express');
var router = express.Router();
var path = require('path';)
router.get('/', function(req, res, next) {
res.sendFile("index.html",
{
root: path.join(__dirname, '../../client/build')});
});
module.exports = router;
- bin -> www : Place for init the App.
RESTful API: Send Backend data from Server
News Routes
- In routes/news.js
touch server/routes/news.js
- Give a mock data here and send as a JSON file
var express = require('express');
var router = express.Router();
router.get('/', function(req, res, next) {
news = [
.....DATA
];
res.json(news);
]
});
module.exports = router;
- In app.js require the news Route
var news = require('./routes/news');
app.use('/news', news);
NewsPanel Requests to Backend for Loading More JSON data
- NewsPanel.js -> loadMoreNews() with backEnd
- Cache: False -> if true, it might show the old news from cache
- news_url -> window.location.hostname
- ‘http://’ + window.location.hostname + ‘:3000’ + ‘/news’
- method: GET
const news_url = 'http://' + window.location.hostname + ':3000' + '/news';
const request = new Request(news_url, {
method:'GET', cache:false});
- Fetch + .then : Http Request & Promise
- res.json -> Ansynchrons : so we need another “.then” 調用JSON
- After we got JSON, deal with the news data
- If there is no news on web, directly give the new one, but if not, “concat” to the old ones
fetch(request)
.then(res => res.json())
.then(news => {
this.setState({
news: this.state.news ? this.state.news.concat(news) : news,
});
});
Access Control Allow Origin
Both Client and Server side localhost for developing
-
Since we couldn’t cross localhost:3000 and localhost:3001 ! Run in the different PORT.
-
Temporarily access to run in different PORT
- (BUT NEED TO BE REMOVED WHEN FINAL PUBLISH)
- app.js
app.all('*', function(req, res, next) {
res.header("Access-Control-Allow-Origin", "*");
res.header("Access-Control-Allow-Headers", "X-Requested-With");
next();
});
Failed to load http://localhost:3000/news: No 'Access-Control-Allow-Origin' header is present on the requested resource. Origin 'http://localhost:3001' is therefore not allowed access.
Handle Scrolling
- Keep using loadMoreNews by combining Scroll EventListener
- | | -
document. | | |
body. | | ScrollY
offestHeight | | |
| |__________| _
| | | -
| | | window.innerHeight
- |__________| _
- window.innerHeight + scrollY >= document.body.offsetHeight - 50 means when touch the boundry of bottom -> load more News
- Couldn’t use “this.loadMoreNews()” until you change handleScroll to arrow function
window.addEventListener('scroll', () => this.handleScroll);
- handleScroll()
handleScroll() {
const scrollY = window.scrollY
|| window.pageYOffset
|| document.documentElement.scrollYTop;
if((window.innerHeight) + scrollY) >= (document.body.offsetHeight - 50);
}
- DONT FORGET THE () -> THIS.HANDLESCROLL()
componentDidMount() {
this.loadMoreNews();
window.addEventListener('scroll', () => this.handleScroll());
}
Debounce
- Install Lodash inclient
npm install lodash --save
- Solve the Scroll frequent problems (Scroll Events happened too much)
- Send several requests to backend too frequently
import _ from 'lodash';
componentDidMount() {
this.loadMoreNews();
this.loadMoreNews = _.debounce(this.loadMoreNews, 1000);
window.addEventListener('scroll', () => this.handleScroll());
}
SOA (Service Oriented Architrcture)
SOA Desgin Pattern
- All service interfaces should be designed for both internal and external users
Benefit:
Isolation - language / technology / tools /
decoupleing / independency / deployment / maintenance
Ownership - minimal gray area and gap
Scalability - easy to scale up and modify
=======
Con:
Complexity - sometimes unnecessary
Latency - network communication eats time
Test effort - all services require E2E tests
DevOp : On-call!!!
Noraml Application Design Ligic
- Often built as a three tier architecture:
[Desktop User]
|
[Presentation Tier] : Client interatcion via a web browser
|
[Logic Tier] : provide the appliction's
| functionality via detailed processing
|Storage Tier|: handle persisting and retrieving application data
Unfortucately things get more complicated: Comflict!!!
- Ohter types of users
- Attachments
- Bulk operations
- Data pipelines
- Notifications
- Monitoring
- Testing
Mobile Destop UI
User User Test
\ | /
Chrome
Extension - Presentation - Prober
Tier
File File
Upload \ | / Download
Logic
Notifica- - Tier - Command
tions Line Tool
/ | \
CSV Storage CSV
Upload Tier Download
/ \
Data Data
Provider Consumer
With SOA:
- Fort-end Service handles all external interactions
- Back-end implements one protocol to talk to front-end
- All clients see same business abstraction
- Consistent business logic enforcement
- Easy internal refactoring
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-7kuHM2NH-1608896291248)(image/SOA_structure.png)]
RPC Backend Service
|| Client || || Node Server || || Backend Server || || Redis || || MongoDB || || ML Server ||
| | | Check if in Redis | | |
|---------------> | |<------------------>| | |
| fetch more news |---------------->| (If not) get news from DB | |
|(userID/ pageNum)| getNewsSunmmaire|<------------------------------->| |
| | sForUser | Get Recommended news from ML server |
|<----------------|(userID /pageNum)|<---------------------------------------------->|
| Sliced News | |Store combined news | | |
| |<----------------| in Redis | | |
| | Sliced News |------------------->| | |
| | | | | |
|| Client || || Node Server || || Backend Server || || Redis || || MongoDB || || ML Server ||
BackEnd Server
- Open a file backend_server and a service.py
mkdir backend_server
touch backend_server/service.py
JSONRPClib Libraries
-
Build a Client or Server to send or receive RPC Request
-
Not have a good support to Python 3.5, so we need a jsonrpclib-pelix to help development
JSONRPClib
JSONRPClib-pelix -
install library
pip3 install jsonrpclib
pip3 install jsonrpclib-pelix
RPC Server - Testing
- Server Host define
- Server Port define
- (Reason why we define is that in the future if we want to change that we could only change in the first line)
- Give a function - add
- Register host and port and your fnuctions
from jsonrpclib.SimpleJSONRPCServer import SimpleJSONRPCServer
SERVER_HOST = 'localhost';
SERVER_PORT = 4040;
def add(a, b):
print("Add is called with %d and %d " %(a, b))
return a + b
RPC_SERVER = SimpleJSONRPCServer((SERVER_HOST, SERVER_PORT))
RPC_SERVER.register_function(add, 'add')
print("Starting RPC server")
RPC_SERVER.serve_forever()
Testing by Postman
- Send a REQUEST:
- jsonpc version
- id : to identify
- method : add
- params : give a & b
POST Request:
{
"jsonrpc" : "2.0",
"id" : 1,
"method" : "add",
"params" : [1,2]
}
Result:
{
"result": 3,
"id": 1,
"jsonrpc": "2.0"
}
Add is called with 13 and 2
127.0.0.1 - - [13/Jan/2018 14:48:25] "POST / HTTP/1.1" 200 -
NodeJS Server as a RPCclient - jayson
- Open a new folder in web_server/server/
mkdir web_server/server/rpc_client
- Change news.js server not to hard code the data here but get News from our backend server
var express = require('express');
var router = express.Router();
/* GET News List. */
router.get('/', function(req, res, next) {
news = backend_server.getNews();
res.json(news);
});
module.exports = router;
Make NodeJs as a client - Npm jayson
- install jayson in server
npm install jayson --save
- Open a rpc_client.js with a helper method to let news.js could “getNoews()” from our backend server
var jayson = require('jayson');
// create a client
var client = jayson.client.http({
hostname: 'localhost',
port: 4040
});
function add(a, b, callback) {
client.request('add', [a, b], function(err, response) {
if(err) throw err;
console.