Contents
Outline
This project uses python code to crawl important news from different websites, and the news (URL, title, time and summary) is stored in Mongodb database, MVC architecture has been used to design the web server and DOM technology has been applied to design the web pages. The website’s functions include: different news showing, admin login, news searching and so on.
This is my undergraduate ” Provincial College Students’ innovation and entrepreneurship training program “ project from October 2017 to October 2018, and I have made some improvements afterward.
Project Structure
The project has two parts: EduProject_crawl( for data crawling) and EduProject(for website showing). EduProject_crawl contains 5 important python documents: html_downloader, html_parser, url_manager, html_outputer, spider_main. EduProject contains Java classes(for server MVC designing) and JSP(for web page designing). Let me introduce them one by one.
Demo
All functions of this website have been demonstrated in this Demo Video
Data crawling
- Install eclipse, tomcat server, mongodb database, adminMongo.
- Use “sudo brew services start mongodb-community@4.2” command line to start mongodb.
- Use “cd /Users/adminMongo” command line to execute mongodb visible tool (adminMongo).
- Use “npm start” to start adminMongo.
- Run “spider_main.python” to crawl data.
Websites
Keep the database connected, run “FirstPage.jsp” in EduProject
Introduction: the website has 4 main sections: **First Page, Educational Information, Integrated Section, and Admin Login.
First Page: this page is a static web page that involves some important real-time news.
As this picture shows: Educational Information Section has 6 subsections: general, preschool, basic, higher, family, international
Integrated Section has 4 subsections: mental, entertainment,book and technology.
When the user clicks the Admin Login section, it will show:
html_parser
Main Idea: parse the web content with certain URL and put the URL to URL manager, each subsection should have a different method of parsing, so there are ten different htm_parser doc:
Take “html_parser_family” as an example:
class HtmlParser(object):
def parse(self, page_url, html_cont):
if page_url is None or html_cont is None:
return
soup = BeautifulSoup(html_cont, 'html.parser', from_encoding='utf-8')
new_urls = self._get_new_urls(page_url, soup)
new_data = self._get_new_data(page_url, soup)
return new_urls, new_data
# get entire URL
def _get_new_urls(self, page_url, soup):
new_urls = set()
links = soup.find_all('a', href=re.compile(r"http://www.jyb.cn/rmtzgjyb/(.*)"))
for link in links:
new_url = link['href']
new_full_url = urllib.parse.urljoin(page_url, new_url)
new_urls.add(new_full_url)
return new_urls
# get the url,summary,title and time of news
def _get_new_data(self, page_url, soup):
res_data = {}
res_data['url'] = page_url
title_node = soup.find('div', class_= "xl_title").find("h1")
res_data['title'] = title_node.get_text()
summary = ""
summary_node = soup.find('div', class_= "xl_text").findAll("p")
for a in summary_node:
s = str(a)
s = re.sub(r"<a.*>.*</a>","",s)
summary = summary + s
res_data['summary'] = summary
time = ""
time_node = soup.find('div', class_= "xl_title").findAll("span")
for b in time_node:
r = str(b)
r = re.sub(r"<span>\D*</span>","",r)
time = time + r
res_data['time'] = time
return res_data
url_manager
Main Idea: manage new and old urls
class UrlManager(object):
def __init__(self):
self.new_urls = set()
self.old_urls = set()
def add_new_url(self, url):
if url is None:
return
if url not in self.new_urls and url not in self.old_urls:
self.new_urls.add(url)
def add_new_urls(self, urls):
if urls is None or len(urls) == 0:
return
for url in urls:
self.add_new_url(url)
def has_new_url(self):
return len(self.new_urls) != 0
def get_new_url(self):
new_url = self.new_urls.pop()
self.old_urls.add(new_url)
return new_url
html_outputer
Main Idea: Store the URL to the txt doc and database, if the .txt doc is empty or the URL doesn’t exist before, store the news with this URL to the database directly, else if the URL exists before, no need to store relevant data to the database.
for data in self.datas:
urlForCheck = data['url']+'\n'
current_url_txt = EduNews.switch_content.switch_txt(t)
#print("curretn url txt is",current_url_txt)
if check_url_exist(urlForCheck,current_url_txt) == 0:
if data['title'] !='' and data['summary']!='' and data['time']!='':
sheet_table.insert(data)
#print("data has been inserted to db")
store_url(data['url'],current_url_txt)
else: print("the url exists, no need to store")
The variable “t” of html_outputer should be changed in order to output the content to different .HTML doc (check whether the data crawled has correct a data format), and .txt doc (used to check whether the new URL exists in a doc before, that is, to avoid storing the same news repeatedly)
Note: The mark * could be one of ten subsections
spider_main
Main Idea: start crawling and stop crawling, monitor the running condition.Note: the crawling URL should be changed according to the different collection, the html_parser_(*) should be changed as well.
class SpiderMain(object):
# initiate url_manager、html_downloader、html_parser、html_outputer
def __init__(self):
self.maxcount = 1000 # set the maximum number of crawling
self.urls = EduNews.url_manager.UrlManager() # set up the url manager
self.downloader = EduNews.html_downloader.HtmlDownloader() #set up the url downloader
self.parser = EduNews.html_parser_tech.HtmlParser() #set up the url parser
self.outputer = EduNews.html_outputer.HtmlOutputer() #set up the news outputer
# execute the crawling procedure, crawl about 1000 root_urls, and store them in output.html
def craw(self, root_url):
count = 1
self.urls.add_new_url(root_url)
while self.urls.has_new_url():
try :
new_url = self.urls.get_new_url()
print ("ss %d : %s" % (count, new_url))
html_cont = self.downloader.download(new_url)
new_urls, new_data = self.parser.parse(new_url, html_cont)
self.urls.add_new_urls(new_urls)
self.outputer.collect_data(new_data)
if count == self.maxcount:
break
count = count + 1
except(ConnectionFailure):
print ('craw failed')
self.outputer.output_html()
def switch_collection(item):# contains different root_urls
return {
'general':"http://www.mnw.cn/edu/news/2289080.html",
'preschool':"http://www.jyb.cn/rmtzgjyb/202006/t20200628_340144.html",
'basic':"http://www.jyb.cn/rmtzcg/xwy/wzxw/202006/t20200629_340408.html",
'family':"http://www.jyb.cn/rmtzgjyb/202006/t20200618_337837.html",
'higher':"https://gaokao.chsi.com.cn/gkxx/zc/ss/202006/20200619/1932050338.html",
'inter':"http://au.liuxue360.com/plan/03958175.html",
'book':"https://haoshutj.com/17097.html",
'enter':"http://www.chinanews.com/yl/2020/06-28/9223584.shtml",
'mental':"http://xljk.gznc.edu.cn/info/1094/2065.htm",
'tech':"http://www.ityears.com/tech/201903/31185.html"
}.get(item,"nothing")
if __name__ == "__main__":
root_url = switch_collection('tech')
ssl._create_default_https_context = ssl._create_unverified_context
obj_spider = SpiderMain()
obj_spider.craw(root_url)
time.sleep(3)
Database
Ten database collections have been stored
- The ten web pages realized by Java doc of com.edu.* can read the data from the database, which includes the control of the web page length, the acquisition of the detail of news, and the search function.
- The administrative operations (delete, insert and find) realized by Java doc of com.test can do extra operation on the database.
Website Establishing
Server MVC
MVC architecture(Model View Controller) has been applied to the web server design, which can separate the code of Model and View.
- Model means interaction with the data in the database (cope with the data logic of code). The model contains Service and Dao.
Partial code of NewsDao.java
public class NewsDao {
private DBCollection coll=null;
// connect to mongodb service
private MongoClient mongoClient = new MongoClient( "localhost" , 27017 );
// connect to database
private DB db=null;
private int Get;
public NewsDao(String tableName) {
db = mongoClient.getDB( "Edu_project" );
coll = db.getCollection(tableName);
}
public List<News> queryList(int offset,int limit){// get all the news
DBCursor cursor =coll.find().skip(offset).limit(limit);
List<News> list = new ArrayList();
while(cursor.hasNext())
{ News news = new News();
DBObject bd = cursor.next();
news.setId(bd.get("_id").toString());
news.setUrl(bd.get("url").toString());
news.setTitle(bd.get("title").toString());
news.setSummary(bd.get("summary").toString());
news.setTime(bd.get("time").toString());
list.add(news);
}
return list;
}
public String searchFx(String name){// search
String c;
BasicDBObject fx = new BasicDBObject();
fx.put("title", getLikeStr(name));
Get = coll.find(fx).count();
c = String.valueOf(Get);
return c;
}
}
Partial code of NewsService.java
public class newsService {
private static final int eachPageNumber=8;// news number of each page is limited
private NewsDao newsdao;
public newsService(String tName) {
newsdao=new NewsDao(tName);
}
public List<News> getEachPageList(int nowPage){ // get news list of page according to the page number
int limit = eachPageNumber;
int offset = (nowPage-1)*eachPageNumber;
return newsdao.queryList(offset, limit );
}
public String getEachPageList2(String name){// get news list of page according to the news title
return newsdao.searchFx(name);
}
}
- View means user interface.
- Controller means the acquisition of the request from View by using the method defined by Service. The controller sends data to the Model.
First Page
This page shows some real-time news, which belongs to static web page
related document:FirstPage.jsp(start page, use tomcat server to run it ), Firstpage_Allphoto doc,FirsPageNews doc,FirstpPageImg doc.
Integrated Section
Webselvet
Web page name | related.jsp | @WebServlet(*) to get this page | @WebServlet(*) to get the detail of this page | collection name |
---|---|---|---|---|
mental | alltypes.jsp | “/getPageMental_alltypes” | “/getNewsMental_alltypes” | mental |
entertainment | enter.jsp | “/getPageEnter” | “/getNewsEnter” | enter |
book | book.jsp | “/getPageBook” | “/getNewsBook” | book |
technology | tech.jsp | “/getPageTech” | “/getNewsTech” | tech |
Note: HOW IT WORKS will be introduced in the next part since all sections will be visited in the same way.
Educational Information Section
Webselvet
Web page name | related.jsp doc | @WebServlet(*) to get this page | @WebServlet(*) to get the detail of this page | collection name |
---|---|---|---|---|
general education | general.jsp | “/getPageGeneral” | “/getNewsGeneral” | general |
preschool education | preschool.jsp | “/getPagePreschool” | “/getNewsPreschool” | preschool |
higher education | higher.jsp | “/getPageHigher” | “/getNewsHigher” | higher |
family education | family.jsp | “/getPageFamily” | “/getNewsFamily” | family |
International education | inter.jsp | “/getPageInter” | “/getNewsInter” | inter |
HOW TO VISIT WEB PAGE:
Take “Family educational news” web page as an example:
-
when the user clicks the “family education” section of “educational information”, the system will send an HTTP request according to
<li><a href="http://localhost:8080/EduProject/getPageFamily">家庭教育</a></li>
-
Two functions have been defined in “family.jsp”:
<script>
var pageNow=${pageNow}
function doChangePage(change){
if(change==0){
pageNow=pageNow-1
}else{
pageNow=pageNow+1
}
$("#pageNow").val(pageNow)
$("#changePageForm").submit()
}
function toNewsPage(id){
$("#id").val(id)
$("#newsForm").submit()
}
</script>
<div class=one_all>
<!-- this class can get the content of newslist returned from getPageFamily.java -->
<ul>
<c:forEach items='${newsList}' var='news'>
<li>
<div class=Title ><h2 ><a href="#" onclick="toNewsPage('${news.id}')">${news.title}</a></h2></div>
<div class=Info>${news.time}</div>
<div class=summary>${news.summary}</div>
</li>
</c:forEach>
</ul>
</div>
<form id="changePageForm" hidden="hidden" action="/EduProject/getPageFamily">
<input id="pageNow" name="pageNow">
</form>
<form id="newsForm" hidden="hidden" action="/EduProject/getNewsFamily">
<input id="id" name="id">
</form>
At the end of family.jsp, two forms are defined:
a. When the user clicks “next page”, this page will send a request, then a getPageFamily instance will be created to deal with this request and a value called “pageNow” will also be transferred from original page to this getPageFamily instance.
b. When a user clicks the title of the news, that is, he wants to view the detail), this page will send a request and a getNewsFamily instance will be created to deal with this request, a news “ID” will be transferred from the original page to this getNewsFamily instance, at last, the content of this news will be shown in all_details.jsp if the news ID can be found in database.
getPageFamily.java
@WebServlet("/getPageFamily")
public class getPageFamily extends HttpServlet {
newsService newse = new newsService("family");//connects to database
@Override
protected void doGet(HttpServletRequest request, HttpServletResponse response) throws ServletException, IOException {
int pageNow;
if(request.getParameter("pageNow")==null)
{
pageNow=1;
}else{
pageNow=Integer.parseInt(request.getParameter("pageNow"));
}
request.setAttribute("newsList",newse.getEachPageList(pageNow));
request.setAttribute("pageNow", pageNow);
request.getRequestDispatcher("/family.jsp").forward(request, response);
//request for dispatching, the URL of the server will not be changed but the real URL it visits will be changed, and it will not throw the data from original page away
}
@Override
protected void doPost(HttpServletRequest request, HttpServletResponse response) throws ServletException, IOException {
// TODO 自动生成的构造函数存根
doGet(request, response);
}
private static final long serialVersionUID = 3L;
}
getNewsFamily.java
@WebServlet("/getNewsFamily")
public class getNewsFamily extends HttpServlet {
newsService newsfa = new newsService("family");
News news;
@Override
protected void doGet(HttpServletRequest req, HttpServletResponse resp) throws ServletException, IOException {
String id=req.getParameter("id");
news=newsfa.getNewsById(id);
req.setAttribute("news", news);
req.getRequestDispatcher("/all_details.jsp").forward(req, resp);
//transfer the content of news to all_details.jsp
}
@Override
protected void doPost(HttpServletRequest req, HttpServletResponse resp) throws ServletException, IOException {
doGet(req, resp);
}
}
doGet vs doPost:
doGet: (It will be called when the method of a form is “get”)
GET is used to obtain the data from the server (as a response) and return it back to the client. When the URL of servlet has been visited directly either through a Web search engine, HTML, or JSP, the GET method will be called. When GET is called, the URL will show the detail of data, which causes a big problem to system security. For example, when the user logins and GET is called, the data form which contains the password and user’s name will be shown in the URL by search engine directly.
doPost: (It will be called when the method of a form is “post”)
It can transfer the data, which will not be shown by a search engine, from client to server, and we often use it in transferring a large amount of data.
partial code of all_details.jsp
When req.getRequestDispatcher("/all_details.jsp").forward(req, resp);
in getNewsFamily.java has been executed, all_details.jsp will show the content of the news.
<div class=middle_right>
<div><h2 style="text-align:center;">${news.title}</h2></div>
<div><h4 style="text-align:center;">${news.time}</h4></div>
<p>${news.summary}</p>
</div>
Administrator Login
When the administrator presses the login button, he can manage the database and delete, search, and insert the news.
Webselvet
related .jsp | two outcomes | respond .jsp |
---|---|---|
dologin.jsp | if fail to login, jump to login_failure.jsp | |
if login successfully, jump to login_success.jsp -> | delete.jsp(delete the corresponding news in the database according to the collection name and the title of the news ) | |
insert.jsp(insert the news into the database according to the collection name and the title, summary and time of the news ) | ||
find.jsp(find the related news in the database according to the collection name and the title of the news |
<body>
<%
request.setCharacterEncoding("UTF-8");
String username ="";
String password ="";
//get user's name and user's password
username = request.getParameter("username");
password = request.getParameter("password");
String url=request.getRequestURI();
//if the username is "admin" and the password is "admin", login successfully
if(("admin".equals(username)&&"admin".equals(password))&&(url!=null)){
session.setAttribute("loginUser", username);
//if login information is correct, jump to login_success.jsp
request.getRequestDispatcher("login_success.jsp").forward(request, response);
}
else{
//redirect to failure page if fail to login
response.sendRedirect("http://localhost:8080/EduProject/login_failure.jsp");
}
%>
</body>
partial code of insert.jsp
<body>
<%
request.setCharacterEncoding("UTF-8");
response.setCharacterEncoding("UTF-8");
response.setHeader("content-type","text/html;charset=UTF-8");
String insertTitleName ="";
String insertTimeName ="";
String insertSummaryName ="";
String insertcollectionName ="";
insertTitleName = request.getParameter("insertTitleName");
insertTimeName = request.getParameter("insertTimeName");
insertSummaryName = request.getParameter("insertSummaryName");
insertcollectionName = request.getParameter("insertcollectionName");
MongoDao mdDao3 = new MongoDao(insertcollectionName);
mdDao3.InsertTest(insertTitleName,insertTimeName,insertSummaryName);
%>
<div class="title"><p>添加成功</p></div>
</body>
Search
In this picture, the user inputs “中考” in search place,then it shows “6 news has been found”.
Webselvet
related .jsp | @WebServlet(*) to get this page | @WebServlet(*) to get the detail of this page |
---|---|---|
Search.jsp(search successfully) | “/getPageAllCollSearch” | “/getNewsSearch” |
search_failure.jsp(fail to search) |
When the user inputs what he wants to search (keywords) and presses the search button, a request will be sent:
<form action="/EduProject/getPageAllCollSearch" method="post">
<input type="text" name="name" value="" />
<input type="submit" value="搜索" class="login" style="cursor: pointer;">
</form>
then a getPageAllCollSearch instance will be created to deal with this request and it will search all database collections. If the news which contains the keywords has been found, it will be returned to Search.jsp.
partial code of getPageAllCollSearch.java
@Override
protected void doGet(HttpServletRequest request, HttpServletResponse response) throws ServletException, IOException {
init_service_lists();//initialize all the service lists
int pageNow;
if(request.getParameter("pageNow")==null)
{
pageNow=1;
}else{
pageNow=Integer.parseInt(request.getParameter("pageNow"));
}
if((request.getParameter("name")==null)||request.getParameter("name").equals(""))
{
request.getRequestDispatcher("/search_failure.jsp").forward(request, response);// jump to search_failure.jsp if there is no name input
}
else
{
String name=new String(request.getParameter("name").getBytes("iso8859-1"),"utf-8");//prevent random code
request.getSession().setAttribute("name1",request.getParameter("name"));
List<News> combinelist = new ArrayList<News>(100);
for (newsService s : coll_list) {// find in all collection lists
List<News> n = s.getEachPageList1(pageNow,name);
if (n.isEmpty() == false)
{
for (News ns : n)
{
if(combinelist.contains(ns)==false) {
combinelist.add(ns);
}
}
}
}
request.setAttribute("newsList",combinelist);
count = 0;
for (newsService s : coll_list) {
String n = s.getEachPageList2(name);
if (n!="")
{
count = count + Integer.parseInt(n);
}
}
request.setAttribute("newsCount",String.valueOf(count));
request.setAttribute("newsName",name);
request.setAttribute("pageNow", pageNow);
request.getRequestDispatcher("/Search.jsp").forward(request, response);
}
coll_list.clear();
}
Future Improvements
- The data can be stored in a cloud database.
- More news can be stored.
- The web page design can be more fantastic.
Conclusion
This project provides me a good chance to exercise my ability of researching and team management. As a project manager, I need to balance all the tasks and distribute them to team members and hold the meetings constantly to monitor the process. It is a really good experience for me. As a team member, I am responsible for part of crawling code, admin operations on a database, search function, and part of web page design. Take crawling as an example, I need to come up with the different parse method for different subsections. I have learned a lot from this project, especially the knowledge of HTTP, python crawling, web design, and Database operation. Thanks for reading!