The Research and Development Of Educational News Website Based On Big Data Analysis

Outline

This project uses python code to crawl important news from different websites, and the news (URL, title, time and summary) is stored in Mongodb database, MVC architecture has been used to design the web server and DOM technology has been applied to design the web pages. The website’s functions include: different news showing, admin login, news searching and so on.
This is my undergraduate ” Provincial College Students’ innovation and entrepreneurship training program “ project from October 2017 to October 2018, and I have made some improvements afterward.

Project Structure

The project has two parts: EduProject_crawl( for data crawling) and EduProject(for website showing). EduProject_crawl contains 5 important python documents: html_downloader, html_parser, url_manager, html_outputer, spider_main. EduProject contains Java classes(for server MVC designing) and JSP(for web page designing). Let me introduce them one by one.

Demo

All functions of this website have been demonstrated in this Demo Video

Data crawling

  1. Install eclipse, tomcat server, mongodb database, adminMongo.
  2. Use “sudo brew services start mongodb-community@4.2” command line to start mongodb.
  3. Use “cd /Users/adminMongo” command line to execute mongodb visible tool (adminMongo).
  4. Use “npm start” to start adminMongo.
  5. Run “spider_main.python” to crawl data.

Websites

Keep the database connected, run “FirstPage.jsp” in EduProject

Introduction: the website has 4 main sections: **First Page, Educational Information, Integrated Section, and Admin Login.

First Page: this page is a static web page that involves some important real-time news.
在这里插入图片描述
As this picture shows: Educational Information Section has 6 subsections: general, preschool, basic, higher, family, international
在这里插入图片描述
Integrated Section has 4 subsections: mental, entertainment,book and technology.
在这里插入图片描述
When the user clicks the Admin Login section, it will show:

# Data crawling ## html_downloader ***Main Idea:*** download the web page with certain URL, **store** it in String type, and **transfer** it to the parser which can analyze it ```python class HtmlDownloader(object): ip_list = [] def download(self, url): if url is None: return None #some procedures to prevent anti-crawl req = urllib.request.Request(url) req.add_header('User-Agent','Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/45.0.2454.101 Safari/537.36') response = urllib.request.urlopen(req) if response.getcode() != 200: return None return response.read() ```

html_parser

Main Idea: parse the web content with certain URL and put the URL to URL manager, each subsection should have a different method of parsing, so there are ten different htm_parser doc:

Take “html_parser_family” as an example:

class HtmlParser(object):

    def parse(self, page_url, html_cont):
        if page_url is None or html_cont is None:
            return
        soup = BeautifulSoup(html_cont, 'html.parser', from_encoding='utf-8')
        new_urls = self._get_new_urls(page_url, soup)
        new_data = self._get_new_data(page_url, soup)
        return new_urls, new_data
        
    # get entire URL
    def _get_new_urls(self, page_url, soup):
        new_urls = set()
        links = soup.find_all('a', href=re.compile(r"http://www.jyb.cn/rmtzgjyb/(.*)"))
        for link in links:
            new_url = link['href']          
            new_full_url = urllib.parse.urljoin(page_url, new_url)
            new_urls.add(new_full_url)
        return new_urls

    # get the url,summary,title and time of news
    def _get_new_data(self, page_url, soup):
        res_data = {}
        res_data['url'] = page_url
        title_node = soup.find('div', class_= "xl_title").find("h1")
        res_data['title'] = title_node.get_text()
        
        summary = ""
        summary_node = soup.find('div', class_= "xl_text").findAll("p")
        for a in summary_node:
            s = str(a)
            s = re.sub(r"<a.*>.*</a>","",s)
            summary = summary + s
        
        res_data['summary'] = summary
        
        time = ""
        time_node = soup.find('div', class_= "xl_title").findAll("span")
        for b in time_node:
            r = str(b)
            r = re.sub(r"<span>\D*</span>","",r)
            time = time + r
        res_data['time'] = time
        
        return res_data

url_manager

Main Idea: manage new and old urls

class UrlManager(object):

    def __init__(self):
        self.new_urls = set()
        self.old_urls = set()

    def add_new_url(self, url):
        if url is None:           
            return
        if url not in self.new_urls and url not in self.old_urls:
            self.new_urls.add(url)

    def add_new_urls(self, urls):
        if urls is None or len(urls) == 0:
            return
        for url in urls:
            self.add_new_url(url)
        
    def has_new_url(self):
        return len(self.new_urls) != 0

    def get_new_url(self):
        new_url = self.new_urls.pop()
        self.old_urls.add(new_url)
        return new_url

html_outputer

Main Idea: Store the URL to the txt doc and database, if the .txt doc is empty or the URL doesn’t exist before, store the news with this URL to the database directly, else if the URL exists before, no need to store relevant data to the database.

for data in self.datas:
            
            urlForCheck = data['url']+'\n'
            current_url_txt = EduNews.switch_content.switch_txt(t)
            #print("curretn url txt is",current_url_txt)
            
            if check_url_exist(urlForCheck,current_url_txt) == 0: 
                if data['title'] !='' and data['summary']!='' and data['time']!='':
                        sheet_table.insert(data)
                        #print("data has been inserted to db")
                        store_url(data['url'],current_url_txt)
            else: print("the url exists, no need to store")

The variable “t” of html_outputer should be changed in order to output the content to different .HTML doc (check whether the data crawled has correct a data format), and .txt doc (used to check whether the new URL exists in a doc before, that is, to avoid storing the same news repeatedly)
Note: The mark * could be one of ten subsections

spider_main

Main Idea: start crawling and stop crawling, monitor the running condition.Note: the crawling URL should be changed according to the different collection, the html_parser_(*) should be changed as well.

class SpiderMain(object):
    # initiate url_manager、html_downloader、html_parser、html_outputer
    def __init__(self):
        self.maxcount = 1000  # set the maximum number of crawling
        self.urls = EduNews.url_manager.UrlManager() # set up the url manager
        self.downloader = EduNews.html_downloader.HtmlDownloader() #set up the url downloader
        self.parser = EduNews.html_parser_tech.HtmlParser() #set up the url parser
        self.outputer = EduNews.html_outputer.HtmlOutputer() #set up the news outputer

    # execute the crawling procedure, crawl about 1000 root_urls, and store them in output.html
    def craw(self, root_url):
        count = 1
        self.urls.add_new_url(root_url)
        while self.urls.has_new_url():
            try :
                new_url = self.urls.get_new_url()              
                print ("ss %d : %s" % (count, new_url))
                html_cont = self.downloader.download(new_url)
                new_urls, new_data = self.parser.parse(new_url, html_cont)
                self.urls.add_new_urls(new_urls)   
                self.outputer.collect_data(new_data)
                 
                if count == self.maxcount:
                    break
                count = count + 1
            except(ConnectionFailure):
                print ('craw failed')
        self.outputer.output_html()

def switch_collection(item):# contains different root_urls
        return {
        'general':"http://www.mnw.cn/edu/news/2289080.html",
        'preschool':"http://www.jyb.cn/rmtzgjyb/202006/t20200628_340144.html",
        'basic':"http://www.jyb.cn/rmtzcg/xwy/wzxw/202006/t20200629_340408.html",
        'family':"http://www.jyb.cn/rmtzgjyb/202006/t20200618_337837.html",
        'higher':"https://gaokao.chsi.com.cn/gkxx/zc/ss/202006/20200619/1932050338.html",
        'inter':"http://au.liuxue360.com/plan/03958175.html",
        'book':"https://haoshutj.com/17097.html",
        'enter':"http://www.chinanews.com/yl/2020/06-28/9223584.shtml",
        'mental':"http://xljk.gznc.edu.cn/info/1094/2065.htm",
        'tech':"http://www.ityears.com/tech/201903/31185.html"
        }.get(item,"nothing")
        

if __name__ == "__main__":    
    root_url = switch_collection('tech')
    ssl._create_default_https_context = ssl._create_unverified_context
    obj_spider = SpiderMain()
    obj_spider.craw(root_url)   
    time.sleep(3)

Database

Ten database collections have been stored
在这里插入图片描述

  1. The ten web pages realized by Java doc of com.edu.* can read the data from the database, which includes the control of the web page length, the acquisition of the detail of news, and the search function.
  2. The administrative operations (delete, insert and find) realized by Java doc of com.test can do extra operation on the database.

Website Establishing

Server MVC

MVC architecture(Model View Controller) has been applied to the web server design, which can separate the code of Model and View.

  1. Model means interaction with the data in the database (cope with the data logic of code). The model contains Service and Dao.

Partial code of NewsDao.java

public class NewsDao {
	private  DBCollection coll=null;
	// connect to mongodb service
	private   MongoClient mongoClient = new MongoClient( "localhost" , 27017 );
	// connect to database
	private  DB db=null; 
	private int Get;
   
	public NewsDao(String tableName) {
		 db = mongoClient.getDB( "Edu_project" );
		 coll = db.getCollection(tableName);
	}
	
	public List<News> queryList(int offset,int limit){// get all the news		
		 DBCursor cursor =coll.find().skip(offset).limit(limit);		
		 List<News> list = new ArrayList();		 
			while(cursor.hasNext())
			{   News news = new News();
				DBObject bd = cursor.next();											
				 news.setId(bd.get("_id").toString());
				 news.setUrl(bd.get("url").toString());				 
				 news.setTitle(bd.get("title").toString());				 
				 news.setSummary(bd.get("summary").toString());				 
				 news.setTime(bd.get("time").toString());			 
				 list.add(news);			
			}		 		
		return list;
	}
	
	public String searchFx(String name){// search
		String c;
		BasicDBObject fx = new BasicDBObject();
		fx.put("title", getLikeStr(name));
		Get = coll.find(fx).count();
		c = String.valueOf(Get);
		return c;
	}
}

Partial code of NewsService.java

public class newsService {

	private static final int eachPageNumber=8;// news number of each page is limited
	private NewsDao newsdao;
	
	public newsService(String tName) {
		newsdao=new NewsDao(tName);				
	}

	public List<News> getEachPageList(int nowPage){	// get news list of page according to the page number
	int limit = eachPageNumber;
	int offset = (nowPage-1)*eachPageNumber;	
	 return newsdao.queryList(offset, limit );				
	}
		
	public  String getEachPageList2(String name){// get news list of page according to the news title
		return newsdao.searchFx(name);		
	}	
}
  1. View means user interface.
  2. Controller means the acquisition of the request from View by using the method defined by Service. The controller sends data to the Model.

First Page

This page shows some real-time news, which belongs to static web page
related document:FirstPage.jsp(start page, use tomcat server to run it ), Firstpage_Allphoto doc,FirsPageNews doc,FirstpPageImg doc.

Integrated Section

在这里插入图片描述
Webselvet

Web page namerelated.jsp@WebServlet(*) to get this page@WebServlet(*) to get the detail of this pagecollection name
mentalalltypes.jsp“/getPageMental_alltypes”“/getNewsMental_alltypes”mental
entertainmententer.jsp“/getPageEnter”“/getNewsEnter”enter
bookbook.jsp“/getPageBook”“/getNewsBook”book
technologytech.jsp“/getPageTech”“/getNewsTech”tech

Note: HOW IT WORKS will be introduced in the next part since all sections will be visited in the same way.

Educational Information Section

在这里插入图片描述

Webselvet

Web page namerelated.jsp doc@WebServlet(*) to get this page@WebServlet(*) to get the detail of this pagecollection name
general educationgeneral.jsp“/getPageGeneral”“/getNewsGeneral”general
preschool educationpreschool.jsp“/getPagePreschool”“/getNewsPreschool”preschool
higher educationhigher.jsp“/getPageHigher”“/getNewsHigher”higher
family educationfamily.jsp“/getPageFamily”“/getNewsFamily”family
International educationinter.jsp“/getPageInter”“/getNewsInter”inter

HOW TO VISIT WEB PAGE:
Take “Family educational news” web page as an example:

  1. when the user clicks the “family education” section of “educational information”, the system will send an HTTP request according to
    <li><a href="http://localhost:8080/EduProject/getPageFamily">家庭教育</a></li>

  2. Two functions have been defined in “family.jsp”:

<script>
    
    var pageNow=${pageNow}
   
    function doChangePage(change){
    	
    	if(change==0){
    		pageNow=pageNow-1
    	}else{
    		pageNow=pageNow+1
    	}
    	
    	$("#pageNow").val(pageNow)
    	$("#changePageForm").submit()
    }
    
    function toNewsPage(id){
    	
    	$("#id").val(id)
    	$("#newsForm").submit()
    	
    }
</script>


<div class=one_all>
<!-- this class can get the content of newslist returned from getPageFamily.java  -->
				<ul>
					<c:forEach items='${newsList}' var='news'>
					<li>
							<div class=Title ><h2 ><a  href="#"  onclick="toNewsPage('${news.id}')">${news.title}</a></h2></div>
							<div class=Info>${news.time}</div>
							<div class=summary>${news.summary}</div>
					</li>
					</c:forEach>				
				</ul>
				</div>
				

<form id="changePageForm" hidden="hidden" action="/EduProject/getPageFamily">
	<input id="pageNow" name="pageNow">
</form>

<form id="newsForm" hidden="hidden" action="/EduProject/getNewsFamily">
	<input id="id" name="id">
</form>

At the end of family.jsp, two forms are defined:
a. When the user clicks “next page”, this page will send a request, then a getPageFamily instance will be created to deal with this request and a value called “pageNow” will also be transferred from original page to this getPageFamily instance.

b. When a user clicks the title of the news, that is, he wants to view the detail), this page will send a request and a getNewsFamily instance will be created to deal with this request, a news “ID” will be transferred from the original page to this getNewsFamily instance, at last, the content of this news will be shown in all_details.jsp if the news ID can be found in database.

getPageFamily.java

@WebServlet("/getPageFamily")
public class getPageFamily extends HttpServlet {

	newsService newse = new newsService("family");//connects to database
	
	@Override
	protected void doGet(HttpServletRequest request, HttpServletResponse response) throws ServletException, IOException {
		int pageNow;
		if(request.getParameter("pageNow")==null)
		{
			pageNow=1;
			
		}else{
			pageNow=Integer.parseInt(request.getParameter("pageNow"));
		}
		request.setAttribute("newsList",newse.getEachPageList(pageNow));
		request.setAttribute("pageNow", pageNow);
	
	    request.getRequestDispatcher("/family.jsp").forward(request, response);
        //request for dispatching, the URL of the server will not be changed but the real URL it visits will be changed, and it will not throw the data from original page away  
	}

	@Override
	protected void doPost(HttpServletRequest request, HttpServletResponse response) throws ServletException, IOException {
		// TODO 自动生成的构造函数存根
		doGet(request, response);
	}
	private static final long serialVersionUID = 3L;
}

getNewsFamily.java


@WebServlet("/getNewsFamily")
public class getNewsFamily extends HttpServlet {
	newsService newsfa = new newsService("family");
	News news;			
	@Override
	protected void doGet(HttpServletRequest req, HttpServletResponse resp) throws ServletException, IOException {
		String id=req.getParameter("id");
		news=newsfa.getNewsById(id);
		req.setAttribute("news", news);	
		req.getRequestDispatcher("/all_details.jsp").forward(req, resp);
		//transfer the content of news to all_details.jsp
	}
	@Override
	protected void doPost(HttpServletRequest req, HttpServletResponse resp) throws ServletException, IOException {
		    doGet(req, resp);
	}	
}

doGet vs doPost:
doGet: (It will be called when the method of a form is “get”)
GET is used to obtain the data from the server (as a response) and return it back to the client. When the URL of servlet has been visited directly either through a Web search engine, HTML, or JSP, the GET method will be called. When GET is called, the URL will show the detail of data, which causes a big problem to system security. For example, when the user logins and GET is called, the data form which contains the password and user’s name will be shown in the URL by search engine directly.

doPost: (It will be called when the method of a form is “post”)
It can transfer the data, which will not be shown by a search engine, from client to server, and we often use it in transferring a large amount of data.

partial code of all_details.jsp
When req.getRequestDispatcher("/all_details.jsp").forward(req, resp); in getNewsFamily.java has been executed, all_details.jsp will show the content of the news.

<div class=middle_right>		
				<div><h2 style="text-align:center;">${news.title}</h2></div>
				<div><h4 style="text-align:center;">${news.time}</h4></div>
				<p>${news.summary}</p>							
		</div>

Administrator Login

When the administrator presses the login button, he can manage the database and delete, search, and insert the news.

在这里插入图片描述
Webselvet

related .jsptwo outcomesrespond .jsp
dologin.jspif fail to login, jump to login_failure.jsp
if login successfully, jump to login_success.jsp ->delete.jsp(delete the corresponding news in the database according to the collection name and the title of the news )
insert.jsp(insert the news into the database according to the collection name and the title, summary and time of the news )
find.jsp(find the related news in the database according to the collection name and the title of the news
<body>
        <%
              request.setCharacterEncoding("UTF-8");
		      String username ="";
		      String password ="";
		      //get user's name and user's password
		      username = request.getParameter("username");
		      password = request.getParameter("password");
		      String url=request.getRequestURI();
		      //if the username is "admin" and the password is "admin", login successfully
		      if(("admin".equals(username)&&"admin".equals(password))&&(url!=null)){
		         session.setAttribute("loginUser", username);
		         //if login information is correct, jump to login_success.jsp
		         request.getRequestDispatcher("login_success.jsp").forward(request, response);
		      }
		      else{
		    	  //redirect to failure page if fail to login
		         response.sendRedirect("http://localhost:8080/EduProject/login_failure.jsp");
		      }
        %>
</body>

partial code of insert.jsp

<body>
        <%
              request.setCharacterEncoding("UTF-8");
              response.setCharacterEncoding("UTF-8");
              response.setHeader("content-type","text/html;charset=UTF-8");
		      String insertTitleName ="";
		      String insertTimeName ="";
		      String insertSummaryName ="";
		      String insertcollectionName ="";
		      
		    
		      insertTitleName = request.getParameter("insertTitleName");
		      insertTimeName = request.getParameter("insertTimeName");
		      insertSummaryName = request.getParameter("insertSummaryName");
		      insertcollectionName = request.getParameter("insertcollectionName");
		      
		      
		      MongoDao mdDao3 = new MongoDao(insertcollectionName);
		   mdDao3.InsertTest(insertTitleName,insertTimeName,insertSummaryName);		    
		  %>    

<div class="title"><p>添加成功</p></div>

</body>

Search

In this picture, the user inputs “中考” in search place,then it shows “6 news has been found”.
在这里插入图片描述
Webselvet

related .jsp@WebServlet(*) to get this page@WebServlet(*) to get the detail of this page
Search.jsp(search successfully)“/getPageAllCollSearch”“/getNewsSearch”
search_failure.jsp(fail to search)

When the user inputs what he wants to search (keywords) and presses the search button, a request will be sent:

<form action="/EduProject/getPageAllCollSearch" method="post">           
     <input type="text" name="name" value="" /> 
     <input type="submit" value="搜索" class="login" style="cursor: pointer;"> 
</form>

then a getPageAllCollSearch instance will be created to deal with this request and it will search all database collections. If the news which contains the keywords has been found, it will be returned to Search.jsp.

partial code of getPageAllCollSearch.java

@Override
	protected void doGet(HttpServletRequest request, HttpServletResponse response) throws ServletException, IOException {
		init_service_lists();//initialize all the service lists
		int pageNow;
		if(request.getParameter("pageNow")==null)
		{
			pageNow=1;
			
		}else{
			pageNow=Integer.parseInt(request.getParameter("pageNow"));
		}

		if((request.getParameter("name")==null)||request.getParameter("name").equals(""))
		{
			request.getRequestDispatcher("/search_failure.jsp").forward(request, response);// jump to search_failure.jsp if there is no name input
		}
		else
		{
			String name=new String(request.getParameter("name").getBytes("iso8859-1"),"utf-8");//prevent random code

			request.getSession().setAttribute("name1",request.getParameter("name"));
			
			List<News> combinelist = new ArrayList<News>(100);
			
			for (newsService s : coll_list) {//  find in all collection lists
				List<News> n = s.getEachPageList1(pageNow,name);
			    if (n.isEmpty() == false)
			    {
			    	for (News ns : n)
			    	{
			    		if(combinelist.contains(ns)==false) {
			    		   combinelist.add(ns);
			    		}
			    	}
			    }
			}
			
			request.setAttribute("newsList",combinelist);
			count = 0;
			
			for (newsService s : coll_list) {
				String n = s.getEachPageList2(name);
			    if (n!="")
			    {
			    	count = count + Integer.parseInt(n);
			    }
			}
			
			request.setAttribute("newsCount",String.valueOf(count));
			request.setAttribute("newsName",name);
			request.setAttribute("pageNow", pageNow);
		
			
		    request.getRequestDispatcher("/Search.jsp").forward(request, response);
		}
		coll_list.clear();


	}

Future Improvements

  1. The data can be stored in a cloud database.
  2. More news can be stored.
  3. The web page design can be more fantastic.

Conclusion

This project provides me a good chance to exercise my ability of researching and team management. As a project manager, I need to balance all the tasks and distribute them to team members and hold the meetings constantly to monitor the process. It is a really good experience for me. As a team member, I am responsible for part of crawling code, admin operations on a database, search function, and part of web page design. Take crawling as an example, I need to come up with the different parse method for different subsections. I have learned a lot from this project, especially the knowledge of HTTP, python crawling, web design, and Database operation. Thanks for reading!

评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值