功能:
输入:大学排名URL链接
输出:大学排名信息的屏幕输出(排名,大学名称,总分)
技术路线:requests-bs4
定向爬虫:仅对输入的URL进行爬取,不扩展爬取
步骤:
1.从网络上获取大学排名网页内容
2.提取网页中信息到合适的数据结构
3.利用数据结构展示并输出结果
程序的结构设计:
1. 从网络上获取大学排名网页内容
getHTMLText()
2. 提取网页内容中信息到合适的数据结构
fillUnivList()
3. 利用数据结构展示并输出结果
printUnivList()
代码:
#-*-coding:utf-8-*- import requests from bs4 import BeautifulSoup import bs4 def getHTMLText() try: r = requests.get(url,timeout = 30) r.raise_for_statys() r.encoding = r.apparent_encoding return r.text except: return "" def fillUnivList(ulist,html): soup = BeautifulSoup(html,"html.parser") for tr in soup.find('tbody').childen: if isinstance(tr,bs4.element.Tag) tds = tr('td') ulist.append([tds[0].string, tds[1].string, tds[2].string]) def printUnivList(ulist,num): print("{:^10}\t{:^6}\t{:10}".format("排名", "学校名称", "总分")) for i in range(num): u = ulist[i] print("{:^10}\t{:^6}\t{:10}".format(u[0],u[1],u[2])) def main(): uinfo = [] url = 'http://www.zuihaodaxue.com/zuihaodaxuepaiming2017.html' html =getHTMLText(url) fillUnivList(uinfo,html) printUnivList(uinfo,20) # 20所
结果:
中文对齐问题的原因:
当中文字符宽度不够时,采用西文字符填充;西文字符占用宽度不同
采用中文字符的空格填充chr(12288)
优化后:
- #CrawUnivRankingA.py
- import requests
- from bs4 import BeautifulSoup
- import bs4
- def getHTMLText(url):
- try:
- r = requests.get(url, timeout=30)
- r.raise_for_status()
- r.encoding = r.apparent_encoding
- return r.text
- except:
- return ""
- def fillUnivList(ulist, html):
- soup = BeautifulSoup(html, "html.parser")
- for tr in soup.find('tbody').children:
- if isinstance(tr, bs4.element.Tag):
- tds = tr('td')
- ulist.append([tds[0].string, tds[1].string, tds[3].string])
- def printUnivList(ulist, num):
- print("{:^10}\t{:^6}\t{:^10}".format("排名","学校名称","总分"))
- for i in range(num):
- u=ulist[i]
- print("{:^10}\t{:^6}\t{:^10}".format(u[0],u[1],u[2]))
- def main():
- uinfo = []
- url = 'https://www.zuihaodaxue.cn/zuihaodaxuepaiming2016.html'
- html = getHTMLText(url)
- fillUnivList(uinfo, html)
- printUnivList(uinfo, 20) # 20 univs
- main()