假期来了,打算重新学一下Python爬虫
- 重装pycharm
- 做几个简单的爬虫
- 做一下这一周的安排和计划
重装pycharm
昨天晚上装Ubuntu,感觉挺好但是依赖什么的还是用不惯(果然我还是太菜了),但是硬盘分区的时候把D盘清空了,又得再重装一遍需要用到的Pycharm和IDEA
(Qt大概率很长一段时间不会再动了,想学别的东西)
做几个简单的爬虫复习一下之前学的东西
其实当时就没学透,cv工程师石锤
首先是一个爬取大学排名的爬虫
import requests
import time
from lxml import etree
def get_html(url):
'''
获得 HTML
'''
headers = {
'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/53\
7.36 (KHTML, like Gecko) Chrome/75.0.3770.142 Safari/537.36'
}
response = requests.get(url, headers=headers)
if response.status_code == 200:
response.encoding = 'utf-8'
return response.text
else:
return
def get_infos(html):
'''
提取数据
'''
html = etree.HTML(html)
data = html.xpath("//tr[@class='alt']")
for info in data:
rank = info.xpath('./td[1]/text()')[0]
name = info.xpath('./td[2]/div/text()')[0]
place = info.xpath('./td[3]/text()')[0]
type = info.xpath('./td[4]/text()')[0]
score = info.xpath('./td[5]/text()')[0]
message = "rank"+":"+rank+" 学校:"+name+" 城市:"+ place + " 办学类型:"+type+" 总分:"+score
print(message)
def main():
'''
主接口
'''
print("软科排名2020")
url = 'http://www.zuihaodaxue.com/zuihaodaxuepaiming2020.html'
html = get_html(url)
get_infos(html)
if __name__ == '__main__':
main()
简单粗暴这里直接放代码了。三个部分功能比较明确
这里讲一下etree这个知识点
在这里我们要捕获 的信息分别为 排名 校名 城市 种类 分数
这里采用etree将得到的html文档重新生成一个html对象
采用xpath可以将符合的元素收集到一起
然后做一个爬取豆瓣top250的爬虫
import requests
import time
from lxml import etree
def get_html(url):
'''
获得 HTML
'''
headers = {
'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/53\
7.36 (KHTML, like Gecko) Chrome/75.0.3770.142 Safari/537.36'
}
response = requests.get(url, headers=headers)
if response.status_code == 200:
response.encoding = 'utf-8'
return response.text
else:
return
def get_infos(html):
'''
提取数据
'''
html = etree.HTML(html)
data = html.xpath("//tr[@class='item']")
for info in data:
rank = info.xpath('./td[2]/div[1]/a/text()')[0]
des = info.xpath('./td[2]/p[1]/text()')[0]
score = info.xpath('./td[2]/div[2]/span[2]/text()')[0]
# inq = info.xpath('./td[2]/p[2]/span/text()')[0]
#message = "rank"+":"+rank+" 学校:"+name+" 城市:"+ place + " 办学类型:"+type+" 总分:"+score
message = rank.replace("\n",'').replace(" ",'') +" "+ des+" "+score+"\n"
print(message)
def main():
'''
主接口
'''
print("豆瓣读书top")
for i in range(0,10):
k = i*25
url = 'https://book.douban.com/top250?start={}'.format(str(k))
print(url)
html = get_html(url)
get_infos(html)
if __name__ == '__main__':
main()
效果如下