Python3.5开发6 - 实现简单的定向爬虫
打开:
http://www.jikexueyuan.com/robots.txt
内容:- 不让爬虫的内容
User-agent: *
Disallow: /?*
Disallow: /course/.html?
Requests安装:
Windows:pip install requests
Linux:sudo pip install requests
第三方库安装技巧:
- 少用easy_install 因为只能安装不能卸载
- 多用pip方式安装
- 撞墙了怎么办?请戳->
http://www.lfd.uci.edu/~gohlke/pythonlibs/
使用Requests获取网页源码:
# 突破反爬虫策略
head = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/49.0.2623.110 Safari/537.36'}
html = requests.get('https://movie.douban.com/top250',headers = head).content.decode()
print(html)
Xpath的介绍与使用 — XPath是什么
- XPath 是一门语言
- XPath可以在XML文档中查找信息
- XPath支持HTML
- XPath通过元素和属性进行导航
- XPath可以用来提取信息
- XPath比正则表达式厉害
- XPath比正则表达式简单
正则与Xpath:
- 正则:特征
- Xpath:地址
Xpath的介绍与使用:如何安装使用XPath
- 安装lxml库
$ pip install lxml-3.6.0-cp35-cp35m-win32.whl
- 代码模版
import lxml.html
selector = lxml.html.document_fromstring(网页源代码)
selector.xpath(一段神奇的符号)
神器XPath的使用 — XPath与HTML结构
- 树状结构
- 逐层展开
- 逐层定位
- 寻找独立节点
神器XPath的使用 — 获取网页元素的XPath
- 手动分析法
- Chrome生成法
神器XPath的使用 — 应用XPath提取内容
-
//定位根节点 (独一无二的特征量)
-
/往下层寻找
-
提取文本内容:/text()
-
提取属性内容: /@xxxx
代码:
import lxml.html
html = '''
<!DOCTYPE html>
<html>
<head lang="en">
<meta charset="UTF-8">
<title>测试-常规用法</title>
</head>
<body>
<div id="content">
<ul id="useful">
<li class="thisiswhatIwant">这是第一条信息</li>
<li class="thisiswhatIwant">这是第二条信息</li>
<li>这是第三条信息</li>
</ul>
<ul id="useless">
<li>不需要的信息1</li>
<li>不需要的信息2</li>
<li>不需要的信息3</li>
</ul>
<div id="url">
<a href="http://jikexueyuan.com">极客学院</a>
<a href="http://jikexueyuan.com/course/" title="极客学院课程库">点我打开课程库</a>
</div>
</div>
</body>
</html>
'''
selector = lxml.html.fromstring(html)
#提取文本
content = selector.xpath('//ul[@id="useful"]/li[@class="thisiswhatIwant"]/text()')
# content = selector.xpath('//html/body/div/ul[@id="useful"]/li[@class="thisiswhatIwant"]/text()')
print(content)
for each in content:
print(each)
#提取属性
link = selector.xpath('//div[@id="url"]/a/@href')
for each in link:
print(each)
title = selector.xpath('//a/@title')
print(title[0])
神器XPath的特殊用法
- 以相同的字符开头
- starts-with(@属性名称, 属性字符相同部分)
代码:
selector = lxml.html.fromstring(html1)
content = selector.xpath('//div[starts-with(@id,"test")]/text()')
for each in content:
print(each)
-
标签套标签
- string(.)
你的微信是多少?
代码:
import lxml.html
html2 = '''
<!DOCTYPE html>
<html>
<head lang="en">
<meta charset="UTF-8">
<title></title>
</head>
<body>
<div id="test3">
我左青龙,
<span id="tiger">
右白虎,
<ul>上朱雀,
<li>下玄武。</li>
</ul>
老牛在当中,
</span>
龙头在胸口。
</div>
</body>
</html>
'''
#使用string(.)就可以把数据获取完整
selector = lxml.html.fromstring(html2)
data = selector.xpath('//div[@id="test3"]')[0]
print(data)
info = data.xpath('string(.)')
content_2 = info.replace('\n','').replace(' ','')
print(content_2)
知识点总结:
-
熟练安装Python第三方库
-
熟练使用requests获取大多数网页的源代码
-
熟练使用xpath从网页源代码中提取数据
演示:豆瓣电影TOP250爬虫实现
#-*-coding:utf8-*-
import requests
import lxml.html
import csv
doubanUrl = 'https://movie.douban.com/top250?start={}&filter='
def getSource(url):
head = {'user-agent' : 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/49.0.2623.108 Safari/537.36'}
content = requests.get(url, headers=head)
content.encoding = 'utf-8'
return content.content
def getEveryItem(source):
selector = lxml.html.document_fromstring(source)
movieItemList= selector.xpath('//div[@class="info"]')
movieList = []
for eachMoive in movieItemList:
movieDict = {}
title = eachMoive.xpath('div[@class="hd"]/a/span[@class="title"]/text()')
print(title)
otherTitle = eachMoive.xpath('div[@class="hd"]/a/span[@class="other"]/text()')
link = eachMoive.xpath('div[@class="hd"]/a/@href')[0]
directorAndActor = eachMoive.xpath('div[@class="bd"]/p[@class=""]/text()')
star = eachMoive.xpath('div[@class="bd"]/div[@class="star"]/span[@class="rating_num"]/text()')[0]
quote = eachMoive.xpath('div[@class="bd"]/p[@class="quote"]/span/text()')
if quote:
quote = quote[0]
else:
quote = ''
movieDict['title'] = ''.join(title + otherTitle)
movieDict['url'] = link
movieDict['directorAndActor'] = ''.join(directorAndActor).replace(' ', '').replace('\r', '').replace('\n', '')
movieDict['star'] = star
movieDict['quote'] = quote
print(movieDict)
movieList.append(movieDict)
return movieList
def writeData(movieList):
with open('doubanMovie_formal.csv', 'w', encoding='UTF-8', newline='') as f:
writer = csv.DictWriter(f, fieldnames=['title', 'directorAndActor', 'star', 'quote', 'url'])
writer.writeheader()
for each in movieList:
print(each)
writer.writerow(each)
if __name__ == '__main__':
movieList = []
for i in range(10):
pageLink = doubanUrl.format(i * 25)
print(pageLink)
source = getSource(pageLink)
movieList += getEveryItem(source)
print(movieList[:10])
movieList = sorted(movieList, key=lambda k: k['star'], reverse=True)
writeData(movieList)
使用 数据 -> 自文本 方式打开utf-8编码方式的文本
作业:百度贴吧爬虫
目标网站:http://tieba.baidu.com/p/3522395718
目标内容:跟帖用户名,跟帖内容,跟帖时间
涉及知识:
- Requests获取网页
- XPath提取内容
- map实现多线程爬虫