爬虫 day02

最新推荐文章于 2019-09-04 22:34:17 发布

weixin_44372247

最新推荐文章于 2019-09-04 22:34:17 发布

阅读量221

点赞数

分类专栏： Python

本文链接：https://blog.csdn.net/weixin_44372247/article/details/89156987

版权

Python 专栏收录该内容

9 篇文章 0 订阅

订阅专栏

1. 学习Beautifulsoup

学习beautifulsoup，并使用beautifulsoup提取内容
使用beautifulsoup提取丁香园论坛的回复内容
学习资料：https://beautifulsoup.readthedocs.io/zh_CN/latest/

使用beautifulsoup提取丁香园论坛的回复内容

丁香园直通点：http://www.dxy.cn/bbs/thread/626626#626626,按F12，定位回复内容的位置
在这里插入图片描述
import requests
from bs4 import BeautifulSoup as bs
def main():
headers = {
“User-Agent”: "Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) "
“Chrome/72.0.3626.121 Safari/537.36”
}
url = ‘http://www.dxy.cn/bbs/thread/626626’
request = requests.get(url, headers=headers)
request.encoding = request.apparent_encoding
response = request.content
html = bs(response, ‘lxml’)
getItem(html)
def getItem(html):
datas = [] # 用来存放获取的用户名和评论
for data in html.find_all(“tbody”):
try:
userid = data.find(“div”, class_=“auth”).get_text(strip=True)
print(userid)
content = data.find(“td”, class_=“postbody”).get_text(strip=True)
print(content)
datas.append((userid,content))
except:
pass
print(datas)

if name == ‘main’:
main()

在这里插入图片描述
2.学习xpath
XPath即为XML路径语言（XML Path Language），它是一种用来确定XML文档中某部分位置的语言。
在 XPath 中，有七种类型的节点：元素、属性、文本、命名空间、处理指令、注释以及文档（根）节点。XML 文档是被作为节点树来对待的。

2.1 使用xpath提取丁香园论坛的回复内容。
xpath中路径表达式：
Xpath中text()，string()，data()的区别如下：
text()仅仅返回所指元素的文本内容。
string()函数会得到所指元素的所有节点文本内容，这些文本讲会被拼接成一个字符串。
data()大多数时候，data()函数和string()函数通用，而且不建议经常使用data()函数，有数据表明，该函数会影响XPath的性能。

学习xpath，使用lxml+xpath提取内容
使用xpath提取丁香园论坛的回复内容
丁香园直通点：http://www.dxy.cn/bbs/thread/626626#626626

import requests
from lxml import etree
headers={
‘User-Agent’: ‘Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko)Chrome/72.0.3626.121 Safari/537.36’,
‘referer’:“http://www.dxy.cn/bbs/thread/626626#626626” }

url = ‘http://www.dxy.cn/bbs/thread/626626#626626’

ht_response = requests.get(url,headers=headers)
ht_response = ht_response.text
tree = etree.HTML(ht_response)
id=tree.xpath(’//div[@class=“auth”]/a/text()’)
content = tree.xpath(’//td[@class=“postbody”]’)
for i,j in zip(id,content):
print(i + " : " + j.xpath(‘string(.)’).strip())

weixin_44372247

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
爬虫 day02

1. 学习Beautifulsoup学习beautifulsoup，并使用beautifulsoup提取内容使用beautifulsoup提取丁香园论坛的回复内容学习资料：https://beautifulsoup.readthedocs.io/zh_CN/latest/使用beautifulsoup提取丁香园论坛的回复内容丁香园直通点：http://www.dxy.cn/bbs/t...
复制链接

扫一扫