python学习之爬虫一

深海里的盐汽水

于 2021-08-18 17:32:38 发布

阅读量128

点赞数 1

文章标签： python

本文链接：https://blog.csdn.net/qq_45253758/article/details/119784033

版权

python — 爬虫一

用爬虫可以爬取网页上面的信息，下面通过例子简单的介绍如何爬取网页上的信息

用正则表达式获取信息

（一）找到需要爬取信息的网址,获取网页源码。这里选取了链家二手房网页信息

URL = 'https://cd.lianjia.com/ershoufang/'
headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.131 Safari/537.36'
}
resp = requests.get(url=URL, headers=headers)
print(resp.text)

headers 的内容可以在这里面找到在这里插入图片描述
（二）可以把网页源码写入一个文件中(也可以不用，直接把第三步的content替换成resp.text)

with open('二手房.html', 'w', encoding='utf-8') as file:
    file.write(resp.text)
with open('二手房.html', 'r', encoding='utf-8') as file:
    content = file.read()
    # print(content)

(三）用正则表达式匹配需要的内容（这里想获取房源的标题，地址，总价和单价）

str = re.compile(r'<a class="" href.*?>(.+?)</a><.*?"region">(.+?)</a>.*?>(.+?)</a>.*?<span>(.+?)</span>.*?<span>(单价\d+元/平米)</span>')
result = str.findall(content)
# print(len(result))
for x, y, c, z, q in result:
    print(f'{x}；地址为：{y}- {c}；房子价格：{z}万;{q}')

注意：用正则表达式匹配时要正确应用，能够准确的找到需要的信息，可以先导入能准确找到信息的那一段源码，再慢慢的进行正则表达式的转换

采用bs4方式爬取信息

导入模块：
import bs4
可以从HTML或者XML中提取数据

html = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title" name="dromouse"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""
soup = bs4.BeautifulSoup(html, 'lxml')

注意：html是网页的源码

2、选择标签内容
select:使用（id,class,标签，属性，父子，后代，兄弟，相邻兄弟选择器）去选择标签，返回结果是一个列表
select_one:使用（id,class,标签，属性，父子，后代，兄弟，相邻兄弟选择器）去选择标签,返回结果是select结果中第一个元素

p_list = soup.select('body > p')
print(p_list)

p_list_1 = soup.select('body > .title')
print(p_list_1)

p = soup.select_one('body > p')
print(p)

深海里的盐汽水

关注

1
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
python学习之爬虫一

python — 爬虫一用爬虫可以爬取网页上面的信息，下面通过例子简单的介绍如何爬取网页上的信息用正则表达式获取信息（一）找到需要爬取信息的网址,获取网页源码。这里选取了链家二手房网页信息URL = 'https://cd.lianjia.com/ershoufang/'headers = { 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Ch
复制链接

扫一扫