工具:python3,库request,urllib,urllib2,Beautiful Soup
参考链接:Python爬虫入门 http://python.jobbole.com/81332/
BeautifulSoup4官网链接
https://www.crummy.com/software/BeautifulSoup/bs4/doc/#problems-after-installation
Python3正则表达
http://www.runoob.com/python3/python3-reg-expressions.html
方法一:用正则表达式实现
步骤:
1确定爬去的网址
2调用urlopen(url),此时返回response对象,response对象中有一个read方法,可以获取网页内容
urlopen(url, data, timeout)
print response.read()
3查看网页html源码
4用re.compile进行网页中的标签爬去
re.compile的语法格式
re.compile(pattern[, flags])
参数:
- pattern : 一个字符串形式的正则表达式
- flags 可选,表示匹配模式,比如忽略大小写,多行模式等,具体参数为:
-
- re.I 忽略大小写
- re.L 表示特殊字符集 \w, \W, \b, \B, \s, \S 依赖于当前环境
- re.M 多行模式
- re.S 即为' . '并且包括换行符在内的任意字符(' . '不包括换行符)
- re.U 表示特殊字符集 \w, \W, \b, \B, \d, \D, \s, \S 依赖于 Unicode 字符属性数据库
- re.X 为了增加可读性,忽略空格和' # '后面的注释
注意正则表达式符号使用:
\d 匹配一个数字字符
\w 匹配下划线的任何单词字符
\W匹配任何非单词字符
\s 匹配任何空白字符
\S 匹配任何非空白字符
* 匹配前一个字符0-n次
[] 匹配需要的字符合集
? 匹配前面的子表达式1次或0次
5爬去完所有标签后将返回一个列表list显示爬去内容
findall(string[, pos[, endpos]])
参数:
- string 待匹配的字符串。
- pos 可选参数,指定字符串的起始位置,默认为 0。
- endpos 可选参数,指定字符串的结束位置,默认为字符串的长度。
6打印出爬去内容
方法二 利用BeautifulSoup
步骤:
前3步相同
4用BS4库对html和xml网页进行解析,放到list中
from bs4 import BeautifulSoup
soup = BeautifulSoup(open("index.html"))
find_all( name , attrs , recursive , text , **kwargs )
find_all 搜索当前tag的所有tag子节点,并判断是否符合过滤器的条件
5遍历list,把解析后的内容写到txt文件中去
爬虫的实现方法
1正则表达式法
2 用BS4
实现原理
实现代码:
所爬去的网页 :京东狗粮
爬去html内容
<div class="gl-i-wrap">
<div class="p-img">
<a target="_blank" title="安全才能更安心,京东发货配送,不吃包退,进入主会场" href="https://ccc-x.jd.com/dsp/nc?ext=aHR0cHM6Ly9pdGVtLmpkLmNvbS8xNjcwNTYzMjEyLmh0bWw&log=JcLIlvAF9ia93akZHh0ef2WqVm0ipX_8oufVmdBGSK1VLkzRex16MKxI2be05Okl8JsdqN591ZetY4l-NN-qnVeJJTL6n0k8ub-DRP9m_Z8ub6bhe-3-A3HClmixZp2kVZcBgUa0cSo4yiTWnjZTi5b1LD3MR1tth9OzRagjnAwq23NZ8vh5VK-DdCW1DhVu040ShPoCsNKTtBa_9GjDNPyYvBElcM7Yq85n0T4jU83GAzAtLpJg6yzn2h8NMe-DUBQgU_YOp8L99gWHfzOYNSAHbz1kHRbuHIKKqpsvfLj9K0FinR77Yh-IoEb2Drvp_XQSRAmDDnObwb3sM4f_-eSauLzVg3NWa1o3Bi4V4T6ZjfLxIxW9aZn_UAyXfNStvIqwG1kRiB7H0nlYevJNeEtNOw5WdUB3zNKQeUrzqJQ&v=404" οnclick="searchlog(1,1670563212,0,2,'','adwClk=1')">
<img class="err-product" data-img="1" src="//img12.360buyimg.com/n7/jfs/t19021/212/1898737322/240079/6d3de82/5adef9caNedc6fad2.jpg" width="220" height="220">
</a> <div data-lease="" data-catid="7002" data-venid="77122" data-presale="" data-done="1"></div>
</div>
<div class="p-price">
<strong class="J_1670563212" data-done="1"><em>¥</em><i>45.00</i></strong> </div>
<div class="p-name p-name-type-2">
<a target="_blank" title="安全才能更安心,京东发货配送,不吃包退,进入主会场" href="https://ccc-x.jd.com/dsp/nc?ext=aHR0cHM6Ly9pdGVtLmpkLmNvbS8xNjcwNTYzMjEyLmh0bWw&log=JcLIlvAF9ia93akZHh0ef2WqVm0ipX_8oufVmdBGSK1VLkzRex16MKxI2be05Okl8JsdqN591ZetY4l-NN-qnVeJJTL6n0k8ub-DRP9m_Z8ub6bhe-3-A3HClmixZp2kVZcBgUa0cSo4yiTWnjZTi5b1LD3MR1tth9OzRagjnAwq23NZ8vh5VK-DdCW1DhVu040ShPoCsNKTtBa_9GjDNPyYvBElcM7Yq85n0T4jU83GAzAtLpJg6yzn2h8NMe-DUBQgU_YOp8L99gWHfzOYNSAHbz1kHRbuHIKKqpsvfLj9K0FinR77Yh-IoEb2Drvp_XQSRAmDDnObwb3sM4f_-eSauLzVg3NWa1o3Bi4V4T6ZjfLxIxW9aZn_UAyXfNStvIqwG1kRiB7H0nlYevJNeEtNOw5WdUB3zNKQeUrzqJQ&v=404" οnclick="searchlog(1,1670563212,0,1,'','adwClk=1')">
<em>好主人<font class="skcolor_ljg">狗粮</font> 全犬种通用型幼犬天然粮2.5kg增强免疫力</em>
<i class="promo-words" id="J_AD_1670563212">安全才能更安心,京东发货配送,不吃包退,进入主会场</i>
</a>
</div>
<div class="p-commit">
<strong><a id="J_comment_1670563212" target="_blank" href="https://ccc-x.jd.com/dsp/nc?ext=aHR0cHM6Ly9pdGVtLmpkLmNvbS8xNjcwNTYzMjEyLmh0bWw&log=JcLIlvAF9ia93akZHh0ef2WqVm0ipX_8oufVmdBGSK1VLkzRex16MKxI2be05Okl8JsdqN591ZetY4l-NN-qnVeJJTL6n0k8ub-DRP9m_Z8ub6bhe-3-A3HClmixZp2kVZcBgUa0cSo4yiTWnjZTi5b1LD3MR1tth9OzRagjnAwq23NZ8vh5VK-DdCW1DhVu040ShPoCsNKTtBa_9GjDNPyYvBElcM7Yq85n0T4jU83GAzAtLpJg6yzn2h8NMe-DUBQgU_YOp8L99gWHfzOYNSAHbz1kHRbuHIKKqpsvfLj9K0FinR77Yh-IoEb2Drvp_XQSRAmDDnObwb3sM4f_-eSauLzVg3NWa1o3Bi4V4T6ZjfLxIxW9aZn_UAyXfNStvIqwG1kRiB7H0nlYevJNeEtNOw5WdUB3zNKQeUrzqJQ&v=404" οnclick="searchlog(1,1670563212,0,3,'','adwClk=1')">1.1万+</a>条评价</strong>
</div>
<div class="p-shop" data-selfware="0" data-score="0" data-reputation="99" data-verderid="77122" data-done="1"><span class="J_im_icon"><a target="_blank" οnclick="searchlog(1,73630,0,58)" href="//mall.jd.com/index-73630.html" title="好主人旗舰店">好主人旗舰店</a><b class="im-01" title="联系第三方卖家进行咨询" οnclick="searchlog(1,73630,0,61)"></b></span></div>
<div class="p-icons" id="J_pro_1670563212" data-done="1">
<i class="goods-icons2 J-picon-tips" data-tips="退换货免运费">险</i></div>
<div class="p-operate">
<a class="p-o-btn contrast J_contrast" data-sku="1670563212" href="javascript:;" οnclick="searchlog(1,1670563212,0,6,'','adwClk=1')"><i></i>对比</a>
<a class="p-o-btn focus J_focus" data-sku="1670563212" href="javascript:;" οnclick="searchlog(1,1670563212,0,5,'','adwClk=1')"><i></i>关注</a>
<a class="p-o-btn addcart" href="//cart.jd.com/gate.action?pid=1670563212&pcount=1&ptype=1" target="_blank" οnclick="searchlog(1,1670563212,0,4,'','adwClk=1')" data-limit="0"><i></i>加入购物车</a>
</div>
<span class="p-promo-flag">广告</span>
<img src="https://im-x.jd.com/dsp/np?log=JcLIlvAF9ia93akZHh0ef2WqVm0ipX_8oufVmdBGSK1VLkzRex16MKxI2be05Okl8JsdqN591ZetY4l-NN-qnVeJJTL6n0k8ub-DRP9m_Z8ub6bhe-3-A3HClmixZp2kVZcBgUa0cSo4yiTWnjZTi5b1LD3MR1tth9OzRagjnAzOzUgR6Nxl3gYgjctL-s7XfFz362_ZPxk2A_1_kbuW9OH-Q7O6V1VLYP4BScpoCY_V4g5QN_7jJ5doJ_ZBs0a86KHGfMIv9iUjxfercjPC-gnkSm9Fu76qyiuIU8fFeeJEE75pxxb7fdOFzVoMGtElYi8bIV8o4I3_7GO-kzvSIJYCzhHgdUmjLvCDfLWWkZh5gbehpEoKIXteodkBtRIL3hJ0iWdX2-Xom88c63ZFjD2rXPmt-WziZiBEhzwg24KvjX9UAx2aY9MDpTSTzuH7&v=404" style="display:none;">
</div>
用方法一实现的代码
import re #利用正则来抓取
from urllib.request import urlopen
from urllib.parse import quote
def get_prod(keyword):
url = "https://search.jd.com/Search?keyword=" + quote(keyword) + "&enc=utf-8" #要爬取的目标地址
html = urlopen(url).read().decode('utf-8')
regex = re.compile('<li data-sku="\d*?" class="gl-item">\s+?<div class="gl-i-wrap">\s+?<div class="p-img">\s+?<a target="_blank" title="([\w\W]+?)" href="([\w\W]+?)" οnclick="[\w\W]+?">\s+?<img width="\d*?" height="\d*?" class="err-product" data-img="\d*?"[\s\S]+?src="([\w\W]+?)"[\s\S]+?</a>[\s\S]+?<div data-cid1="\d*?"[\w\W]+?</div>[\s\S]+?<div class="p-price">[\s\S]+?<strong class="[\w\W]+?" data-price="([\d\D]+?)"')
patt = re.findall(regex, html)
for i in list(patt):
print("name: ", i[0])
print("url: ", i[1].split('//')[1])
print("img: ", i[2].split('//')[1])
print("price: ", i[3])
print("-----------------")
if __name__ == '__main__':
get_prod("狗粮")
用方法2实现
from urllib.request import urlopen
from urllib.parse import quote
from bs4 import BeautifulSoup
def get_prod(keyword):
url = "https://search.jd.com/Search?keyword=" + quote(keyword) + "&enc=utf-8"
html = urlopen(url).read().decode('utf-8')
soup = BeautifulSoup(html)
li_all = soup.find_all('li', 'gl-item')
for i in li_all:
print("title: ", i.a["title"])
print("url: ", i.a["href"])
img = i.img["src"] if "src" in i.img else i.img.get("data-lazy-img")
print("img: ", img)
print("price: ", i.strong.get("data-price"))
if __name__ == '__main__':
get_prod("狗粮")