python3 爬虫(一)

工具:python3,库request,urllib,urllib2,Beautiful Soup

参考链接:Python爬虫入门 http://python.jobbole.com/81332/

BeautifulSoup4官网链接

https://www.crummy.com/software/BeautifulSoup/bs4/doc/#problems-after-installation

Python3正则表达

http://www.runoob.com/python3/python3-reg-expressions.html

方法一:用正则表达式实现

步骤:

1确定爬去的网址

2调用urlopen(url),此时返回response对象,response对象中有一个read方法,可以获取网页内容

urlopen(url, data, timeout)
print response.read()

3查看网页html源码

4用re.compile进行网页中的标签爬去

re.compile的语法格式

re.compile(pattern[, flags])
参数:

  • pattern : 一个字符串形式的正则表达式
  • flags 可选,表示匹配模式,比如忽略大小写,多行模式等,具体参数为:
    • re.I 忽略大小写
    • re.L 表示特殊字符集 \w, \W, \b, \B, \s, \S 依赖于当前环境
    • re.M 多行模式
    • re.S 即为' . '并且包括换行符在内的任意字符(' . '不包括换行符)
    • re.U 表示特殊字符集 \w, \W, \b, \B, \d, \D, \s, \S 依赖于 Unicode 字符属性数据库
    • re.X 为了增加可读性,忽略空格和' # '后面的注释

注意正则表达式符号使用:

\d 匹配一个数字字符

\w 匹配下划线的任何单词字符

\W匹配任何非单词字符

\s 匹配任何空白字符

\S 匹配任何非空白字符

* 匹配前一个字符0-n次

[] 匹配需要的字符合集

? 匹配前面的子表达式1次或0次

5爬去完所有标签后将返回一个列表list显示爬去内容

findall(string[, pos[, endpos]])

参数:

  • string 待匹配的字符串。
  • pos 可选参数,指定字符串的起始位置,默认为 0。
  • endpos 可选参数,指定字符串的结束位置,默认为字符串的长度。

6打印出爬去内容

方法二 利用BeautifulSoup

步骤:

前3步相同

4用BS4库对html和xml网页进行解析,放到list中

from bs4 import BeautifulSoup
soup = BeautifulSoup(open("index.html"))
find_all( name , attrs , recursive , text , **kwargs )
find_all 搜索当前tag的所有tag子节点,并判断是否符合过滤器的条件

5遍历list,把解析后的内容写到txt文件中去

爬虫的实现方法

1正则表达式法

2 用BS4

实现原理

实现代码:

所爬去的网页 :京东狗粮

爬去html内容

<div class="gl-i-wrap">
					<div class="p-img">
						<a target="_blank" title="安全才能更安心,京东发货配送,不吃包退,进入主会场" href="https://ccc-x.jd.com/dsp/nc?ext=aHR0cHM6Ly9pdGVtLmpkLmNvbS8xNjcwNTYzMjEyLmh0bWw&log=JcLIlvAF9ia93akZHh0ef2WqVm0ipX_8oufVmdBGSK1VLkzRex16MKxI2be05Okl8JsdqN591ZetY4l-NN-qnVeJJTL6n0k8ub-DRP9m_Z8ub6bhe-3-A3HClmixZp2kVZcBgUa0cSo4yiTWnjZTi5b1LD3MR1tth9OzRagjnAwq23NZ8vh5VK-DdCW1DhVu040ShPoCsNKTtBa_9GjDNPyYvBElcM7Yq85n0T4jU83GAzAtLpJg6yzn2h8NMe-DUBQgU_YOp8L99gWHfzOYNSAHbz1kHRbuHIKKqpsvfLj9K0FinR77Yh-IoEb2Drvp_XQSRAmDDnObwb3sM4f_-eSauLzVg3NWa1o3Bi4V4T6ZjfLxIxW9aZn_UAyXfNStvIqwG1kRiB7H0nlYevJNeEtNOw5WdUB3zNKQeUrzqJQ&v=404" οnclick="searchlog(1,1670563212,0,2,'','adwClk=1')">
							<img class="err-product" data-img="1" src="//img12.360buyimg.com/n7/jfs/t19021/212/1898737322/240079/6d3de82/5adef9caNedc6fad2.jpg" width="220" height="220">
</a>						<div data-lease="" data-catid="7002" data-venid="77122" data-presale="" data-done="1"></div>
					</div>
					<div class="p-price">
<strong class="J_1670563212" data-done="1"><em>¥</em><i>45.00</i></strong>					</div>
					<div class="p-name p-name-type-2">
						<a target="_blank" title="安全才能更安心,京东发货配送,不吃包退,进入主会场" href="https://ccc-x.jd.com/dsp/nc?ext=aHR0cHM6Ly9pdGVtLmpkLmNvbS8xNjcwNTYzMjEyLmh0bWw&log=JcLIlvAF9ia93akZHh0ef2WqVm0ipX_8oufVmdBGSK1VLkzRex16MKxI2be05Okl8JsdqN591ZetY4l-NN-qnVeJJTL6n0k8ub-DRP9m_Z8ub6bhe-3-A3HClmixZp2kVZcBgUa0cSo4yiTWnjZTi5b1LD3MR1tth9OzRagjnAwq23NZ8vh5VK-DdCW1DhVu040ShPoCsNKTtBa_9GjDNPyYvBElcM7Yq85n0T4jU83GAzAtLpJg6yzn2h8NMe-DUBQgU_YOp8L99gWHfzOYNSAHbz1kHRbuHIKKqpsvfLj9K0FinR77Yh-IoEb2Drvp_XQSRAmDDnObwb3sM4f_-eSauLzVg3NWa1o3Bi4V4T6ZjfLxIxW9aZn_UAyXfNStvIqwG1kRiB7H0nlYevJNeEtNOw5WdUB3zNKQeUrzqJQ&v=404" οnclick="searchlog(1,1670563212,0,1,'','adwClk=1')">
							<em>好主人<font class="skcolor_ljg">狗粮</font>  全犬种通用型幼犬天然粮2.5kg增强免疫力</em>
							<i class="promo-words" id="J_AD_1670563212">安全才能更安心,京东发货配送,不吃包退,进入主会场</i>
						</a>
					</div>
					<div class="p-commit">
						<strong><a id="J_comment_1670563212" target="_blank" href="https://ccc-x.jd.com/dsp/nc?ext=aHR0cHM6Ly9pdGVtLmpkLmNvbS8xNjcwNTYzMjEyLmh0bWw&log=JcLIlvAF9ia93akZHh0ef2WqVm0ipX_8oufVmdBGSK1VLkzRex16MKxI2be05Okl8JsdqN591ZetY4l-NN-qnVeJJTL6n0k8ub-DRP9m_Z8ub6bhe-3-A3HClmixZp2kVZcBgUa0cSo4yiTWnjZTi5b1LD3MR1tth9OzRagjnAwq23NZ8vh5VK-DdCW1DhVu040ShPoCsNKTtBa_9GjDNPyYvBElcM7Yq85n0T4jU83GAzAtLpJg6yzn2h8NMe-DUBQgU_YOp8L99gWHfzOYNSAHbz1kHRbuHIKKqpsvfLj9K0FinR77Yh-IoEb2Drvp_XQSRAmDDnObwb3sM4f_-eSauLzVg3NWa1o3Bi4V4T6ZjfLxIxW9aZn_UAyXfNStvIqwG1kRiB7H0nlYevJNeEtNOw5WdUB3zNKQeUrzqJQ&v=404" οnclick="searchlog(1,1670563212,0,3,'','adwClk=1')">1.1万+</a>条评价</strong>
					</div>
					<div class="p-shop" data-selfware="0" data-score="0" data-reputation="99" data-verderid="77122" data-done="1"><span class="J_im_icon"><a target="_blank" οnclick="searchlog(1,73630,0,58)" href="//mall.jd.com/index-73630.html" title="好主人旗舰店">好主人旗舰店</a><b class="im-01" title="联系第三方卖家进行咨询" οnclick="searchlog(1,73630,0,61)"></b></span></div>
					<div class="p-icons" id="J_pro_1670563212" data-done="1">
					<i class="goods-icons2 J-picon-tips" data-tips="退换货免运费">险</i></div>
					<div class="p-operate">
						<a class="p-o-btn contrast J_contrast" data-sku="1670563212" href="javascript:;" οnclick="searchlog(1,1670563212,0,6,'','adwClk=1')"><i></i>对比</a>
						<a class="p-o-btn focus J_focus" data-sku="1670563212" href="javascript:;" οnclick="searchlog(1,1670563212,0,5,'','adwClk=1')"><i></i>关注</a>
						<a class="p-o-btn addcart" href="//cart.jd.com/gate.action?pid=1670563212&pcount=1&ptype=1" target="_blank" οnclick="searchlog(1,1670563212,0,4,'','adwClk=1')" data-limit="0"><i></i>加入购物车</a>
					</div>
					<span class="p-promo-flag">广告</span>
					<img src="https://im-x.jd.com/dsp/np?log=JcLIlvAF9ia93akZHh0ef2WqVm0ipX_8oufVmdBGSK1VLkzRex16MKxI2be05Okl8JsdqN591ZetY4l-NN-qnVeJJTL6n0k8ub-DRP9m_Z8ub6bhe-3-A3HClmixZp2kVZcBgUa0cSo4yiTWnjZTi5b1LD3MR1tth9OzRagjnAzOzUgR6Nxl3gYgjctL-s7XfFz362_ZPxk2A_1_kbuW9OH-Q7O6V1VLYP4BScpoCY_V4g5QN_7jJ5doJ_ZBs0a86KHGfMIv9iUjxfercjPC-gnkSm9Fu76qyiuIU8fFeeJEE75pxxb7fdOFzVoMGtElYi8bIV8o4I3_7GO-kzvSIJYCzhHgdUmjLvCDfLWWkZh5gbehpEoKIXteodkBtRIL3hJ0iWdX2-Xom88c63ZFjD2rXPmt-WziZiBEhzwg24KvjX9UAx2aY9MDpTSTzuH7&v=404" style="display:none;">
	</div>

用方法一实现的代码

import re   #利用正则来抓取
from urllib.request import urlopen
from urllib.parse import quote

def get_prod(keyword):
    url = "https://search.jd.com/Search?keyword=" + quote(keyword) + "&enc=utf-8"  #要爬取的目标地址
    html = urlopen(url).read().decode('utf-8')
    regex = re.compile('<li data-sku="\d*?" class="gl-item">\s+?<div class="gl-i-wrap">\s+?<div class="p-img">\s+?<a target="_blank" title="([\w\W]+?)" href="([\w\W]+?)" οnclick="[\w\W]+?">\s+?<img width="\d*?" height="\d*?" class="err-product" data-img="\d*?"[\s\S]+?src="([\w\W]+?)"[\s\S]+?</a>[\s\S]+?<div data-cid1="\d*?"[\w\W]+?</div>[\s\S]+?<div class="p-price">[\s\S]+?<strong class="[\w\W]+?" data-price="([\d\D]+?)"')
    patt = re.findall(regex, html)
    for i in list(patt):
        print("name: ", i[0])
        print("url: ", i[1].split('//')[1])
        print("img: ", i[2].split('//')[1])
        print("price: ", i[3])
        print("-----------------")


if __name__ == '__main__':
   get_prod("狗粮")

用方法2实现

from urllib.request import urlopen
from urllib.parse import quote
from bs4 import BeautifulSoup

def get_prod(keyword):
    url = "https://search.jd.com/Search?keyword=" + quote(keyword) + "&enc=utf-8"
    html = urlopen(url).read().decode('utf-8')
    soup = BeautifulSoup(html)
    li_all = soup.find_all('li', 'gl-item')
    for i in li_all:
        print("title: ", i.a["title"])
        print("url: ", i.a["href"])
        img = i.img["src"] if "src" in i.img else i.img.get("data-lazy-img")
        print("img: ", img)
        print("price: ", i.strong.get("data-price"))



if __name__ == '__main__':
    get_prod("狗粮")







  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值