正则表达式:
是对字符串操作的一种逻辑公式,就是用事先定义好的一些特定字符,以及这些特定字符的组合,组成一个“规则字符串”,这个“规则字符串”用来表达对字符串的一种过滤逻辑
re模块的应用:
I say Good not food
import re
dir(re)
单个字符匹配:
. 点 匹配单个任意字符
re.findall(".ood","I say Good not food")
[] []里的内容被逐一单个匹配
re.findall("[Gf]ood","I say Good not food")
\d 匹配单个数字
re.findall("/d","I am 40")//['4','0']
\w 匹配[0~9,a~z,A~Z]
re.findall("\w","I am 40")//['4','0']
\s 匹配空的字符 空格,tab键算
re.findall("\s","I am 40")
匹配一组字符串直接匹配即可:
直接匹配:
re.findall("good","I say Good not food")//空白因为直接匹配需要严格大小写
分隔符的应用:
re.findall("Good|food","I say Good not food")
匹配两个不同的字符串:
*号:匹配左邻出现0次或多次
re.findall("go*gle","I like google not ggle goooogle and gooooooogle")
+号:左邻字符出现1次或多次
re.findall("go+gle","I like google not ggle goooogle and gooooooogle")
?号:左邻字符出现0次或1次
re.findall("go?gle","I like google not ggle goooogle and gooooooogle")
{}号:定义左邻字符出现的次数
re.findall("go{2}gle","I like google not ggle goooogle and gooooooogle")
re.findall("go{2,10}gle","I like google not ggle goooogle and gooooooogle")
re.findall("go{2,3}gle","I like google not ggle goooogle and gooooooogle")
^号:匹配是否以某个字符串开头
re.findall("^I like","I like google not ggle goooogle and gooooooogle")//有
re.findall("^and","I like google not ggle goooogle and gooooooogle")//无
$号:匹配是否以某字符串结尾
re.findall("gogle$","I like google not ggle goooogle and gooooooogle")
()分组和保存: \数字
test=re.search("(allen)\\1","my name is allenallen")
test.group()
\\1
\\:转义字符
1:存在的内容
1.爬虫获取主页信息:如何使用爬虫获取网页的html代码
2.过滤图片地址
3.爬虫图片获取
本次爬取图片测试使用的网址是https://www.dxsbb.com/
paqu.py:
import urllib.request
import re
class GetHtml(object):
def __init__(self,URL,HEAD):
self.url=URL
self.head=HEAD
def get_index(self):
self.request=urllib.request.Request(self.url)
self.request.add_header("user_agent",self.head)
self.response=urllib.request.urlopen(self.request)
return self.response.read()
def get_list(self):
self.strimglist=[]
self.imglist=re.findall(b"upFiles/infoImg/\w{16}.jpg",self.get_index())
for i in self.imglist:
self.strimglist.append(self.url+str(i,encoding="utf8"))
return self.strimglist
def get_image(self):
num=0
for self.url in self.get_list():
num+=1
with open(str(num)+".jpg","wb") as f:
f.write(self.get_index())
html=GetHtml("https://www.dxsbb.com/","Mozilla/5.0 (Windows NT 8.1; Win32; x32; rv:105.0) Gecko/20100101 QQBroswer/105.0")
#print(html.get_index())
html.get_image()
结果示意图:因为我的paqu.py文件是放在桌面的,所以最终爬取的图片就在桌面存放也就是在你的运行的那个文件的同一级目录