通配符
通配符的内容:
* : 匹配0到任意字符
? : 匹配单个字符
. : 当前
.. : 当前的上一级
[0-9]: 0到9的任意一个数字
[a-z]: a到z的任意一个字母
[A-Z]: A到Z的任意一个字母
[a-zA-Z]: a到z或者A到Z之间的任意一个字母
下面的python并不能识别:
[[:digit:]]
[[:algha:]]
[[:upper:]]
[[:lower:]]
[[:space:]]
import glob
# glob.glob: 返回所有匹配正则的路径(返回的是一个列表)
print(glob.glob('/etc/*.conf'))
print(glob.glob('/etc/?????.conf'))
print(glob.glob('/etc/*[0-9]*.conf'))
print(glob.glob('/etc/*[A-Z]*.conf'))
print(glob.glob('/etc/*[0-9A-Z]*.conf'))
# glob.iglob: 返回所有匹配正则的路径(返回的是一个生成器)
print(glob.iglob('/etc/*[0-9A-Z]*.conf'))
正则表达式
re: regular experssion ==== 正则表达式
作用: 对于字符串进行处理, 会检查这个字符串内容是否与你写的正则表达式匹配,
如果匹配, 拿出匹配的内容;
如果不匹配, 忽略不匹配内容;
常用方法(findall match search)
findall: 返回匹配到的所有内容的一个列表
match:尝试从字符串的起始位置开始匹配,
如果起始位置没有匹配成功, 返回一个None;
如果起始位置匹配成功, 返回一个对象,通过group方法获取对应的字符串;
search:会扫描整个字符串, 只返回第一个匹配成功的内容;
如果能找到, 返回一个对象, 通过group方法获取对应的字符串;
import re
s='/home/kiosk/Desktop/python1/home/kiosk/Desktop/正则表达式'
pattern1 = r'cooffee'
pattern2 = r'kiosk'
print(re.findall(pattern1,s))
print(re.findall(pattern2,s))
pattern3 = r'/home'
print(re.match(pattern2,s))
print(re.match(pattern3,s))
matchobj = re.match(pattern3,s)
print(matchobj.group())
mat1obj = re.search(pattern2,s)
print(mat1obj)
print(mat1obj.group())
mat2obj = re.search(pattern3,s)
print(mat2obj)
print(mat2obj.group())
正则表达式特殊字符类
字符匹配:
r'cooffee'
字符类:
[pP]ython:以p或者P开头的字符
cooffee[pP]:以p或者P结尾的字符
[aeiou]:匹配元音因素的字符
[a-z]:匹配由a到z的字符
[A-Z]:匹配由A到Z的字符
[a-zA-Z0-9]:由大小写字母或者数字的字符
[^aeiou]:除了元音因素的字符
[^0-9]:除了数字
特殊字符类:
.: 匹配除了\n之外的任意字符; [.\n]
\d: digit--(数字), 匹配一个数字字符, 等价于[0-9]
\D: 匹配一个非数字字符, 等价于[^0-9]
\s: space(广义的空格: 空格, \t, \n, \r), 匹配单个任何的空白字符;
\S: 匹配除了单个任何的空白字符;
\w: 字母数字或者下划线, [a-zA-Z0-9_]
\W: 除了字母数字或者下划线, [^a-zA-Z0-9_]
import re
print(re.findall(r'[^0-9]','cooffee520floation'))
print(re.findall(r'[0-9]','cooffee520floation'))
print(re.findall(r'.','cooffee\n'))
print(re.findall(r'\d','当前文章阅读量为8'))
print(re.findall(r'\d','当前文章阅读量为8000'))
print(re.findall(r'\D','当前文章阅读量为8000'))
print(re.findall(r'\s','\n当前\r文章\t阅读量为8'))
print(re.findall(r'\S','\n当前\r文章\t阅读量为8'))
print(re.findall(r'\w','12当前cooffee_读量为8&'))
print(re.findall(r'\W','12当前cooffee_读量为8&'))
指定字符出现的次数
匹配字符出现次数:
*: 代表前一个字符出现0次或者无限次; d*, .*
+: 代表前一个字符出现一次或者无限次; d+
?: 代表前一个字符出现1次或者0次; 假设某些字符可省略, 也可以不省略的时候使用
第二种方式:
{m}: 前一个字符出现m次;
{m,}: 前一个字符至少出现m次; * == {0,}; + ==={1,}
{m,n}: 前一个字符出现m次到n次; ? === {0,1}
eg:电话号码的匹配
import re
pattern = r'\d{3}[\s-]?\d{4}[\s-]?\d{4}'
print(re.findall(pattern,'188 3789 7597'))
print(re.findall(pattern,'18837897597'))
print(re.findall(pattern,'188-3789-7597'))
eg:匹配邮箱
需求: 匹配一个qq邮箱;(xdshcdshvfhdvg@qq.com), xdshcdshvfhdvg(可以由字母数字或者下划线组成, 但是不能以数字或者下划线开头; 位数是6-12之间)
import re
#'\.'将.转义
pattern = r'[A-z]\w{5,11}@qq\.com'
s = '''
你好,各种格式的邮箱入下所示:
kevintian126@126.com
2. 1136667341@qq.com
3. meiya@cn-meiya.com
4. wq901200@hotmail.com
5. meiyahr@163.com
6. meiyuan@0757info.com
7. chingpeplo@sina.com
8. tony@erene.com.com
9. melodylu@buynow.com
10.cooffee@qq.com
11.f1234567@qq.com
'''
datali = re.findall(pattern,s)
with open('email.txt','w') as f:
for email in datali:
f.write(email+'\n')
eg:匹配IP
import re
pattern = r'^(25[0-5]|2[0-4]\d|[0-1]\d{2}|[1-9]?\d)\.' \
r'(25[0-5]|2[0-4]\d|[0-1]\d{2}|[1-9]?\d)\.' \
r'(25[0-5]|2[0-4]\d|[0-1]\d{2}|[1-9]?\d)\.' \
r'(25[0-5]|2[0-4]\d|[0-1]\d{2}|[1-9]?\d)$'
obj = re.match(pattern,'172.25.1.178')
if obj:
print('查找到匹配的内容:',obj.group())
else:
print('No Found')
obj1 = re.match(pattern,'172.25.1.278')
if obj1:
print('查找到匹配的内容:', obj1.group())
else:
print('No Found')
获取贴吧总页数
from itertools import chain
from urllib.request import urlopen
import re
def getPageContent(url):
"""
获取网页源代码
:param url: 指定url内容
:return: 返回页面的内容(str格式)
"""
with urlopen(url) as html:
return html.read().decode('utf-8')
def parser_page(content):
"""
根据内容获取所有的贴吧总页数;
:param content: 网页内容
:return: 贴吧总页数
"""
pattern = r'<span class="red">(\d+)</span>'
data = re.findall(pattern,content)
return data[0]
def parser_all_page(pageCount):
"""
根据贴吧页数, 构造不同的url地址;并找出所有的邮箱
:param pageCount:
:return:
"""
emails = []
for page in range(int(pageCount)):
url = 'http://tieba.baidu.com/p/2314539885?pn=%d' %(page+1)
print('正在爬取:%s' %(url))
content = getPageContent(url)
pattern = r'[a-zA-Z0-9][-\w.+]*@[A-Za-z0-9][-A-Za-z0-9]+\.+[A-Za-z]{2,14}'
findEmail = re.findall(pattern,content)
print(findEmail)
emails.append(findEmail)
return emails
def main():
url = 'http://tieba.baidu.com/p/2314539885'
content = getPageContent(url)
pageCount = parser_page(content)
emails = parser_all_page(pageCount)
print(emails)
with open('tiebaEmail.text','w') as f:
for tieba in chain(*emails):
f.write(tieba+'\n')
main()
正则中需要转义的字符及分组
因为这些字符在正则中有特殊含义, 所有必须转义: \., \+, \?, \*
表示分组:
| : 匹配| 左右任意一个表达式即可;
(ab): 将括号中的字符作为一个分组
\num: 引用分组第num个匹配到的字符串
(?P): 分组起别名
import re
# 当使用分组时, findall方法只能获取到分组里面的内容;
print(re.findall(r'(cooffee|hello)\d+','cooffee12hello2'))
# findall不能满足时, 考虑使用search 或者match
obj = re.search(r'(cooffee|hello)(\d+)','cooffee12hello2')
if obj:
print(obj.group())
print(obj.groups())
else:
print("Not Found")
# \num
s = '<html><title>cooffee</title></html>'
pattern = r'<(\w+)><(\w+)>(\w+)</\w+></\w+>'
print(re.findall(pattern,s))
s = '<html><title>cooffee</title></html>'
# 目前有三个分组, \1: 代指第一个分组的内容, \2: 代指第一个分组的内容,
pattern = r'<(\w+)><(\w+)>(\w+)</\2></\1>'
print(re.findall(pattern,s))
s = '<html><title>cooffee</tile></html>'
pattern = r'<(\w+)><(\w+)>(\w+)</\2></\1>'
print(re.findall(pattern,s))
#(?P)
s1 = 'http://www.cooffee.org/linux/book/'
pattern = 'http://[\w\.]+/(?P<courseName>\w+)/(?P<courseType>\w+)/'
obj1 = re.match(pattern,s1)
if obj1:
print(obj1.group())
print(obj1.groups())
print(obj1.groupdict())
else:
print('Not Found')
# 身份证号: 610 897 19900415 4534
s = '610897199004154534'
pattern = r'(?P<Province>\d{3})[\s-]?(?P<City>\d{3})[\s-]?(?P<Year>\d{4})[\s-]?' \
r'(?P<Month>\d{2})(?P<Day>\d{2})(\d{4})'
Obj = re.search(pattern, s)
if Obj:
print(Obj.groupdict())
else:
print('Not Found')
爬取图片
爬取单个图片
from urllib.request import urlopen
url = 'http://imgsrc.baidu.com/forum/w%3D580/sign=e23a670db9b7d0a27' \
'bc90495fbee760d/38292df5e0fe9925f33f62ef3fa85edf8db17159.jpg'
# 1. 获取图片的内容
content = urlopen(url).read()
# 2. 写入本地文件
with open('hello.jpg','wb') as f:
f.write(content)
爬取指定页贴吧图
import re
from itertools import chain
from urllib.request import urlopen
def get_content(url):
"""
获取网页内容
:param url:
:return:
"""
with urlopen(url) as html:
return html.read()
def parser_get_img_url(page):
"""
解析贴吧内容, 获取所有风景图片的url
"""
imgLi=[]
for page in range(7):
url = 'http://tieba.baidu.com/p/5437043553?pn=%d' % (page + 1)
content = get_content(url)
pattern = r'<img class="BDE_Image".*?src="(http://.*?\.jpg)".*?>'
imgUrl = re.findall(pattern,content.decode('utf-8').replace('\n',' '))
imgLi.append(imgUrl)
return imgLi
def main():
imgLi = parser_get_img_url(7)
index=0
for imgurl in chain(*imgLi):
index += 1
# 根据图片的url获取每个图片的内容;
content = get_content(imgurl)
with open('img/%s.jpg' %(index),'wb+') as f:
f.write(content)
print("第%s个图片下载成功...." %(index))
main()
正则批量替换和批量分隔符分离
# split()方法: 指定多个分隔符进行分割;
import re
ip = '172.25.254.178'
print(ip.split('.'))
s = '12+13-15/16'
print(re.split(r'[\+\-\*/]',s))
#replace实现替换
s = 'cooffee is a company'
print(s.replace('cooffee','floation'))
# 希望替换的时数字, 但数字的值不固定, 则通过正则来实现;
s = '本次转发数为100'
print(re.sub(r'\d+','0',s))
# 自动会给addNum传递一个参数, 参数时匹配到的SRE对象
def addNum(sreobj):
"""在原有基础上加1"""
# 在默认情况字符串中匹配到的内容还是字符串
num = sreobj.group()
new_num = int(num)+1
return str(new_num)
s1 = '本次转发数为100,分享数量为99'
print(re.sub(r'\d+',addNum,s1))
compile:
对于一些经常要用到的正则表达式,可以使用compile进行编译,后期再使用的时候可以直接拿过来用,执行效率会更快。而且compile还可以指定flag=re.VERBOSE,在写正则表达式的时候可以做好注释。示例代码如下:
text = "the number is 20.50"
r = re.compile(r"""
\d+ # 小数点前面的数字
\.? # 小数点
\d* # 小数点后面的数字
""",re.VERBOSE)
ret = re.search(r,text)
print(ret.group())
反爬虫第一步:伪装浏览器
如何伪装成浏览器访问?
from urllib.request import urlopen
from urllib import request
url = "http://www.cbrc.gov.cn/chinese/jrjg/index.html"
# 1. 定义一个真实浏览器的代理名称,怎么查看浏览器的User-Agent
#在浏览器的地址栏输入:javascript:alert(navigator.userAgent)
user_agent = 'Mozilla/5.0 (X11; Linux x86_64; rv:38.0) Gecko/20100101 Firefox/38.0'
# 2. 写到请求页面的header里面去
req = request.Request(url,headers={'User-Agent':user_agent})
# 3. 打开网页, 获取内容
print(urlopen(req).read().decode('utf-8'))
爬取银行名称及官网地址
需求:爬取所有银行的银行名称和官网地址(如果没有官网就忽略),并写入数据库;( http://www.cbrc.gov.cn/chinese/jrjg/index.html)
from urllib.request import urlopen
from urllib import request
import re
import pymysql
def get_content(url):
user_agent='Mozilla/5.0 (X11; Linux x86_64; rv:38.0) ' \
'Gecko/20100101 Firefox/38.0'
req = request.Request(url,headers={'User-Agent':user_agent})
return urlopen(req).read().decode('utf-8')
def parser_get_url_name(content):
pattern = r'<a href="(http://[\w\./]+)" target="_blank".*>\s+(\w+)\s+</a>'
print("正在爬取.....")
urlname=re.findall(pattern,content)
return urlname
def write_data(urlnameLi):
conn = pymysql.connect(host='172.25.254.78', user='cooffee',
password='cooffee', charset='utf8', autocommit=True)
cur = conn.cursor()
conn.select_db('cooffee')
insert_sqli = 'insert into blankdata VALUES (%s, %s);'
print("正在写入数据库.....")
cur.executemany(insert_sqli,urlnameLi)
url = 'http://www.cbrc.gov.cn/chinese/jrjg/index.html'
content = get_content(url)
urlnameLi=parser_get_url_name(content)
write_data(urlnameLi)
print('ok')
等等还没截图完。。。。
猫眼电影TOOP100
from urllib.request import urlopen
from urllib import request
import re
import pymysql
def get_content(url):
user_agent = 'Mozilla/5.0 (X11; Linux x86_64; rv:38.0) ' \
'Gecko/20100101 Firefox/38.0'
req = request.Request(url,headers={'User-Agent':user_agent})
return urlopen(req).read()
def parser_get_url(content):
pattern = re.compile(r'<dd>.*?board-index.*?>(\d+)</i>.*?data-src="(.*?)" alt="(.*?)" class="board-img" />.*?star">(.*?)</p>.*?releasetime">(.*?)</p>.*?integer">(.*?)</i>.*?fraction">(.*?)</i>.*?</dd>',re.S)
infomations = re.findall(pattern,content.decode('utf-8'))
infomationLi=[]
for infomation in infomations:
info_dict={
'imgurl':infomation[1],
'title':infomation[2],
'actor':infomation[3].strip()[3:],
'time': infomation[4],
}
infomationLi.append(info_dict)
return infomationLi
def save_img(infomationLi):
for infomation in infomationLi:
contentimg=get_content(infomation['imgurl'])
print("正在保存图片....")
with open('imgmao/%s.jpg'%(infomation['title']),'wb') as f:
f.write(contentimg)
def save_infomation(infomationLi):
conn = pymysql.connect(host='172.25.254.78', user='cooffee', password='cooffee', charset='utf8', autocommit=True)
cur = conn.cursor()
conn.select_db('cooffee')
saveLi=[(infomation['title'],infomation['actor'],infomation['time']) for infomation in infomationLi]
insert_sqli = 'insert into maodata VALUES (%s,%s,%s);'
print("正在写入数据库.....")
cur.executemany(insert_sqli,saveLi )
for page in range(10):
url = 'http://maoyan.com/board/4?offset=%s'%(page*10)
content=get_content(url)
infomationLi=parser_get_url(content)
save_img(infomationLi)
save_infomation(infomationLi)