正则
单字符匹配
字符 | 功能 |
---|---|
. | 除换行符之外的任意字符 |
\d | 匹配数字 [0-9] |
\D | 匹配非数字 |
\w | 匹配单词字符[a-z,A-Z,0-9_] |
\W | 匹配非单词字符 |
\s | 匹配空白字符,空格,\n \t \r |
\S | 匹配非空白字符 |
^ | 匹配以…开头 |
$ | 匹配以…结尾 |
[0-9a-z] | 表示匹配数字0~9,字母a~z |
[^a-z] | 不匹配a-z之间的字符 |
多字符匹配(贪婪匹配)
字符 | 功能 |
---|---|
* | 匹配*前面的字符任意次数 |
+ | 匹配+前面的字符至少一次 |
? | 匹配?前面的字符0-1次 |
{n,m} | 匹配{n,m}前面的字符n-m次 |
多字符匹配(非贪婪匹配)
- *?
- +?
- ??
其他
- ()分组
- | 逻辑或
- \ 转义字符
re模块的方法
import re
re.compile():构建正则表达式对象
re.match():匹配字符串的开头,有符合正则规则则子串,返回结果match对象,
没有匹配到结果,返回None,单次匹配
re.search():从字符串开头,在整个字符串中进行匹配,有符合正则规则则子串
,返回结果match对象,没有匹配到结果,返回None,单次匹配
re.findall():在整个字符串中匹配所有符合正则规则的结果,返回一个list
re.finditer():在整个字符串中匹配所有符合正则规则的结果,返回一个可叠代对象
re.sub():在整个字符串中,替换所有符合正则表达式的子串
re.split() :根据正则表达式,分割字符串,返回一个list
xpath
- 安装:pip install lxml
- 引用:from lxml import etree
创建etree对象进行指定数据解析
- 1.本地
- etree = etree.parse(‘本地路径’)
- etree.xpath(‘xpath表达式’)
- 2.网络
- etree = etree.HTML(‘网络请求到页面的数据’)
- etree.xpath(‘xpath表达式’)
常用的xpath表达式:
- 1.属性定位:
找到class属性值为song的div标签
//div[@class=‘song’]
- 2.层级索引定位
找到class属性值为tang的div的直系子标签ul下的第二个子标签li下的直系子标签a
//div[@class=‘tang’]/ul/li[2]/a
- 3.逻辑运算
找到href属性值为空且class属性值为du的a标签
//a[@href=’’ and @class=‘du’]
- 4.模糊匹配
/表示获取某个标签下的文本内容
//div[@class=‘song’]/p[1]/text()
//表示获取某个标签下的文本内容和所有子标签下的文本内容
//div[@class=‘tang’]//text()
- 5.取属性
//div[@class=‘tang’]//li[2]/a/@href
#pip install lxml
from lxml.html import etree
import requests
class CollegateRank(object):
def get_page_data(self,url):
response = self.send_request(url=url)
if response:
# print(response)
with open('page.html','w',encoding='gbk') as file:
file.write(response)
self.parse_page_data(response)
def parse_page_data(self,response):
#使用xpath解析数据
etree_xpath = etree.HTML(response)
ranks = etree_xpath.xpath('//div[@class="scores_List"]/dl')
# print(ranks)
for dl in ranks:
school_info = {}
school_info['url'] = self.extract_first(dl.xpath('./dt/a[1]/@href'))
school_info['icon'] = self.extract_first(dl.xpath('./dt/a[1]/img/@src'))
school_info['name'] = self.extract_first(dl.xpath('./dt/strong/a/text()'))
school_info['adress'] = self.extract_first(dl.xpath('./dd/ul/li[1]/text()'))
school_info['tese'] = '、'.join(dl.xpath('./dd/ul/li[2]/span/text()'))
school_info['type'] = self.extract_first(dl.xpath('./dd/ul/li[3]/text()'))
school_info['belong'] = self.extract_first(dl.xpath('./dd/ul/li[4]/text()'))
school_info['level'] = self.extract_first(dl.xpath('./dd/ul/li[5]/text()'))
school_info['weburl'] = self.extract_first(dl.xpath('./dd/ul/li[6]/text()'))
print(school_info)
def extract_first(self,data=None,defalut=None):
if len(data) > 0:
return data[0]
return defalut
def send_request(self, url, headers=None):
headers = headers if headers else {
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/76.0.3809.100 Safari/537.36'}
response = requests.get(url=url,headers=headers)
if response.status_code == 200:
return response.text
if __name__ == '__main__':
url = 'http://college.gaokao.com/schlist/'
obj = CollegateRank()
obj.get_page_data(url)