Python爬虫数据提取

最新推荐文章于 2021-08-02 21:40:59 发布

weixin_34345753

最新推荐文章于 2021-08-02 21:40:59 发布

阅读量228

点赞数

文章标签：爬虫 python c#

原文链接：https://segmentfault.com/a/1190000013196759

版权

通过爬虫抓取到的内容，需要提取出有用的东西，这一步就是数据提取或者数据清洗

内容一般分为两部分，非结构化的数据和结构化的数据。
非结构化数据：先有数据，再有结构，比如文本、电话号码、邮箱地址（利用正则表达式处理）、HTML 文件（利用正则、XPath、CSS选择器）
结构化数据：先有结构、再有数据，比如JSON（JSON Path）/XML (Xpath/正则等)
不同类型的数据，我们需要采用不同的方式来处理。

实际上爬虫一共就四个主要步骤：

明确目标 (要知道你准备在哪个范围或者网站去搜索)
爬 (将所有的网站的内容全部爬下来)
取 (去掉对我们没用处的数据)
处理数据（按照我们想要的方式存储和使用）

什么是正则表达式

正则表达式，又称规则表达式，通常被用来检索、替换那些符合某个模式(规则)的文本。
正则表达式是对字符串操作的一种逻辑公式，就是用事先定义好的一些特定字符、及这些特定字符的组合，组成一个“规则字符串”，这个“规则字符串”用来表达对字符串的一种过滤逻辑。

在任何编程语言中都有正则表达式，JS、JAVA、C#等等多有，Python 自1.5版本起增加了re 模块，re 模块使 Python 语言拥有全部的正则表达式功能。

正则匹配的规则

图片描述

re模块使用步骤

在 Python 中，我们可以使用内置的 re 模块来使用正则表达式。

有一点需要特别注意的是，正则表达式使用对特殊字符进行转义，所以如果我们要使用原始字符串，只需加一个 r 前缀，示例：r'testt.tpython'

re 模块的一般使用步骤如下：
1.使用 compile() 函数将正则表达式的字符串形式编译为一个 Pattern 对象
2.通过 Pattern 对象提供的一系列方法对文本进行匹配查找，获得匹配结果，一个 Match 对象。
3.最后使用 Match 对象提供的属性和方法获得信息，根据需要进行其他的操作

import re
str="abcdefabcdef" #待匹配的字符串
m=re.compile("a") #编译正则表达式，第一个参数是表达式字符串，第二个参数是标志位，比如re.I 匹配不对大小写敏感，re.S等等
result=m.findall(str) 
print(result) #输出["a","a"]

具体的正则表达式如何编写，其实就是根据上图的元字符进行组合，匹配你要的结果，
小练习1：找出字符串中的数字。 d表示 0-9 + 表示匹配前一个字符1次或无限次

import re
str="a11b22c3"
m=re.compile("\d+")
print(m.findall(str)) #输出['11', '22', '3']

小练习2：找出单词中包含 oo的单词

import re

str="1oo1 tina is a good girl ,she is cool"
m=re.compile("[a-z]oo[a-z]")
print(m.findall(str)) #输出['good', 'cool']

大概知道了正则的书写方式以后，先来看后续的步骤，不要着急，编写正则是个积累的过程。
compile() 函数将正则表达式的字符串形式编译为一个 Pattern 对象,Pattern 对象提供的一系列方法对文本进行匹配查找,来罗列下方法：
m.search函数会在字符串内查找模式匹配,只要找到第一个匹配然后返回，如果字符串没有匹配，则返回None。

import re

str="1oo1 tina is a good girl ,she is cool"
m=re.compile("[a-z]oo[a-z]")
print(m.search(str)) #<_sre.SRE_Match object; span=(15, 19), match='good'>

m.findall遍历匹配，可以获取字符串中所有匹配的字符串，返回一个列表。

import re

str="1oo1 tina is a good girl ,she is cool"
m=re.compile("[a-z]oo[a-z]")
print(m.findall(str)) #输出['good', 'cool']

m.match()决定RE是否在字符串刚开始的位置匹配
re.match与re.search的区别
re.match只匹配字符串的开始，如果字符串开始不符合正则表达式，则匹配失败，函数返回None；而re.search匹配整个字符串，直到找到一个匹配。

import re

str="aooz tina is a good girl ,she is cool"
m=re.compile("[a-z]oo[a-z]")
#str为待匹配的字符串，第一个参数是起始位置，第二个是字符串长度，从0开始，长度为6
print(m.match(str,0,6)) #<_sre.SRE_Match object; span=(0, 4), match='aooz'>

m.split()按照能够匹配的子串将string分割后返回列表

import re

str="aa1bb2cc3dd4"
m=re.compile("\d+")
# split(string[, maxsplit])，maxsplit 用于指定最大分割次数，不指定将全部分割
# list=m.split(str) #输出['aa', 'bb', 'cc', 'dd', '']
list=m.split(str,2) #输出['aa', 'bb', 'cc3dd4']
print(list)

m.sub()使用re替换string中每一个匹配的子串后返回替换后的字符串。

import re

str="aa1bb2cc3dd4"
m=re.compile("\d+")
result=m.sub('*',str) 
print(result) #输出aa*bb*cc*dd*

正则练习题

1 已知字符串:
info = 'baidu'
用正则模块提取出网址："http://www.baidu.com"和链接文本:"baidu"
2 字符串："one1two2three3four4" 用正则处理，输出 "1234"
3 已知字符串：text = "JGood is a handsome boy, he is cool, clever, and so on..." 查找所有包含'oo'的单词。

正则练习题答案：

import re

# 1 已知字符串:
# info = '<a href="http://www.baidu.com">baidu</a>'
# 用正则模块提取出网址："http://www.baidu.com"和链接文本:"baidu"
info = '<a href="http://www.baidu.com">baidu</a>'
# pattern1=re.compile(r'http:.+.com')#['www.baidu.com', 'baidu']
# pattern1=re.compile(r"[a-z.]*baidu[.a-z]*")#['www.baidu.com', 'baidu']
pattern1=re.compile(r"[w.]*baidu\.*\w*") #['www.baidu.com', 'baidu']
f1=pattern1.findall(info)
print(f1)
# print(f1[0])

#2 字符串："one1two2three3four4" 用正则处理，输出 "1234"
info1="one1two2three3four4"
pattern2=re.compile(r'\d{1}')
f2=pattern2.findall(info1)
print(f2) #['1', '2', '3', '4']

# 3 已知字符串：text = "JGood is a handsome boy, he is cool, clever, and so on..." 查找所有包含'oo'的单词。
info3="JGood is a handsome boy, he is cool, clever, and so on..."
pattern3=re.compile(r'\w*oo\w*')
f3=pattern3.findall(info3)
print(f3)

爬虫内涵段子正则匹配爬取

#coding:utf8

from urllib import request
import re

#定义一个爬虫类
class Splider:
    def __init__(self):
        # 初始化起始页的位置
        self.page = 1
        # 爬取开关,如果为True继续爬取
        self.switch = True

    def loadPage(self):
        '''
        下载页面
        '''
        # 拼接完成的url
        url = 'http://www.neihan8.com/article/list_5_'+str(self.page)+'.html'

        headers = {"User-Agent" : "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.99 Safari/537.36"}

        request1 = request.Request(url,headers=headers)
        response = request.urlopen(request1)

        # 每页的html源码
        html = response.read().decode('gbk')
        print(html)

        # 创建正则表达式规则对象,匹配每页里的段子内容,re.S是匹配全部的字符串内容
        pattern = re.compile('<div\sclass="f18 mb20">(.*?)</div>',re.S)

         # 将正则匹配对象应用到html源码字符串里，返回这个页面里的所有段子的列表
        content_list = pattern.findall(html)

        print(content_list)

        # for content in content_list:
        #     print content.decode('gbk')

        self.dealPage(content_list)


    def dealPage(self,content_list):

        '''
        处理每页的段子
        content_list: 每页的段子列表集合
        '''

        for item in content_list:
            item = item.replace('<p>',"").replace('</p>','').replace('&rdquo;','').replace('&ldquo;','').replace('<br />','').replace('<br />','').replace('&hellip;','')
            print(item)
        # 处理完成后调用writePage()将每个段子写入文件内
        self.writePage(item)
    def writePage(self,item):
        '''
            把每条段子逐个写入文件里
            item:处理后的每条段子
        '''
        # 写入文件内
        print('正在写入数据....')
        with open('duanzi.txt','a') as f:
            f.write(item)


    def startWork(self):
        '''
        控制爬虫的功能
        '''

        while self.switch:
            # 用户确定爬取的次数
            self.loadPage()
            command = input('如果想继续爬取,请按回车(退出输入的quit)')
            if command == 'quit':
                # 如果停止爬取,则输入quit
                self.switch = False
            # 每次循环,page页码自增1
            self.page = self.page+1
        print("谢谢使用")

if __name__ == '__main__':
    duanziSpider = Splider()
    duanziSpider.startWork()

XPath

正则用的不好，处理HTML文档很累,XPath，我们可以先将 HTML文件转换成 XML文档，然后用 XPath 查找 HTML 节点或元素。
XPath (XML Path Language) 是一门在 XML 文档中查找信息的语言，可用来在 XML 文档中对元素和属性进行遍历。
W3School官方文档：http://www.w3school.com.cn/xp...

XPath 开发工具
开源的XPath表达式编辑工具:XMLQuire(XML格式文件可用)
Chrome插件 XPath Helper
Firefox插件 XPath Checker
Chrome插件 XPath Helper安装：先翻墙-》安装图解http://blog.csdn.net/after95/...
shift+ctrl+x 导出插件，除了自己输入下面的规则匹配以外，还可以按住shift 鼠标悬浮会自动帮你提示出完整匹配
图片描述

//h2[@class="title blog-type-common blog-type-1"]/a/@href

XPath贴吧图片下载

tieb=input("请输入要爬取的贴吧：")
startpage=int(input("请输入起始页码："))
endpage=int(input("请输入截止页码："))
url="http://tieba.baidu.com/f?"
keyw=urllib.parse.urlencode({"kw":tieb})
url=url+keyw

#加载页面
def loadpage(fullurl):
    head={"User-Agent":"Mozilla/5.0 (compatible; Baiduspider-render/2.0; +http://www.baidu.com/search/spider.html)"}
    req=urllib.request.Request(fullurl,headers=head)
    res=urllib.request.urlopen(req)
    html=res.read().decode('utf-8')
    #使用规则
    obj=etree.HTML(html)    
    linklist=obj.xpath('//div[@class="t_con cleafix"]/div/div/div/a/@href')
    for link in linklist:
        url="http://tieba.baidu.com"+link
        clicklink(url)

#模拟从列表点击详情
def clicklink(url):
    head={"User-Agent":"Mozilla/5.0 (compatible; Baiduspider-render/2.0; +http://www.baidu.com/search/spider.html)"}
    req=urllib.request.Request(url,headers=head)
    res=urllib.request.urlopen(req)
    html=res.read().decode('utf-8')
    #使用规则
    obj=etree.HTML(html)    
    srclist=obj.xpath('//img[@class="BDE_Image"]/@src')
    for src in srclist:
        saveimg(src)

#保存图片
def saveimg(src):
    head={"User-Agent":"Mozilla/5.0 (compatible; Baiduspider-render/2.0; +http://www.baidu.com/search/spider.html)"}
    req=urllib.request.Request(src,headers=head)
    res=urllib.request.urlopen(req)
    html=res.read()
    filename=src[-10:]
    with open(filename,"wb") as f:
        f.write(html)

def spidertieba(startpage,endpage):
    for page in range(startpage,endpage+1):
        pn=(page-1)*50
        fullurl=url+"&pn="+str(pn)
        loadpage(fullurl)
  
spidertieba(startpage,endpage)

BeautifulSoup4

抓取工具速度使用难度安装难度
正则最快困难无（内置）
BeautifulSoup 慢最简单简单
lxml 快简单一般

Beautiful Soup 也是一个HTML/XML的解析器，主要的功能也是如何解析和提取 HTML/XML 数据。
lxml 只会局部遍历，而Beautiful Soup 是基于HTML DOM的，会载入整个文档，解析整个DOM树，因此时间和内存开销都会大很多，所以性能要低于lxml。BeautifulSoup 用来解析 HTML 比较简单，API非常人性化，支持CSS选择器、Python标准库中的HTML解析器，也支持 lxml 的 XML解析器。Beautiful Soup 3 目前已经停止开发，推荐现在的项目使用Beautiful Soup4

pip 安装即可：pip install beautifulsoup4

Beautiful Soup将复杂HTML文档转换成一个复杂的树形结构,每个节点都是Python对象

from bs4 import BeautifulSoup
ua_headers={"User-Agent": "Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36"}
request=urllib.request.Request("http://www.lrcgc.com/lyric-106-330486.html",headers=ua_headers)
response=urllib.request.urlopen(request)
html=response.read().decode("utf-8")

soup=BeautifulSoup(html)

print(soup.title)
print(soup.select("#J_lyric"))
print(soup.head.meta)
print(soup.select(".Text>.f14"))
print(soup.select("p[class='f14']"))