（进行中）书籍：网络爬虫权威指南四五六章

最新推荐文章于 2024-10-09 23:30:06 发布

harosha

最新推荐文章于 2024-10-09 23:30:06 发布

阅读量102

点赞数

文章标签：爬虫

本文链接：https://blog.csdn.net/harosha/article/details/120825790

版权

本文介绍了如何使用Scrapy框架爬取网页内容，包括通过Crawler类解析URL、CSS选择器和XPath选择器获取信息，以及创建Content和Website类来组织数据。重点讲解了如何在结构化爬虫中提取标题、内容和网站结构信息。

摘要由CSDN通过智能技术生成

cmd.exe对话框中如何返回上一级，如何返回根目录?

返回上一级输入 cd.. 回车

返回根目录输入 cd\ 回车

https://m.ituring.com.cn/book/tupubarticle/25962?bookID=1980&type=tubook&subject=%E7%AC%AC%204%20%E7%AB%A0%E3%80%80%E7%BD%91%E7%BB%9C%E7%88%AC%E8%99%AB%E6%A8%A1%E5%9E%8B

第 4 章　网络爬虫模型

#一个 Content 类的示例

import requests
from bs4 import BeautifulSoup

class Content:
    def __init__(self,url,title,body):
        self.url=url
        self.title=title
        self.body=body

def getPage(url):
    req=requests.get(url)
    return BeautifulSoup(req.text,'html.parser')


def scrapeBrooking(url):
    soup=getPage(url)
    title=soup.find('h1').text
    body=soup.find('div',{'class':'post-body'}).text
    return Content(url,title,body)

url='https://www.brookings.edu/blog/future-development/2018/01/26/delivering-inclusive-urban-access-3-uncomfortable-truths/'

content=scrapeBrooking(url)
print('Title:{}'.format(content.title))
print('URL:{}\n'.format(content.url))
print(content.body)

print('*******************************')
print('*******************************')


def scrapeNYTime(url):
    soup=getPage(url)
    title=soup.find('h1').text
    lines=soup.findAll('p',{'class':'story-content'})
    body='\n'.join([lines.text for line in lines])
    return Content(url,title,body)

url='https://www.nytimes.com/2018/01/25/opinion/sunday/silicon-valley-immortality.html'
content=scrapeNYTime(url)
print('Title:{}'.format(content.title))
print('URL:{}\n'.format(content.url))
print(content.body)
# print('Body:{}\n'.format(content.body))

每个网站的解析函数基本上都在做同样的事情:
1.选择标题元素并从标题中抽取文本

2.选择文章的主要内容

3.按需选择其他内容项

4.返回此前由字符串实例化的content对象

from bs4 import BeautifulSoup
import requests
#用单个CSS选择器 使用Beautifulsoup的select函数进行少量抓取 并且将这些选择器放入到一个字典对象中
class Content:
    #所有文章/网页的共同基类

    def __init__(self,url,title,body):
        self.url=url
        self.title=title
        self.body=body

    def print(self):
        print('URL:{}'.format(self.url))
        print('Body:\n{}'.format(self.body))
        print('Title:{}'.format(self.title))

class Website:
    #描述网站结构的信息
    def __init__(self,name,url,titleTag,bodyTag):
        self.name=name
        self.url=url
        self.titleTag=titleTag
        self.bodyTag=bodyTag


#有了上面的content和website类 就可以编写一个crawler去爬取任何网站的任何网页的标题和内容
import requests
from bs4 import BeautifulSoup

class Crawler:

    def getPage(self,url):
        try:
            req=requests.get(url)
        except requests.exceptions.RequestException:
            return None
        return BeautifulSoup(req.text,'html.parser')

    def safeGet(self,pageObj,selector):
        #用于从一个BeautifulSoup对象和一个选择器获取内容的辅助函数
        #如果选择器没有找到对象 就返回空符串
        selectedElems=pageObj.select(selector)  #这个是啥？？
        if selectedElems is not None and len(selectedElems)>0:
            return '\n'.join([elem.get_text() for elem in selectedElems])
        return ''   #这又是啥？？

    def parse(self,site,url):
        #从指定url提取内容
        bs=self.getPage(url)
        if bs is not None:
            title=self.safeGet(bs,site.titleTag)
            body=self.safeGet(bs,site.bodyTag)
            if title!=''and body!='':
                content=Content(url,title,body)
                content.print()


#以下代码定义了网站对象并开启了流程：
crawler=Crawler()

siteData = [
    ['O\'Reilly Media', 'http://oreilly.com',
    'h1', 'section#product-description'],
    ['Reuters', 'http://reuters.com', 'h1',
    'div.StandardArticleBody_body_1gnLA'],
    ['Brookings', 'http://www.brookings.edu',
    'h1', 'div.post-body'],
    ['New York Times', 'http://nytimes.com',
    'h1', 'p.story-content']
]
websites = []
for row in siteData:
    websites.append(Website(row[0], row[1], row[2], row[3]))

crawler.parse(websites[0], 'http://shop.oreilly.com/product/'\
    '0636920028154.do')
crawler.parse(websites[1], 'http://www.reuters.com/article/'\
    'us-usa-epa-pruitt-idUSKBN19W2D0')
crawler.parse(websites[2], 'https://www.brookings.edu/blog/'\
    'techtank/2016/03/01/idea-to-retire-old-methods-of-policy-education/')
crawler.parse(websites[3], 'https://www.nytimes.com/2018/01/'\
    '28/business/energy-environment/oil-boom.html')

再来一遍代码：

from bs4 import BeautifulSoup
import requests
#用单个CSS选择器 使用Beautifulsoup的select函数进行少量抓取 并且将这些选择器放入到一个字典对象中
class Content:
    #所有文章/网页的共同基类

    def __init__(self,url,title,body):
        self.url=url
        self.title=title
        self.body=body

    def print(self):
        print('URL:{}'.format(self.url))
        print('Body:\n{}'.format(self.body))
        print('Title:{}'.format(self.title))

class Website:
    #描述网站结构的信息
    def __init__(self,name,url,titleTag,bodyTag):
        self.name=name
        self.url=url
        self.titleTag=titleTag
        self.bodyTag=bodyTag


#有了上面的content和website类 就可以编写一个crawler去爬取任何网站的任何网页的标题和内容
import requests
from bs4 import BeautifulSoup

class Crawler:

    def getPage(self,url):   #getPage函数的作用就是获取url然后用BeautifulSoup解析
        try:
            req=requests.get(url)
        except requests.exceptions.RequestException:
            return None
        return BeautifulSoup(req.text,'html.parser')

    def safeGet(self,pageObj,selector):  #selector选择器
        #用于从一个BeautifulSoup对象和一个选择器获取内容的辅助函数
        #如果选择器没有找到对象 就返回空符串
        selectedElems=pageObj.select(selector)  #这个是啥？？
        if selectedElems is not None and len(selectedElems)>0:
            return '\n'.join([elem.get_text() for elem in selectedElems])
        return ''   #这又是啥？？


    def parse(self,site,url):  #这个parse函数就是解析的吧 打印出title和body
        #从指定url提取内容
        bs=self.getPage(url)  #获取url
        if bs is not None:     #如果url不是空集 title就是titleTag  body就是bodyTag
            title=self.safeGet(bs,site.titleTag)
            body=self.safeGet(bs,site.bodyTag)
            if title!=''and body!='':   #如果url不是空集 而且 title和body都是空白 那就打印出title和body
                content=Content(url,title,body)
                content.print()

#以下代码定义了网站对象并开启了流程：
crawler=Crawler()

siteData = [
    ['O\'Reilly Media', 'http://oreilly.com',
    'h1', 'section#product-description'],       #上面的Weisite class有写。这四个分别代表name,url,titleTag,bodyTag
    ['Reuters', 'http://reuters.com', 'h1',
    'div.StandardArticleBody_body_1gnLA'],
    ['Brookings', 'http://www.brookings.edu',
    'h1', 'div.post-body'],
    ['New York Times', 'http://nytimes.com',
    'h1', 'p.story-content']
]
websites = []  #创建这个字典对象 将抓取到的信息放到这个字典中
for row in siteData:
    websites.append(Website(row[0], row[1], row[2], row[3]))

crawler.parse(websites[0], 'http://shop.oreilly.com/product/'\
    '0636920028154.do')
#这个parse由两个参数 site就是websites[0]，url就是'http://shop.oreilly.com/product/0636920028154.do'
crawler.parse(websites[1], 'http://www.reuters.com/article/'\
    'us-usa-epa-pruitt-idUSKBN19W2D0')
crawler.parse(websites[2], 'https://www.brookings.edu/blog/'\
    'techtank/2016/03/01/idea-to-retire-old-methods-of-policy-education/')
crawler.parse(websites[3], 'https://www.nytimes.com/2018/01/'\
    '28/business/energy-environment/oil-boom.html')

print('**********')
for web in websites:
    print(web)

最后再来一遍代码：

#三个类 一个Content 一个Website  一个Crawler

#Content类 包含 init和print函数。Content是所有文章/网页的共同基类 标配？
class Content:
    def __init__(self, url, title, body):
        self.url = url
        self.title = title
        self.body = body

    def print(self):
        print('URL:{}'.format(self.url))
        print('TITLE:{}'.format(self.title))
        print('BODY:{}\n'.format(self.body))


#Website类 包含init函数 是拿来描述网页结构的信息的
class Website:
    def __init__(self,name,url,titleTag,bodyTag):
        #这里的name,url,titleTag,bodyTag和上面Content类里的不同含义吧？
        self.name=name
        self.url=url
        self.titleTag=titleTag
        self.bodyTag=bodyTag

import requests
from bs4 import BeautifulSoup

#Crawler类 就是爬虫设置部分 包含getPage函数(获取url和解析的)  safeGet函数(选择器选取的)  parse函数(解析，打印用的)
class Crawler:

    #getPage函数 拿来获取url 解析url
    def getPage(self,url):
        try:
            req=requests.get(url)
        except requests.exceptions.RequestException:
            return None
        return BeautifulSoup(req.text,'html.parser')

    #这个选择器怎么用的？？
    def safeGet(self,pageObj,selector):
        selectedElems=pageObj.select(selector)
        if selectedElems is not None and len(selectedElems)>0:
            return '\n'.join([elem.get_text() for elem in selectedElems])
        return ''

    #parse函数拿来干啥的  打印title和body？？
    def parse(self,site,url):
        bs=self.getPage(url)
        if bs is not None:  #这个是关于Website类的
            title=self.safeGet(bs,site.titleTag)
            body=self.safeGet(bs,site.bodyTag)
            if title!='' and body!='':   #这个是关于Content类的
                content=Content(url,title,body)
                content.print()




#以下代码说了爬什么网站 然后开始爬了
crawler=Crawler()  #启动爬虫

siteData=[
    ['O\'Reilly Media', 'http://oreilly.com',
     'h1', 'section#product-description'],  # 上面的Weisite class有写。这四个分别代表name,url,titleTag,bodyTag
    ['Reuters', 'http://reuters.com', 'h1',
     'div.StandardArticleBody_body_1gnLA'],
    ['Brookings', 'http://www.brookings.edu',
     'h1', 'div.post-body'],
    ['New York Times', 'http://nytimes.com',
     'h1', 'p.story-content']
]

websites=[]

for row in siteData:
    websites.append(Website(row[0],row[1], row[2], row[3]))
    print('Website:',websites)
    print('row:',row)

print('*'*20)
#下面crawler.parser就是调用Crawler类里的parse函数
crawler.parse(websites[0], 'http://shop.oreilly.com/product/'\
    '0636920028154.do')
#这个parse由两个参数 site就是websites[0]，url就是'http://shop.oreilly.com/product/0636920028154.do'
crawler.parse(websites[1], 'http://www.reuters.com/article/'\
    'us-usa-epa-pruitt-idUSKBN19W2D0')
crawler.parse(websites[2], 'https://www.brookings.edu/blog/'\
    'techtank/2016/03/01/idea-to-retire-old-methods-of-policy-education/')
crawler.parse(websites[3], 'https://www.nytimes.com/2018/01/'\
    '28/business/energy-environment/oil-boom.html')

4.3　结构化爬虫

后面还有部分太难了没写。。。

第五章 Scrapy

爬虫入门之Scrapy框架基础rule与LinkExtractors(十一) - 诚实善良小郎君 - 博客园 (cnblogs.com)https://www.cnblogs.com/why957/p/9276338.html

#在当前目录创建新的scrapy项目
scrapy startproject wikiSpider

#这行命令会在项目所在的目录中创建一个新的子目录 名为wikiSpider

scrapy也有查找功能，分别是：css选择器、xpath选择器、正则

提取属性我们是用：“标签名::attr(属性名)”，比如我们要提取url表达式就是：a::attr(href)，要提取图片地址的表达式就是：img::attr(src)……以此类推，好了知道scrapy给我们提供的提取变了的工具，那我们就可以提取上面的URL了，有多种方式，首先我们可以直接：

response.css("a::attr(href)")

extract():这个方法返回的是一个数组list，，里面包含了多个string，如果只有一个string，则返回['ABC']这样的形式。

extract_first()：这个方法返回的是一个string字符串，是list数组里面的第一个字符串。

import scrapy

class ArticleSpider(scrapy.Spider):
    name='article'  #给蜘蛛命名article

    def start_requests(self):
        urls=['http://en.wikipedia.org/wiki/Python_'
              '%28programming_language%29',
              'https://en.wikipedia.org/wiki/Functional_programming',
              'https://en.wikipedia.org/wiki/Monty_Python']
        return[scrapy.Request(url=url,callback=self.parse) for url in urls]
        #在这4个url中爬取 函数返回url和parse函数

    def parse(self,response):
        url=response.url   #response是干啥用的？？
        title=response.css('h1::text').extract_first()
        #标签名::attr(属性名)  例如h1::text 就是提取标签名为h1 属性名为text的
        #extract_first()：这个方法返回的是一个string字符串，是list数组里面的第一个字符串。
        print('URL is:{}'.format(url))
        print('Title is:{}'.format(title))
        #在处理包含多种内容的大型网站时 需要为每种内容(博客文章/新闻稿/文章等)分配不同的Scrapy item,每个具有不同的字段，但他们都在同一个scrapy项目中运行，项目里面的每个蜘蛛名称必须唯一

#带规则的抓取
#使用Scrapy的CrawlSpider类


Scrapy框架学习（四）—-CrawlSpider、LinkExtractors、Rule及爬虫示例

(14条消息) Scrapy框架学习（四）----CrawlSpider、LinkExtractors、Rule及爬虫示例_Widsom的博客-CSDN博客https://blog.csdn.net/qq_33689414/article/details/78669514