【无标题】

(全部代码在最后,学自“我的IT私塾”)

什么是爬虫

  • 网络爬虫,是一种按照一定规则,自动抓取互联网信息的程序或者脚本。由于互联网数据的多样性和资源的有限性,根据用户需求定向抓取相关网页并分析,已经成为如今主流的爬取策略。
  • 爬虫的本质就是模拟浏览器打开网页,获取网页中我们想要的那部分数据。

基本流程

  • 准备工作:通过浏览器查看分析目标网页,学习编程基础规范。
  • 获取数据:通过HTTP库向目标站点发起请求,请求可以包含额外的header等信息,如果服务器能正常响应,会得到一个response,便是所要获取的页面内容。
  • 解析内容:得到的内容可能是HTML、json等格式,可以用页面解析库、正则表达式等进行解析。
  • 保存数据:保存形式多种多样,可以存为文本,也可以保存到数据库,或者保存特定格式的文件。

编码规范

  • 一般python程序第一行需要加入:#-*- coding:utf-8 -*-或者#coding = utf-8,这样可以在代码中包含中文

  • 在Python中,使用函数实现单一功能或相关功能的代码段,可以提高可读性和代码重复利用率,函数代码块以def关键词开头,后接空格、函数标识符名称、圆括号、冒号,括号中可以传入参数,函数段缩进,return用于结束函数,可以返回一个值,也可以不带任何表达式(表示返回None)

  • Python文件中可以加入main函数用于测试程序:
    if__name__ == “__main__”:

  • Python使用#添加注释,说明代码段作用

引入模块

  • 模块(module):用来从逻辑上组织Python代码(变量、函数、类),本质就是py文件,提高代码的可维护性。Python使用import来导入模块。
  • 下方terminal选项,进入控制台,输入pip install xxx(需要引入的包)
    在这里插入图片描述 在这里插入图片描述

urllib库 获取数据

  • 通过urllib库,模仿浏览器向网页发送请求,并获取网页返回的响应数据
import urllib.request,urllib.parse
#获取一个get请求
response = urllib.request.urlopen("http://www.baidu.com")
print(response.read().decode("utf-8"))  #对获取到的网页源码进行utf-8解码
#获取一个post请求
import urllib.parse
data = bytes(urllib.parse.urlencode({"hello":"world"}),encoding='utf-8')
response1 = urllib.request.urlopen("http://httpbin.org/post",data)
print(response1.read().decode('utf-8'))
#超时处理
try:
    response2 = urllib.request.urlopen('http://httpbin.org/get',timeout=0.01)
    print(response2.read().decode('utf-8'))
except urllib.error.URLError:
    print('超时了!')
#获取信息
response3 = urllib.request.urlopen('http://www.baidu.com')
print(response3.status)#状态码
print(response3.getheaders())#获取所有响应头
print(response3.getheader("Date"))#获取响应头中特定信息

    
    
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14
  • 15
  • 16
  • 17
  • 18
  • 19
  • 20
import urllib.request,urllib.parse
#不想被人发现是爬虫,伪装成浏览器
url = 'http://httpbin.org/post'
data = bytes(urllib.parse.urlencode({'name':'tom'}),encoding='utf-8')
headers = {
    "User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/105.0.0.0 Safari/537.36"
}
req = urllib.request.Request(url=url,data=data,headers=headers,method='POST')
response = urllib.request.urlopen(req)
print(response.read().decode('utf-8'))

    
    
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
import urllib.request,urllib.parse
#访问豆瓣
url = 'https://movie.douban.com/top250'
headers = {
    "User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/105.0.0.0 Safari/537.36"
}
req = urllib.request.Request(url=url,headers=headers,method='GET')
response = urllib.request.urlopen(req)
print(response.read().decode('utf-8'))

    
    
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9

数据解析

BeautifulSoup4相关介绍

  • 对网页返回的响应信息进行解析,用到BeautifulSoup4库,BeautifulSoup4将复杂的html文档转化成一个复杂的树形结构,每一个节点都是Python对象,所有对象可以归纳为4种:
  • 1、Tag:html中的标签
  • 2、NavigableString:标签里的内容(字符串)
  • 3、BeautifulSoup:表示整个文档
  • 4、Comment:注释
'''
BeautifulSoup4将复杂的html文档转化成一个复杂的树形结构,每一个节点都是Python对象,所有对象可以归纳为4种
-- Tag
-- NavigableString
-- BeautifulSoup
-- Comment
'''
from bs4 import BeautifulSoup
import lxml
import urllib
file = open('./baidu.html',encoding='UTF-8')
html = file.read()
bs = BeautifulSoup(html,'lxml')#或者以html.parser的方式

#1、Tag:html中的标签
print(bs.title) #获取第一个title标签:<title>百度一下,你就知道</title>
print(bs.a) #获取第一个a标签
print(bs.findAll(‘a’))#获取所有a标签
print(type(bs.head))#<class ‘bs4.element.Tag’>

#2、NavigableString:标签里的内容(字符串)
print(bs.title.string)#第一个title标签里的内容:百度一下,你就知道

#3、标签的属性值
print(bs.a.attrs)#以字典的形式,存储第一个a标签的所有属性:{‘class’: [‘mnav’], ‘href’: ’ ', ‘name’: ‘tj_trnews’}

#4、BeautifulSoup:表示整个文档
print(type(bs)) #<class ‘bs4.BeautifulSoup’>
print(bs) #输出整个文档

#5、Comment:注释
print(bs.a.string)#输出音乐1,会自动将注释符号取消
print(type(bs.a.string))#<class ‘bs4.element.Comment’>

  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14
  • 15
  • 16
  • 17
  • 18
  • 19
  • 20
  • 21
  • 22
  • 23
  • 24
  • 25
  • 26
  • 27
  • 28
  • 29
  • 30
  • 31
  • 32
  • 33
  • 34

文档遍历

#文档的遍历
print(bs.head.contents)#以列表的形式输出head标签里的所有子标签
print(bs.head.contents[1])
#更多内容搜索BeautifulSoup文档
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6

在这里插入图片描述
在这里插入图片描述
在这里插入图片描述

文档搜索


#文档的搜索

#1、find_all方法
print(bs.find_all(‘a’))#字符串匹配标签,字符串内容必须与标签完全一致。以列表的形式,返回所有a标签

import re
#2、正则表达式搜索:testRe细讲
list = bs.find_all(re.compile(‘a’))#以列表的形式,返回所有标签中含字符串a的标签。
print(list)

#3、传入一个函数,根据函数的要求来搜索
def name_is_exists(tag):
return tag.has_attr(“name”) #返回含有name属性的标签

list1 = bs.find_all(name_is_exists)#不传参,则默认把上一个方法(bs.find_all)的返回值传入参数

#4、kwargs参数

list2 = bs.find_all(id=‘head’)#通过参数id来进行搜索
list3 = bs.find_all(class_=True)#通过有没有class属性来进行搜索

#5、text参数
list4 = bs.find_all(text=“hao123”)#输出[‘hao123’]
list5 = bs.find_all(text=[“hao123”,“地图”,“贴吧”])#输出[‘hao123’, ‘地图’, ‘贴吧’]

#6、limit参数
list6 = bs.find_all(‘a’,limit=3) #limit限制数量,输出前3条
print(list6)

#7、CSS选择器
print(bs.select(‘title’))#通过标签、id等来查找
print(bs.select(‘.mnav’))#通过class来查找,前面加个‘.’
print(bs.select(‘#head’))#通过id来查找,前面加个‘#’
print(bs.select(‘a[class=“bri”]’))#用中括号来指定标签属性,外面用单引号的话里面用双引号,外面用双引号的话,里面用单引号
print(bs.select(‘head > title’)) #通过子标签来查找,找到head标签中的title标签
list7 = bs.select(“.mnav ~ .bri”)#通过兄弟节点来查找,查找与class=mnav同级的class=bri的标签
print(list7[0].get_text())#获取单纯的文本内容

  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14
  • 15
  • 16
  • 17
  • 18
  • 19
  • 20
  • 21
  • 22
  • 23
  • 24
  • 25
  • 26
  • 27
  • 28
  • 29
  • 30
  • 31
  • 32
  • 33
  • 34
  • 35
  • 36
  • 37
  • 38

re库 正则表达式补充

  • 正则表达式常用操作规则如下,也可以百度搜索正则表达式大全,手动获取需要的正则表达式模板
    在这里插入图片描述
    在这里插入图片描述
  • 使用正则表达式,需要引入re库
    在这里插入图片描述
    在这里插入图片描述
  • 常用演示:
import re

#创建模式对象
pat = re.compile(“AA”) #此处的AA是正则表达式,用来验证其他字符串
m = pat.search(“CBA”) #search内的字符是被校验的内容
print(m)#None
n = pat.search(“asjkhgAAasdnjAA”)
print(n)#输出:<re.Match object; span=(6, 8), match=‘AA’>,只搜索第一个,span左闭右开

#不创建模式对象
x = re.search(“AA”,“AJKAADSD”) #省去创建验证模板,第一个参数即为模板,第二个参数为被验证内容
print(x) #<re.Match object; span=(3, 5), match=‘AA’>

#前面字符串是规则(正则表达式),后面是被校验的字符串
print(re.findall(“a”,‘akjgakjga’))#输出[‘a’, ‘a’, ‘a’]
print(re.findall(‘[A-Z]’,‘AUhjkjbAJBd’))#输出[‘A’, ‘U’, ‘A’, ‘J’, ‘B’]
print(re.findall(‘[A-Z]+’,‘AUhjkjbAJBd’))#输出[‘AU’, ‘AJB’](贪婪)

#sub,替换:在最后一个参数中,用第一个参数替换第二个参数
print(re.sub(‘a’,‘A’,‘hjagjakhj’))#输出hjAgjAkhj

#建议在正则表达式中,被比较的字符串前面加上r,这样不用担心转义字符的问题

  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14
  • 15
  • 16
  • 17
  • 18
  • 19
  • 20
  • 21
  • 22

bs4库与re库 解析内容

综合使用正则表达式与BeautifulSoup库,对获取的网页数据进行解析。

#全局变量:电影详情页提取规则(正则表达式)
#影片超链接
findLink = re.compile(r'<a href="(.*?)">')
#电影名称
findTitle = re.compile(r'<span class="title">(.*?)</span>',re.S) #通过re.S去除换行符
#电影图片链接
findPicScr = re.compile(r'<img .*src="(.*?)".*>')
#电影评分
findScore = re.compile(r'<span class="rating_num" property="v:average">(.*)</span>')
#电影评价人数
findJudgeNum = re.compile(r'<span>(\d*)人评价</span>')
#电影相关信息
findBd = re.compile(r'<p class="">(.*?)</p>',re.S)
#电影简介
findInq = re.compile(r'<span class="inq">(.*?)</span>')

datalist = []
#逐一解析数据:BS库解析爬取的html,下方html为网页数据
soup = bs4.BeautifulSoup(html,‘lxml’)
#通过每一部电影所有信息所在的标签提取
for item in soup.find_all(‘div’,class_=‘item’):
#爬取网页
data = []
item = str(item)
#正则表达式提取电影主要信息

<span class="token comment">#电影名称</span>
title <span class="token operator">=</span> re<span class="token punctuation">.</span>findall<span class="token punctuation">(</span>findTitle<span class="token punctuation">,</span>item<span class="token punctuation">)</span>
<span class="token keyword">if</span> <span class="token builtin">len</span><span class="token punctuation">(</span>title<span class="token punctuation">)</span><span class="token operator">==</span><span class="token number">2</span><span class="token punctuation">:</span>
    data<span class="token punctuation">.</span>append<span class="token punctuation">(</span>title<span class="token punctuation">[</span><span class="token number">0</span><span class="token punctuation">]</span><span class="token punctuation">)</span><span class="token comment">#电影中文名字</span>
    data<span class="token punctuation">.</span>append<span class="token punctuation">(</span>title<span class="token punctuation">[</span><span class="token number">1</span><span class="token punctuation">]</span><span class="token punctuation">)</span><span class="token comment">#电影原名</span>
<span class="token keyword">else</span><span class="token punctuation">:</span>
    data<span class="token punctuation">.</span>append<span class="token punctuation">(</span>title<span class="token punctuation">[</span><span class="token number">0</span><span class="token punctuation">]</span><span class="token punctuation">)</span>
    data<span class="token punctuation">.</span>append<span class="token punctuation">(</span><span class="token string">' '</span><span class="token punctuation">)</span>

<span class="token comment">#电影链接</span>
link <span class="token operator">=</span> re<span class="token punctuation">.</span>findall<span class="token punctuation">(</span>findLink<span class="token punctuation">,</span>item<span class="token punctuation">)</span><span class="token punctuation">[</span><span class="token number">0</span><span class="token punctuation">]</span>
data<span class="token punctuation">.</span>append<span class="token punctuation">(</span>link<span class="token punctuation">)</span>

<span class="token comment">#电影图片</span>
picScr <span class="token operator">=</span> re<span class="token punctuation">.</span>findall<span class="token punctuation">(</span>findPicScr<span class="token punctuation">,</span>item<span class="token punctuation">)</span><span class="token punctuation">[</span><span class="token number">0</span><span class="token punctuation">]</span>
data<span class="token punctuation">.</span>append<span class="token punctuation">(</span>picScr<span class="token punctuation">)</span>

<span class="token comment">#电影评分</span>
score <span class="token operator">=</span> re<span class="token punctuation">.</span>findall<span class="token punctuation">(</span>findScore<span class="token punctuation">,</span>item<span class="token punctuation">)</span><span class="token punctuation">[</span><span class="token number">0</span><span class="token punctuation">]</span>
data<span class="token punctuation">.</span>append<span class="token punctuation">(</span>score<span class="token punctuation">)</span>

<span class="token comment">#电影评价人数</span>
judgeNum <span class="token operator">=</span> re<span class="token punctuation">.</span>findall<span class="token punctuation">(</span>findJudgeNum<span class="token punctuation">,</span>item<span class="token punctuation">)</span><span class="token punctuation">[</span><span class="token number">0</span><span class="token punctuation">]</span>
data<span class="token punctuation">.</span>append<span class="token punctuation">(</span>judgeNum<span class="token punctuation">)</span>

<span class="token comment"># 电影相关信息</span>
bd <span class="token operator">=</span> re<span class="token punctuation">.</span>findall<span class="token punctuation">(</span>findBd<span class="token punctuation">,</span>item<span class="token punctuation">)</span><span class="token punctuation">[</span><span class="token number">0</span><span class="token punctuation">]</span>
<span class="token comment">#去掉中间杂乱符号</span>
bd <span class="token operator">=</span> re<span class="token punctuation">.</span>sub<span class="token punctuation">(</span><span class="token string">'&lt;br(\s+)?/&gt;(\s+)?'</span><span class="token punctuation">,</span><span class="token string">""</span><span class="token punctuation">,</span>bd<span class="token punctuation">)</span>
bd <span class="token operator">=</span> re<span class="token punctuation">.</span>sub<span class="token punctuation">(</span><span class="token string">'\\xa0'</span><span class="token punctuation">,</span><span class="token string">''</span><span class="token punctuation">,</span>bd<span class="token punctuation">)</span>
bd <span class="token operator">=</span> re<span class="token punctuation">.</span>sub<span class="token punctuation">(</span><span class="token string">'\\n'</span><span class="token punctuation">,</span> <span class="token string">''</span><span class="token punctuation">,</span> bd<span class="token punctuation">)</span>
bd <span class="token operator">=</span> bd<span class="token punctuation">.</span>strip<span class="token punctuation">(</span><span class="token punctuation">)</span>
data<span class="token punctuation">.</span>append<span class="token punctuation">(</span>bd<span class="token punctuation">)</span>

<span class="token comment">#电影评价</span>
inq <span class="token operator">=</span> re<span class="token punctuation">.</span>findall<span class="token punctuation">(</span>findInq<span class="token punctuation">,</span>item<span class="token punctuation">)</span>
<span class="token keyword">if</span> <span class="token builtin">len</span><span class="token punctuation">(</span>inq<span class="token punctuation">)</span> <span class="token operator">!=</span> <span class="token number">0</span><span class="token punctuation">:</span>
    data<span class="token punctuation">.</span>append<span class="token punctuation">(</span>inq<span class="token punctuation">[</span><span class="token number">0</span><span class="token punctuation">]</span><span class="token punctuation">)</span>
<span class="token keyword">else</span><span class="token punctuation">:</span>
    data<span class="token punctuation">.</span>append<span class="token punctuation">(</span><span class="token string">' '</span><span class="token punctuation">)</span>
datalist<span class="token punctuation">.</span>append<span class="token punctuation">(</span>data<span class="token punctuation">)</span>
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14
  • 15
  • 16
  • 17
  • 18
  • 19
  • 20
  • 21
  • 22
  • 23
  • 24
  • 25
  • 26
  • 27
  • 28
  • 29
  • 30
  • 31
  • 32
  • 33
  • 34
  • 35
  • 36
  • 37
  • 38
  • 39
  • 40
  • 41
  • 42
  • 43
  • 44
  • 45
  • 46
  • 47
  • 48
  • 49
  • 50
  • 51
  • 52
  • 53
  • 54
  • 55
  • 56
  • 57
  • 58
  • 59
  • 60
  • 61
  • 62
  • 63
  • 64
  • 65
  • 66
  • 67
  • 68
  • 69

xlwt库操作excel

import xlwt

workbook = xlwt.Workbook(encoding=‘utf-8’)#相当于excel文件
worksheet = workbook.add_sheet(‘sheet1’)#相当于excel文件中的表单,命名为sheet1
worksheet.write(0,0,‘111’)#在当前表单的第0行第0列写入111
workbook.save(‘test.xls’)#将文件保存到test.xls

  • 1
  • 2
  • 3
  • 4
  • 5
  • 6

在这里插入图片描述

综合使用上述库→全部代码→爬取豆瓣top250

askURL中head可以根据自己浏览器修改

# -*- coding = utf-8 -*-
# @Time: 2022/9/27 14:17
# @Author: Sunhao
# @File: spider.py
# @Software:PyCharm

import re #正则表达式,进行文字匹配
import bs4 #网页解析获取数据
import lxml
import urllib.request,urllib.error,urllib.parse #指定url,获取网页数据
import xlwt #进行excel操作
import sqlite3 #进行sqllite数据库操作

#电影详情页提取规则(正则表达式)
#影片超链接
findLink = re.compile(r’<a href=“(.?)">‘)
#电影名称
findTitle = re.compile(r’<span class=“title”>(.?)</span>‘,re.S) #通过re.S去除换行符
#电影图片链接
findPicScr = re.compile(r’<img .src="(.?)”.>‘)
#电影评分
findScore = re.compile(r’<span class=“rating_num” property=“v:average”>(.)</span>‘)
#电影评价人数
findJudgeNum = re.compile(r’<span>(\d*)人评价</span>‘)
#电影相关信息
findBd = re.compile(r’<p class=“”>(.?)</p>‘,re.S)
#电影简介
findInq = re.compile(r’<span class=“inq”>(.?)</span>')

def main():
baseurl = ‘https://movie.douban.com/top250?start=’
#1.爬取网页
#2.解析数据
datalist = getData(baseurl)
#3.保存数据
savepath = r’.\豆瓣电影Top250.xls’
saveData(savepath,datalist)
print(‘保存成功’)

# 得到一个指定URL的网页内容
def askURL(url):
head = {
“User-Agent”: “Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/105.0.0.0 Safari/537.36”
}
request = urllib.request.Request(url=url, headers=head, method=“GET”)
try:
response = urllib.request.urlopen(request)
html = response.read().decode(‘utf-8’)
return html
except urllib.error.URLError as error:
if hasattr(error,‘code’):
print(error.code)
if hasattr(error,“reason”):
print(error.reason)
#爬取网页
def getData(baseurl):
datalist = []
for i in range(0,10):
url = baseurl+str(i*25)
html = askURL(url)
#逐一解析数据:BS库解析爬取的html
soup = bs4.BeautifulSoup(html,‘lxml’)
#通过每一部电影所有信息所在的标签提取
for item in soup.find_all(‘div’,class_=‘item’):
data = []
item = str(item)
#正则表达式提取电影主要信息

        <span class="token comment">#电影名称</span>
        title <span class="token operator">=</span> re<span class="token punctuation">.</span>findall<span class="token punctuation">(</span>findTitle<span class="token punctuation">,</span>item<span class="token punctuation">)</span>
        <span class="token keyword">if</span> <span class="token builtin">len</span><span class="token punctuation">(</span>title<span class="token punctuation">)</span><span class="token operator">==</span><span class="token number">2</span><span class="token punctuation">:</span>
            data<span class="token punctuation">.</span>append<span class="token punctuation">(</span>title<span class="token punctuation">[</span><span class="token number">0</span><span class="token punctuation">]</span><span class="token punctuation">)</span><span class="token comment">#电影中文名字</span>
            title<span class="token punctuation">[</span><span class="token number">1</span><span class="token punctuation">]</span> <span class="token operator">=</span> title<span class="token punctuation">[</span><span class="token number">1</span><span class="token punctuation">]</span><span class="token punctuation">.</span>replace<span class="token punctuation">(</span><span class="token string">'/'</span><span class="token punctuation">,</span><span class="token string">''</span><span class="token punctuation">)</span>
            data<span class="token punctuation">.</span>append<span class="token punctuation">(</span>title<span class="token punctuation">[</span><span class="token number">1</span><span class="token punctuation">]</span><span class="token punctuation">)</span><span class="token comment">#电影原名</span>
        <span class="token keyword">else</span><span class="token punctuation">:</span>
            data<span class="token punctuation">.</span>append<span class="token punctuation">(</span>title<span class="token punctuation">[</span><span class="token number">0</span><span class="token punctuation">]</span><span class="token punctuation">)</span>
            data<span class="token punctuation">.</span>append<span class="token punctuation">(</span><span class="token string">' '</span><span class="token punctuation">)</span>

        <span class="token comment">#电影链接</span>
        link <span class="token operator">=</span> re<span class="token punctuation">.</span>findall<span class="token punctuation">(</span>findLink<span class="token punctuation">,</span>item<span class="token punctuation">)</span><span class="token punctuation">[</span><span class="token number">0</span><span class="token punctuation">]</span>
        data<span class="token punctuation">.</span>append<span class="token punctuation">(</span>link<span class="token punctuation">)</span>

        <span class="token comment">#电影图片</span>
        picScr <span class="token operator">=</span> re<span class="token punctuation">.</span>findall<span class="token punctuation">(</span>findPicScr<span class="token punctuation">,</span>item<span class="token punctuation">)</span><span class="token punctuation">[</span><span class="token number">0</span><span class="token punctuation">]</span>
        data<span class="token punctuation">.</span>append<span class="token punctuation">(</span>picScr<span class="token punctuation">)</span>

        <span class="token comment">#电影评分</span>
        score <span class="token operator">=</span> re<span class="token punctuation">.</span>findall<span class="token punctuation">(</span>findScore<span class="token punctuation">,</span>item<span class="token punctuation">)</span><span class="token punctuation">[</span><span class="token number">0</span><span class="token punctuation">]</span>
        data<span class="token punctuation">.</span>append<span class="token punctuation">(</span>score<span class="token punctuation">)</span>

        <span class="token comment">#电影评价人数</span>
        judgeNum <span class="token operator">=</span> re<span class="token punctuation">.</span>findall<span class="token punctuation">(</span>findJudgeNum<span class="token punctuation">,</span>item<span class="token punctuation">)</span><span class="token punctuation">[</span><span class="token number">0</span><span class="token punctuation">]</span>
        data<span class="token punctuation">.</span>append<span class="token punctuation">(</span>judgeNum<span class="token punctuation">)</span>

        <span class="token comment"># 电影相关信息</span>
        bd <span class="token operator">=</span> re<span class="token punctuation">.</span>findall<span class="token punctuation">(</span>findBd<span class="token punctuation">,</span>item<span class="token punctuation">)</span><span class="token punctuation">[</span><span class="token number">0</span><span class="token punctuation">]</span>
        bd <span class="token operator">=</span> re<span class="token punctuation">.</span>sub<span class="token punctuation">(</span><span class="token string">'&lt;br(\s+)?/&gt;(\s+)?'</span><span class="token punctuation">,</span><span class="token string">""</span><span class="token punctuation">,</span>bd<span class="token punctuation">)</span>
        bd <span class="token operator">=</span> re<span class="token punctuation">.</span>sub<span class="token punctuation">(</span><span class="token string">'\\xa0'</span><span class="token punctuation">,</span><span class="token string">''</span><span class="token punctuation">,</span>bd<span class="token punctuation">)</span>
        bd <span class="token operator">=</span> re<span class="token punctuation">.</span>sub<span class="token punctuation">(</span><span class="token string">'\\n'</span><span class="token punctuation">,</span> <span class="token string">''</span><span class="token punctuation">,</span> bd<span class="token punctuation">)</span>
        bd <span class="token operator">=</span> bd<span class="token punctuation">.</span>strip<span class="token punctuation">(</span><span class="token punctuation">)</span>
        data<span class="token punctuation">.</span>append<span class="token punctuation">(</span>bd<span class="token punctuation">)</span>

        <span class="token comment">#电影简介</span>
        inq <span class="token operator">=</span> re<span class="token punctuation">.</span>findall<span class="token punctuation">(</span>findInq<span class="token punctuation">,</span>item<span class="token punctuation">)</span>
        <span class="token keyword">if</span> <span class="token builtin">len</span><span class="token punctuation">(</span>inq<span class="token punctuation">)</span> <span class="token operator">!=</span> <span class="token number">0</span><span class="token punctuation">:</span>
            data<span class="token punctuation">.</span>append<span class="token punctuation">(</span>inq<span class="token punctuation">[</span><span class="token number">0</span><span class="token punctuation">]</span><span class="token punctuation">)</span>
        <span class="token keyword">else</span><span class="token punctuation">:</span>
            data<span class="token punctuation">.</span>append<span class="token punctuation">(</span><span class="token string">' '</span><span class="token punctuation">)</span>
        datalist<span class="token punctuation">.</span>append<span class="token punctuation">(</span>data<span class="token punctuation">)</span>
<span class="token keyword">return</span> datalist

#保存数据
def saveData(savepath,datalist):
workbook = xlwt.Workbook(encoding=“utf-8”,style_compression=0)#style_compression表示是否压缩数据,默认为0,表示不压缩,1为压缩
worksheet = workbook.add_sheet(“豆瓣电影TOP250”,cell_overwrite_ok=True)#cell_overwrite_ok=True表示允许覆盖
col = (‘影片中文名称’,‘影片原名’,‘影片链接’,‘影片图片’,‘影片评分’,‘评价人数’,‘影片相关信息’,‘影片简介’)
#表格第一行内容
for i in range(0,8):
worksheet.write(0,i,col[i])
#将datalist存入表格
for i in range(0,250):
data = datalist[i]
print(‘保存第%i条’%i)
for j in range(0,8):
worksheet.write(i+1, j, data[j])
workbook.save(savepath)

if name == main:
main()

  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14
  • 15
  • 16
  • 17
  • 18
  • 19
  • 20
  • 21
  • 22
  • 23
  • 24
  • 25
  • 26
  • 27
  • 28
  • 29
  • 30
  • 31
  • 32
  • 33
  • 34
  • 35
  • 36
  • 37
  • 38
  • 39
  • 40
  • 41
  • 42
  • 43
  • 44
  • 45
  • 46
  • 47
  • 48
  • 49
  • 50
  • 51
  • 52
  • 53
  • 54
  • 55
  • 56
  • 57
  • 58
  • 59
  • 60
  • 61
  • 62
  • 63
  • 64
  • 65
  • 66
  • 67
  • 68
  • 69
  • 70
  • 71
  • 72
  • 73
  • 74
  • 75
  • 76
  • 77
  • 78
  • 79
  • 80
  • 81
  • 82
  • 83
  • 84
  • 85
  • 86
  • 87
  • 88
  • 89
  • 90
  • 91
  • 92
  • 93
  • 94
  • 95
  • 96
  • 97
  • 98
  • 99
  • 100
  • 101
  • 102
  • 103
  • 104
  • 105
  • 106
  • 107
  • 108
  • 109
  • 110
  • 111
  • 112
  • 113
  • 114
  • 115
  • 116
  • 117
  • 118
  • 119
  • 120
  • 121
  • 122
  • 123
  • 124
  • 125
  • 126
  • 127
  • 128
  • 129
  • 130
文章知识点与官方知识档案匹配,可进一步学习相关知识
Python入门技能树网络爬虫urllib 145750 人正在系统学习中
  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 打赏
    打赏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

杭州下小雨~

你的鼓励将是我创作的最大动力

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值