python爬虫

最新推荐文章于 2024-08-21 16:56:45 发布

xiao52x

最新推荐文章于 2024-08-21 16:56:45 发布

阅读量253

点赞数

文章标签： python 爬虫开发语言

本文链接：https://blog.csdn.net/weixin_44953928/article/details/121330585

版权

python爬虫

1.任务介绍

1.需求分析

任务：爬取top250的电影基本信息：名称、豆瓣评分、评价数、电影概况、电影链接等

http://movie.douban.com/top250

2.爬虫的认识

1.什么是爬虫

2.爬虫可以做什么

3.爬虫的本质

**模拟浏览器**打开网页，获取网页中我们想要的那部分数据

3.基本流程

准备工作——>获取数据——>解析内容——>保存数据

准备工作
通过浏览器查看分析目标网页，学习编程基础规范。
获取数据
通过HTTP库向目标站点发起请求，请求可以包含额外的header等信息，如
果服务器能正常响应，会得到一个Response，便是所要获取的页面内容。
解析内容
得到的内容可能是HTML、json等格式，可以用页面解析库、正则表达式等
进行解析。
保存数据
保存形式多样，可以存为文本，也可以保存到数据库，或者保存特定格式的
文件.

3.1准备工作

URL分析

页面包括250电影数据

每页的URL的不同之处

3.1.1分析数据

借助Chrome开发者工具（F12）来分析网页，在Elements下找到需要的数据位置

3.1.2 编码规范

1.一般Python的程序第一行要加入(这样代码中可以有中文)

    #-*-coding:utf-8 -*- 或者 #coding=utf-8

def 定义函数

Python文件中可以加入main函数用于测试程序

if __name__ =="__main__":  #当程序执行时
    #调用函数

Python使用#添加注释，说明代码（段）的作用

3.1.3引入模块

1.模块：用来从逻辑上组织Python代码（变量、函数、类）、本质就是py文件，提高代码的可维护性。Python使用import 来导入模块

from bs4 import BeautifulSoup    #网页解析，获取数据的包
import re     #正则表达式，进行文字匹配的
import sys
import urllib
import xlwt     #进行Excel操作

import urllib.request,urllib.error  #指定URL，获取网页数据
import sqlite3  #进行SQLite数据库操作

from text1 import t1  #从文件夹text1中调取t1
#引入系统模块
import sys
import os
#引入自定义的模块
print(t1.add(2,3))

2.数据包

import bs4    #网页解析，获取数据的包  #from bs4 import BeautifulSoup  
import re     #正则表达式，进行文字匹配的
import urllib.request,urllib.error  #指定URL，获取网页数据
import xlwt     #进行Excel操作
import sqlite3  #进行SQLite数据库操作

3.2获取数据

3.2.1 python一般使用 urllib库获取页面

#得到指定一个URL的网页内容
def askURL(url):
    #head用户代理，表示告诉豆瓣服务器我们是什么类型的机器、浏览器（本质上是告诉浏览器，我们可以接受什么水平的文件内容）
    head={ #模拟浏览器头部信息，向豆瓣服务器发送消息
        "User-Agent": "Mozilla / 5.0(Windows NT 10.0;Win64;x64) AppleWebKit / 537.36(KHTML, likeGecko) Chrome / 93.04577.82Safari / 537.36"
    }

    request=urllib.request.Request(url,headers=head)
    html=""
    try:
        response=urllib.request.urlopen(request)
        html=response.read().decode("utf-8")
        #print(html)
    except urllib.error.URLError as e:
        if hasattr(e,"code"):
            print(e.code)
        if hasattr(e,"reason"):
            print(e.reason)

    return html

Beautifulsoup

from bs4 import BeautifulSoup

file=open("./baidu.html","rb")  #二进制读取
html=file.read().decode("utf-8")
bs=BeautifulSoup(html,"html.parser")  #BeautifulSoup解读器，去解析html文档,html.parser是解析器

#二、文档的搜索（重点，重点）

#(1)find_all()
#字符串过滤：会查找与字符串完全匹配的内容
t_list=bs.find_all("a")  #查找所有的a（a是一个单独词）
print(t_list)

#正则表达式搜索：使用searc（）方法来匹配内容（重点，常用）
import re   #引入re这个库
# t_list=bs.find_all(re.compile("a")) #查找出标签中包含a的，就把其都找出来
# print(t_list)

#2.kwargs 参数

t_list=bs.find_all(id="head")   #id="head" 包含的所有内容
for item in t_list: #输出为列表，看起来更加清楚
    print(item)

# t_list = bs.find_all(class_=True)
# for item in t_list: #输出为列表，看起来更加清楚
#     print(item)

#3.text参数
#t_list=bs.find_all(text="hao123")
#t_list=bs.find_all(text=["hao123","地图","贴吧"])
# t_list=bs.find_all(text=re.compile("/d"))   #应用正则表达式来查找包含特定文本的内容（标签里的字符串）
# for item in t_list: #输出为列表，看起来更加清楚
#     print(item)

#5.css选择器（重要）

t_list=bs.select('title')       #通过标签来查找
# t_list=bs.select('.mnav')       #通过类名来查找，点代表类class
#t_list=bs.select('#u1')             #通过id来查找
#t_list=bs.select('a[class="bri"]')  #通过属性来查找
# t_list=bs.select('head>title')             #通过子标签来查找（head里面的title）
#
for item in t_list: #输出为列表，看起来更加清楚
    print(item)
#
# t_list=bs.select(".manv ~ .bri")  #通过属性来查找
# print(t_list[0].get_text())

urllib 的补充

#访问豆瓣
url="http://www.douban.com"
#headers为封装信息
headers={"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/93.0.4577.63 Safari/537.36"}
req=urllib.request.Request(url=url,headers=headers)
response=urllib.request.urlopen(req)
print(response.read().decode("utf-8"))

##网站：测试网址httpbin.org
import urllib.request

#获取一个get请求
response=urllib.request.urlopen("http://www.baidu.com")
print(response.read().decode('utf-8'))  #对获取到的网页源码进行utf-8解码

#获取一个post请求   用在网站上用户模拟登陆是使用
import urllib.parse
data=bytes(urllib.parse.urlencode({"hello":"world"}),encoding="utf-8")
#模拟浏览器发出请求   hello：用户，world：密码
response=urllib.request.urlopen("http://httpbin.org/post",data=data)
print(response.read().decode("Utf-8"))

#get 超时处理，timeout爬取限制的时间，try out：异常处理
try:
        response=urllib.request.urlopen("http://httpbin.org/get",timeout=1)
     print(response.read().decode("Utf-8"))
except urllib.error.URLError as e:
     print("出错了")



#伪装成浏览器访问某网站，不让其发现是爬虫

#url="http://www.douban.com"
#为了伪装成浏览器
 url="http://httpbin.org/post"
 headers={"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/93.0.4577.63 Safari/537.36"}
 data=bytes(urllib.parse.urlencode({'name':'eric'}),encoding="utf-8")
 req=urllib.request.Request(url=url,data=data,headers=headers,method="POST")
response=urllib.request.urlopen(req)
print(response.read().decode("utf-8"))

补充：re模块

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-Ple3Blvs-1636945398188)(F:\笔记\图片\1.jpg)]

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-CYM80q1C-1636945398194)(F:\笔记\图片\2.jpg)]

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-nZURYzUB-1636945398196)(F:\笔记\图片\3.jpg)]

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-oILUD0bk-1636945398199)(F:\笔记\图片\4.jpg)]

修饰符	描述
re.I
re.S

3.3解析内容

1. (.*?)  #问号的意思表示0次到1次（有限）

1.对爬取的html文件进行解析

soup=BeautifulSoup(html,"html.parser")
for item in soup.find_all ('div', class_='item'): # 找到每一个影片项
    data=[]
    item=str(item)  #转换成字符串
    #影片详情链接
    link=re.findall(findLink,item)[0]
    data.append (link)         #添加详情链接

    imgSrc=re.findall(findImgSrc,item)[0]
    data.append(imgSrc)#添和图片链接
    titles=re.findall(findTitle,item)
    #片名可能只有一个中文名，发有外国名
    if(len(titles)==2)
        ctitle=titles[0]
        data.append(ctitle)#添加中文片名
        otitle=titles[1].replace ("/","") #去棹无关符号

1.解析页面内容

使用BeautifulSoup定位特定
使用正则表达式找到具体的内容

3.3.1标签解析

Beautiful Soup

Beautiful Soup是一个库，提供一些简单的、python式的用来处理导航、搜索、修改分析树等功能，通过解析文档为用户提供需要抓取的数据。我们需要的每个电影都在一个

的标签中，且每个div标签都有一个属性class= “item”

soup=BeautifulSoup(html,"html.parser")
for item in soup.find_all('div',class_='item'):		#找到每一个影片项

    BeautifulSoup: 创建BeautifulSoup对象，html为页面内容，html.parser是一种页面解析器
    class_='item'：找到能够完整提取出一个影片内容的项，即页面中所有样式是item类的div

3.3.2正则解析

正则表达式

正则表达式，通常被用来检索、替换那些符合某个模式（规则)的文本。正则表达式是对字符串操作的一种逻辑公式，就是用事先定义好的一些特定字符及这些特定字符的组合，组成一个“规则字符串”，这个“规则字符串”用来表达对字符串的一种过滤逻辑。Python中使用re模块操作正则表达式。

findLink=re.compile(r'<a href=""(.*?)">')	#找到影片详情链接
findImgSrc=re.compile(r'<img *src="(.*?)"',re.S)		#找到影片图片
findTitle=re.compile(r'<span class="title">(.*)</span>')  #找到片名
#找到评分
findRating=re.compile (r'<span class="rating_nun" property="v.average">(.*)<span>')
#找到评价人数
findTudge=re.compile(r'<span>(\d*)人评价</span>')
#找到概括
findTnq=re.compilele(r'<span class="inq">(.*)</span>') 
#找到到影片相关内容,导演，主演，年份，地区，类别
findBd=re.compilel(r'<p class=""(.*?)</p>' ,ra.S)

3.4保存数据

3.4.1 Ecxel表存储

1.Excel表格存储

利用python库xlwt将抽取的数据datalist写入Excel表格

xiao52x

关注

0
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
python爬虫

python爬虫1.任务介绍1.需求分析任务：爬取top250的电影基本信息：名称、豆瓣评分、评价数、电影概况、电影链接等 http://movie.douban.com/top2502.爬虫的认识1.什么是爬虫2.爬虫可以做什么3.爬虫的本质**模拟浏览器**打开网页，获取网页中我们想要的那部分数据3.基本流程准备工作——>获取数据——>解析内容——>保存数据准备工作通过浏览器查看分析目标网页，学习编程基础规范。获取数据通过HTTP库向目
复制链接

扫一扫