2021-05-20

最新推荐文章于 2022-11-04 22:31:24 发布

Vivid

最新推荐文章于 2022-11-04 22:31:24 发布

阅读量359

点赞数

文章标签： python

本文链接：https://blog.csdn.net/vivid0610/article/details/117081764

版权

网络数据爬取的深入学习
1、准备工作
通过浏览器查看分析目标网页，学习编程基础规范
获取数据
通过HTTP库向目标站点发起请求，请求可以包含额外的header等信息，如
果服务器能正常响应，会得到一个Response，便是所要获取的页面内容解析内容
得到的内容可能是HTML、json等格式，可以用页面解析库、正则表达式等
进行解析
保存数据
保存形式多样，可以存为文本，也可以保存到数据库，或者保存特定格式的文件
在这里插入图片描述
借用chrome开发者工具（F12）分析网页，在Elements下找到需要的数据位置

2、编码规范
一般Python程序第一行需要加入
＃-＊-coding：utf-8-＊-或者＃coding＝utf-8
这样可以在代码中包含中文
在Python中，使用函数实现单一功能或相关联功能的代码段，可以提高可读性和代码重复利用率，函数代码块以def关键词开头，后接空格、函数标识符名称、圆括号0、冒号：，括号中可以传入参数，函数段缩进（Tab或四个空格，只能任选一种），return用于结束函数，可以返回一个值，也可以不带任何表达式（表示返回None）
Python文件中可以加入main函数用于测试程序

if _name_=="_main_";

Python使用＃添加注释，说明代码（段）的作用
3、引入模块
模块（module）：用来从逻辑上组织Python代码（变量、函数、类），本质就是py文件，提高代码的可维护性。Python使用import来导入模块，如

import sys
import sys
import re
import rllib
import xlwt

def main():
from bs4 import BeautifulSoup
print（“开始爬取......”）
baseurl='https://blog.csdn.net/eastmount/article/details/52577215'
datalist=getData(baseur1)
savapath＝u＇／home／aistudio／data／CDSN．x1s＇
saveData(datalist,savapath)

主流程

main()
print（“爬取完成，请查看．x1s文件”）

获取数据、解析内容、保存数据

模块 module：一般情况下，是一个以.py为后缀的文件。
module 可看作一个工具类，可共用或者隐藏代码细节，将相关代码放置在一个module以便让代码更好用、易懂，让coder重点放在高层逻辑上
module能定义函数、类、变量，也能包含可执行的代码。module来源有3种：
①Python内置的模块（标准库）；
②第三方模块；
③自定义模块。
包 package：为避免模块名冲突，Python引入了按目录组织模块的方法，称之为包（package）。包是含有Python模块的文件夹

4、获取数据
Python一般使用urllib2库获取页面

for i in range(0,10):
url=baseurl+str(i*25)
html=askURL(ur1)
＃得到页面全部内容
def askURL(url):
request＝urllib．request．Request（ur1）＃发送请求
try:
response ＝urllib.request.urlopen（request）＃歇得响应
html＝response．read0＊获取网页内容
print (html)
except urllib.error.URLError as e:
if hasattr(e,"code"):
print (e.code)
if hasattr (e,"reason"):
print (e.reason)
return html

获取页面数据
对每一个页面，调用askURL函数获取页面内容
定义一个获取页面的函数askURL，传入一个url参数，表示网址，如
https://blog.csdn.net/eastmount/article/details/52577215
urllib2．Request生成请求；
urllib2．urlopen发送请求获取响应；read获取页面内容
在访问页面时经常会出现错误，为了程序正常运行，加入异常捕获try···except···语句

补充：urllib模块
最最基本的请求
是python内置的一个http请求库，不需要额外的安装。只需要关注请求的链接，参数，提供了强大的
解析

 urllb.request 请求模块
 urllib.error 异常处理模块
 urllib.parse 解析模块

简单的一个get请求

import urllib.request
reponse = urllib.request.urlopen('http://www.baidu.com')
print(reponse.read().decode('utf-8'))

简单的一个post请求

import urllib.parse
import urllib.request
data = bytes(urllib.parse.urlencode({'hello':'world'}),encoding='utf-8')
reponse = urllib.request.urlopen('http://httpbin.org/post',data=data)
print(reponse.read())

超时处理

import urllib.request
response = urllib.request.urlopen('http://httpbin.org/get',timeout=1)
print(response.read())

import urllib.request
import socket
import urllib.error
try:
response = urllib.request.urlopen('http://httpbin.org/get',timeout=0.01)
except urllib.error.URLError as e:
if isinstance(e.reason,socket.timeout):#判断错误原因
print('time out!')

打印出响应类型，状态码，响应头

import urllib.request
response=urllib.request.urlopen('http://www.baidu.com')
print(type(response))

import urllib.request
response = urllib.request.urlopen('http://www.baidu.com')
print(response.status) # 状态码 判断请求是否成功
print(response.getheaders()) # 响应头 得到的一个元组组成的列表
print(response.getheader('Server')) #得到特定的响应头
print(response.read().decode('utf-8')) #获取响应体的内容，字节流的数据，需要转成utf-8格式

由于使用urlopen无法传入参数，我们需要解决这个问题
我们需要声明一个request对象，通过这个对象来添加参数

import urllib.request
request = urllib.request.Request('https://python.org') #由于urlopen无法传参数，声明一个Request对象
response = urllib.request.urlopen(request)
print(response.read().decode('utf-8'))

还可以分别创建字符串、字典等等来带入到request对象里面

from urllib import request,parse
url='http://httpbin.org/post'
headers={
'user-agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36
(KHTML, like Gecko) Chrome/71.0.3578.98 Safari/537.36',
'Host':'httpbin.org'
}
dict={
'name':'jay'
}
data = bytes(parse.urlencode(dict),encoding='utf-8')
req=request.Request(url=url,data=data,headers=headers,method='POST')
response=request.urlopen(req)
print(response.read().decode('utf-8'))

小结：数据的获取、解析和保存

Vivid

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
2021-05-20

网络数据爬取的深入学习1、准备工作通过浏览器查看分析目标网页，学习编程基础规范获取数据通过HTTP库向目标站点发起请求，请求可以包含额外的header等信息，如果服务器能正常响应，会得到一个Response，便是所要获取的页面内容解析内容得到的内容可能是HTML、json等格式，可以用页面解析库、正则表达式等进行解析保存数据保存形式多样，可以存为文本，也可以保存到数据库，或者保存特定格式的文件借用chrome开发者工具（F12）分析网页，在Elements下找到需要的数据位置2、编
复制链接

扫一扫