爬虫学习笔记（2）

最新推荐文章于 2022-12-07 21:51:10 发布

小李同学314

最新推荐文章于 2022-12-07 21:51:10 发布

阅读量494

点赞数

分类专栏：爬虫 python 量化文章标签：爬虫 python 前端

本文链接：https://blog.csdn.net/weixin_44602227/article/details/122463156

版权

量化同时被 3 个专栏收录

5 篇文章 0 订阅

订阅专栏

python

3 篇文章 0 订阅

订阅专栏

爬虫

2 篇文章 0 订阅

订阅专栏

爬虫学习

注：本笔记使用jupyter编写。

web前端知识

jupyter 可以直接运行html和javascript只需要在代码前面加上%%html或者%%javascript

%%html

<html>
    <head>
        <title>python爬虫开发与项目实战</title>
            <meta charset='UTF-8'>
    </head>
    <body>
            文档设置标记<br>
            <p>这是段落</p>
    </body>
</html>

python爬虫开发与项目实战文档设置标记

这是段落

%%html
<html>
    <head>
        <script type='text/javascript'>
            alert('Hello,world!');
            var str1='hi';
            var str2 = 'you';
            str1 +=str2
            alert(str1)
        </script>
    </head>
    <body>
        python爬虫
    </body>
</html>

python爬虫 str1

下面可以直接运行javascrit语言。

%%javascript
alert('hello li')
var str1='hi';
var str2 = 'you';
str1 +=str2
alert(str1)
var person = {name:'li',age:17};
alert(person.name)

<IPython.core.display.Javascript object>

Xpath节点

%%html
<xml version="1.0" encoding="ISO-8859-1">
<classroom>
    <student>
        <id>1001</id>
        <name lang="en">marry</name>
        <country>China</country>
    </student>
</classroom>

1001 marry China

CSS层叠样式表

CSS由选择器和若干条声明构成。
一般有三种做法：

内联样式表，直接使用style属性改变样式，例如

<body style='background-color:green;margin:0;padding:0;'></body>

嵌入式样式表，代码写在<style type = 'text/css'></style>中间
外部样式表，css文件写一个单独的外部文件中。使用<link rel='StyleSheet' type='text/css' href='style.css'>。

javascript

两种引用方法：

直接写入代码，使用<script type='text/javascript'>alert('hello')</script>
引用外部文件使用<scipt src='temp/test1.js'></script>一般放在<head></head>中间。

HTTP 标准

常见状态码含义，200联接成功。301资源被永久转移其他url。404访问不存在。500内部服务器错误。
头部信息，常用的User-Agent,这个常用来反爬虫。
GET方式与POST方式的区别，GET通过url传递数据，数据最大只能是1024B，并且参数会显示在地址栏上。POST通过实体传递数据，数据大小没有限制，安全性更高。

python爬虫概述

爬虫的种类：

通用网络爬虫，如百度谷歌搜索引擎。
聚焦网络爬虫，自动下载网页程序。
增量式网络爬虫，变则改，不变则不下载。
深层网络爬虫，必须登录后才能访问的网页。

HTTP请求的python实现

import urllib
response = urllib.request.urlopen('http://www.zhihu.com')
html=response.read()
print(html[0:20])

b'<!doctype html>\n<htm'

上面的方式是GET的请求方式。下面是POST请求。

# encoding:utf-8
import urllib
# import urllib2
url = 'http://www.zhihu.com/login'
postdata = {b'username' : b'qiye',
                b'password' : b'qiye_pass'}
# info 需要被编码为urllib2能理解的格式，这里用到的是urllib
data = urllib.parse.urlencode(postdata).encode('utf-8')
req = urllib.request.Request(url, data)
response = urllib.request.urlopen(req)
html = response.read()
print(html[0:20])

b'<!DOCTYPE html>\n<htm'

书上代码运行错误：

AttributeError: module 'urllib' has no attribute 'urlencode'

解决方法：
urllib在python3中分解了，

urllib.urlencode()

改为

urllib.parse.urlencode()

然而继续出错：

TypeError: POST data should be bytes, an iterable of bytes, or a file object. It cannot be of type str.

采用方法：
输入格式设置为’utf-8‘

data = urllib.parse.urlencode(postdata).encode('utf-8')

继续出错：

HTTPError: HTTP Error 403: Forbidden

采用方法：
原来的网址输入不明确。

url = 'http://www.zhihu.com/login'

请求头的处理

import urllib
url = 'https://www.cnblogs.com/login'
user_agent = 'Mozilla/4.0(compatible;MSIE 5.5;Windows NT)'
referer = 'https://www.cnblogs.com'
postdata = {'username':'小李同学314','password':'***'}
#将user_agent,referer写入头信息
headers = {'User_Agent':user_agent,'Referer':referer}
data = urllib.parse.urlencode(postdata).encode('utf-8')
req = urllib.request.Request(url,data,headers)
response = urllib.request.urlopen(req)
html = response.read()
print(html[0:50])

b'<!DOCTYPE html>\n<html lang="zh-cn">\n<head>\n    <me'

也可以采用add_header()函数

import urllib
url = 'https://wwww.cnblogs.com/login'
user_agent='Mozilla/4.0(compatible;MSIE 5.5;Windows NT)'
referer = 'https://wwww.cnblogs.com'
postdata = {'username':'小李同学314','password':'***'}
data = urllib.parse.urlencode(postdata).encode('utf-8')
# add_header
req = urllib.request.Request(url)
req.add_header('User-Agent',user_agent)
req.add_header('Referer',referer)
req.data =data
response = urllib.request.urlopen(req)
html = response.read()
print(html[0:10])

b'<!DOCTYPE '

上面两个应该还是有问题的，密码错误也返回了相同的结果。

requests库的介绍

import requests
r = requests.get('https://www.baidu.com')
print(r.content[0:20])

b'<!DOCTYPE html>\r\n<!-'

import requests
postdata = {'key':'value'}
r = requests.post('https://www.baidu.com/login',data=postdata)
print(r.content)

b'<!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN">\n<html><head>\n<title>404 Not Found</title>\n</head><body>\n<h1>Not Found</h1>\n<p>The requested URL /login was not found on this server.</p>\n</body></html>\n'

响应与编码

import requests
r = requests.get('https://www.baidu.com')
print(r.encoding)
r.encoding = 'utf-8'
print(r.text[0:20])

ISO-8859-1
<!DOCTYPE html>
<!-

import chardet
import requests
r = requests.get('https://www.baidu.com')
print(chardet.detect(r.content))
r.encoding = chardet.detect(r.content)
print(r.text[0:20])

{'encoding': 'utf-8', 'confidence': 0.99, 'language': ''}
<!DOCTYPE html>
<!-

除了全部响应还有流模式,将会以字节流的方式读取

import requests
r = requests.get('https://www.baidu.com',stream=True)
print(r.raw.read(10))

b'\x1f\x8b\x08\x00\x00\x00\x00\x00\x00\x03'

请求头的处理

import requests
user_agent= 'Mozilla/4.0 (compatible;MSIE 5.5;Windows NT)'
headers = {'User-Agent':user_agent}
r = requests.get('https://www.baidu.com',headers=headers)
print(r.content[0:20])

b'<!DOCTYPE html><!--S'

响应码和响应头的处理

import requests
r = requests.get('http://www.baidu.com')
if r.status_code == requests.codes.ok:
    print(r.status_code)
    print(r.headers.get('content-type'))#推荐使用这种方式，也可以采用headers['conten-type'],但是没有字段时会返回异常。
    print(r.headers)
else:
    pritn(r.raise_for_status())

200
text/html
{'Cache-Control': 'private, no-cache, no-store, proxy-revalidate, no-transform', 'Connection': 'keep-alive', 'Content-Encoding': 'gzip', 'Content-Type': 'text/html', 'Date': 'Wed, 12 Jan 2022 13:00:11 GMT', 'Last-Modified': 'Mon, 23 Jan 2017 13:28:12 GMT', 'Pragma': 'no-cache', 'Server': 'bfe/1.0.8.18', 'Set-Cookie': 'BDORZ=27315; max-age=86400; domain=.baidu.com; path=/', 'Transfer-Encoding': 'chunked'}

cookie的处理

import requests
user_agent = 'Mozilla/4.0 (compatible;MSIE 5.5;Windows NT)'
headers = {'User-Agent':user_agent}
r = requests.get('https://www.baidu.com',headers=headers)
for cookie in r.cookies.keys():
    print(cookie+':'+r.cookies.get(cookie))

BAIDUID:29C35B46ABEAF0273D2DC8EB99F1EE42:FG=1
BIDUPSID:29C35B46ABEAF02741391EDC08B8E0B8
H_PS_PSSID:35106_35627_35489_34584_35491_35698_35688_35541_35316_26350_35613_22159
PSTM:1641993903
BDSVRTM:13
BD_HOME:1

这里介绍一种自动处理cookie的方法以便换网页

import requests
loginUrl = 'https://www.baidu.com'
s = requests.Session()
r = s.get(loginUrl,allow_redirects=True)
datas = {'name':'qiye','passwd':'qiye'}
r = s.post(loginUrl,data=datas,allow_redirects=True)
print(r.text[0:10])

使用代理

import requests
proxies = {
    'http:':'http://0.10.1.10:3128',
    'https:':'http://0.10.1.10:1080'
}
requests.get('http://example.org',proxies=proxies)

小李同学314

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
爬虫学习笔记（2）

爬虫学习注：本笔记使用jupyter编写。web前端知识jupyter 可以直接运行html和javascript只需要在代码前面加上%%html或者%%javascript%%html<html> <head> <title>python爬虫开发与项目实战</title> <meta charset='UTF-8'> </head> <body>
复制链接

扫一扫