爬虫基础学习

最新推荐文章于 2024-05-13 19:32:52 发布

sunmlight

最新推荐文章于 2024-05-13 19:32:52 发布

阅读量160

点赞数 2

分类专栏：爬虫 Python 文章标签： Python

本文链接：https://blog.csdn.net/qq_39926957/article/details/80322338

版权

Python 同时被 2 个专栏收录

11 篇文章 0 订阅

订阅专栏

爬虫

5 篇文章 0 订阅

订阅专栏

本文介绍了Python爬虫的基础知识，包括使用requests、urllib.request、urllib库进行GET和POST请求，处理CA认证的网站，自定义Handler和Opener，以及正则表达式。还涉及了Requests库的使用，XPath和BeautifulSoup4进行HTML解析，JSON处理以及多线程在爬虫中的应用。

摘要由CSDN通过智能技术生成

request封装请求头

from urllib.request import Request,urlopen
url=’http://baidu.com’
headers ={‘User-Agent’: ‘Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/66.0.3359.139 Safari/537.36’}
request = Request(url,headers=headers)
response = urlopen(request)
response_data = response.read()
print(response_data.decode(“utf-8”))

Get方式–汉字urlencode编码问题

url中的汉字需要转成urlencode编码
from urllib import parse
wd = {“wd”:”尚硅谷”}
urlencode = parse.urlencode(wd)
解码:decode = parse.unquote(“wd=%E5%B0%9A%E7%A1%85%E8%B0%B7”)

POST请求

1.data ={“name”:”zhangsan”,”age”:”18”}字典
2.data= urlencode(data).encode(“utf-8”)
3.request = urllib.request(url,data=data)

代码访问带有CA认证的网站

1.import ssl
2.context = ssl._create_unverified_context()
3.response = urlopen(request,context=context)

urllib.request：Handler处理器和自定义Opener

自定义opener
- from urllib.request import HTTPHandler,Request,build_opener
- 1.http_handler=HTTPHandler() 创建HTTPHandler实例对象
- 2.opener=build_opener(http_handler)创建支持处理HTTP请求的opener对象
- 3.url=’www.baidu.com’
- 4.request=Request(url)
- 5.response=opener.open(request)

ProxyHandler
- proxy_hander = ProxyHandler({“http”:”114.67.228.126:16819”})
- opener = build_opener(proxy_hander)

获取Cookie，并保存到CookieJar()对象中
- from urllib.request import HTTPCookieProcessor,build_opener
- from http.cookiejar import CookieJar
- 1.cookiejar = CookieJar()构建一个cookiejar对象实例存储cookiejar
- 2.handler = HTTPCookieProcessor(cookiejar=cookiejar) # 使用HTTPCookieProcessor()来创建cookie处理器对象
- 3.opener = build_opener(handler)
- 4.response = opener.open(“http://www.baidu.com/“)
- 5.打印出cookie:

cookie_str = ""
for item in cookiejar:
   # print(item)
   cookie_str = cookie_str +item.name+"="+item.value+";"
cookie_str = cookie_str[:-1]#把最后一个分号干掉
print(cookie_str)

利用cookie登录
- 利用浏览器登录抓取cookie
- 使用sublime处理handler:

- 替换:正则---findwhat: ^(.*): (.*)$ Replacewhat: "\1":"\2",
- from urllib.request import Request,urlopen
- headers={填入处理好的header信息}
- request=Request(url,headers=headers)
- response=urlopen(request)

Requests

import request
response = requests.get("http://www.baidu.com/",params=,headers=)
print(response.request) #打印出是什么类型的请求
print(response.content) #打印出返回的二进制内容
print(response.text) #打印解码后的数据

保存图片

with open('name.jpg','wb') as f:
    for block in response.iter_content(1024):
        if not block:
            break
        f.write(response.content)

正则

match

content=’hello world python’
pattern=re.compile(r’python’)
print(pattern.match(content)
search/findall

pattern=re.compile(r’\d’)
content=’qwer1234’
print(pattern.search(content)
得到一个,findall得到全部
re.S和re.I
- pattern=re.compile(r’\d’,re.S)
- re.L忽略大小写

- re.S不使用则在每一行内匹配,而使用re.S后会将字符串作为一个整体,将\n当做一个普通字符

爬取内涵段子

#需求,爬取内涵吧http://www.neihan8.com/article/list_5_1.html的段子
import requests
import re
class Spider(object):
    #第一步,请求页面的数据
    #page就是爬取的页数
    def loader_page(self,page):
        url = "http://www.neihan8.com/article/list_5_"+str(page)+".html"
        response = requests.get(url)
        return response.content
    def write_file(self,item):
        with open("内涵吧段子.txt","a") as f:
            f.write(item)
if __name__ == "__main__":
    print("""
        内涵吧小爬虫开始干活了
    """)
    page = 1
    spider = Spider()
    swicth = True
    while swicth:
        cmd = input("请按回车键,爬虫开干,输入quit退出爬取:")
        if cmd == "quit":
            swicth = False
        print("当前正在爬取[%d]页面" % page)
        content = spider.loader_page(page)
        content = content.decode("gbk")
        # print(content)
        #第二步,使用正则得到段子数据,数据清洗
        pattern = re.compile(r'<div class="f18 mb20">(.*?)</div>',re.S)
        lists = pattern.findall(content)
        for item in lists:
            item = item.replace("<p>","").replace("</p>","").replace("<br />","").replace("&ldquo;","").replace("&hellip;","").replace("&rdquo;","")
            print(item)
            spider.write_file(item)
        page += 1

XML和HTML
- XML:可扩展标记语言,被设计为传输和存储数据，其焦点是数据的内容。
- HTML:超文本标记语言,显示数据以及如何更好显示数据。

HTML DOM: Document Object Model for HTML (文档对象模型) 通过 HTML DOM，可以访问所有的 HTML 元素，连同它们所包含的文本和属性。可以对其中的内容进行修改和删除，同时也可以创建新的元素。

XML节点
- 父 (Parent) :每个元素以及属性都有一个父。
- 子 (Children) :元素节点可有零个、一个或多个子。
- 同胞（Sibling）:拥有相同的父的节点
- 先辈（Ancestor）:某节点的父、父的父，等等

XPath定义
XPath (XML Path Language) 是一门在 XML 文档中查找信息的语言，可用来在 XML 文档中对元素和属性进行遍历。

XPath语法
- nodename 选取此节点的所有子节点。
- / 从根节点选取。
- // 从匹配选择的当前节点选择文档中的节点，而不考虑它们的位置。(从任何地方开始不用使用/根节点一级一级往下找)
- . 选取当前节点。
- .. 选取当前节点的父节点。
- @ 选取属性。要加上中括号
- //@name 选取所有为name的属性
- 选取未知节点:

- *   匹配任何元素节点。
- @*匹配任何属性节点。需要结合其他使用。
- node()匹配任何类型的节点。

- 运算符
- 选择多个使用 | 连接例.11
- 加+ 减- 乘* 除div—->返回计算结果
- 等于= 不等于!= 小于< 小于等于<= 或or 与and —>返回值是true和false
- mod 计算出发的余数
- 得到文本信息: //name/text()
- 实例应用:

1. /bookstore/book[1]   选取属于 bookstore 子元素的第一个 book 元素。
2. /bookstore/book[last()]  选取属于 bookstore 子元素的最后一个 book 元素。
3. /bookstore/book[last()-1]    选取属于 bookstore 子元素的倒数第二个 book 元素。
4. /bookstore/book[position()<3]    选取最前面的两个属于 bookstore 元素的子元素的 book 元素。
5. //title[@lang]   选取所有拥有名为 lang 的属性的 title 元素。
6. /bookstore/book[price>35.00] 选取 bookstore 元素的所有 book 元素，且其中的 price 元素的值须大于 35.00。
7. /bookstore/book[price>35.00]/title   选取 bookstore 元素中的 book 元素的所有 title 元素，且其中的 price 元素的值须大于 35.00。
8. /bookstore/* 选取 bookstore 元素的所有子元素。
9. //*  选取文档中的所有元素。
10. //title[@*] 选取所有带有属性的 title 元素。
11. //book/title | //book/price 选取 book 元素的所有 title 和 price 元素。

Python内使用XPath

使用 lxml 的 etree 库进行转码
- html = etree.HTML(text) 使用etree.HTML()可以将字符串转为HTML文档(lxml 可以自动修正 html 代码)
- result = etree.tostring(html) 使用etree.tostring()可以将html文档转换为字符串
- html = etree.parse(‘./hello.html’) etree.parse可以读取本地文件
result = html.xpath(‘//li’) 使用xpath()函数提取目标,括号里面是XPath语法

BeautifulSoup4

使用pip安装
导入: from bs4 import BeautifulSoup
soup = BeautifulSoup(text,’lxml’) 创建实例化对象(数据,解析器)

四大对象种类:
- Tag
- suop.p – 得到第一个p标签
- suop.p.text/string – 获取p的内容
- suop.p.attrs – 获取p的属性(字典) p.attrs[“class”] – 获取class属性的值
- suop.p[“class”] = “new” – 修改属性值

NavigableString
- suop.a.string 是NavigableString类型,而.text是字符串类型
BeautifulSoup
- soup = BeautifulSoup(text,’lxml’)
Comment 是NavigableString的子类

遍历文档树
- 直接子节点: .content/.children
- .content 属性可以将tag的子节点以列表的方式输出,可以使用[1]/[2]方式取出
- .children 以list生成器方式输出,可以使用for循环得到内容
- 所有后代节点: .descendants
- 遍历获取其中的内容。for child in soup.descendants

搜索文档树
- find_all
- name 参数可以查找所有名字为 name 的tag,字符串对象会被自动忽略掉
- 参数为正则 find_all(re.compile(‘a’)
- 参数为列表 find_all([“a”, “b”])
- keyword 参数 : 查找id find_all(id=’idname’)
- text 参数 : 通过 text 参数可以搜搜文档中的字符串内容，与 name 参数的可选值一样, text 参数接受字符串 , 正则表达式
- href : links = soup.find_all(href=re.compile(r’http://example.com/‘))
- 选择器(select) : soup.select()，返回类型是 list
- 直接通过标签名查找 : soup.select(‘title’)
- 通过类名查找 : soup.select(‘.sister’)
- 通过id查找 : soup.select(‘#link1’)
- 组合查找 p下的id=link1 : soup.select(‘p #link1’)–直接子标签查找，则使用 > 分隔:soup.select(“head > title”)
- 属性查找 : soup.select(‘a[class=”sister”]’)
- 获取内容: get_text() — soup.select(‘title’)[0].get_text()

JSON与jsonPATH

json模块
- json.loads()–Json转Python对象，在内存
- dict_str=json.loads(“{‘name’:’zhangsan’}”)
- json.dumps()–python转json字符串，在内存
- json.dump()–Python转json对象，写入文件
- list_str = [{“city”: “北京”}, {“name”: “大刘”}]
- fw = open(“list_str.json”,”w”,encoding=”utf-8”)
- json.dump(list_str, fw, ensure_ascii=False)
- json.load()–json转python类型，读取文件

JsonPath
- 语法:
- 根节点: $
- 当前节点: @
- 子节点: .or[]
- 就是不管位置，选择所有符合条件的条件 .. 相当于xpath的//
- 匹配所有元素节点 *

多线程

多线程提高效率
线程安全问题—使用队列(Queue)解决
多线程保存文件需要使用线程同步保证数据的安全 —互斥锁
- 创建队列:
- from queue import Queue # 导包
- page_queue = Queue(10) # 队列中最多存储数为10
- page_queue.put(page) # 将page放入队列中
- page_queue.get(block=False) # 取出队列中的数据参数block=False 设置为非阻塞,队列为空后再取会报错
- 创建线程:
- from threading import Thread
- crawl=ThresdCrawl(参数)
- crawl.start()
  
  class ThreadParse(Thread):
  def init(self,参数):
  super(ThreadParse,self).init()
  self.参数=参数
  def run(self):
  线程执行的程序
- 互斥锁
- from threading import Lock
- lock=Lock()
- with lock:

关于爬虫
1. 尽量减少请求次数
2. 手机App与H5反爬虫措施少一些
3. 防守方一般做到根据ip限制频次
4. 可使用多线程甚至分布式爬虫提高性能

关于反爬虫
1. 后台对访问进行统计，如果单个IP访问超过阈值，予以封锁
2. 后台对访问进行统计，如果单个session访问超过阈值，予以封锁。
3. 后台对访问进行统计，如果单个userAgent访问超过阈值，予以封锁。
4. 动态HtML:JavaScript,jQuery,Ajax,DHTML

动态网站验证码处理

测试模块

import time
#导入python测试模块
import unittest
#类名任意,但必须继承unittest.TestCase
class DouyuTest(unittest.TestCase):
   #固定写法,通常做初始化
   def setUp(self):
      print("setUp()....")
      self.num1 = 1
      self.num2 = 1
   def testTest(self):
      for i in range(1,3):
         print("testTest()==",i)
         self.num1 += 1
         time.sleep(1)
   def testTest2(self):
      for i in range(1,3):
         print("testTest2()==",i)
         self.num2 += 1
         time.sleep(1)
   #固定写法,但每个自定义方法接收后都会调用一次该方法
   def tearDown(self):
      print("tearDown()...")
      print("num1==",self.num1)
      print("num2==", self.num2)
if __name__ == "__main__":
   #调用的时候只需要写上main,固定的调用方式
   unittest.main()

sunmlight

关注

2
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
爬虫基础学习

request封装请求头from urllib.request import Request,urlopenurl=’http://baidu.com’headers ={‘User-Agent’: ‘Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/66.0.3...
复制链接

扫一扫

专栏目录