Python3爬虫学习笔记

最新推荐文章于 2023-01-25 16:21:28 发布

KILOLOKI

最新推荐文章于 2023-01-25 16:21:28 发布

阅读量603

点赞数 1

文章标签： python3 爬虫基础语法

本文链接：https://blog.csdn.net/qq_37688437/article/details/81585026

版权

Python3爬虫学习笔记

一、条件

1 工具

Jetbrains Pycharm 、Fiddler 、 Workbench

2 库、包

urllib 、 re 、 pymysql 、 lxml 、time

二、浏览器伪装

1设置headers

各字段基本格式：“字段名：字段值”

字段1 Accept（浏览器支持的内容类型）
字段2 Accept-Encoding （浏览器支持的压缩编码）
字段3 Accept-Language （浏览器支持语言类型）
字段4 User-Agent （用户代理）
字段5 Connection （客户端与服务器的连接类型）
字段6 Host （请求的服务器网址）
字段7 Referer （来源网址）

各字段值可以通过Fiddler得知
一般设置User-Agent 与 Accept 足矣

import urllib.request

url='        '
headers = ('User-Agent','       ')
opener = urllib.request.build_opener()
opener.addheaders() = [headers]
data = opener.open(url).read()

#或者

url='       '
req=urllib.request.Request(url)
req.add_header('User-Agent','         ')
data = urllib.request.urlopen(req)

2 使用代理服务器

格式：“IP：端口号”

import urllib.request
import urllib.parse

url = '    '
IP = '    '
POST = '    '
proxy_addr = IP+':'+ POST
proxy = urllib.parse.urlencode({'http':proxy_addr})
opener = urllib.request.build_opener(proxy, urllib.request.HTTPHandler)
urllib.request.install_opener(opener)
data = urllib.request.urlopen(url).read().decode('utf-8')

4 利用工具软件

比如 selenium + phantomJS

三、异常处理

1 超时设置

import urllib.request

url = '   '
data = urllib.request.urlopen(url, timeout=1).read().decode('utf-8')

2 抛出异常

import urllib.request
import urllib.error

url = '   '
try:
    urllib.request.urlopen(url)
except urllib.error.URLError as e:
    if hassatr(e, "code"):
        print(e.code)
    if hasattr(e, "reason"):
        print(e.reason)

3 DebugLog

import urllib.request

url = '      '
httphd = urllib.request.HTTPHandler(debuglevel = 1)
httpshd = urllib.request.HTTPSHandler(debuglevel = 1)
opener = urllib.request.build_opener(httphd, httpshd)
urllib.request.insatll_opener(opener)
data = urllib.request.urlopen(url)

四、HTTP协议请求

1 HTTP请求分类

GET请求
POST请求
PUT请求
DELETE请求
HEAD请求
OPTIONS请求

2 GET请求

get请求：http://网址?字段名1=字段内容1&字段名2=字段内容2……

import urllib.request

#key为关键字
key = '   '
keyword = urllib.request.quote(key)
url = 'http://www.baidu.com/s?wd='+keyword
data = urllib.request.urlopen(url)

3 POST请求

爬虫通过POST表单传递信息

import urllib.request
import urllib.parse


url = '   '
postdata=urllib.parse.urlencode({"name":"ceo@iqianyue.com","pass":"aA123456"}).encode('utf-8')
req = urllib.request.Request(url, postdata)
req.add_header('User-Agent',"      ")
data = urllib.request.urlopen(req).read()

五、正则表达式

1 常用的非打印字符

符号	含义
\n	用于匹配一个换行符
\t	用于匹配一个制表符

2 常见的通用字符及其含义

符号	含义
\w	匹配任意一个字母、数字或下划线
\W	匹配除字母、数字和下划线以外的任意一个字符
\d	匹配任意一个十进制数
\D	匹配除十进制数以外的任意一个其他字符
\s	匹配任意一个空白字符
\S	匹配除空白字符以外的任意一个其他字符

3 常见的元字符及其含义

符号	含义
.	匹配换行符意外的任意字符
^	匹配字符串的开始位置
$	匹配字符串的结束位置
*	匹配0次、1次或多次前面的原子
?	匹配0次、1次前面的原子
+	匹配1次或多次前面的原子
{n}	前面的原子恰好出现n次
{n,}	前面的原子至少出现n次
{n,m}	前面的原子至少出现n次，至多出现m次
\|	模式选择符
()	模式单元符

4 常见的模式修正符及其含义

符号	含义
I	匹配时忽略大小写
M	多行匹配
L	做本地化识别匹配
U	根据Unicode字符及解析字符
S	让.匹配换行符，即用了该模式修正后，“.”匹配就可以匹配任意的字符了

5 贪婪模式与懒惰模式

贪婪模式的核心是尽可能多地匹配
懒惰模式的核心是尽可能少地匹配

import re
pattern1 = "p.*y"  #贪婪模式
pattern2 = "p.*?y"  #懒惰模式

6 常见函数

re.match()
re.search()
全局匹配函数
re.sub()

match()函数与search()函数的区别在于：match（）从头开始匹配，search（）函数在全局匹配
注：写正则表达式时所需要的部分用小括号括住即可
几种常用的正则表达式：

Email地址	^\w+([-+.]\w+)@\w+([-.]\w+)\.\w+([-.]\w+)*$
域名	[a-zA-z]+://[^\s]* 或 ^http://([\w-]+\.)+[\w-]+(/[\w-./?%&=]*)?$
URL	^\w+([-+.]\w+)@\w+([-.]\w+)\.\w+([-.]\w+)*$

六、Cookie

通过cookie保存会话信息（比如登录）
步骤：
- 导入Cookie处理模块http.cookiejar
- 使用http.cookiejar.CookieJar()创建CookieJar对象
- 使用HTTPCookieProcessor创建cookie处理器，并以其为参数构建opener对象
- 创建全局默认的opener对象

import urllib.request
import urllib.parse
import http.cookiejar

url = '  '
postdata = urllib.parse.urlencode({"  ":"  ","  ":"  "}).encode('utf-8')
req = urllib.request.Request(url, postdata)
req.add_header('User-Agent','    ')
cjar = http.cookiejar.CookieJar()
opener = urllib.request.build_opener(urllib.request.HTTPCookieProcessor(cjar))
urllib.request.build_opener(opener)
data = opener.open(req).read()

七、Xpath

实例文档

<?xml version="1.0" encoding="ISO-8859-1"?>

<bookstore>

<book>
  <title lang="eng">Harry Potter</title>
  <price>29.99</price>
</book>

<book>
  <title lang="eng">Learning XML</title>
  <price>39.95</price>
</book>

</bookstore>

举例说明

路径表达式	结果
bookstore	选取 bookstore 元素的所有子节点。
/bookstore	选取根元素 bookstore。注释：假如路径起始于正斜杠( / )，则此路径始终代表到某元素的绝对路径！
bookstore/book	选取属于 bookstore 的子元素的所有 book 元素。
//book	选取所有 book 子元素，而不管它们在文档中的位置。
bookstore//book	选择属于 bookstore 元素的后代的所有 book 元素，而不管它们位于 bookstore 之下的什么位置。
//@lang	选取名为 lang 的所有属性。
/bookstore/book[1]	选取属于 bookstore 子元素的第一个 book 元素。
/bookstore/book[last()]	选取属于 bookstore 子元素的最后一个 book 元素。
/bookstore/book[last()-1]	选取属于 bookstore 子元素的倒数第二个 book 元素。
//title[@lang]	选取所有拥有名为 lang 的属性的 title 元素。
//title[@lang=’eng’]	选取所有 title 元素，且这些元素拥有值为 eng 的 lang 属性。
/bookstore/*	选取 bookstore 元素的所有子元素。
//*	选取文档中的所有元素。
//title[@*]	选取所有带有属性的 title 元素。
//book/title \| //book/price	选取 book 元素的所有 title 和 price 元素。
//title \| //price	选取文档中的所有 title 和 price 元素。
/bookstore/book/title \| //price	选取属于 bookstore 元素的 book 元素的所有 title 元素，以及文档中所有的 price 元素。

代码例子

爬取法律快车–房地产法网址下的全部分类的url

import urllib.request
from lxml import etree

headers = {"Accept": '*/*',
           'User-Agent': "    "}
opener = urllib.request.build_opener()
opener.addheaders = [headers]
urllib.request.install_opener(opener)

url = 'http://www.lawtime.cn/info/fangdichan/list.html'
data = urllib.request.urlopen(url).read()
html = etree.HTML(data)
urllist = html.xpath('//div[@class="navigation-block-item clearfix"]/ul//li/a/@href')
print(urllist)

八、mysql数据库

1 创建表（数据库已在workbench中提前创建好）

import pymysql

db = pymysql.connect(host='127.0.0.1', port=3306, user='root', passwd='CLiu123#', db='mypydb', charset='utf8')
cursor = db.cursor()
sql_create = "create table fcfalvkc(id int not null auto_increment, question text not null, answer text not null, primary key (id));"
cursor.execute(sql_create)
db.commit()

2 插入数据

import pymysql

db = pymysql.connect(host='127.0.0.1', port=3306, user='root', passwd='     ', db='mypydb', charset='utf8')  
cursor = db.cursor()
sql_insert = "insert into fcfalvkc(question,answer) values(" + question + "," + answer + ")"    #sql语言
cursor.execute(sql_insert)
db.commit()

3 查询数据

import pymysql

db = pymysql.connect(host='127.0.0.1', port=3306, user='root', passwd='     ', db='mypydb', charset='utf8')
cursor = db.cursor()
sql_select = '''select * from fcfalvkc'''    #sql语言
cursor.execute(sql_select)
result = cursor.fetchall()
for row in result:
    id = row[0]
    question11 = row[1]
    answer11 = row[2]
    print("id = %d, question = %s, answer = %s" % (id, question11, answer11))
print("end")
db.close()

九、多线程基础

import threading

class A(threading.Thread):
    def__init__(self):
        #初始化该线程
        threading.Thread.__init__(self)
    def run(self):
        #该线程要执行的程序内容
        for i in range(10):
            print("我是线程A")
class B(threading.Thread):
    def __init__(self):
        threading.Thread.__init__(self)
    def run(self):
        for i in range(10):
            print("我是线程B")
#实例化线程A为t1
t1 = A()
#启动线程t1
t1.start()
#实例化线程B为t2
t2 = A()
#启动线程t2
t2.start()