1、解析与提取(Beautiful、json)
当数据藏匿于网页源代码(BeautifulSoup ) 手动修改编码类型:response.encoding='xxx'
当数据藏匿于 XHR 中(json ) 总结
2、更厉害的请求(get、post、cookies)
requests.get() _ 参数params :让我们带着参数来请求数据,如我想要第几页?我想要搜索的关键词?我想要多少个数据? requests.get() _ 参数headers :请求头。 get 是明文显示参数,post 是非明文显示参数。requests.post() _ 参数data :用法和params非常相像。 cookies :作用是让服务器“记住你”。示例代码
import requests
url_1 = 'https://…'
headers = { 'user-agent' : '' }
data = { }
login_in = requests. post( url, headers= headers, data= data)
cookies = login_in. cookies
url_2 = 'https://…'
params = { }
response = requests. get( url, headers= headers, params= params, cookies= cookies)
3、存储(csv、openpyxl)
csv
import csv
csv_file= open ( 'demo.csv' , 'w' , newline= '' )
writer = csv. writer( csv_file)
writer. writerow( [ '电影' , '豆瓣评分' ] )
csv_file. close( )
import csv
csv_file= open ( 'demo.csv' , 'r' , newline= '' )
reader= csv. reader( csv_file)
for row in reader:
print ( row)
Excel文件
import openpyxl
wb= openpyxl. Workbook( )
sheet= wb. active
sheet. title= 'new title'
sheet[ 'A1' ] = '漫威宇宙'
rows= [ [ '美国队长' , '钢铁侠' , '蜘蛛侠' , '雷神' ] , [ '是' , '漫威' , '宇宙' , '经典' , '人物' ] ]
for i in rows:
sheet. append( i)
print ( rows)
wb. save( 'Marvel.xlsx' )
import openpyxl
wb = openpyxl. load_workbook( 'Marvel.xlsx' )
sheet= wb[ 'new title' ]
sheetname = wb. sheetnames
print ( sheetname)
A1_value= sheet[ 'A1' ] . value
print ( A1_value)
4、更多的爬虫(协程/gevent、queue)
import gevent, time, requests
from gevent. queue import Queue
from gevent import monkey
monkey. patch_all( )
start = time. time( )
url_list = [ 'https://www.baidu.com/' ,
'https://www.sina.com.cn/' ,
'http://www.sohu.com/' ,
'https://www.qq.com/' ,
'https://www.163.com/' ,
'http://www.iqiyi.com/' ,
'https://www.tmall.com/' ,
'http://www.ifeng.com/' ]
work = Queue( )
for url in url_list:
work. put_nowait( url)
def crawler ( ) :
while not work. empty( ) :
url = work. get_nowait( )
r = requests. get( url)
print ( url, work. qsize( ) , r. status_code)
tasks_list = [ ]
for x in range ( 2 ) :
task = gevent. spawn( crawler)
tasks_list. append( task)
gevent. joinall( tasks_list)
end = time. time( )
print ( end- start)
5、更强大的爬虫(Scrapy框架)
Scrapy结构 Scrapy工作原理 Scrapy用法
6、给爬虫加上翅膀(selenium、邮件/smtplib+email、定时/schedule)
selenium 提取数据的方法: 对象的转换过程: 获取字符串格式的网页源代码:HTML源代码字符串 = driver.page_source
自动操作浏览器的方法: 邮件 流程: 示例代码:
import smtplib
from email. mime. text import MIMEText
from email. header import Header
mailhost= 'smtp.qq.com'
qqmail = smtplib. SMTP( )
qqmail. connect( mailhost, 25 )
account = input ( '请输入你的邮箱:' )
password = input ( '请输入你的密码:' )
qqmail. login( account, password)
receiver= input ( '请输入收件人的邮箱:' )
content= input ( '请输入邮件正文:' )
message = MIMEText( content, 'plain' , 'utf-8' )
subject = input ( '请输入你的邮件主题:' )
message[ 'Subject' ] = Header( subject, 'utf-8' )
try :
qqmail. sendmail( account, receiver, message. as_string( ) )
print ( '邮件发送成功' )
except :
print ( '邮件发送失败' )
qqmail. quit( )
import schedule
import time
def job ( ) :
print ( "I'm working..." )
schedule. every( 10 ) . minutes. do( job)
schedule. every( ) . hour. do( job)
schedule. every( ) . day. at( "10:30" ) . do( job)
schedule. every( ) . monday. do( job)
schedule. every( ) . wednesday. at( "13:15" ) . do( job)
while True :
schedule. run_pending( )
time. sleep( 1 )
7、爬虫进阶路径指引
解析与提取 解析库 xpath / lxml 正则表达式( re 模块) 存储 MySQL 库 、MongoDB 库 SQL 语言 数据分析和可视化 模块与库 Pandas / Matplotlib / Numpy / Scikit-Learn / Scipy 更多的爬虫 多进程( multiprocessing 库) 更强大的爬虫-框架 Scrapy模拟登录 、存储数据库 、使用HTTP代理 、分布式爬虫 PySpider 框架