通过崔庆才老师的《网络爬虫开发实战》,现将自己学习要点记录如下:
1.分析ajax。利用params和urlencode,构建request url。
params = {
'offset': offset,
'format': 'json',
'keyword': '电影',
'autoload': 'true',
'count': '20',
'cur_tab': '3',
'from': 'gallery'}
url = 'https://www.toutiao.com/search_content/?' + urlencode(params)

2.由于网站代码一改,删除了image_detail,无法一次性爬取图片。爬取思路改为分两步爬取,先在搜索页遍历每个图集的url,再在每个图集的url下爬取每张图片。

3.正则表达式,网址的筛选,转义字符的剔除。

pattern = re.compile('url_list(.*?),', re.S)
result = re.findall(pattern, movie_page)
for i in result:
yield(re.sub(r'\\','',i[15:-3]))
4.由于第一次图片没有加后缀名jpg,后来通过os模块批量加后了一次后缀名。
# -*- coding: utf-8 -*-
"""
Created on Sat Jun 9 16:13:00 2018
@author: 01
"""
import os
files = os.listdir(r"C:\Users\01\test\JRTT\pic") #列出当前目录下所有的文件
for filename in files:
portion = os.path.splitext(filename)#分离文件名字和后缀
print(portion)
if portion[1] =="":#根据后缀来修改,如无后缀则空
newname = portion[0]+".jpg"#要改的新后缀
os.chdir(r"C:\Users\01\test\JRTT\pic")#切换文件路径,如无路径则要新建或者路径同上,做好备份
os.rename(filename,newname)
最后附上代码
# -*- coding: utf-8 -*-
"""
Created on Wed Jun 6 13:09:40 2018
@author: 01
"""
import re
import requests
import json
from urllib.parse import urlencode
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3423.2 Safari/537.36'
}
def get_page(offset):#通过urlencoder建立request url
#Request URL: https://www.toutiao.com/search_content/?offset=0&format=json&keyword=%E7%94%B5%E5%BD%B1&autoload=true&count=20&cur_tab=3&from=gallery
params = {
'offset': offset,
'format': 'json',
'keyword': '电影',
'autoload': 'true',
'count': '20',
'cur_tab': '3',
'from': 'gallery'}
url = 'https://www.toutiao.com/search_content/?' + urlencode(params)
try:
response = requests.get(url, headers=headers)
if response.status_code == 200:
return response.text
except requests.ConnectionError:
return None
def get_movie_url(html):#爬取搜索页下面下每个图集的url
data = json.loads(html)
if data and 'data' in data.keys():
for item in data.get('data'):
yield item.get('article_url')
def parse_pic(movie_url):#request每个图集的url
try:
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3423.2 Safari/537.36'
}
response = requests.get(movie_url, headers=headers)
if response.status_code == 200:
return response.text
except requests.ConnectionError:
return None
def get_pic(movie_page):#通过正则表达式筛选每个图片的url
pattern = re.compile('url_list(.*?),', re.S)
result = re.findall(pattern, movie_page)
for i in result:
yield(re.sub(r'\\','',i[15:-3]))
def savefile(pic_url):#通过request每个图片的url,以二进制方式写入文件
pic = requests.get(pic_url)
pic_name = pic_url[41:] +'.jpg'
with open(pic_name,'wb') as f:
f.write(pic.content)
if __name__ == '__main__': #遍历一定的offset页数
for offset in range(0,100,20):
html = get_page(offset)
for movie_url in get_movie_url(html):
movie_page = parse_pic(movie_url)
pic_urls = get_pic(movie_page)
for pic_url in pic_urls:
savefile(pic_url)
本文是根据崔庆才老师的《网络爬虫开发实战》学习的总结,主要讲述了如何分析ajax请求,构建request url,应对网站代码变动调整爬取策略,使用正则表达式筛选网址,处理转义字符,以及通过os模块为图片批量添加后缀名的过程。
;爬取今日头条图片&spm=1001.2101.3001.5002&articleId=80658724&d=1&t=3&u=6ec16929762842c39f0c041baa17cf94)
1092

被折叠的 条评论
为什么被折叠?



