python爬取今日头条街拍,Python3今日头条街拍爬虫

最新推荐文章于 2024-08-18 15:31:51 发布

陈村

最新推荐文章于 2024-08-18 15:31:51 发布

阅读量229

点赞数

文章标签： python爬取今日头条街拍

学习了大才哥的在线视频教程，特来这里总结分享一下。

不同于上一篇糗事百科的爬虫，这里爬取今日头条街拍需要分析ajax请求得来的数据。

首先这里是爬取的起始页

可以看到当我们往下拉滚动条的时候，新数据是即时生成的，也就是ajax发起的请求。

按F12 选中prelog，选中XHR，找到这样的请求，发现通过页面滚动，会生成只有offset不同的请求。

%E6%90%9C%E7%8B%97%E6%88%AA%E5%9B%BE20170410171448.png

点击请求，选中preview选项卡，发现json格式的数据，其中的data里面article_url就是我们要找的每组网页的URL

接着点击article_url,进入其中一组图片的网页。接下来我们来找网页图片的连接地址。

在network中找到第一个请求，发现html中并没有要找的图片地址，而是在script中找到一个gallery变量，其中存储着图片的地址信息。

script中的信息无法使用bs4这样的解析器来解析，所以可以直接使用正则去匹配。

分析完毕，接下来开始编码。

首先获取索引页，找到header中get请求的参数，复制下来

使用urlencode为数组解析成get参数，接着使用requests请求，同时判断状态是否可以访问，可以的话返回response.text，最后不要忘记try catch起来，一般网页请求都是有可能出错的。

def get_page_index(offset, keyword):

data = {

'offset':offset,

'format':'json',

'keyword':keyword,

'autoload':'true',

'count':20,

'cur_tab':3

}

url = 'http://www.toutiao.com/search_content/?' + urlencode(data)

print('([%d] 正在下载索引页 %s' % (os.getpid(), url))

try:

response = requests.get(url)

if response.status_code == 200:

return response.text

return None

except RequestException:

print('请求索引页出错', url)

return None

接着从返回的response.text中，我们要解析出json数据中data中的article_rul。引入json库，编写parse_page_index函数

json.loads()函数直接将json数据解析成python字典，data.keys()可以获取所有的键。

yield 可以将函数变成生成器，后面可以使用for循环依次获取这个函数yield出来的值。

import json

def parse_page_index(html):

try:

data = json.loads(html)

if data and 'data' in data.keys():

for item in data.get('data'):

yield item.get('article_url')

except JSONDecodeError:

pass

解析出来article_url接着进入详情页，和前面的类似，我们之前判断是否能正确访问。

def get_page_detail(url):

print('([%d] 正在下载详细页 %s' % (os.getpid(), url))

try:

response = requests.get(url)

if response.status_code == 200:

return response.text

return None

except RequestException:

print('请求详情页出错', url)

return None

接着解析详情页。

将get_page_detail中返回的值，还有参数url作为参数传入parse_page_detail()中。

使用re.compile()制作匹配规则，获取gallery中的json变量存入result中。

同样使用json.loads()解析json串，(.*?)这样匹配的值通过result.group()获取，从1开始。

接着判断是否存在sub_images这个键。

使用这句话，直接获取sub_images中所有url组成列表。

images = [item.get('url') for item in sub_images]

有了图片url就可以编写下载图片的函数。最后return 图片信息的字典。

def parse_page_detail(html, url):

soup = BeautifulSoup(html, 'lxml')

title = soup.select('title')[0].get_text()

print(title)

images_pattern = re.compile('var gallery = (.*?);', re.S)

result = re.search(images_pattern, html)

if result:

data = json.loads(result.group(1))

if data and 'sub_images' in data.keys():

sub_images = data.get('sub_images')

images = [item.get('url') for item in sub_images]

for image in images: download_img(image)

return {

'title': title,

'url': url,

'images': images

}

接着将返回的数据存入mongoDB中。需要下载mongoDB，安装pymongo。

MONGO_URL是配置文件中链接的地址，localhost； connect = False 防止多进程时出错。

MONGO_DB 数据库名称， MONGO_TABLE数据库表名称。

mongoDB不用自己创表创建数据库，直接拿来用就行了，存入的json数据也自己解析好了。

import pymongo

client = pymongo.MongoClient(MONGO_URL, connect=False)

db = client[MONGO_DB]

def save_to_mongo(result):

if db[MONGO_TABLE].insert(result):

print('存储到MongoDB成功', result)

return True

return False

最后编写下载的部分。

首先还是判断是否可以正确访问，response.content，不同于response.text,返回的是二进制文件，适用于图片。

接着使用'{}’.format构造好存储地址。

os.getcwd()获取当前文件地址，md5函数使得相同的内容散列成相同的值。

def download_img(url):

print('([%d] 正在下载图片 %s' % (os.getpid(), url))

try:

response = requests.get(url)

if response.status_code == 200:

save_image(response.content)

return None

except RequestException:

print('请求图片出错', url)

return None

def save_image(content):

#print (os.path.abspath('.'))

#print (os.getcwd())

file_path = '{0}\images\{1}.{2}'.format(os.getcwd(), md5(content).hexdigest(), 'jpg')

if not os.path.exists(file_path):

with open(file_path, 'wb') as f:

f.write(content)

f.close()

主函数

def main(offset):

print('获取索引页 (%d)' % offset)

html = get_page_index(offset, KEYWORD)

for url in parse_page_index(html):

html = get_page_detail( url)

if html:

result = parse_page_detail(html, url)

if result: save_to_mongo( result)

pass

起始代码

[x*20 for x in range(GROUP_START, GROUP_END)]

获取给定范围内x的20倍的列表，对应开头我们讲的页面滚动时，新的ajax请求时不同的offset。

使用进程池。

from multiprocessing import Pool

pool.map(main,groups)将groups中的值作为参数传入了main。

PROCESS_NUM指定进程数量。

if __name__ == '__main__':

groups = [x*20 for x in range(GROUP_START, GROUP_END)]

pool = Pool(processes=PROCESSE_NUM)

pool.map(main, groups)

配置文件config.py

MONGO_URL = 'localhost'

MONGO_DB = 'toutiao'

MONGO_TABLE = 'toutiao'

GROUP_START = 0

GROUP_END = 20

KEYWORD = '街拍'

PROCESSE_NUM = 8

成果图片

源代码

陈村

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
复制链接

分享到 QQ

分享到新浪微博

扫一扫