爬虫学习之抓取今日头条街拍美图

最新推荐文章于 2021-02-17 21:21:06 发布

没有刺的仙人掌

最新推荐文章于 2021-02-17 21:21:06 发布

阅读量464

点赞数

分类专栏： python 文章标签： python Ajax

本文链接：https://blog.csdn.net/qq_21467113/article/details/82192025

版权

python 专栏收录该内容

38 篇文章 0 订阅

订阅专栏

一，进入今日头条主页，搜索街拍，进入搜索到的页面后分析网页代码

可以看到头条采用的并不是常见的html代码，而是使用的Ajax通过后台请求获取页面。不论什么代码，只要获得我们想要的就行了。通过Preserve log获得网页的收发代码，在Doc部分可以看到我们想要的，也就是选项的网页地址。

找到选项地址了，就能编写爬取代码了

def get_page_data(offset,KEYWORD):
    data = {
        'offset': offset,
        'format': 'json',
        'keyword': KEYWORD,
        'autoload': 'true',
        'count': 20,
        'cur_tab': 3,
        'from': 'gallery',
    }
    url = 'https://www.toutiao.com/search_content/?' + urlencode(data)
    try:
        respond = requests.get(url)
        if respond.status_code == 200:
            return respond.text
        else:
            return None
    except RequestException:
        print('请求主页面出错！', url)
        return None

通过构建一个data字典来创建街拍搜索页面的url参数，通过urlencode函数将字典转化为url样式

获取页面代码后再使用json.loads将它转换成字典格式，让我们可以提取出选项的地址

def get_image_url(page):
    data = json.loads(page)
    if data and 'data' in data.keys():
        for item in data.get('data'):
            yield item.get('article_url')

二、进入选项页面，分析其代码

通过requests来试图获取代码，发现不能成功得到，遇到这种情况，我只能用selenium库来模拟浏览器来获取

def get_image_detail(image_url):
    browser = webdriver.Chrome()
    try:
        browser.get(image_url)
        image_detail =  browser.page_source
        browser.close()
        return image_detail
    except TimeoutError:
        get_image_detail()

分析获得的源码，提取出题目，图片地址，本网页的url

def prase_image_page(html,url):
    soup = BeautifulSoup(html,'lxml')
    title = soup.select('title')[0].get_text()
    image_pattern = re.compile('gallery: JSON.parse\("(.*?)"\),',re.S)
    result = re.search(image_pattern,html)
    images_data = result.group(1)
    images_data = images_data.replace("\\","")
    data = json.loads(images_data)
    if data and 'sub_images' in data.keys():
        sub_images = data.get('sub_images')
        images = [item.get('url') for item in sub_images]
        for image in images:download_image(image)
        return {
            'title': title,
            'url':    url,
            'images': images
        }

先用Beautiful Soup库来进行初次获取文章题目，再用正则表达式提出含有图片地址的json字符串，不过该字符串头条对其进行了一些反爬处理，使用X.replace()函数将其中的干扰字符去掉使其符合json格式，再使其转化成字典进行提取。

三、对获取的信息进行储存

将获取的信息存到MongoDB数据库中

def save_to_mongo(result):
    if db[MONGO_TABLE].insert(result):
        print('存储到MongoDB成功',result)
        return True
    return False

把得到的图片下载下来，保存到当前文件下

def download_image(url):
    print('正在下载图片',url)
    try:
        respond = requests.get(url)
        if respond.status_code == 200:
            save_image(respond.content)
        else:
            return None
    except RequestException:
        print('请求图片出错！', url)
        return None

def save_image(content):
    file_path = '{0}/{1}.{2}'.format(os.getcwd(),md5(content).hexdigest(),'jpg')
    if not os.path.exists(file_path):
        with open(file_path,'wb') as f:
            f.write(content)
            f.close()

四、创建配置文件

使用一个配置文件来保存全局变量

MONGO_URL = 'localhost'
MONGO_DB = 'toutiao'
MONGO_TABLE = 'toutiao'

GROUP_START = 1
GROUP_END = 20

KEYWORD = '街拍'

五、使用进程池，使爬取批处理

这里和上篇是差不多的，就直接贴完整代码了，配置文件config.py就不贴了

import requests
import os
from hashlib import md5
from urllib.parse import urlencode
from requests.exceptions import RequestException
import json
from selenium import webdriver
import pymongo
from bs4 import BeautifulSoup
import re
from config import *
from multiprocessing import Pool

client = pymongo.MongoClient(MONGO_URL)
db = client[MONGO_DB]


def get_page_data(offset,KEYWORD):
    data = {
        'offset': offset,
        'format': 'json',
        'keyword': KEYWORD,
        'autoload': 'true',
        'count': 20,
        'cur_tab': 3,
        'from': 'gallery',
    }
    url = 'https://www.toutiao.com/search_content/?' + urlencode(data)
    try:
        respond = requests.get(url)
        if respond.status_code == 200:
            return respond.text
        else:
            return None
    except RequestException:
        print('请求主页面出错！', url)
        return None

def get_image_url(page):
    data = json.loads(page)
    if data and 'data' in data.keys():
        for item in data.get('data'):
            yield item.get('article_url')

def get_image_detail(image_url):
    browser = webdriver.Chrome()
    try:
        browser.get(image_url)
        image_detail =  browser.page_source
        browser.close()
        return image_detail
    except TimeoutError:
        get_image_detail()

def prase_image_page(html,url):
    soup = BeautifulSoup(html,'lxml')
    title = soup.select('title')[0].get_text()
    image_pattern = re.compile('gallery: JSON.parse\("(.*?)"\),',re.S)
    result = re.search(image_pattern,html)
    images_data = result.group(1)
    images_data = images_data.replace("\\","")
    data = json.loads(images_data)
    if data and 'sub_images' in data.keys():
        sub_images = data.get('sub_images')
        images = [item.get('url') for item in sub_images]
        for image in images:download_image(image)
        return {
            'title': title,
            'url':    url,
            'images': images
        }

def save_to_mongo(result):
    if db[MONGO_TABLE].insert(result):
        print('存储到MongoDB成功',result)
        return True
    return False

def download_image(url):
    print('正在下载图片',url)
    try:
        respond = requests.get(url)
        if respond.status_code == 200:
            save_image(respond.content)
        else:
            return None
    except RequestException:
        print('请求图片出错！', url)
        return None

def save_image(content):
    file_path = '{0}/{1}.{2}'.format(os.getcwd(),md5(content).hexdigest(),'jpg')
    if not os.path.exists(file_path):
        with open(file_path,'wb') as f:
            f.write(content)
            f.close()

def main(offset):
    page = get_page_data(offset, KEYWORD)
    for image_url in  get_image_url(page):
        image_detail = get_image_detail(image_url)
        page_detail = prase_image_page(image_detail,image_url)
        save_to_mongo(page_detail)

if __name__ == '__main__':
    group = [x*20 for x in range(GROUP_START,GROUP_END + 1)]
    pool = Pool()
    pool.map(main,group)

没有刺的仙人掌

关注

0
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
爬虫学习之抓取今日头条街拍美图

一，进入今日头条主页，搜索街拍，进入搜索到的页面后分析网页代码可以看到头条采用的并不是常见的html代码，而是使用的Ajax通过后台请求获取页面。不论什么代码，只要获得我们想要的就行了。通过Preserve log获得网页的收发代码，在Doc部分可以看到我们想要的，也就是选项的网页地址。找到选项地址了，就能编写爬取代码了def get_page_data(offset,KE...
复制链接

扫一扫