python爬虫爬取头条街拍

最新推荐文章于 2024-08-18 15:31:51 发布

dd205qq

最新推荐文章于 2024-08-18 15:31:51 发布

阅读量809

点赞数

分类专栏： python

本文链接：https://blog.csdn.net/dd205qq/article/details/88572376

版权

python 专栏收录该内容

2 篇文章 0 订阅

订阅专栏

身为刚入门的爬虫小白，尝试爬取头条的街拍的图片，爬取网址如下https://www.toutiao.com/search/?keyword=%E8%A1%97%E6%8B%8D
这个页面运用的是ajax技术，所以首先我们需要先分析这个页面的网址组成
。
1.https://www.toutiao.com/api/search/content/?aid=24&app_name=web_search&offset=40&format=json&keyword=街拍&autoload=true&count=20&en_qc=1&cur_tab=1&from=search_tab&pd=synthesis&timestamp=1552627477975’
2.https://www.toutiao.com/api/search/content/?aid=24&app_name=web_search&offset=60&format=json&keyword=街拍&autoload=true&count=20&en_qc=1&cur_tab=1&from=search_tab&pd=synthesis&timestamp=1552627569611
通过对比网址我们可以发现offset在变化，可以推测是每20条刷一次。
下面是具体怎么拼接url地址


def get_page(offset,keyword):
     params={
        'offset':offset,
        'format':'json',
        'keyword':keyword,
        'autoload':'true',
        'count':'20',
        'en_qc':'1',
        'cur_tab':'1',
        'from':'search_tab',
        'pd':'synthesis',
        'aid':'24',
        'app_name':'web_search'}
     base_url='https://www.toutiao.com/api/search/content/?'
     url=base_url+urlencode(params)
     headers={}
     headers['User-Agent']="Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/65.0.3325.181 Safari/537.36"
     try:
         resp=requests.get(url)
         if resp.status_code==200:
             return resp.json()  #<class 'dict'>
     except requests.ConnectionError as e:
         print('连接 ',url,'错误')`

地址拼接完成后，接下来叫分析网页获取它的图片信息

def get_images(json):
	if json.get('data'):
        data=json.get('data')
        for item in data:
            title=item.get('title')
            image_list=item.get('image_list')
            if image_list!=None:
                for image in image_list:
                    #yield是一个惰性求值 它返回一个迭代器对象  所以可用于  循环
                    yield{
                        'image':image.get('url'),
                        'title':title
                    }

接下来是保存图片运用进程池加快速度

def save_images(item):  #去重  md5  取指纹  md5(值)  -> 32位16进制
    '''
    1.保存图片
    2.去重
    '''
    img_path='img'+os.path.sep+item.get('title')  #要保存的目录  os.path.sep是根据操作系统来取/  或\
    if not os.path.exists( img_path):
        os.makedirs(img_path)
    #下载图片并保存
    headers={}
    headers['User-Agent']='Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/65.0.3325.181 Safari/537.36'
    try:
        resp=requests.get( item.get('image'),headers=headers)
        if resp.status_code==200:
            content=resp.content
            filename=md5( content).hexdigest()   #32位16进制
            file_suffix='.jpg'
            file_path=img_path+os.path.sep+filename+file_suffix
            if not os.path.exists( file_path):  # 这张图片在这个目录中不存在
                with open(file_path,'wb')  as fo:
                    fo.write( content)
                    print( '保存下载的图片到',file_path,'成功')
            else:
                print('图片重复',file_path)
    except requests.ConnectionError:
        print('下载',item.get('title'),'失败')

keyword='街拍'
#进程池方案
def task(offset):
    json=get_page(offset,keyword)
    for item in get_images(json):
        print(  '要下载:',item.get('image'),'   ',item.get('title'))
        save_images(item)

爬去的主要思路就是这样，多有不足之处，还请各位大佬指出