第一只爬虫（Requests 和BeautifulSoup）第一版，真不容易

王真人

于 2020-03-08 22:26:00 发布

阅读量195

点赞数

分类专栏： python 基础

本文链接：https://blog.csdn.net/u011388209/article/details/104742283

版权

python 基础专栏收录该内容

5 篇文章 0 订阅

订阅专栏

学了好几天Python，总算写了第一只爬虫，虽然很糙，但是能爬还是很有成就感的，中间有好多地方卡壳，再看文档，看不懂的再找视频，真不容易，现在这个只是能爬，等有时间再优化优化写第二版~

import requests
from bs4 import BeautifulSoup
import re
import time
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_4) AppleWebKit/537.36 ''(KHTML, like Gecko) Chrome/52.0.2743.116 Safari/537.36'}
lst=[]
for i in range(1,167):
    response=requests.get(('https://www.tupianzj.com/meinv/xiezhen/list_179_{0}.html'.format(i)),headers=headers)
    response.raise_for_status()  
    response.encoding = response.apparent_encoding
    
    soup=BeautifulSoup(response.text,'html.parser')
    soup=soup.find(name='ul',attrs={"class":"list_con_box_ul"})
    soup=soup.find_all('a')
    
    for i in soup:
        g=i.get('href')
        if len(i) ==4:
            lst.append(g)
lst=map(lambda x: 'https://www.tupianzj.com'+x,lst)
lst=(list(lst))
print(list(lst))
#从列表中获取每个人的页面
for j in range(len(lst)):
    response1=requests.get(lst[j],headers=headers)
    response1.raise_for_status() 
    response1.encoding = response1.apparent_encoding
    soup1=BeautifulSoup(response1.text,'html.parser')
    
#获取当前人共多少页
    SoupPage=soup1.find(name='div',attrs={"class":"pages"})
    SoupPage=SoupPage.find('a')
    SoupPage=re.sub(r"\D","",SoupPage.text)
    print(SoupPage)
    
    
#先存第一张
    #获取图片下载地址
    SoupDown=soup1.find(name='img')
    SoupDown=SoupDown.get('src')
    print(SoupDown)
    
#下载图片到当前文件夹
    pic = requests.get(SoupDown)
    Shi=int(time.time())
    with open("{0}{1}".format(Shi,SoupDown[55:]),"wb")as f:
        f.write(pic.content)
    f.close()
    
#获取当前人的所有连接
    for k in range(2,(int(SoupPage)+1)):
        PersonUrlPage = list(lst[j])
        PersonUrlPage.insert(-5,'_{0}'.format(k))  
        PersonUrlPage = "".join(PersonUrlPage)
        
        response2=requests.get(PersonUrlPage,headers=headers)
        response2.raise_for_status() 
        response2.encoding = response2.apparent_encoding
        soup2=BeautifulSoup(response2.text,'html.parser')
        #print(response2.text)

#获取图片下载地址
#         print(SoupDown)
        SoupDown=soup2.find(name='img')
        SoupDown=SoupDown.get('src')
        
#下载图片到当前文件夹
        pic = requests.get(SoupDown)
        Shi=int(time.time())
        with open("{0}{1}".format(Shi,SoupDown[55:]),"wb")as f:
            f.write(pic.content)
        f.close()

王真人

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
第一只爬虫（Requests 和BeautifulSoup）第一版，真不容易

学了好几天Python，总算写了第一只爬虫，虽然很糙，但是能爬还是很有成就感的，中间有好多地方卡壳，再看文档，看不懂的再找视频，真不容易，现在这个只是能爬，等有时间再优化优化写第二版~import requestsfrom bs4 import BeautifulSoupimport reimport timeheaders = {'User-Agent': 'Mozilla/5.0 (...
复制链接

扫一扫