第一只爬虫(Requests 和BeautifulSoup)第一版,真不容易

学了好几天Python,总算写了第一只爬虫,虽然很糙,但是能爬还是很有成就感的,中间有好多地方卡壳,再看文档,看不懂的再找视频,真不容易,现在这个只是能爬,等有时间再优化优化写第二版~

import requests
from bs4 import BeautifulSoup
import re
import time
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_4) AppleWebKit/537.36 ''(KHTML, like Gecko) Chrome/52.0.2743.116 Safari/537.36'}
lst=[]
for i in range(1,167):
    response=requests.get(('https://www.tupianzj.com/meinv/xiezhen/list_179_{0}.html'.format(i)),headers=headers)
    response.raise_for_status()  
    response.encoding = response.apparent_encoding
    
    soup=BeautifulSoup(response.text,'html.parser')
    soup=soup.find(name='ul',attrs={"class":"list_con_box_ul"})
    soup=soup.find_all('a')
    
    for i in soup:
        g=i.get('href')
        if len(i) ==4:
            lst.append(g)
lst=map(lambda x: 'https://www.tupianzj.com'+x,lst)
lst=(list(lst))
print(list(lst))
#从列表中获取每个人的页面
for j in range(len(lst)):
    response1=requests.get(lst[j],headers=headers)
    response1.raise_for_status() 
    response1.encoding = response1.apparent_encoding
    soup1=BeautifulSoup(response1.text,'html.parser')
    
#获取当前人共多少页
    SoupPage=soup1.find(name='div',attrs={"class":"pages"})
    SoupPage=SoupPage.find('a')
    SoupPage=re.sub(r"\D","",SoupPage.text)
    print(SoupPage)
    
    
#先存第一张
    #获取图片下载地址
    SoupDown=soup1.find(name='img')
    SoupDown=SoupDown.get('src')
    print(SoupDown)
    
#下载图片到当前文件夹
    pic = requests.get(SoupDown)
    Shi=int(time.time())
    with open("{0}{1}".format(Shi,SoupDown[55:]),"wb")as f:
        f.write(pic.content)
    f.close()
    
#获取当前人的所有连接
    for k in range(2,(int(SoupPage)+1)):
        PersonUrlPage = list(lst[j])
        PersonUrlPage.insert(-5,'_{0}'.format(k))  
        PersonUrlPage = "".join(PersonUrlPage)
        
        response2=requests.get(PersonUrlPage,headers=headers)
        response2.raise_for_status() 
        response2.encoding = response2.apparent_encoding
        soup2=BeautifulSoup(response2.text,'html.parser')
        #print(response2.text)

#获取图片下载地址
#         print(SoupDown)
        SoupDown=soup2.find(name='img')
        SoupDown=SoupDown.get('src')
        
#下载图片到当前文件夹
        pic = requests.get(SoupDown)
        Shi=int(time.time())
        with open("{0}{1}".format(Shi,SoupDown[55:]),"wb")as f:
            f.write(pic.content)
        f.close()
  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值