python简易爬虫编写--图片获取

最新推荐文章于 2022-03-02 22:37:58 发布

寒月望山

最新推荐文章于 2022-03-02 22:37:58 发布

阅读量591

点赞数

python简易爬虫编写--图片获取

原创 2018年03月04日 15:28:56

项目：抓取淘女郎图片，命名为姓名+城市

知识要点：

1，学会使用selenium的webdriver来模拟浏览器行为

2，熟悉并使用python爬虫的基本开源库

3，掌握正则表达式的基本用法

找到目标网址

https://mm.taobao.com/search_tstar_model.htm?

首先编写一个简单的爬虫脚本用于爬取网页源代码，分析可行性

[python]view plain copy
import urllib.request  
from bs4 import BeautifulSoup  
url="https://mm.taobao.com/search_tstar_model.htm?"  
page=urllib.request.urlopen(url)  
html=page.read().decode('gbk')  
soup=BeautifulSoup(html,"html.parser")  
print(soup)  

发现网页采用ajax技术，我们可以使用webdriver来获取动态网页

最终代码：

[python]view plain copy
<span style="color:rgb(0,0,0);font-family:Consolas, Inconsolata, Courier, monospace;font-size:12px;background-color:rgb(240,240,240);">import urllib.request</span>  
import re  
from bs4 import BeautifulSoup  
import time  
import urllib.request  
from selenium import webdriver  
import os  
from lxml import etree  
  
  
def gethtml(url):  
    global driver  
    parser="html5lib"  
    browserPath='E:/虚拟浏览器/phantomjs-2.1.1-windows/bin/phantomjs.exe'  
    driver=webdriver.PhantomJS(executable_path=browserPath)  
    driver.get(url)  
    time.sleep(2.5)  
    driver.find_element_by_class_name("page-skip").send_keys(page)  
    driver.find_element_by_class_name("page-btn").click()  
    time.sleep(2)  
    bsOBj=BeautifulSoup(driver.page_source, parser)  
    return bsOBj  
      
      
def getjpg(bsOBj):  
    allpic=[]  
      
    jpgurl=re.findall(r'<div class="img"><img data-ks-lazyload="(.*?)" src=".*?"/></div>',str(bsOBj))  
    jpgurl1=re.findall(r'<div class="img"><img src="(.*?)"/></div>',str(bsOBj))  
    name=re.findall(r'<span class="name">(.*?)</span>',str(bsOBj))  
    site=re.findall(r'<span class="city">(.*?)</span>',str(bsOBj))  
    for i in range(0,len(jpgurl1)):  
        jpgurl.insert(i+1,jpgurl1[i])  
          
      
    #print(jpgurl)  
    cmd="taskkill /F /IM Phantomjs.exe"  
    os.mkdir('E:/taobao/%s/' % page)  
    os.system(cmd)  
    for k in range(0,len(jpgurl)):  
           img='http:'+jpgurl[k]  
           #print(img)  
           picname=name[k]+site[k]  
             
           urllib.request.urlretrieve(img,'E:/taobao/%s/' % page+'%s.jpg' % picname)  
firsturl='https://mm.taobao.com/search_tstar_model.htm?'  
  
  
  
  
for page in range(1,51):  
    bsOBj=gethtml(firsturl)  
    getjpg(bsOBj)  

效果图：

寒月望山

关注

0
点赞
踩
3

收藏

觉得还不错? 一键收藏
0
评论
python简易爬虫编写--图片获取

python简易爬虫编写--图片获取原创 2018年03月04日 15:28:5611项目：抓取淘女郎图片，命名为姓名+城市知识要点：1，学会使用selenium的webdriver来模拟浏览器行为2，熟悉并使用python爬虫的基本开源库3，掌握正则表达式的基本用法找到目标网址 https://mm.taobao.com/search_tstar_model.htm?...
复制链接

扫一扫