此篇爬虫的背景是:虚拟机刚装好的ubuntu 16.04,系统环境还需配置,爬虫的程序是之前几个月前在windows上写的,今天放到虚拟机上跑一跑!(安装了VMware Tools就可以把宿主机上的文件拉进虚拟机中!)
xpath爬取用到了urllib2与lxml库,ubuntu16.04自带python2.7.11,包含了urllib2库,但lxml还需安装!
上程序:
# -*- coding:utf-8 -*-
import urllib2
from lxml import etree
def loadPage(url):
headers = {"User-Agent":"Mozilla/5.0 (Macintosh; Intel Mac OS X 10.12; rv:54.0) Gecko/20100101 Firefox/54.0","Referer":"http://www.mmonly.cc/mmtp/xgmn/175265_4.html"}
request = urllib2.Request(url,headers = headers)
response = urllib2.urlopen(request)
html = response.read()
#print html
content = etree.HTML(html)
link_list = content.xpath('//div[@class="thumb"]/img/@src')
for link in link_list:
writeImage(link)
def writeImage(link):
request = urllib2.Request(link)
image = urllib2.urlopen(request).read()
filename = link[-10:]
with open(filename,'wb') as f:
f.write(image)
print "download successful" + filename
if __name__ == "__main__":
url = "http://www.xiaoliaobaike.cn/qutu"
p = input("please input a tegert: ")
fullurl = url + "?p=" + str(p)
loadPage(fullurl)
~
运行结果为:
查看文件:
打开对应的文件夹即可查看图片