1.图片爬虫
1)建立爬取图片的自定义函数
2)通过for循环将该分类下的所有网页都爬取一遍
import re
import urllib.request
def craw(url,page):
html1 = urllib.request.urlopen(url).read()
html1 = str(html1)
pat1='<div id="plist".+?<div class="page clearfix">'
result1 = re.compile(pat1).findall(html1)
result1 = result1[0]
pat2 = '<img width="220" height="220" data-img="1" data-lazy-img="//(.+?\.jpg)">'
imagelist = re.compile(pat2).findall(result1)
x = 1
for imageurl in imagelist:
imagename = "C:/Users/alibaba/Desktop/jupyter/code/picturecrawler/jdphoto/"+str(page)+str(x)+".jpg"
imageurl = "http://"+imageurl
try:
urllib.request.urlretrieve(imageurl, filename = imagename)
except urllib.error.URLError as e:
if hasattr(e,"code"):
x+=1
if hasattr(e,"reason"):
x+=1
x+=1
for i in range(1,79):
url = "http://list.jd.com/list.html?cat=9987,653,655&page="+str(i)
craw(url,i)
break
2.链接爬取
1)确定好爬取的入口链接
2)构建正则表达式
3)模拟浏览器对网页进行爬取
4)提取需要的网页链接(re.compile(pat).findall(data))
5)过滤重复的网页(list(set(link)))
import re
import urllib.request
def getlink(url):
#模拟成浏览器
headers = ("User-Agent","Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like