python爬虫基本知识_python爬虫的基本知识储备

最新推荐文章于 2022-05-07 16:24:37 发布

未有涵涵然

最新推荐文章于 2022-05-07 16:24:37 发布

阅读量112

点赞数

文章标签： python爬虫基本知识

本文链接：https://blog.csdn.net/weixin_32668987/article/details/113984572

版权

1.关于引用全局变量：

引用全局变量并不是拿来就可以用，拿来就可以改的，当在子函数中引用全局变量的时候，应该声明这个变量是全局变量：如global test，全局变量test。但是在if __name__ == '__main__':之下不属于子函数的范畴，所以不用声明，声明了反而出错。具体：https://blog.csdn.net/my2010sam/article/details/17735159

2.关于寻找网页的原始图片：

一般来说，显示在网页上面的图片是经过压缩的缩略图片，但是我们想要爬取的却是高清的原图，那么这个时候我们就可以右键显示网页源码，到网页源码里面找，一般来说，都是可以找到原图的链接的，打个比方，百度图片的原图链接是在一个objURL的对象之下的，ctrl+f查找一下就可以找到了，其他的网站估计也差不多，仔细找就好

3.关于下一个网页链接：

有时候网页链接非常的长，比如百度图库的链接就是很臭很长，所以通过：观察网页规律然后传入参数构造下一个页面的链接，这个方法显然行不通。那么这个时候我们就要寻找另外一个方法了，那就是：右键先进入网页源码，然后在源码中检索页面当中显示的

“下一页“这样的词汇，还是拿百度图库来做例子：先右上角切换翻页模式，然后在网页源码当中检索。贴图如下：

4.最后在说一下最重要的一个知识点，就是网页的中文解码：

当我们用requestes库的get函数请求成功之后，我们想把网页的源码保存下来，但是我们保存之后发现，网页源码当中的中文字符，不管怎么保存都是乱码的，这时候保存之前就要用上这个句子：r.encoding = r.apparent_encoding，r.apparent_encoding表示获取网页的正确编码方式，那么这句话得到意思就是让网页的编码方式等于他正确的编码方式(网上原话)，然后在保存的时候with ope('file.txt','w',encoding = 'utf-8') as f:.........。这样保存下来的文件就不会是中文乱码的了。

附上一段代码：

importosimportrequestsimportjsonfrom hashlib importmd5from multiprocessing.pool importPoolfrom pyquery importPyQuery as pqfrom fake_useragent importUserAgentfrom urllib.parse importquoteimporttimeimportre

url_list=[]

page_num= 1headers={'User-Agent' : 'ua.random()'}defget_one_page(url):globalpage_num

ua=UserAgent()try:

r= requests.get(url=url, headers=headers)if r.status_code == 200:print ("当前下载第 %s 页，网页响应状态码 %s" %(page_num,r.status_code))

page_num= page_num + 1

returnr.textexceptrequests.ConnectionError:returnNonedefget_image_list(html):globalurl_list

image_list=[]

pattern_1= re.compile('objURL":"(.*?)",',re.S)

list=re.findall(pattern_1,html)if list !=None:for item inlist:

image_list.append(item)

pattern_2= re.compile('

list_2=re.findall(pattern_2,html)

next_url= 'https://image.baidu.com' + ''.join(list_2[0][1])

url_list.append(next_url)returnimage_listdefsave_image(image_list):if not os.path.exists('picture'):

os.mkdir('picture')try:

response= requests.get(url = image_list,headers = headers,timeout = 2)

file_path= '{0}/{1}.{2}'.format('picture', md5(response.content).hexdigest(),'jpg')if notos.path.exists(file_path):

with open(file_path,'wb') as f:

f.write(response.content)print ("成功下载:" +file_path)else:print ("已经存在图片：" +file_path)

time.sleep(5)except:print ("下载失败")if __name__ == '__main__':

keyword= input("输入要爬取的关键词：") #要爬取的内容

page = input("输入要爬取的页数：") #要爬取的页数

keyword =str(keyword)

page=int(page)

keyword=quote(keyword)

url= 'https://image.baidu.com/search/flip?tn=baiduimage&ipn=r&ct=201326592&cl=2&lm=-1&st=-1&fm=result&fr=&sf=1&fmq=1535006333854_R&pv=&ic=0&nc=1&z=&se=1&showtab=0&fb=0&width=&height=&face=0&istype=2&ie=utf-8&ctd=1535006333855%5E00_1903X943&word=' +keyword

url_list.append(url)

pool=Pool()for each inrange(page):

html=get_one_page(url_list[each])print(url_list[each])

image_list=get_image_list(html)#print (image_list)

pool.map(save_image,image_list)#save_image(image_list)

View Code

在代码中值得注意的是：pool.map()进程池因为是放在for循环下面的，所以进程池不要关闭，(pool.close(),pool.join()这些先不要写上去)可以等到for循环结束之后再关闭，不然的话会报错。

未有涵涵然

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
python爬虫基本知识_python爬虫的基本知识储备

1.关于引用全局变量：引用全局变量并不是拿来就可以用，拿来就可以改的，当在子函数中引用全局变量的时候，应该声明这个变量是全局变量：如global test，全局变量test。但是在if __name__ == '__main__':之下不属于子函数的范畴，所以不用声明，声明了反而出错。具体：https://blog.csdn.net/my2010sam/article/details/177351...
复制链接

扫一扫