Python爬虫常用函数

最新推荐文章于 2023-03-13 14:41:56 发布

jay&chuxu

最新推荐文章于 2023-03-13 14:41:56 发布

阅读量2.1k

点赞数 3

分类专栏：爬虫 python 文章标签：爬虫 python

本文链接：https://blog.csdn.net/jayandchuxu/article/details/55817479

版权

python 同时被 2 个专栏收录

17 篇文章 0 订阅

订阅专栏

爬虫

3 篇文章 0 订阅

订阅专栏

import urllib.request
import re
import time

下载图像的函数

#imgurl：图像网络存储地址
#'D:\python\code\girls\%s.jpg'%img_num：本地存储路径及名称

download_img=urllib.request.urlretrieve(imgurl,'D:\python\code\girls\%s.jpg'%img_num)

延时1s，可以防止操作频繁，让网站发现

time.sleep(1)

正则匹配，返回一个满足条件的列表

reg2=r'http://www.douban.com/group/topic/\d+'
topiclist=re.findall(reg2,html2)

伪装浏览器，读取网页

headers = {'User-Agent':'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) ' 'Chrome/51.0.2704.63 Safari/537.36'}
req = urllib.request.Request(url=article_url, headers=headers)
html = urllib.request.urlopen(req).read().decode('utf8', 'ignore')#'ignore'可以防止解码报错

python的requests初步使用