爬虫

最新推荐文章于 2024-08-06 11:55:39 发布

pythoncrawler

最新推荐文章于 2024-08-06 11:55:39 发布

阅读量145

点赞数

分类专栏：爬虫

本文链接：https://blog.csdn.net/E_hero_/article/details/99626012

版权

爬虫专栏收录该内容

1 篇文章 0 订阅

订阅专栏

需要的步骤

1.使用python发送服务器请求
2.请求响应之后，会得到相应对象response（源码，以及其他的响应信息），response.read( )-------源码信息-----字节
如果需要的是网页的源码信息
字节.decode() 转换成字符串，默认按照utf-8编码，将字节转换成字符串
3.使用抓取工具（正则表达式）来对字符串信息进行处理
4.保存抓取

一、使用python发送服务器请求

from urllib.request import urlopen
import re
#urlopen(url)向服务器发送请求，返回值是一个响应对象response
url="http://lagou.com"
response=urlopen(url)

二、获取源码信息

read（）获得字节，decode（）将字节转换成字符串，默认utf-8

#print(response.read().decode())
html_text=response.read().decode()

三、使用正则表达式来获取信息

res_url="<a.*?href=\"(http.*?)\".*?>"
r=re.findall(res_url,html_text,re.M|re.S|re.I)
# print(r)
for i in r:
	print(i)

四、信息的存储

excel,json,数据库

with open("c:/lagou.csv","wt",newline="") as f:
	w=csv.writer(f)
	for i in r:
		#print
		w.writerow([i])

文字信息
图片信息
获取图片信息

img_url="https://www.baidu.com/img/bd_logo1.png?where=super"
response_img=urlopen(img_url)
#print(response_img.read())
with open("1.jpg","wb") as f:
	f.write(response_img.read())