嵩天老师网络爬虫与信息提取课程学习笔记（三）

最新推荐文章于 2021-12-25 21:35:09 发布

cnnf

最新推荐文章于 2021-12-25 21:35:09 发布

阅读量230

点赞数

分类专栏： Python网络爬虫文章标签： python 正则表达式 cookie

本文链接：https://blog.csdn.net/xuechen_gemgirl/article/details/105449684

版权

Python网络爬虫专栏收录该内容

6 篇文章 0 订阅

订阅专栏

此文根据嵩天老师的视频课程边听边敲下来的代码，模块化代码结构，将爬虫实例的三个步骤用三个函数实现，功能见下面解释，感谢MOOC平台，谢谢嵩老师的精细讲解，以及助教的图文解说突破淘宝访问限制。

#淘宝商品信息定向爬虫
#功能描述：获取淘宝搜索页面的信息，提取其中的商品名称和价格
#理解：淘宝的搜索接口
#      翻页的处理
#技术路线：requests - re
#URL:https://s.taobao.com/search?q=%E5%8C%85&imgfile=&js=1&stats_click=search_radio_all%3A1&initiative_id=staobaoz_20200410&ie=utf8
#程序的结构设计
#步骤1：提交商品搜索请求，循环获取页面.
#步骤2：对于每个页面，提取商品名称和价格信息。
#步骤3：将信息输出到屏幕上。
#因为淘宝需要登录，因此在原有程序基础之上，加入了两个get参数，一个是User-Agent，一个是cookie，这也是爬虫突破封禁的其中2种常见方法.
#1.其中User-Agent就是为了构造合理的 HTTP 请求头，因为python程序去访问，请求头中User-Agent会是一个爬虫在使用 urllib 标准库时发送的请求头，如Python-urllib/3.4之类的。
#因此在代码中指定为浏览器之类的。
#2.网站会用 cookie 跟踪你的访问过程，如果发现了爬虫异常行为就会中断你的访问，比如这个程序爬取淘宝数据时，如果没有设置cookie，会导致网页获取时转到登录页面，不能正常获取数据，
#因此需要指定这个值，调出这个值时，我是先登录了淘宝，然后搜索指定物品，在网页页面上按F12，查看对应链接中Network菜单界面中name为search?q=。。。这个链接下的headers里的request Headers里的cookie值。
import re
import requests
def getHTMLText(url):
	try:
		headers = {"User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.149 Safari/537.36",
		"cookie": "你的cookie信息，需要自己按照下方图片去找，祝你好运！"}
		r = requests.get(url,timeout = 30,headers = headers )
		r.raise_for_status()
		r.encoding = r.apparent_encoding
		return r.text
	except :
		return ""
	

def parsePage(ilt,html):
	try:
		regex1 = re.compile(r'\"view_price\"\:\"[\d\.]*\"')
		regex2 = re.compile(r'\"raw_title\"\:\".*?\"')
		plt = regex1.findall(html)
		tlt = regex2.findall(html)
		for i in range(len(plt)):
			price = eval(plt[i].split(':')[1])
			title = eval(tlt[i].split(':')[1])

			ilt .append([price,title])
	except:
		print("parsePageEx")

def printGoodsList(ilt):
	tplt = "{:4}\t{:8}\t{:16}"
	print(tplt.format("序号","价格","商品名称"))
	count = 0
	for g in ilt:
		count = count + 1
		print(tplt.format(count,g[0],g[1]))

def main():
	goods = "佳明"
	depth = 2
	start_url = "https://s.taobao.com/search?q=" + goods
	infoList = []
	for i in range(depth):
		try :
			url = start_url + '&s=' + str(i*44)
			
			html = getHTMLText(url)
			#print(html)
			parsePage(infoList,html)
		except:
			print("mainEx")
			continue
	printGoodsList(infoList)
main()