浅谈python爬虫

JevonK

于 2023-02-15 09:31:58 发布

阅读量650

点赞数

分类专栏： python 文章标签： python 爬虫开发语言

本文链接：https://blog.csdn.net/zanj0525/article/details/129037390

版权

python 专栏收录该内容

3 篇文章 0 订阅

订阅专栏

背景

在我工作期间有接到运营这样一个需求，就去去抓去阳光高校网站上的学校信息和其对应的专业。说是需要应用到当前程序中坐学校类型的处理，方便对不同学校的人员进行分类。当然后接到这么一个需求对于当时的我是有点懵的，然后在思索了一下，想想也是可以写的，只是在想用什么去写这个。然后正逢当时python这门语言比较火热的时候，就想着用着这个需求去练练手，毕竟当前需求是独

立业务之外的，没有什么影响。在简单的去学习了一下关于python的语法，就着手开始写了。

介绍

既然选择了python这门语言去写这个爬虫，那么肯定的先了解下python的发展史与其介绍了。

Python之父，荷兰人Guido van Rossum。他于1982年从阿姆斯特丹大学取得了数学和计算机硕士学位。

20世纪80年代中期，Python之父Guido van Rossum还在CWI(数学和理论计算机科学领域的研究中心，位于阿姆斯特丹)为ABC语言贡献代码。ABC语言是一个为编程初学者打造的研究项目。ABC语言给了Python之父Guido很大影响，Python从ABC语言中继承了很多东西：比如字符串、列表和字节数列都支持索引、切片排序和拼接操作。

在CWI工作了一段时间后，Guido构思了一门致力于解决问题的编程语言，他觉得现有的编程语言对非计算机专业的人十分不友好。于是，1989年12月份，为了打发无聊的圣诞节假期，Guido开始写Python的第一个版本。值得一提的是Python这个名字的由来，Python有蟒蛇的意思，但Guido起这个名字完全和蟒蛇没有关系。当Guido在实现Python的时候，他还阅读了Monty Python's Flying Circus的剧本，这是来自一部来自20世纪70年代的BBC喜剧。Guido认为他需要一个简短、独特且略显神秘的名字，因此他决定将该语言称为Python。

1991年，Python的第一个解释器诞生了。他是由C语言实现的，有很多语法来自C，又受到了很多ABC语言的影响。有很多来自ABC语言的语法，知道今天还很有争议，强制缩进就是其中之一。要知道，大多数语言都是代码风格自由的，即：不在乎缩进有多少，写在哪一行，只要有必要的空格即可。而Python是必须要有缩进的，这也导致了很多其他语言的程序员开玩笑说“Python程序员必须会要用游标卡尺。”

Python1.0版本于1994年1月发布，这个版本的主要新功能是lambda, map, filter和reduce，但是Guido不喜欢这个版本。

六年半之后的2000年10月份，Python2.0发布了。这个版本的主要新功能是内存管理和循环检测垃圾收集器以及对Unicode的支持。然而，尤为重要的变化是开发的流程的改变，Python此时有了一个更透明的社区。

2008年的12月份，Python3.0发布了。Python3.x不向后兼容Python2.x，这意味着Python3.x可能无法运行Python2.x的代码。Python3代表着Python语言的未来。

今天的Python已经进入到了3,0时代，Python的社区也在蓬勃发展，当你提出一个有关的Python问题，几乎总是有人遇到了同样的问题并已经解决了。所以，学习Python并不是很难，你只需要安装好环境----开始敲代码----遇到问题----解决问题。

聊了这么多，那么我们还是开始我们的主要内容关于爬虫需求的开发。

内容

首先我们的需求抓取阳光高校的学校信息以及对应的专业，那么你需要的就是把学校相关主信息和专业子信息做好关联。

说需要做好数据关联及管理，这肯定就是需要用到数据库，当时我正在用的数据库是mysql5.6，就直接引用当时数据库了。

既然是网页就需要模拟浏览器请求获取生成后的网页信息，然后分析html，抓取其中text文本。

总结一下那么就是需要：获取网页HTML、数据库存储、分析html获取文本。

那么对应的就需要在python中找到对应的包，做对应的操作。通过其面向百度开发的原则查了查，然后发现可以用requests、pymysql、bs4这几个包去实现操作。

那么接下来我们看一下实现需求的代码：

首先了我们把相关的包给引入一下。

import requests 
import bs4
import pymysql
import json
from xpinyin import Pinyin

连接数据库，并创建相关需要使用的数据表

# 打开数据库连接
db = pymysql.connect(host = "localhost", user = "root", password = "123456", database = "test")
# 使用 cursor() 方法创建一个游标对象 cursor
cursor = db.cursor()
#判断是否存在学校表（school）和专业表（major）
cursor.execute("drop table if exists school")
cursor.execute("drop table if exists major") 
#第一次打开网页
res = requests.get('https://gaokao.chsi.com.cn/sch/search.do?start=0') 
 
res.raise_for_status() 
soup = bs4.BeautifulSoup(res.text, 'html.parser') 
#获取总页数
page_num = soup.select('.ch-page li')[7].string
th = soup.select('th')
th_herd = []
#申明汉子转换拼音对象
pin = Pinyin() 
field = ""
for item in th:
	field += (","+pin.get_pinyin(item.get_text(), '_')+" varchar(100)")
	th_herd.append(pin.get_pinyin(item.get_text(), '_'))
#学校表创建sql语句
sql = "CREATE TABLE school (id int NOT NULL AUTO_INCREMENT "+field+",PRIMARY KEY ( id ))"
sql1 = "CREATE TABLE major (s_id int , name varchar(100))"
#执行语句
cursor.execute(sql);
cursor.execute(sql1);

获取相关的学校信息并组装学校信息的insert 语句

while j<int(page_num):
	#输出数据量
	print(j*20) 
	res = requests.get('https://gaokao.chsi.com.cn/sch/search.do?start=%s'%(j*20)) 
	res.raise_for_status() 
	soup = bs4.BeautifulSoup(res.text, 'html.parser') 
	tr = soup.select("table tr")

b = ''
		for x in tr[i].select('td')[5].select('span'):
			b = b + "    " + str(x.get_text())
		c = (1 if (tr[i].select('td')[6].string) is None else 0)
		d = (";" if (i+1) is len(tr) else ",")
		a += "('%s', '%s', '%s', '%s', '%s', '%s', '%s', '%s')%s" % (tr[i].select('td')[0].get_text().strip(), tr[i].select('td')[1].get_text().strip(), tr[i].select('td')[2].get_text().strip(),tr[i].select('td')[3].get_text().strip(), tr[i].select('td')[4].get_text().strip(), b, c, tr[i].select('td')[7].get_text().strip(), d)

获取其学校对应的相关专业，然后生成保存的insert语句

#获取学校专业
		##查询学校
		res1 = requests.get('https://gaokao.chsi.com.cn/zyk/pub/myd/specAppraisalTop.action?yxmc=%s'%(tr[i].select('td')[0].get_text().strip())) 
		res1.raise_for_status() 
		soup1 = bs4.BeautifulSoup(res1.text, 'html.parser') 
		url = soup1.select('.check_detail')
		##查看全部专业
		u = 0
		for x in url:
			if u > 1:
				break
			u = u + 1
			res1 = requests.get('https://gaokao.chsi.com.cn/%s'%(x['href'])) 
			res1.raise_for_status() 
			soup1 = bs4.BeautifulSoup(res1.text, 'html.parser') 
			td = soup1.select(".first_td")
			j1 = 0
			while j1<len(td):
				if j1 == 0:
					j1 = j1 + 1
					continue
				d1 = ','
				a1 = a1 + "(%s, '%s')%s" % (m, td[j1].get_text().strip(), d1)
				j1 = j1 + 1

执行相关的sql语句，并且处理在发生错误的时候回滚，然后关闭你的数据库连接。

#执行新增语句
	sql = """INSERT INTO school("""+field+""") VALUES """ + a
	sql1 = """INSERT INTO major(s_id,name) VALUES """ + a1
	j = j + 1
	try:
	   	# 执行sql语句	
	   	cursor.execute(sql)
	   	cursor.execute(sql1[:-1])
	   	# 提交到数据库执行
	   	db.commit()
	except:
	   	# 如果发生错误则回滚
	   	print(sql1)
	   	exit()
	   	db.rollback()
db.close()

import requests 
import bs4
import pymysql
import json
from xpinyin import Pinyin
 
# 打开数据库连接
db = pymysql.connect(host = "localhost", user = "root", password = "123456", database = "test")
# 使用 cursor() 方法创建一个游标对象 cursor
cursor = db.cursor()
#判断是否存在学校表（school）和专业表（major）
cursor.execute("drop table if exists school")
cursor.execute("drop table if exists major") 
#第一次打开网页
res = requests.get('https://gaokao.chsi.com.cn/sch/search.do?start=0') 
 
res.raise_for_status() 
soup = bs4.BeautifulSoup(res.text, 'html.parser') 
#获取总页数
page_num = soup.select('.ch-page li')[7].string
th = soup.select('th')
th_herd = []
#申明汉子转换拼音对象
pin = Pinyin() 
field = ""
for item in th:
	field += (","+pin.get_pinyin(item.get_text(), '_')+" varchar(100)")
	th_herd.append(pin.get_pinyin(item.get_text(), '_'))
#学校表创建sql语句
sql = "CREATE TABLE school (id int NOT NULL AUTO_INCREMENT "+field+",PRIMARY KEY ( id ))"
sql1 = "CREATE TABLE major (s_id int , name varchar(100))"
#执行语句
cursor.execute(sql);
cursor.execute(sql1);
# 转化数组为逗号分隔的字符串
delimiter = ','
field = delimiter.join(th_herd)
j = 0
m = 1
while j<int(page_num):
	#输出数据量
	print(j*20) 
	res = requests.get('https://gaokao.chsi.com.cn/sch/search.do?start=%s'%(j*20)) 
	res.raise_for_status() 
	soup = bs4.BeautifulSoup(res.text, 'html.parser') 
	tr = soup.select("table tr")
	i = 0
	a = ""
	a1 = ""
	while i<len(tr):
		if i == 0:
			i += 1
			continue
		b = ''
		for x in tr[i].select('td')[5].select('span'):
			b = b + "    " + str(x.get_text())
		c = (1 if (tr[i].select('td')[6].string) is None else 0)
		d = (";" if (i+1) is len(tr) else ",")
		a += "('%s', '%s', '%s', '%s', '%s', '%s', '%s', '%s')%s" % (tr[i].select('td')[0].get_text().strip(), tr[i].select('td')[1].get_text().strip(), tr[i].select('td')[2].get_text().strip(),tr[i].select('td')[3].get_text().strip(), tr[i].select('td')[4].get_text().strip(), b, c, tr[i].select('td')[7].get_text().strip(), d)
		#获取学校专业
		##查询学校
		res1 = requests.get('https://gaokao.chsi.com.cn/zyk/pub/myd/specAppraisalTop.action?yxmc=%s'%(tr[i].select('td')[0].get_text().strip())) 
		res1.raise_for_status() 
		soup1 = bs4.BeautifulSoup(res1.text, 'html.parser') 
		url = soup1.select('.check_detail')
		##查看全部专业
		u = 0
		for x in url:
			if u > 1:
				break
			u = u + 1
			res1 = requests.get('https://gaokao.chsi.com.cn/%s'%(x['href'])) 
			res1.raise_for_status() 
			soup1 = bs4.BeautifulSoup(res1.text, 'html.parser') 
			td = soup1.select(".first_td")
			j1 = 0
			while j1<len(td):
				if j1 == 0:
					j1 = j1 + 1
					continue
				d1 = ','
				a1 = a1 + "(%s, '%s')%s" % (m, td[j1].get_text().strip(), d1)
				j1 = j1 + 1
		i = i + 1
		m = m + 1
	#执行新增语句
	sql = """INSERT INTO school("""+field+""") VALUES """ + a
	sql1 = """INSERT INTO major(s_id,name) VALUES """ + a1
	j = j + 1
	try:
	   	# 执行sql语句	
	   	cursor.execute(sql)
	   	cursor.execute(sql1[:-1])
	   	# 提交到数据库执行
	   	db.commit()
	except:
	   	# 如果发生错误则回滚
	   	print(sql1)
	   	exit()
	   	db.rollback()
db.close()

#coding=utf-8
import urllib.request
from urllib.parse import quote
import string
import json
import requests
from prettyprinter import pprint
import os
 
print("    I love animals.");
print("              ┏┓      ┏┓");
print("            ┏┛┻━━━┛┻┓");
print("            ┃      ☃      ┃")
print("            ┃  ┳┛  ┗┳  ┃")
print("            ┃      ┻      ┃")
print("            ┗━┓      ┏━┛")
print("                ┃      ┗━━━┓")
print("                ┃  神兽保佑    ┣┓")
print("                ┃ 永无BUG！   ┏┛")
print("                ┗┓┓┏━┳┓┏┛")
print("                  ┃┫┫  ┃┫┫")
print("                  ┗┻┛  ┗┻┛")
print("      ")
 
def mkdir(path):
 
    # 去除首位空格
    path=path.strip()
    # 去除尾部 \ 符号
    path=path.rstrip("\\")
 
    # 判断路径是否存在
    # 存在     True
    # 不存在   False
    isExists=os.path.exists(path)
 
    # 判断结果
    if not isExists:
        # 如果不存在则创建目录
        # 创建目录操作函数
        os.makedirs(path) 
        print(path+' 创建成功')
        return True
    else:
        # 如果目录存在则不创建，并提示目录已存在
        print(path+' 目录已存在')
        return False
def upload_file(fileUrl, filePath, fileName):
    try:
        pic = requests.get(fileUrl, timeout=15)
        string = filePath + '/' + str(fileName) + '.jpg'
        with open(string, 'wb') as f:
            f.write(pic.content)
            print('成功下载第%s张图片: %s' % (fileName, str(fileUrl)))
    except Exception as e:
        print('下载第%s张图片时失败: %s' % (fileName, str(fileUrl)))
        print(e)
 
filePath = "./baiduImages";
mkdir(filePath)
# 设置搜索值
searchName = "小狗";
row = 30;
i = 1;
gsm = "1e"
while i>=0:
    page = row*i;
    num = 1560496791090 + page
    i = i+1;
    url = "https://image.baidu.com/search/acjson?tn=resultjson_com&ipn=rj&ct=201326592&is=&fp=result&queryWord="+searchName+"&cl=2&lm=-1&ie=utf-8&oe=utf-8&adpicid=&st=-1&z=&ic=0&hd=&latest=&copyright=&word="+searchName+"&s=&se=&tab=&width=&height=&face=0&istype=2&qc=&nc=1&fr=&expermode=&force=&pn="+str(page)+"&rn="+str(row)+"&gsm="+gsm
    headers = {
        'User-Agent':'Mozilla/5.0 (X11; Fedora; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36',
        'Accept-Language': 'zh-CN,zh;q=0.8,zh-TW;q=0.7,zh-HK;q=0.5,en-US;q=0.3,en;q=0.2',
        'Connection': 'keep-alive',
        'Upgrade-Insecure-Requests': '1',
    };
    print(url);
    request = urllib.request.Request(quote(url,safe = string.printable),headers=headers)
    responese = urllib.request.urlopen(request)
    responeseData = responese.read();
    responeseJson = json.loads(responeseData);
    print(responeseJson);
    if len(responeseJson['data']) == 0:
        break
    gsm = responeseJson['gsm']
    for j in range(len(responeseJson['data'])):
        if len(responeseJson['data'][j]) == 0:
            continue
        upload_file(responeseJson['data'][j]['hoverURL'], filePath, str(i) + str(j))

JevonK

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
浅谈python爬虫

既然选择了python这门语言去写这个爬虫，那么肯定的先了解下python的发展史与其介绍了。Python之父，荷兰人Guido van Rossum。他于1982年从阿姆斯特丹大学取得了数学和计算机硕士学位。20世纪80年代中期，Python之父Guido van Rossum还在CWI(数学和理论计算机科学领域的研究中心，位于阿姆斯特丹)为ABC语言贡献代码。ABC语言是一个为编程初学者打造的研究项目。
复制链接

扫一扫