网络爬虫小白入坑的辛酸血泪史（持续更新挖土直到入坟）

最新推荐文章于 2023-05-18 22:09:34 发布

三癫_

最新推荐文章于 2023-05-18 22:09:34 发布

阅读量962

点赞数

文章标签：网络爬虫

本文链接：https://blog.csdn.net/hengyanxu/article/details/87899634

版权

首先作为学校的IT工作室招新进去的实习小白丁，此刻正怀着忐忑而又激动的心情在凌晨三点写下这篇博客，开始了属于自己的码农（秃头）生涯。
第一次布置了一个任务是爬取实验楼课程的相关内容。
就是这样的界面需要爬取在这里插入图片描述
首先点击审查元素，看一看网页的html代码，鼠标放上标签的时候会在网页对应的地方显示出蓝色，对应找到课程的名称，是否是免费课，关注，学习，评论的人数，还有老师的名字。

作为一个小白，不犯一点白痴的错误怎么能行。因为使用requests库和正则表达式进行爬取，一开始没有搞懂正则表达式的贪婪匹配和非贪婪匹配的规则，因此从事出现一些标签也被爬取下来，或者是需要爬取到的地方没有被爬取下来。
而且因为后面还要爬取实验数目，检查不出来哪错的小白就分成了两段代码来写（事实证明这的确是个白痴的行为。。）那直接上最终代码好了
re.compile(’<h4\sclass=“course.?(.?).?<span.?course-type=”.?">(.?)’+
‘.?<div\sclass="course-info-details.?(.?).?(.?).?(.?)’+
'.<div\sclass=“lab-item-index”>(.?).?<div.?mooc-teacher.?name.?(.?).?<div.?data-course-id="(.?)".?data-course-name.*?’,re.S)
就非常棒的爬取到所需要的数据啦。
好不容易写完了正则表达式，就又被将数据填入txt文件中保存给拦下来了
**for i in range(0,items.len()):
items[i]=str(items[i])
str1=""
items = str1.join(items)
print(items)
with open(‘result4.txt’, ‘a+b’) as f:
f.write(bytes(items.encode(‘utf-8’)+b’\r’))
f.close()
**
items.len（）来表达整个数据所构成元组类型的长度，后面遍历每一个将items转化为字符串的形式，运用with open的方法写入信息，注意要将str转化为bytes的形式，这里有几个问题，第一个是编码要用utf-8，然后换行符要用b’\r’，将换行符也转化过来。“a”代表可追加填写，“w”是覆盖填写，我就是卡了很久，很久，很久…注意报错时不能填入txt文件的错误原因，我是因为类型问题之前才出现的报错。
下面就是将txt文件保存为xls形式的文件。
def txt_xls(filename,xlsname):
Txt转化成xls文件的函数，filename为文本文件名称，xlsname为excel文件的名称
try:
f = open(filename,encoding=‘utf-8’)
xls = xlwt.Workbook()
生成excel的方法,声明excel
sheet = xls.add_sheet(‘sheet1’,cell_overwrite_ok=True)
x = 0
while True:
按行循环，读取文本文件
line = f.readline()
if not line:
break 如果没有内容就退出循环
for i in range(len(line.split(’\t’))):
item=line.split(’\t’)[i]
sheet.write(x,i,item) x为单元格经度， i为单元格纬度
x += 1 excel另起一行
f.close()
xls.save(‘C:\Users\lenovo\Deskto2.xls’) 保存xls文件
except:
raise
接下来在爬虫程序的运行过程中发现总会到当url最后一个数字为10的时候程序报错，发现报错的几页都是出现了404错误无法正确访问
def main():
global j
j = 1
while j<=1145:
url = ‘https://www.shiyanlou.com/courses/’ + str(j)
response = requests.get(url,proxies=proxy)
if response.status_code == 200:
有网页会有404错误无法正确返回html信息，因此设置只有当状态码为200时，才进行爬虫程序
随着爬取内容的增多会变得越来越慢，被限制爬了？我就写了个多线程，然而其实感觉到后面也没有很大的用，好像也就快了一点点？
if name == ‘main’:
main()
p = multiprocessing.Pool(processes=8)
p.apply_async(main)
p.close()
p.join()
然后proxy={‘http’: ‘http://183.61.71.112:8888’}又设置了个代理网站反爬虫，这里还是没有搞的很懂，所以以后得继续研究研究。
以上就差不多啦。萌新小白的第一次爬虫试点，总之，很多时候还是蛮有成就感的嘛嘻嘻嘻。痛并快乐着？？？
全部代码：
import multiprocessing
import json
import xlwt
from bs4 import BeautifulSoup
import requests
import codecs
import string
from requests.exceptions import RequestException
import re
global j
j =1

proxy={‘http’: ‘http://183.61.71.112:8888’}

def parse_one_page(html)：
pattern= re.compile(’<h4\sclass=“course.?(.?).?<span.?course-type=”.?">(.?)’+
‘.?<div\sclass="course-info-details.?(.?).?(.?).?(.?)’+
'.<div\sclass=“lab-item-index”>(.?).?<div.?mooc-teacher.?name.?(.?).?<div.?data-course-id="(.?)".?data-course-name.*?’,re.S)
items = re.findall(pattern,html)
for i in range(0,items.len()):
items[i]=str(items[i])
str1=""
items = str1.join(items)
print(items)
with open(‘result4.txt’, ‘a+b’) as f:
f.write(bytes(items.encode(‘utf-8’)+b’\r’))
f.close()
def txt_xls(filename,xlsname):
try:
f = open(filename,encoding=‘utf-8’)
xls = xlwt.Workbook()
sheet = xls.add_sheet(‘sheet1’,cell_overwrite_ok=True)
x = 0
while True:
line = f.readline()
if not line:
break
for i in range(len(line.split(’\t’))):
item=line.split(’\t’)[i]
sheet.write(x,i,item)
x += 1
f.close()
xls.save(‘C:\Users\lenovo\Deskto2.xls’)
except:
raise

def main():
global j
j = 1
while j<=1145:
url = ‘https://www.shiyanlou.com/courses/’ + str(j)
response = requests.get(url,proxies=proxy)
if response.status_code == 200:
html= response.text
parse_one_page(html)
else:
print(‘not found’)
j +=1
filename = “C:\Users\lenovo\PycharmProjects\untitled3\result4.txt”
xlsname = “C:\Users\lenovo\Deskto2”
txt_xls(filename, xlsname)

if name == ‘main’:
main()
p = multiprocessing.Pool(processes=8)
p.apply_async(main)
p.close()
p.join()
在这过程中其实也不仅仅是网络爬虫的学习，最基本的一些东西比如全局变量的应用，定时程序运行等等等等，也莫名其妙的被某好几次脑洞给搞清楚了一点（虽然心塞的是很多都用不到），还有各种数据库的安装和pip install这个命令杠上了。最麻烦的还是scrapy库，各种找网站去下载数据库。不过也是收获满满吧嘻嘻嘻。新挖了一抔土，那以后加油鸭。