Python爬虫练习

最新推荐文章于 2024-07-20 14:53:58 发布

非鸽传书

最新推荐文章于 2024-07-20 14:53:58 发布

阅读量323

点赞数

文章标签： python

本文链接：https://blog.csdn.net/qq_15351029/article/details/109113197

版权

Python爬虫

因工作原因需要进行数据爬取，来进行数据方便处理业务。之前只是知道Python可以做爬虫。这次有机会写爬虫。决定写一下学习过程。

首先，现在网上找各种教程

https://www.runoob.com/python/python-install.html

https://www.zhihu.com/question/20899988/answer/96904827

当然也可以使用成熟的框架

https://blog.csdn.net/sinat_38682860/article/details/81044027

环境搭建

这里可以参考: 总之网上有一大把

https://www.runoob.com/python/python-install.html

https://www.cnblogs.com/vuciao/p/10562416.html

开发工具

开发工具有很多，我这边选择使用vsCode,

https://www.cnblogs.com/xiaojwang/p/11331202.html

爬虫学习

https://www.zhihu.com/question/20899988/answer/96904827

因为需求中爬取的是静态页面，所以选用 requests 和 bs4，当然如果涉及其他的也一样需要使用，这里需要使用pip来下载响应的程序包。pip install requests。我在使用过程中pip命令无法使用，就先进入python安装目录的script文件夹中在使用该命令。当然对于linux环境在后面有机会会补充。

下面是写的简单的爬虫语句，写的时候需要查看具体要爬取的数据结构，根据不同结构进行处理。因为考虑到环境不熟和工期问题，暂时不使用这个方式来做爬虫。以后等有机会我会继续把这个爬虫完善完成。当然爬虫也有一点就是爬取程序要不断随着要爬取的网站信息变化而变化。

import requests
from bs4 import BeautifulSoup
from PIL import Image
from io import BytesIO
import pytesseract

'''
未做异常判断版本，简单版
'''

# 链接前缀
base_url = "..."
pre_url = "/flight/fnum/"
# 链接后缀
after_url = ".html?AE71649A58c77&fdate="
# 航班号
fligtCode = "KN5977"
# 爬取日期
fdate = "20201014"

# 爬取url
url = base_url + pre_url + fligtCode + after_url + fdate

# 准备请求头信息
user_agent = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/85.0.4183.83 Safari/537.36'
host = '...'
connection = 'keep-alive'
accept_Language = 'zh-CN;zh;q=0.9'
accept_Encoding = 'gzip; deflate'
accept = 'text/html;application/xhtml+xml;application/xml;q=0.9;image/avif;image/webp;image/apng;*/*;q=0.8;application/signed-exchange;v=b3;q=0.9'
referer = '...'


# 组装请求头
headers = {
    'Accept': accept, 
    'Accept-Language': accept_Language,
    'Connection': connection,
    'Host': host,
    'Referer': referer,
    'Upgrade-Insecure-Requests': '1',
    'User-Agent': user_agent
}


# 请求
r_get = requests.get(url + '/get', headers=headers)
r_get.encoding = 'utf-8'

# 找到需要爬取的信息
htmlText = r_get.content
text = r_get.text

# 获取页面html页面
soup = BeautifulSoup(text, 'html.parser')

# 列表
item = soup.find_all('div', class_="li_com")

print(item)
print('----------------------------------------------')
index1 = item[0]

# company = index1.find_all('a')[0].get_text()
# code = index1.find_all('a')[1].get_text()
# 航班信息
flyInfo = index1.find_all('a')[0].get_text() + index1.find_all('a')[1].get_text()

print('航班信息:' + flyInfo)

# 图片列表
# 实际起飞 （第2个）
picDTime = index1.find_all('img')[1]
imgRespDTime = requests.get(base_url + picDTime['src'] + '/get', headers=headers)
image = Image.open(BytesIO(imgRespDTime.content))
dtime = pytesseract.image_to_string(image)
print('实际起飞:' + dtime)
# 实际到达 （第3个）
picATime = index1.find_all('img')[1]
imgRespATime = requests.get(base_url + picATime['src'] + '/get', headers=headers)
image = Image.open(BytesIO(imgRespATime.content))
atime = pytesseract.image_to_string(imagpe)
print('实际到达:' + atime)
# 准点率 （第4个）
punctualityRate = index1.find_all('img')[1]
imgPunctualityRate = requests.get(base_url + punctualityRate['src'] + '/get', headers=headers)
image = Image.open(BytesIO(imgPunctualityRate.content))
rate = pytesseract.image_to_string(image)
print('准点率:' + rate)

# cStringIO解析
# file = cStringIO.StringIO((base_url + picDTime['src']).urlopen(url).read())
# image = Image.open(file)

# 解析本地图片
# image = Image.open(r'C:/Users/HP/Desktop/16027604786851.png')
# xx = pytesseract.image_to_string(image)  
# print('图片内容：' + xx)

非鸽传书

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
2
评论
Python爬虫练习

Python爬虫因工作原因需要进行数据爬取，来进行数据方便处理业务。之前只是知道Python可以做爬虫。这次有机会写爬虫。决定写一下学习过程。首先，现在网上找各种教程https://www.runoob.com/python/python-install.htmlhttps://www.zhihu.com/question/20899988/answer/96904827当然也可以使用成熟的框架https://blog.csdn.net/sinat_38682860/article/detail
复制链接

扫一扫