python网络爬虫入门小程序_人生第一个python爬虫小程序

最新推荐文章于 2024-03-24 15:29:25 发布

无状态小黄

最新推荐文章于 2024-03-24 15:29:25 发布

阅读量276

点赞数

文章标签： python网络爬虫入门小程序

本文链接：https://blog.csdn.net/weixin_36096137/article/details/112829524

版权

这篇博客介绍了一个Python入门级的网络爬虫小程序，主要用于抓取并存储天气网站的数据到CSV文件。通过使用requests库获取网页内容，BeautifulSoup解析HTML，以及自定义UnicodeWriter类处理CSV文件的编码问题，程序能够抓取指定网页的日期、天气、最高和最低温度信息，并存储到CSV文件weather.csv中。

摘要由CSDN通过智能技术生成

程序的主要功能就是获取一个天气网站的数据然后存储在一个cvs文件

环境为2.7 需要用到的库

import requests

import csv

import random

from bs4 import BeautifulSoup

requests 网络请求需要自行安装

csv python自带的操作文件的库

random 随机数模拟真实请求的timeout

Beaautifulsoup 代替正则表达式的神器帮助我们更好获取html中需要的内容

主要业务分为三步：

1.获取当前网页内容

2.解析网页获取目标数据内容

3.写入csv文件中

代码如下

#获取网页内容

def get_content(url, data = None):

header = {

'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',

'Accept-Encoding': 'gzip, deflate, sdch',

'Accept-Language': 'zh-CN,zh;q=0.8',

'Connection': 'keep-alive',

'User-Agent': 'Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/43.0.235'

}

timeout = random.choice(range(80, 180))

rep = requests.get(url, headers=header, timeout=timeout)

rep.encoding = 'utf-8'

return rep.text

#解析网页返回目标数据

def get_data(html_text):

final = []

bs = BeautifulSoup(html_text, "html.parser") # 创建BeautifulSoup对象

body = bs.body # 获取body部分

data = body.find('div', {'id': '7d'}) # 找到id为7d的div

ul = data.find('ul') # 获取ul部分

li = ul.find_all('li') # 获取所有的li

for day in li: # 对每个li标签中的内容进行遍历

temp = []

date = day.find('h1').string # 找到日期

temp.append(date) # 添加到temp中

inf = day.find_all('p') # 找到li中的所有p标签

temp.append(inf[0].string,) # 第一个p标签中的内容(天气状况)加到temp中

if inf[1].find('span') is None:

temperature_highest = None # 天气预报可能没有当天的最高气温(到了傍晚，就是这样)，需要加个判断语句,来输出最低气温

else:

temperature_highest = inf[1].find('span').string # 找到最高温

temperature_highest = temperature_highest.replace('', '') # 到了晚上网站会变，最高温度后面也有个℃

temperature_lowest = inf[1].find('i').string # 找到最低温

temperature_lowest = temperature_lowest.replace('', '') # 最低温度后面有个℃，去掉这个符号

temp.append(temperature_highest) # 将最高温添加到temp中

temp.append(temperature_lowest) #将最低温添加到temp中

final.append(temp) #将temp加到final中

return final

#写入文件

def write_data(data, name):

file_name = name

with open(file_name, 'w+r') as f:

myCsv = UnicodeWriter(f)

myCsv.writerow(f)

myCsv.writerows([[u'日期', u'天气', u'最高温度', u'最低温度']])

myCsv.writerows(data)

涉及到unicode字符内容导致数据写不进去，我查阅资料发现官方给出了一个解决办法-----重写csv类

import csv, codecs, cStringIO

class UnicodeWriter:

"""

A CSV writer which will write rows to CSV file "f",

which is encoded in the given encoding.

"""

def __init__(self, f, dialect=csv.excel, encoding="utf-8", **kwds):

# Redirect output to a queue

self.queue = cStringIO.StringIO()

self.writer = csv.writer(self.queue, dialect=dialect, **kwds)

self.stream = f

self.encoder = codecs.getincrementalencoder(encoding)()

def writerow(self, row):

self.writer.writerow([s.encode("utf-8") for s in row])

# Fetch UTF-8 output from the queue ...

data = self.queue.getvalue()

data = data.decode("utf-8")

# ... and reencode it into the target encoding

data = self.encoder.encode(data)

# write to the target stream

self.stream.write(data)

# empty queue

self.queue.truncate(0)

def writerows(self, rows):

for row in rows:

self.writerow(row)

最后的代码

#!/usr/bin/python

# -*- coding: UTF-8 -*-