python爬虫获取历史天气信息

最新推荐文章于 2023-12-21 15:25:03 发布

YangYang~

最新推荐文章于 2023-12-21 15:25:03 发布

阅读量2.3k

点赞数 1

文章标签： python 数据分析

本文链接：https://blog.csdn.net/weixin_44909868/article/details/108639509

版权

想要获得一个城市的历史天气，可以在天气后报网站上查询获得
如果要通过大量历史天气数据做分析，可以通过爬虫的方式获得。
如，我们要查询北京2020年9月的天气汇总。可以看到网站界面如图所示
在这里插入图片描述
要爬取这个列表中的数据，首先设置headers，headers是解决requests请求反爬的方法之一，相当于我们进去这个网页的服务器本身，假装自己本身在爬取数据。对反爬虫网页，可以设置一些headers信息，模拟成浏览器取访问网站。

headers = {
    'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
    'Accept-Encoding': 'gzip, deflate, compress',
    'Accept-Language': 'en-us;q=0.5,en;q=0.3',
    'Cache-Control': 'max-age=0',
    'Connection': 'keep-alive',
    'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:22.0) Gecko/20100101 Firefox/22.0'
}

说明：
（1）根据网页网址信息，可以替换其中的城市名称及年月信息爬取需要的天气数据
（2）beautiful soup 是Python的一个HTML或XML的解析库。他提供一个简单的、Python式的函数来处理导航、搜索、修改分析数等功能。它是一个工具箱，通过解析文档为用户提供需要抓取的数据。
beautiful soup 自动将输入文档转化为Unicode编码，输出文档转化为utf-8编码，不需要考虑编码方式。
（3）生成天气数据表，这里把最高温和最低温区分开来。

def GetWeather(year,month,city):
    url = 'http://www.tianqihoubao.com/lishi/'+city+'/month/'+year+month+'.html'  
    htmlsingle = requests.get(url, headers=headers)
    t=htmlsingle.text.encode(htmlsingle.encoding)
    soup=BeautifulSoup(t,'lxml')  
    TextList = []
    tagh3 = soup.find_all('td')
    del tagh3[:4]
    for each in tagh3:
        TextList.append(each.text)
    TextList = [re.sub('[\n\r ]','',v) for v in TextList]
    WeatherDf = pd.DataFrame(np.array(TextList).reshape(int(len(TextList) / 4),4))
    WeatherDf.columns = ['date','weather','high_low','wind']
    low = []
    high = []
    for i in range(0,len(WeatherDf)):
        a = re.search('/', WeatherDf.high_low[i]).span()
        high.append(WeatherDf.high_low[i][:a[0]].replace("℃",""))
        low.append(WeatherDf.high_low[i][a[1]:].replace("℃",""))
    WeatherDf['high'] = high
    WeatherDf['low'] = low
    WeatherDf = WeatherDf.loc[:,['date','weather','high','low','wind']]
    return WeatherDf