（1用API爬取天气预报数据）Python爬虫与数据清洗的进化

最新推荐文章于 2024-08-17 13:56:26 发布

daxi0ng

最新推荐文章于 2024-08-17 13:56:26 发布

阅读量1.8k

点赞数 1

分类专栏： Python学习

本文链接：https://blog.csdn.net/qq_40989258/article/details/88740425

版权

Python学习专栏收录该内容

7 篇文章 1 订阅

订阅专栏

1、一个简单网页源代码爬取

import requests
url='http://www.cntour.cn/'
strhtml=requests.get(url)
print(strhtml.text[:50]) #提取前50个字符

2、使用Beautiful Soup解析网页，可以顺便安装一下lxml库，功能强大，速度更快。

复制CSS选择器路径。

将css选择器路径复制到soup.select中。

import requests
from bs4 import BeautifulSoup
url='http://www.cntour.cn/'
strhtml=requests.get(url)
soup=BeautifulSoup(strhtml.text,'lxml')
data=soup.select('#main > div > div.mtop.firstMod.clearfix > div.leftBox > div:nth-child(2) > ul > li > a')
for item in data:
    result={
        'Title':item.get_text(),
        'Link':item.get('href'), 
    }
    print(result)

3、使用正则表达式提取文章数字ID

\d 匹配数字 +匹配前一个字符一次或多次

使用re库的findall方法，第一个参数为正则表达式，第二个参数表示要提取的文本。

import requests
from bs4 import BeautifulSoup
import re
url='http://www.cntour.cn/'
strhtml=requests.get(url)
soup=BeautifulSoup(strhtml.text,'lxml')
data=soup.select('#main > div > div.mtop.firstMod.clearfix > div.leftBox > div:nth-child(2) > ul > li > a')
for item in data:
    result={
        'ID': re.findall('\d+', item.get('href')),
        'Title':item.get_text(),
        'Link':item.get('href'), 
    }
    print(result)

4、用API爬取天气预报数据

网址：https://www.heweather.com/

删除顶部不需要的数据使用remove方法，通过一个列表元素item循环输出每一行前14个字符（城市编码）

import requests
url='https://cdn.heweather.com/china-scenic-list.txt'
strhtml=requests.get(url)
strhtml.encoding='utf-8'
data=strhtml.text
new_data=data.split('\n')
for i in range(3):
    new_data.remove(new_data[0])
for item in new_data:
    print(item[0:14])

调用接口获取数据

import requests
import time
with open('outcity.txt','r')as f:
    list=f.readlines()
for i in range(2):
    list.remove(list[0])
for item in list:
    print(item[0:12])
    url='https://free-api.heweather.net/s6/weather?location='+item[0:12]+'&key=5ce6464c17824a6ea9aa12dd749e2bc5'
    strhtml=requests.get(url)
    strhtml.encoding='utf-8'
    time.sleep(1)
    print(strhtml.text)

将返回的json数据解析出来，通过观察路径接下来三天的最高温度在 ["HeWeather6"][0]["daily_forecast"][n]["tmp_max"]下面，其中n表示分节点

    dic=strhtml.json()
    for item in dic["HeWeather6"][0]["daily_forecast"]:
        print(item["tmp_max"])

将数据上传到mongodb中

import requests
import time
import pymongo #加载pymongo库
client=pymongo.MongoClient('localhost',27017) #建立连接
book_weather=client['weather'] #新建weather数据库
sheet_weather=book_weather['sheet_weather_3'] #在weather库中新建sheet_weather_3表
with open('outcity.txt','r')as f:
    list=f.readlines()
for i in range(2):
    list.remove(list[0])
for item in list:
    print(item[0:12])
    url='https://free-api.heweather.net/s6/weather?location='+item[0:12]+'&key=5ce6464c17824a6ea9aa12dd749e2bc5'
    strhtml=requests.get(url)
    strhtml.encoding='utf-8'
    time.sleep(1)
    dic=strhtml.json()
    sheet_weather.insert_one(dic)

在pycharm中下载mongo plus插件可以便捷查看数据。

MongoDB数据库查询（北京的天气数据）

import pymongo
client=pymongo.MongoClient('localhost',27017)
book_weather=client['weather']
sheet_weather=book_weather['sheet_weather_3']
for item in sheet_weather.find({'HeWeather6.basic.location':'北京'}):
    print(item)

查询最高温度低于20摄氏度的城市 $lt:< $lte:<= $gt:> $gte:>=，其中update_one方法用于指定更新一条数据，第一个参数{'_id:item['_id']}表示要更新的查询条件。第二个参数表示要更新的信息，$set是MongoDB中的一个修改器，用于制定一个键并更新键值，若键不存在则创建一个件键。

import pymongo
client=pymongo.MongoClient('localhost',27017)
book_weather=client['weather']
sheet_weather=book_weather['sheet_weather_3']
for item in sheet_weather.find():
    tmp_max=item["HeWeather6"][0]["daily_forecast"][0]['tmp_max']
    sheet_weather.update_one({'_id':item['_id']},{'$set':{'HeWeather6.0.daily_forecast.{}.tmp_max'.format(0):int(tmp_max)}})
    for item in sheet_weather.find({'HeWeather6.0.daily_forecast.0.tmp_max':{'$lt':20}}):
        print(item['HeWeather6'][0]['basic']['location'])