[python]爬取全国城市历史天气数据

python爬取全国城市历史天气数据

  • 爬取全国城市2011至2020每天天气数据
  • 以requests+BeautifulSoup的方式抓取数据
  • 多线程爬取
  • 按城市名爬取后按省份存为xls
  • 从全国城市名称对应拼音构造字典时,存在城市拼音相同问题
  • 这个网站对城市的拼音有错误导致数据爬不到
  • 构造省份城市字典,按照省份创建文件夹归档
  • 数据来源:天气后报网站
  • 完整代码及数据:项目地址
分析资源URL

http://www.tianqihoubao.com/lishi/beijing/month/201101.html

易见http://www.tianqihoubao.com/lishi/{城市拼音}/month/{年月}.html

主要代码
class Crawler(threading.Thread,):

    def run(self):
        print("%s is running" % threading.current_thread())
        while True:
            # 上锁
            gLock.acquire()
            if len(city_dict) == 0:
                # 释放锁
                gLock.release()
                continue
            else:
                item = city_dict.popitem()
                gLock.release()
                data_ = list()
                urls = self.get_urls(item[0])
                for url in urls:
                    try:
                        data_.extend(self.get_data(url))  # 列表合并,将某个城市所有月份的天气信息写到data_
                    except Exception as e:
                        print(e)
                        pass
                self.saveTocsv(data_, item[1])  # 保存为csv
                if len(city_dict) == 0:
                    end = time.time()
                    print("消耗的时间为:", (end - start))
                    exit()

    # 获取城市历史天气url
    def get_urls(self,city_pinyin):
        urls = []
        for year in target_year_list:
            for month in target_month_list:
                date = year + month
                # url = "http://www.tianqihoubao.com/lishi/beijing/month/201812.html"
                urls.append("http://www.tianqihoubao.com/lishi/{}/month/{}.html".format(city_pinyin, date))
        return urls

    def get_soup(self,url):
        try:
            r = requests.get(url, timeout=30)
            r.raise_for_status()  # 若请求不成功,抛出HTTPError 异常
            soup = BeautifulSoup(r.text, "html.parser")
            return soup
        except Exception as e:
            print(e)
            pass

    # 将天气数据保存至xls文件
    def saveTocsv(self,data, city):
        fileName = './weather_data/' + city + '天气.xls'
        result_weather = pd.DataFrame(data, columns=['日期', '天气状况', '气温', '风力风向'])
        # print(result_weather)
        result_weather.to_excel(fileName, index=False)
        print('Save all weather success!')
        print('remain{}'.format(len(city_dict)))

    def get_data(self,url):
        print(url)
        try:
            soup = self.get_soup(url)
            all_weather = soup.find('div', class_="wdetail").find('table').find_all("tr")
            data = list()
            for tr in all_weather[1:]:
                td_li = tr.find_all("td")
                for td in td_li:
                    s = td.get_text()
                    data.append("".join(s.split()))
            res = np.array(data).reshape(-1, 4)
            return res

        except Exception as e:
            print(e)
            pass
  • 3
    点赞
  • 15
    收藏
    觉得还不错? 一键收藏
  • 1
    评论
要使用Python爬虫和XPath来爬取2345历史天气,你可以按照以下步骤进行操作: 1. 首先,你需要使用Python的requests库发送HTTP请求来获取网页的源代码。你可以使用以下代码示例: ```python import requests url = 'http://example.com' # 替换为你要爬取的网页URL response = requests.get(url) html_data = response.text ``` 2. 接下来,你需要使用XPath来解析网页的源代码并提取所需的数据。根据引用中的示例代码,你可以使用以下代码示例: ```python from lxml import etree tree = etree.HTML(html_data) tr_list = tree.xpath('//table[@class="history-table"]/tr') for tr in tr_list//text()') # 空气质量指数 lst = [d1<span class="em">1</span><span class="em">2</span><span class="em">3</span> #### 引用[.reference_title] - *1* *2* [爬取特定城市天气数据(2345)](https://blog.csdn.net/qq_40932165/article/details/128685550)[target="_blank" data-report-click={"spm":"1018.2226.3001.9630","extra":{"utm_source":"vip_chatgpt_common_search_pc_result","utm_medium":"distribute.pc_search_result.none-task-cask-2~all~insert_cask~default-1-null.142^v93^chatsearchT3_2"}}] [.reference_item style="max-width: 50%"] - *3* [python爬虫爬取重庆近20年天气信息](https://blog.csdn.net/qq_45935025/article/details/122692968)[target="_blank" data-report-click={"spm":"1018.2226.3001.9630","extra":{"utm_source":"vip_chatgpt_common_search_pc_result","utm_medium":"distribute.pc_search_result.none-task-cask-2~all~insert_cask~default-1-null.142^v93^chatsearchT3_2"}}] [.reference_item style="max-width: 50%"] [ .reference_list ]
评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值