看了网上的教程,打算爬一下中国天气网的7日天气
结果F12一看傻了,这网页改版了吧,这skyid还带变化的…
算了,又不是不能爬
import requests
import bs4
import pandas as pd
url = r'http://www.weather.com.cn/weather/101280101.shtml'
response = requests.get(url)
soup = bs4.BeautifulSoup(response.text, features='lxml')
weather_all = [i.text for i in soup.findAll(name='ul', attrs={'class': 't clearfix'})]
print(weather_all)
结果中文先给我来个乱码
一开始还以为是Pycharm的设置有问题,搞了半天,才发现是网页的编码不是UTF-8,吐血
print(response.encoding)
重编码为UTF-8
response.encoding = 'utf-8'
解决
然而新问题来了,这么多\n
咋办嘛,一开始是想把每一天分割出来,结果不太行。最后用split('\n')
试试
weather_all = [i.text.split('\n') for i in soup.findAll(name='ul', attrs={'class': 't clearfix'})]
可以是可以了,就是多了一堆''
查阅全网,用一个循环把多余的''
删掉
for i in weather_all:
while '' in i:
i.remove('')
嗯,可以了
当我要把各元素放在各个新列表里,发现元素都在列表[0]中,坑爹啊
只好又写了4个循环一个个放进新列表(还是觉得太笨了,但是我想不到其他方法)
days, weather, temper, wind = [], [], [], []
for i in range(0, 25, 4):
if i <= 24:
days.append(weather_all[0][i])
for j in range(1, 26, 4):
if j <= 25:
weather.append(weather_all[0][j])
for k in range(2, 27, 4):
if k <= 26:
temper.append(weather_all[0][k])
for l in range(3, 28, 4):
if l <= 27:
wind.append(weather_all[0][l])
最后新建一个字典用pandas制表
seven_days_weather = {'日期': days, '天气': weather, '温度': temper, '风力': wind}
pd.DataFrame(seven_days_weather)
用Jupyter Notebooks
完整代码:
import requests
import bs4
import pandas as pd
url = r'http://www.weather.com.cn/weather/101280101.shtml'
response = requests.get(url)
response.encoding = 'utf-8'
soup = bs4.BeautifulSoup(response.text, features='lxml')
weather_all = [i.text.split('\n') for i in soup.findAll(name='ul', attrs={'class': 't clearfix'})]
for i in weather_all:
while '' in i:
i.remove('')
print(weather_all)
days, weather, temper, wind = [], [], [], []
for i in range(0, 25, 4):
if i <= 24:
days.append(weather_all[0][i])
for j in range(1, 26, 4):
if j <= 25:
weather.append(weather_all[0][j])
for k in range(2, 27, 4):
if k <= 26:
temper.append(weather_all[0][k])
for l in range(3, 28, 4):
if l <= 27:
wind.append(weather_all[0][l])
seven_days_weather = {'日期': days, '天气': weather, '温度': temper, '风力': wind}
pd.DataFrame(seven_days_weather)
最后想请教下有什么简洁好方法,大佬们当看个笑话好了hhh