初窥爬虫(python)
这两天整理初学Python时的爬虫,一个简单的例子引导入门:
爬取省份城市及气温
大体思路:
1.获取网页内容
代码块
from bs4 import BeautifulSoup
import requests
import time
def get_temperature(url):
header = {
'User-Agent':'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) \
Chrome/60.0.3112.113 Safari/537.36',
'Upgrade-Insecure-Requests':'1',
# 'Referer':'http://www.weather.com.cn/textFC/hb.shtml',
'Host':'www.weather.com.cn'
}
# html = "http://www.weather.com.cn/textFC/hb.shtml"
req = requests.get(url,headers=header)
req = req.encode('gbk','ignore')
print (req.content)
2.解析数据
# 获取页面
content = req.content
soup = BeautifulSoup(content,'lxml')
# 获取省份
conMid_tab = soup.find('div',class_='conMidtab')
conMid_list = conMid_tab.find_all('div',class_='conMidtab2')
for x in conMid_list:
tr_list = x.find_all('tr')[2:]
# print(tr_list)
# 获取城市
for index,tr in enumerate(tr_list):
if index == 0: # 如果是第0个tr标签,那么城市和省份名是放在一起的
td_list = tr.find_all('td')
province = td_list[0].text.replace('\n',' ') # 省份
city = td_list[1].text.replace('\n',' ') # 城市
minW = td_list[7].text.replace('\n',' ') # 最低气温
else: # 如果不是第0个tr标签,那么在这个tr标签中只存放城市名
td_list = tr.find_all('td')
city = td_list[0].text.replace('\n',' ') # replace('\n',' ')删除换行
minW = td_list[6].text.replace('\n',' ') # 最低气温
print('%s|%s' % (province+city, minW))
3.数据分析
ps:需要用到的库
**做网页分析的 bs4**
pip install bs4
**解析网页**
pip install lxml
**用来做网络请求的**
pip install requests