学习用requests, bs4 抓取网页特定的内容

最新推荐文章于 2023-04-02 23:26:31 发布

work-harder

最新推荐文章于 2023-04-02 23:26:31 发布

阅读量1.2k

点赞数

文章标签： python win7

#check city pm2.5 value and quality; if assigned city does not exist, then display beijing city result;
'''

如题，学习中。抓取pm2.5信息为借用网络中的帖子目的，内容细节为自己尝试，未验证与其它类似帖子的重合度。

如有意见，请私信，谢谢。
keypoint:
1. using requests to get website result, text;
2. put website result (text) into soup module, a DOM project is created;
3. trying to find out where to store target information; try and try, until there result is correct;
4. collections members can be accessed one by one as: select('abc'), or abc['abc']...
5. analyse website result in chrome, not IE.

result:
works well in win7, python 3.0, requests module, bs4 module
'''
from bs4 import BeautifulSoup
import requests
#checkcity='jiangmen'
checkcity='abc'
find_checkcity=''
pm25url='http://www.pm25.com/'
tempurl=pm25url+checkcity+'.html'
#print (tempurl) #test step
res=requests.get(tempurl)
res.encoding='utf-8'

#if checkcity is not in the list, then checkcity will be assigned as bejing
for city1 in soup.select('.city_province_item'):
    for href1 in city1.select('a'):
        if checkcity in href1['href']:
            find_checkcity=='yes'
if find_checkcity=='':
    find_checkcity='beijing'

#print(res.text)
soup=BeautifulSoup(res.text,'html.parser')
#print (soup.text) #works
for city in soup.select('.banner_index'):
    mycity=city.select('h2')[0].text
    mypm25=city.select('a')[2]['pm25']
    myqua=city.select('a')[2]['qua']
    print(mycity, ": 空气PM25-",mypm25, ", 空气质量-" , myqua, sep="")