python爬虫之-BeautifulSoup

最新推荐文章于 2022-10-28 17:56:02 发布

风起云永

最新推荐文章于 2022-10-28 17:56:02 发布

阅读量538

点赞数 1

分类专栏： Python 文章标签： python 爬虫 bs

本文链接：https://blog.csdn.net/xingweiyong/article/details/51322085

版权

Python 专栏收录该内容

9 篇文章 0 订阅

订阅专栏

此段代码实现现在天津房管局官网上，天津各区县历年房价信息，起始网址见：天津房管局2007年各区县房价详情.

程序大体思路为：采用urllib2下载网页源码，再用解析html的神器BeautifulSoup，解析下载的源码，通过定为节点和字段特征定为到所需要的信息，并保存为字符串。BeautifulSoup的相关参考文档：BeautifulSoup的相关参考文档，通过for循环控制页面翻页。
详细代码如下：

  #coding:utf-8
import urllib2
from bs4 import BeautifulSoup
import sys
import random,time
reload(sys)
sys.setdefaultencoding('utf-8')
#获取字段函数
def get_sub(str):
    num2=int(str.find('<',2))
    num1=int(str.find('">'))
    return str[num1+2:num2]
#保存文件函数
def save_as(str,f):
    f.write(str)
#检测网络连接异常
def check_error(web_url):
        try:
            request=urllib2.Request(web_url)
            response=urllib2.urlopen(request)
            soup=BeautifulSoup(response)
            return True
        except Exception,e:
            print e
            return False
#主体函数
def get_web(web_url):
    request=urllib2.Request(web_url)
    response=urllib2.urlopen(request)
    soup=BeautifulSoup(response)
    #print soup.prettify()#beautifulsoup 默认把网页源码转成Unicode 再转成utf-8输出
    temp=[]
    for tag in soup.find_all('td'):
        if tag.has_attr('class') and tag.get('class')[0]=='UserArticleHeader':
            tit=tag.string+' '
    for tag in soup.find_all('tr'):
        if tag.has_attr('style'):
            if tag['style']=='height:14.25pt':
                temp.append(tit)
                if tag.td.p.span.span!=None:
                    for i in tag.td.p.span.span.stripped_strings:
                        if i !=u'县':
                            #print i
                            temp.append(i+' ')
                else:
                    for item in tag.td.p.strings:
                        if item!=u'成交套数（套）':
                            #print item
                            temp.append(item+' ')
                item=tag.select('span[lang="EN-US"]')
                num_item=len(item)
                #print num_item
                if num_item >6: #过滤杂质
                    for i in range(0,num_item):
                        if item[i].string != None:
                            if item[i].string !=u'(':
                                temp.append(item[i].string+' ')
                    temp.append('\n')
            if tag['style']=='height:15.75pt':#网页样式改变
                temp.append(tit)
                if tag.td.p.span.font!=None:
                    #print get_sub(str(tag.td.p.span.font))
                    temp.append(get_sub(str(tag.td.p.span.font))+' ')
                    item=tag.select('span[lang="EN-US"]')
                    num_item1=len(item)
                    #print num_item1
                    if num_item1>11:
                        for i in range(0,num_item1):
                            if item[i].string!=None:
                                temp.append(item[i].string+' ')
                        temp.append('\n')
                else:
                    #print get_sub(str(tag.td.p.span))
                    temp.append(get_sub(str(tag.td.p.span))+' ')
                    item=tag.select('span[lang="EN-US"]')
                    num_item1=len(item)
                    if num_item1>11:
                        for i in range(0,num_item1):
                            if item[i].string!=None:
                                temp.append(item[i].string+' ')
                        temp.append('\n')
    try:
        result_f=file('result1.txt','a')
        if len(temp)!=0:
            if temp[1]==tit:
                del temp[0]
            if temp[1]==u'(':
                del temp[0:2]
        for item in temp:
            if item==u'蓟 ':
                item=u'蓟县'
            save_as(item.encode('utf-8'), result_f)
    finally:
        result_f.close()
if __name__=='__main__':
    for i in range(2,3079):#[3,1000,2000,3000,3015]
        url='http://www2.tjfdc.gov.cn/Lists/List51/DispForm1.aspx?ID='+ str(i)
        print url
        time.sleep(random.randint(1,4))
        if check_error(url):
            get_web(url)
        else:  
            continue

风起云永

关注

1
点赞
踩
0

收藏

觉得还不错? 一键收藏
打赏
0
评论
python爬虫之-BeautifulSoup

此段代码实现现在天津房管局官网上，天津各区县历年房价信息，起始网址见：[天津房管局2007年各区县房价详情](http://www2.tjfdc.gov.cn/Lists/List51/DispForm1.aspx?ID=2)，程序大体思路为：采用urllib2下载网页源码，再用解析html的神器BeautifulSoup，解析下载的源码，通过定为节点和字段特征定为到所需要的信息，并保存为字符串。B
复制链接

扫一扫