爬虫——爬取网页数据存入表格

最新推荐文章于 2024-06-24 12:33:01 发布

置顶无纸~文

最新推荐文章于 2024-06-24 12:33:01 发布

阅读量9.7k

点赞数 3

分类专栏： python学习

本文链接：https://blog.csdn.net/noingw96/article/details/82177587

版权

python学习专栏收录该内容

6 篇文章 0 订阅

订阅专栏

最近由于个人需要，从相关书籍以及网上资料进行爬虫自学，目标网址为http://mzj.beijing.gov.cn，对其内容进行整理筛选，存入excel格式。

首先是对表格的内容进行设置，编码格式定义为utf-8，添加一个sheet的表格，其中head为表头的内容，定义之后，利用sheet.write将表头内容写入。

book = xlwt.Workbook(encoding='utf-8')
sheet = book.add_sheet('ke_qq')
head = ['组织名称','登记证号','统一社会信用代码','业务主管单位','登记管理机关','社会组织类型','开办资金','业务范围','法定代表人','电话','地址','邮编','登记状态','成立日期','行业分类']#表头
for h in range(len(head)):
    sheet.write(0,h,head[h])    #写入表头

爬取网页采用requests进行访问，利用BeautifulSoup进行解析。

response = requests.get(url, headers=headers)
soup = BeautifulSoup(response.text, 'html.parser', from_encoding='utf-8')

之后提取网页内容中有效字段，使用soup.stripped_strings去除其中的空格和空行内容。

str1 = []
    nice = []
    for wz in soup.stripped_strings:
        str1.append(repr(wz))
    k = len(str1)

最后，根据每个人不同的需要，对数据进行整理，在这里是使用insert、pop、append对数据进行一些调整。

完整代码如下：

# coding:utf-8
import requests
from bs4 import BeautifulSoup
import operator as op
import re
import xlwt

user_agent = 'Mozilla/4.0 (compatible;MSIE5.5;windows NT)'
headers = {'User-Agent': user_agent}
num=1
book = xlwt.Workbook(encoding='utf-8')
sheet = book.add_sheet('ke_qq')
head = ['组织名称','登记证号','统一社会信用代码','业务主管单位','登记管理机关','社会组织类型','开办资金','业务范围','法定代表人','电话','地址','邮编','登记状态','成立日期','行业分类']#表头
for h in range(len(head)):
    sheet.write(0,h,head[h])    #写入表头
for one in range(10001,17000):
    keyword = 10000000001
    keywords=keyword+one
    url = 'http://mzj.beijing.gov.cn/wssbweb/wssb/dc/orgInfo.do?action=seeParticular&orgId=0000' + str(keywords) + '&websitId=&netTypeId='
    response = requests.get(url, headers=headers)
    soup = BeautifulSoup(response.text, 'html.parser', from_encoding='utf-8')
    str1 = []
    nice = []
    for wz in soup.stripped_strings:
        str1.append(repr(wz))
    k = len(str1)
    if k>5:
        i = 1
        for content in str1:
            if i > 3:
                nice.append(content)
            i = i + 1
        try:
            # num=num+1
            if  op.eq(nice[4], '\'业务主管单位：\''):
                nice.insert(4, '无')
            if op.eq(nice[14], '\'法定代表人/负责人：\''):
                nice.insert(14, '无')
            if op.eq(nice[13], '\'活动地域：\''):
                nice.pop(13)
                nice.pop(13)
            if op.eq(nice[16], '\'电话：\''):
                nice.insert(16, '无')
            if op.eq(nice[18], '\'地址：\''):
                nice.insert(18, '无')
            if op.eq(nice[20], '\'邮编：\''):
                nice.insert(20, '无')
            if len(nice)>22:
                if op.eq(nice[22], '\'登记状态：\''):
                    nice.insert(22, '无')
            if len(nice) > 27:
                if op.eq(nice[27], '\'行业分类：\'') and len(nice) == 28:
                    nice.append('无')
                # if op.eq(nice[13], '\'活动地域：\''):
                #   nice.pop(13)
                #  nice.pop(13)
            if op.eq(nice[12], '\'元\''):
                nice[12] = '0'
            # print(nice)
            j = 0
            d = 0
            s = 0
            for data in nice:
                if j & 1 == 0:
                    s = j - d
                    sheet.write(num, s, data)
                    d += 1
                j += 1
            print(num)
            num += 1
        except:
            print('error'+num)

book.save('E:\WU\pyfile\shuju\save2\shuju2.xls')

其中网页地址中的keyword由于爬取网页的不同，可能采取方法有异。