Python爬虫实践2，企业工商信息查询

最新推荐文章于 2024-09-13 22:22:02 发布

kyle_1111

最新推荐文章于 2024-09-13 22:22:02 发布

阅读量4.9k

点赞数 1

分类专栏： Data数据 Python

本文链接：https://blog.csdn.net/kyle_1111/article/details/103781641

版权

本文介绍了使用Python爬虫进行企查猫工商信息查询的实践，强调了数据清洗的重要性以及在项目中遇到的问题。通过设置爬取间隔和优化代码，成功获取8000条数据，耗时7小时。

摘要由CSDN通过智能技术生成

做项目，从内容上可划分四大部分：数据、流程、组织、功能。按工作量排，数据与流程稳居前二。特别涉及到期初数据的整理，几W条垃圾数据，质量这么差，公司业务还能玩的转，深刻体会到了老员工的智慧——山不转水转，水不转人转。总有奇思妙想，通过野路子保证业务流不断。

若排序项目任务的枯燥程度，数据再次荣登第一梯度。好在大家都知道数据清理是个慢活，趁此摸鱼Python，也算是有个人价值的。整理供应商期初数据，缺少必需字段。网上找的脚本，如下

gihub-企业工商信息批量查询 https://github.com/nigo81/tools-for-auditor

方法是可行的，爬取企查猫。优点在于不用对身份进行二次认证，反爬监控弱。设置爬50条，sleep30秒。实际爬取了8000条数据，花了7小时。

部分代码重写了，如下。原码无法直接跑。

# -*- coding: utf-8 -*-
# Author: wwcheng
import csv
import logging
import logging.handlers
import os
import random
import time
from urllib import request
import numpy as np
import pandas as pd
import prettytable as pt
import requests
import xlwt
from fake_useragent import UserAgent
from lxml import etree

class QiChaMao(object):
  def __init__(self):  
    self.start_time = time.time()
    self.timestr = str(int(time.time() * 1000))
    self.nowpath = os.path.abspath(os.curdir)
    self.search_key = ''
    self.company_url = ''
    self.company_id = ''
    self.company_name = ''
    self.basic_info = []
    self.cyxx_info = ''
    self.fzjg_info = ''
    self.share_info = ''
    self.dwtz_info = ''
    self.company_list = []
    self.base_headers = {
      'Host': 'www.qichamao.com',
      'Connection': 'keep-alive',
      'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3',
      'Accept-Encoding': 'gzip, deflate, br',
      'Accept-Language': 'zh-CN,zh;q=0.9',
    }
    self.__logging()  

  def __cookies_load(self):
    '''内部函数：加载cookies，切换时间戳'''
    filepath = self.nowpath+r'\Input\cookies.txt'  
    cookies = (open(filepath, 'r').read()).strip()
    if len(cookies) < 50:
      self.logger.info('请先在cookies.txt中填写cookies信息')
      return None
    else:
      cookies = cookies[0:-10]+str(int(time.time() * 1000)) #切片，最后十位用时间戳替换
      return cookies

  def __logging(self):
    '''内部函数；用于记录日志'''
    logging.basicConfig(filename=self.nowpath+r"\Output\error.log",
                        filemode="w",
                        format="%(asctime)s %(name)s:%(levelname)s:%(message)s",
                        datefmt="%d-%M-%Y %H:%M:%S",
                        level=logging.INFO
                        )
    self.logger = logging.getLogger("15Scends")
    handler = logging.StreamHandler()
    handler.setLevel(logging.INFO)
    formatter = logging.Formatter(
      "%(asctime)s %(name)s %(levelname)s %(message)s")
    handler.setFormatter(formatter)
    self.logger.addHandler(handler)

  def __get_html(self):
    '''内部函数：获取页面源码'''
    headers2 = {'Referer': self.search_url,
                'Cookie':