做项目,从内容上可划分四大部分:数据、流程、组织、功能。按工作量排,数据与流程稳居前二。特别涉及到期初数据的整理,几W条垃圾数据,质量这么差,公司业务还能玩的转,深刻体会到了老员工的智慧——山不转水转,水不转人转。总有奇思妙想,通过野路子保证业务流不断。
若排序项目任务的枯燥程度,数据再次荣登第一梯度。好在大家都知道数据清理是个慢活,趁此摸鱼Python,也算是有个人价值的。整理供应商期初数据,缺少必需字段。网上找的脚本,如下
gihub-企业工商信息批量查询 https://github.com/nigo81/tools-for-auditor
方法是可行的,爬取企查猫。优点在于不用对身份进行二次认证,反爬监控弱。设置爬50条,sleep30秒。实际爬取了8000条数据,花了7小时。
部分代码重写了,如下。原码无法直接跑。
# -*- coding: utf-8 -*-
# Author: wwcheng
import csv
import logging
import logging.handlers
import os
import random
import time
from urllib import request
import numpy as np
import pandas as pd
import prettytable as pt
import requests
import xlwt
from fake_useragent import UserAgent
from lxml import etree
class QiChaMao(object):
def __init__(self):
self.start_time = time.time()
self.timestr = str(int(time.time() * 1000))
self.nowpath = os.path.abspath(os.curdir)
self.search_key = ''
self.company_url = ''
self.company_id = ''
self.company_name = ''
self.basic_info = []
self.cyxx_info = ''
self.fzjg_info = ''
self.share_info = ''
self.dwtz_info = ''
self.company_list = []
self.base_headers = {
'Host': 'www.qichamao.com',
'Connection': 'keep-alive',
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3',
'Accept-Encoding': 'gzip, deflate, br',
'Accept-Language': 'zh-CN,zh;q=0.9',
}
self.__logging()
def __cookies_load(self):
'''内部函数:加载cookies,切换时间戳'''
filepath = self.nowpath+r'\Input\cookies.txt'
cookies = (open(filepath, 'r').read()).strip()
if len(cookies) < 50:
self.logger.info('请先在cookies.txt中填写cookies信息')
return None
else:
cookies = cookies[0:-10]+str(int(time.time() * 1000)) #切片,最后十位用时间戳替换
return cookies
def __logging(self):
'''内部函数;用于记录日志'''
logging.basicConfig(filename=self.nowpath+r"\Output\error.log",
filemode="w",
format="%(asctime)s %(name)s:%(levelname)s:%(message)s",
datefmt="%d-%M-%Y %H:%M:%S",
level=logging.INFO
)
self.logger = logging.getLogger("15Scends")
handler = logging.StreamHandler()
handler.setLevel(logging.INFO)
formatter = logging.Formatter(
"%(asctime)s %(name)s %(levelname)s %(message)s")
handler.setFormatter(formatter)
self.logger.addHandler(handler)
def __get_html(self):
'''内部函数:获取页面源码'''
headers2 = {'Referer': self.search_url,
'Cookie':