小白的Python爬虫

最新推荐文章于 2024-08-17 20:45:04 发布

有纯金理想的萝卜

最新推荐文章于 2024-08-17 20:45:04 发布

阅读量693

点赞数

分类专栏：杂文文章标签： python爬虫小白存储到mysql数据库

本文链接：https://blog.csdn.net/YCJLXDLB/article/details/81703084

版权

杂文专栏收录该内容

1 篇文章 0 订阅

订阅专栏

爬虫三步走：

1、获取到想要爬取的网页HTML

2、解析HTML，获得自己想要的信息

3、文本信息、图片信息下载到本地或者保存到数据库中

环境：mysql + python2.7 (32位) + vscode

html_utils.py文件(请求和解析)

#-*-coding:utf-8 -*-

import requests
from bs4 import BeautifulSoup
import other_utils as otherUtil
import sys

reload(sys)
sys.setdefaultencoding('UTF-8')

#获得整个网页的HTML代码
def download_page_html(url,headers):
    data = requests.get(url,headers=headers)
    return data

#解析放回的HTML代码————拉勾
def lagou_parse_page_html(data):
    soup = BeautifulSoup(data.content,"html.parser")
    #获取所有职位信息的<li>标签
    ul = soup.find_all('li',attrs={'class':'con_list_item default_list'})
    object_list = []
    for li in ul:
        #职位名称(获得<li>标签的data-positionname属性值)
        job_name = li.get('data-positionname')
        #职位薪水
        salary_range = li.get('data-salary')
        #公司名称
        company_name = li.get('data-company')
        #公司地址(<li>标签下的子标签<em>的文本内容)
        company_address = li.find('em').string
        #公司类型及融资(查找属性class=industry的<div>标签下面的文本内容，并且去掉两边空格)
        company_type = li.find('div',attrs={'class':'industry'}).string.strip()
        #职位要求
        requirment = li.find('div',attrs={'class':'li_b_l'}).stripped_strings
        job_requirement = ''
        for str in requirment:
            job_requirement = str.strip()
        #公司自我描述
        company_desc = li.find('div',attrs={'class':'li_b_r'}).string[1:-1]
        #公司logo
        company_logo = 'https:' + li.find('img').get('src')
        img_name = li.find('img').get('alt')
        #下载公司logo到本地
        otherUtil.download_img(company_logo,img_name)
        dict = {
            'jobName':job_name,
            'salaryRange':salary_range,
            'companyName':company_name,
            'companyAddress':company_address,
            'companyType':company_type,
            'jobRequirement':job_requirement,
            'companyDesc':company_desc,
            'companyLogo':company_logo
        }
        object_list.append(dict)

    #得到下一页的URL ('javascript:;'代表最后一页,soup好像得从某些父节点开始查)
    next_page_a = soup.find('div',attrs={'id':'s_position_list'}).find('div',attrs={'class':'pager_container'}).contents[-2]
    next_page_url = next_page_a.get('href')

    return object_list,next_page_url

Request模块大致了解：《https://zhuanlan.zhihu.com/p/20410446》，他的专栏里有不错的爬虫教程文章

BeautifulSoup的中文文档：《https://www.crummy.com/software/BeautifulSoup/bs4/doc/index.zh.html》

就这样网页请求和解析就算完成啦，然后就是根据自己的需要将数据保存到本地或者数据库了。

other_utils.py文件（下载）

里面有从《http://www.xicidaili.com/wt》爬取代理IP的代码，但是自己没用到，只用到下载图片的函数。

有些网址爬取的时候得多做反爬处理，要不然IP很容易被封的，就如豆瓣top250电影。

#-*-coding:utf-8 -*-

import requests
from bs4 import BeautifulSoup
import telnetlib
import codecs
import myconfig as config

#从代理网站上获取代理IP
def get_ip_list(url,headers):
    proxy_list = []
    data = requests.get(url,headers=headers)
    #使用BeautifulSoup解析
    soup = BeautifulSoup(data.content,"html.parser")
    #根据具体的网页代码来解决代码解析问题
    ul_list = soup.find_all('tr',limit=20)
    for i in range(2,len(ul_list)):
        print 'wait for i:' + str(i)
        line = ul_list[i].find_all('td')
        ip = line[1].text
        port = line[2].text
        #检查代理IP的可用性
        flag = check_proxie(ip,port)
        print str(i) + ' is ' + str(flag)
        if flag:
            proxy = get_proxy(ip,port)
            proxy_list.append(proxy)
    return proxy_list

#验证代理IP是否可用
def check_proxie(ip,port):
    try:
        ip = str(ip)
        port = str(port)
        telnetlib.Telnet(ip,port=port,timeout=200)
    except Exception,e:
        #异常信息
        print e.message
        return False
    else:
        return True

#格式化proxy的参数，添加http/https
def get_proxy(ip,port):
    aip = ip + ':' + port
    proxy_ip = str('http://' + aip)
    proxy_ips = str('https://' + aip)
    proxy = {"http":proxy_ip,"https":proxy_ips}
    return proxy

#公司logo下载
def download_img(url,img_name):
    try:
        #保存在本地的路径
        temp_path = 'D:\\tarinee_study_word\\Python\\spider_project\\images\\' + str(img_name) + '.jpg'
        #路径编码问题，img_name是中文
        save_path = temp_path.decode('UTF-8').encode('GBK')
        r = requests.get(url)
        r.raise_for_status()
        with codecs.open(save_path,'wb') as f:
            f.write(r.content)
    except Exception as e:
        print 'Error Msg:' + str(e)

mysqldb_utils.py文件(保存到数据库)

文件中的myconfig是我的配置文件，里面包括链接数据库的账号、密码以及其他的配置数据。

#-*-coding:utf-8 -*-

import MySQLdb
import myconfig as config

#存储数据库
def insert_data(object_list):
    #创建链接
    conn = MySQLdb.Connect(
        host= config.HOST,
        port= config.PORT,
        user= config.USER,
        passwd= config.PASSWD,
        db= config.DB,
        charset= config.CHARSET
    )

    #获取cursor游标
    cur = conn.cursor()

    #创建SQL语句
    sql = "insert into t_job(company_logo,company_name,company_address,company_desc,salary_range,job_name,job_requirement,company_type) values('{0}','{1}','{2}','{3}','{4}','{5}','{6}','{7}')"
    #str = 'insert into t_job(company_logo,company_name,company_address,company_desc,salary_range,job_name,job_requirement,company_type) values({companyLogo},{companyName},{companyAddress},{companyDesc},{salaryRange},{jobName},{jobRequirement},{companyType})'
    for index in object_list:
        d = list(index.values())
        #skr = sql.format(d[0].encode('GBK'),d[1].encode('GBK'),d[2].encode('GBK'),d[3].encode('GBK'),d[4].encode('GBK'),d[5].encode('GBK'),d[6].encode('GBK'),d[7].encode('GBK'))
        skr = sql.format(d[0],d[1],d[2],d[3],d[4],d[5],d[6],d[7])
        #print skr
        cur.execute(skr)   #执行SQL语句

    #提交
    conn.commit()
    print 'insert success......'
    #关闭对象
    cur.close()
    conn.close()

index.py（启动文件）

def lagou_main():
    url = config.DOWNLOAD_URL
    while(url):
        print url
        # 一、获取整个HTML页面
        html = htmlUtil.download_page_html(url,config.HEADERS)
        # 二、解析HTML页面，获得想到的数据
        data,next_page_url = htmlUtil.lagou_parse_page_html(html)
        if not next_page_url == 'javascript:;':
            url = next_page_url
        else:
            url = False
        # 三、保存到数据库中
        dbUtil.insert_data(data)

if __name__ == '__main__':
    lagou_main()

操作数据库遇到的问题：

一、安装MySQLdb时出错（error: Microsoft Visual C++ 9.0 is required Unable to find vcvarsall.bat）

解决办法：

1、先安装 wheel 库，来运行whl 文件

pip install wheel

2、根据自己python的位数去下载MySQL-python 《https://www.lfd.uci.edu/~gohlke/pythonlibs/#mysql-python》

pip install MySQL_python‑1.2.5‑cp27‑none‑win_amd32.whl

二、execute(sql [,list]) （python2.7中，当输出一个list对象，或者传一个list对象，list里面的数据编码会有问题，单个数据取得时候没有问题）——没有解决，我只能单个单个的取，然后在拼接到sql语句中了，知道的，望留言帮助。

MySQLdb是可以直接传一个list对象或者executemany(sql [,list])一次执行多条来保存数据的，但由于list编码问题会报错。

总结

写python代码，没有什么问题，但对于python的编码问题，我去看过别人的解释，但自己觉得自己还是似懂非懂的阶段，可能需要多在写代码过程中尝试才体会的到吧。

有纯金理想的萝卜

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
复制链接

分享到 QQ

分享到新浪微博

扫一扫

专栏目录