中华英才网爬虫程序解析（1）-基础的爬虫程序实现

最新推荐文章于 2024-07-26 22:40:18 发布

Ejasmine

最新推荐文章于 2024-07-26 22:40:18 发布

阅读量1.3k

点赞数 4

分类专栏：中华英才网爬虫 python爬虫教程从入门到精通文章标签： python 网络爬虫中华英才网

本文链接：https://blog.csdn.net/weixin_42183408/article/details/88043513

版权

欢迎来到爬虫高级兼实战教程，打开你的IDE，开始python之旅吧！

中华英才网爬虫

在讲完python爬虫基础知识后，我们开始进行实战，在实战中我们会借实例来讲解爬虫的高级知识，爬虫程序已经公布于 https://github.com/code-nick-python/yingcaiwang-spider

在这个实例中，涉及到多线程threading和queue，分布式redis，接下来废话不多说，直接开始讲解！

爬虫程序基础编写

这里我们看到网页长这样：
在这里插入图片描述
我们的目标是爬取工作名称，工资，城市，学历，人数，公司和类别，接下来开始总结HTML代码吧！

首先看看连接数据库：

from pymongo import MongoClient
import pymysql

# a class for connect
class spider:
    # some things for mongodb and mysql
    def __init__(self, data=''):
        self.host = 'localhost'
        self.port = 27017
        self.user = 'root'
        self.passwd = 'nick2005'
        self.db = 'scraping'
        self.charset = 'utf8'
        self.data = data

    # connect to mongodb and remove all
    def connect_to_mongodb(self):
        # connect the mongodb and remove all
        client = MongoClient(host=self.host, port=self.port)
        db = client.blog_database
        collection = db.blog
        collection.remove({
   })
        return collection

    # connect the mysql and remove all
    def connect_to_mysql(self):
        conn = pymysql.connect(host=self.host, user=self.user, passwd=self.passwd, db=self.db, charset=self.charset)
        cur = conn.cursor()
        cur.execute('truncate table yingcaiwang;')
        return cur, conn