农机资讯网_数据爬取(源码)_一蓑烟雨任平生

最新推荐文章于 2023-12-01 19:22:29 发布

一蓑烟雨任平生√

最新推荐文章于 2023-12-01 19:22:29 发布

阅读量362

点赞数

分类专栏： python 爬虫

本文链接：https://blog.csdn.net/Jaeger_Java/article/details/109630808

版权

python 同时被 2 个专栏收录

47 篇文章 4 订阅

订阅专栏

爬虫

35 篇文章 3 订阅

订阅专栏

废话不多说,直接上代码

今天要倒霉的网站是农机资讯网

# -*- coding: utf-8 -*-
import requests
import pymysql
from bs4 import BeautifulSoup  # 用来解析网页
import uuid
import time

url = "http://news.nongji360.com"

headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/52.0.2743.116 '
                  'Safari/537.36',
    'Accept-Language': 'zh-CN,zh;q=0.8'
}
conn = pymysql.connect(host='127.0.0.1', user='root', passwd='123456', db='zhang', charset='utf8')
cur = conn.cursor()
print("连接成功")

for i in range(1, 10):  # 爬取第一页到第3页的数据
    resp = requests.get(f"http://news.nongji360.com/list/9?p={i}", headers=headers)
    page_one = BeautifulSoup(resp.content, "html.parser")
    dd = page_one.find('div', class_='layer2_left').find_all('h3')
    for ss in dd:
        sUrl = url + ss.find('a')['href']

        # 打开二级网页进行爬取
        rp = requests.get(sUrl, headers=headers)
        page_two = BeautifulSoup(rp.content, "html.parser")
        papaer_id = str(uuid.uuid1())

        if page_two.find('div', class_='content_left1') is None:
            continue
        # 标题
        title = page_two.find('div', class_='content_left1').find('h1').text
        # 时间
        print(sUrl)
        timet = page_two.find('div', class_='content_left1').find('div').text[12:22]
        # 来源
        source = page_two.find('div', class_='content_left1').find('div').text[32:].strip()
        print(source)
        # 内容
        content = page_two.find('div', id='article_content').text.strip()
        print(content)
        sql = "insert into knowledge(id,title,timet,content,p_type,url,source) VALUES (%s,%s,%s,%s,%s,%s,%s)"
        cur.execute(sql, (papaer_id, title, timet, content, "机械农业", sUrl, source))
    print("SQL正在执行第{}页执行完毕".format(i))
    conn.commit()
    time.sleep(1)  # 防止服务器蹦了,间隔一秒钟
cur.close()
conn.close()

在这里插入图片描述

一蓑烟雨任平生√

关注

0
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
农机资讯网_数据爬取(源码)_一蓑烟雨任平生

废话不多说,直接上代码今天要倒霉的网站是农机资讯网# -*- coding: utf-8 -*-import requestsimport pymysqlfrom bs4 import BeautifulSoup # 用来解析网页import uuidimport timeurl = "http://news.nongji360.com"headers = { 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleW
复制链接

扫一扫