中国农网_数据抓取(源码)_一蓑烟雨任平生

最新推荐文章于 2023-09-14 09:03:12 发布

一蓑烟雨任平生√

最新推荐文章于 2023-09-14 09:03:12 发布

阅读量399

点赞数

分类专栏：爬虫 python 文章标签： python

本文链接：https://blog.csdn.net/Jaeger_Java/article/details/109630531

版权

python 同时被 2 个专栏收录

47 篇文章 4 订阅

订阅专栏

爬虫

35 篇文章 3 订阅

订阅专栏

废话不多说,直接上代码

今天要倒霉的网站是中国农网

import requests  # 用来请求网址
import pymysql  # 用来存数据库
from fake_useragent import UserAgent
from bs4 import BeautifulSoup  # 用来解析网页
import time

# 中国农网
headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/52.0.2743.116 '
                  'Safari/537.36',
    'Accept-Language': 'zh-CN,zh;q=0.8'
}

conn = pymysql.connect(host='127.0.0.1', user='root', passwd='123456', db='zhang', charset='utf8')
cur = conn.cursor()
print("连接成功")
for i in range(0, 20):  # 爬取第一页到第3页的数据
    headers["User-Agent"] = UserAgent().random
    resp = requests.get(f"http://www.farmer.com.cn/xbpd/xw/rdjj_2458/NewsList_{i}.json", headers=headers).json()
    for dd in resp["info"]:
        # id
        id = dd["id"]
        # 标题
        title = dd["ovtitle"]
        # 创建时间
        timet = dd["createTime"]
        # 地址
        url = dd["url"]
        # 来源
        source = dd["source"]
        rp = requests.get(url, headers=headers)
        page_two = BeautifulSoup(rp.content, "html.parser")
        print(page_two)
        erro = page_two.find("span", id='xy')
        if erro is not None:
            continue
        article = page_two.find('div', id='article_main').find_all('p')
        content = ''
        for aa in article:
            content = content + aa.text.strip()
        print(content)
        # 分类
        n_type = ""
        if "食品安全" in content:
            n_type = "食品安全"
        elif "农业环境" in content:
            n_type = "农业环境"
        elif "农业病虫害" in content:
            n_type = "农业病虫害"
        elif "农业耕地浪费" in content:
            n_type = "农业耕地浪费"
        elif "农产品质量安全" in content:
            n_type = "农产品质量安全"
        else:
            n_type = ""

        sql = "insert into new_paper(id,n_source,n_title,n_timet,n_type,n_url,n_content) VALUES (%s,%s,%s,%s,%s,%s,%s)"
        cur.execute(sql, (id, source, title, timet, n_type, url, content))
    print("{}完成啦".format(i))
    conn.commit()
    time.sleep(1)  # 防止服务器蹦了,间隔一秒钟
cur.close()
conn.close()

在这里插入图片描述

一蓑烟雨任平生√

关注

0
点赞
踩
4

收藏

觉得还不错? 一键收藏
0
评论
中国农网_数据抓取(源码)_一蓑烟雨任平生

废话不多说,直接上代码今天要倒霉的网站是中国农网import requests # 用来请求网址import pymysql # 用来存数据库from fake_useragent import UserAgentfrom bs4 import BeautifulSoup # 用来解析网页import time# 中国农网headers = { 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/53
复制链接

扫一扫