004-中国大学排行、【爬虫】【Python】

最新推荐文章于 2023-12-29 23:49:31 发布

MoltenDivineCore

最新推荐文章于 2023-12-29 23:49:31 发布

阅读量164

点赞数

文章标签： python 爬虫中国大学排名

本文链接：https://blog.csdn.net/MoltenDivineCore/article/details/102712009

版权

来回忆下爬虫四个步骤：

获取页面
解析数据
储存数据
main()

我们再来一次，这次爬取对象是http://www.zuihaodaxue.cn/zuihaodaxuepaiming2019.html

获取页面用的是 request.urlopen(url) ，接着用 response.read().decode('utf-8') 来获取可读取中文的格式。
然后用bs4解析，函数是 BeautifulSoup(html, "lxml") ，利用 soup.find('tbody').children 获取 ’ tbody ‘ 标签之后的所有 ‘ tr ’ 标签项，并且取出里面用 ‘ td ' 作为标签的项目。利用 yield 把‘ td ' 标签里的东西都暂时存起来，方便储存到 csv 文件中。
储存数据：不想详细说了。

w = csv.DictWriter(f, fieldnames=fieldnames)
w.writerow(items)

上源代码：

import csv
import bs4

from urllib import request
from bs4 import BeautifulSoup

#============================================================================
# 1-获得页面
def get_one_page(url):
    try:
        response = request.urlopen(url)
        html_data = response.read().decode('utf-8')
        return html_data
    except :
        return None
        pass
    pass
#============================================================================
# 2-数据解析
def parse_one_page(html):
    soup = BeautifulSoup(html, "lxml")
    for tr in soup.find('tbody').children:  # 搜索'tbody'后面的子节点
        if isinstance(tr, bs4.element.Tag):
            tds = tr('td')
            a = tds[0].string
            b = tds[1].string
            c = tds[2].string
            d = tds[3].string
            yield {
                'ranking': a,
                'school': b,
                'province': c,
                'score': d
            }
        pass
    pass

#============================================================================
# 3-储存页面
def write_to_file(items):
    # a：追加
    # utf_8_sig：保证中文不乱码
    with open('save1.csv', 'a', encoding='utf_8_sig', newline='') as f:
        fieldnames = ['ranking', 'school', 'province', 'score']
        w = csv.DictWriter(f, fieldnames=fieldnames)
        w.writerow(items)
        print("第%s部电影爬取完毕"%items["ranking"])
        pass
    pass

#============================================================================
# 0-Main()
def main():
    start_url = 'http://www.zuihaodaxue.cn/zuihaodaxuepaiming2019.html'
    html = get_one_page(start_url)
    for item in parse_one_page(html=html):
        write_to_file(item)
    pass

if __name__ == '__main__':
    main()
    pass

MoltenDivineCore

关注

0
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
004-中国大学排行、【爬虫】【Python】

来回忆下爬虫四个步骤：获取页面解析数据储存数据 main()我们再来一次，这次爬取对象是http://www.zuihaodaxue.cn/zuihaodaxuepaiming2019.html获取页面用的是request.urlopen(url)，接着用 response.read().decode('utf-8') 来获取可读取中文的格式。然后用bs4解析，函数是...
复制链接

扫一扫