Python爬虫作业

最新推荐文章于 2024-07-22 15:19:55 发布

一箱哇哈哈

最新推荐文章于 2024-07-22 15:19:55 发布

阅读量1.1k

点赞数 24

文章标签：爬虫

本文链接：https://blog.csdn.net/m0_72543096/article/details/134935835

版权

题（２）请用requests库的get()函数访问如下一个网站２０次，打印返回状态，text()内容，计算text()属性和content属性所返回网页内容的长度：

代码如下：

import requests

for i in range(2):
    responese = requests.get('https://cn.bing.com/?mkt=zh-cn')
    print('返回状态：',responese.status_code)
    print('test内容：',responese.text)
    print('test属性：',len(responese.text))
    print('content属性：',len(responese.content))

运行结果：

题（3）a. 打印head标签内容和你的学号后两位‪‬‪‬‪‬‪‬‪‬‮‬‪‬‫‬‪‬‪‬‪‬‪‬‪‬‮‬‪‬‮‬‪‬‪‬‪‬‪‬‪‬‮‬‫‬‭‬‪‬‪‬‪‬‪‬‪‬‮‬‫‬‫‬‪‬‪‬‪‬‪‬‪‬‮‬‭‬‫‬‪‬‪‬‪‬‪‬‪‬‮‬‪‬‫‬‪‬‪‬‪‬‪‬‪‬‮‬‪‬‭ ‪‬‪‬‪‬‪‬‪‬‮‬‪‬‫‬‪‬‪‬‪‬‪‬‪‬‮‬‪‬‮‬‪‬‪‬‪‬‪‬‪‬‮‬‫‬‭‬‪‬‪‬‪‬‪‬‪‬‮‬‫‬‫‬‪‬‪‬‪‬‪‬‪‬‮‬‭‬‫‬‪‬‪‬‪‬‪‬‪‬‮‬‪‬‫‬‪‬‪‬‪‬‪‬‪‬‮‬‪‬‭b,获取body标签的内容‪‬‪‬‪‬‪‬‪‬‮‬‪‬‫‬‪‬‪‬‪‬‪‬‪‬‮‬‪‬‮‬‪‬‪‬‪‬‪‬‪‬‮‬‫‬‭‬‪‬‪‬‪‬‪‬‪‬‮‬‫‬‫‬‪‬‪‬‪‬‪‬‪‬‮‬‭‬‫‬‪‬‪‬‪‬‪‬‪‬‮‬‪‬‫‬‪‬‪‬‪‬‪‬‪‬‮‬‪‬‭ c. 获取id 为first的标签对象‪‬‪‬‪‬‪‬‪‬‮‬‪‬‫‬‪‬‪‬‪‬‪‬‪‬‮‬‪‬‮‬‪‬‪‬‪‬‪‬‪‬‮‬‫‬‭‬‪‬‪‬‪‬‪‬‪‬‮‬‫‬‫‬‪‬‪‬‪‬‪‬‪‬‮‬‭‬‫‬‪‬‪‬‪‬‪‬‪‬‮‬‪‬‫‬‪‬‪‬‪‬‪‬‪‬‮‬‪‬‭‬‪‬‪‬‪‬‪‬‪‬‮‬‪‬‫‬‪‬‪‬‪‬‪‬‪‬‮‬‪‬‮‬‪‬‪‬‪‬‪‬‪‬‮‬‫‬‭‬‪‬‪‬‪‬‪‬‪‬‮‬‫‬‫‬‪‬‪‬‪‬‪‬‪‬‮‬‭‬‫‬‪‬‪‬‪‬‪‬‪‬‮‬‪‬‫‬‪‬‪‬‪‬‪‬‪‬‮‬‪‬‭ d. 获取并打印html页面中的中文字符：

代码如下：

import requests
from bs4 import BeautifulSoup
print('a:')
print("菜鸟教程(runoob.com) 06\n")

with open('C:/Users/86136/Desktop/programming/PythonPractices/reptile2.html', 'r', encoding='utf-8') as file:
    content = file.read()
soup = BeautifulSoup(content,'lxml') #soup是beautifulsoup解析器返回的解析对象
bodycontent = soup.body.prettify()
print('b:')
print(bodycontent,"\n")

fisrtelement = soup.find(id='first')
print('c:')
print(fisrtelement,'\n')

soup2 = BeautifulSoup(content, "html.parser")
print('d:')
for text in soup2.stripped_strings:
    if all('\u4e00' <= char <= '\u9fff' for char in text):
        print(text)

运行结果：

题（4）：爬中国大学排名网站内容，‪‬‪‬‪‬‪‬‪‬‮‬‪‬‫‬‪‬‪‬‪‬‪‬‪‬‮‬‪‬‮‬‪‬‪‬‪‬‪‬‪‬‮‬‫‬‭‬‪‬‪‬‪‬‪‬‪‬‮‬‫‬‫‬‪‬‪‬‪‬‪‬‪‬‮‬‭‬‫‬‪‬‪‬‪‬‪‬‪‬‮‬‪‬‫‬‪‬‪‬‪‬‪‬‪‬‮‬‪‬‭‬

https://www.shanghairanking.cn/rankings/bcur/201811‪‬‪‬‪‬‪‬‪‬‮‬‪‬‫‬‪‬‪‬‪‬‪‬‪‬‮‬‪‬‮‬‪‬‪‬‪‬‪‬‪‬‮‬‫‬‭‬‪‬‪‬‪‬‪‬‪‬‮‬‫‬‫‬‪‬‪‬‪‬‪‬‪‬‮‬‭‬‫‬‪‬‪‬‪‬‪‬‪‬‮‬‪‬‫‬‪‬‪‬‪‬‪‬‪‬‮‬‪‬‭‬

要求：（一）爬取大学排名(学号尾号1,2，爬取年费2020，学号尾号3,4，爬取年费2016，学号尾号5,6，爬取年费2017，学号尾号7,8，爬取年费2018，学号尾号9,0，爬取年费2019，)‪‬‪‬‪‬‪‬‪‬‮‬‪‬‫‬‪‬‪‬‪‬‪‬‪‬‮‬‪‬‮‬‪‬‪‬‪‬‪‬‪‬‮‬‫‬‭‬‪‬‪‬‪‬‪‬‪‬‮‬‫‬‫‬‪‬‪‬‪‬‪‬‪‬‮‬‭‬‫‬‪‬‪‬‪‬‪‬‪‬‮‬‪‬‫‬‪‬‪‬‪‬‪‬‪‬‮‬‪‬‭‬

（二）把爬取得数据，存为csv文件

代码如下：

import requests
from bs4 import BeautifulSoup
import csv

# 请求网页内容
url = "https://www.shanghairanking.cn/rankings/bcur/201611"
response = requests.get(url)
response.encoding = 'utf-8'
html_content = response.text


# 解析网页内容
soup = BeautifulSoup(html_content, 'html.parser')
table = soup.find('table', {'class': 'rk-table'})
rows = table.find_all('tr')[1:] # 跳过表头

# 提取大学名称和序号
data = []
for row in rows:
    cols = row.find_all('td')
    university_name = cols[1].find('a', class_='name-cn').text.strip()
    university_rank = cols[0].text.strip()
    data.append([university_name, university_rank])

# 保存为csv文件
with open('university_ranking_2016.csv', 'w', newline='', encoding='utf-8') as csvfile:
    writer = csv.writer(csvfile)
    writer.writerow(['大学名称', '排序'])  # 写入表头
    writer.writerows(data)  # 写入数据

运行结果：