兰州大学python实验：第三次实验作业（爬虫作业）-CSDN博客

本文链接：https://blog.csdn.net/yuebeixin/article/details/136252385

本文介绍了如何使用Python实现一个多线程和多进程的爬虫程序，对比了单线程、多线程和多进程在爬取特定规模网站上的效率，并通过CSV保存抓取结果。结果显示，多线程在I/O密集型任务中表现优于多进程，而多进程在CPU密集型任务中略胜一筹。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

实验内容

完成对某个有一定规模（数据量）的网站（网页）的爬虫程序，要有多进程或多线程的内容，并且要有一定的单进程（线程）和多进程（线程）执行的对比分析。实验报告要求包括：思路，代码（含必要注释），运行结果文件和截图，单、多进（线）程执行对比分析。

注意：不能有任何两人爬取同一个网站。

介绍：

爬取新闻的内容有：标题，来源，时间，正文

标题的标签：

来源及时间：

正文：

思路：

在这个实验中最重要的就是对网页的爬取，单线程的运行与多线程的运行的对比，单进程的运行与多进程的运行的对比。

网页爬取的思路：首先在自己所要查找的网页找到User-Agent:，防止反爬虫

然后requests.get得到网站的html文件，再通过BeautifulSoup进行解析，再在网页中观察所爬取的内容的html前后标签的特点，用soup.find_all抓取数据，并将其整合在一起就基本完成了网页的抓取

单线程的运行与多线程的运行的对比：定义两个函数，一个是单线程，一个是多线程，分别在每个函数前面进行时间的记录，在每个函数最后进行时间的记录，两个相减，返回差值则可得知该函数所用的时间，对比时间可直观地单线程与多线程的区别。

单进程的运行与多进程的运行的对比：定义两个函数，一个是单进程，一个是多进程，分别在每个函数前面进行时间的记录，在每个函数最后进行时间的记录，两个相减，返回差值则可得知该函数所用的时间，对比时间可直观地单进程与多进程的区别。

代码：

1、获取可用网址：

原因：同一网络连续爬取这一个美国新闻网的网站的时候，速度会变慢，可能是网页的保护措施，且爬取大量网址的时候通常会在其中爬取有问题（如爬到中间就因为网络问题停下来）。所以我的解决办法就是通过获取可用网址.py这个程序先小量的爬取尝试看看网址在不在，并保存在urls.txt里面，并且采用追加的形式，这样就可以通过多次的爬取，获取大量的可用网址。在下面爬虫代码中我会读取urls.txt的内容。

注：在重复实验的时候请检查您爬取的网址所有的范围。

import requests

# 引用requests库

start_page=int(input("请输入开始页页码："))

end_page=int(input("请输入结束页页码："))

url = "http://www.uscntv.com/news/usa/{0}.html"

urls = []

num = 0

for page in range(start_page, end_page + 1): # 判断网页是否存在

try:

headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36'}

# 模拟成chrome浏览器发送的http请求

r=requests.get(url.format(page), headers=headers)

print(url.format(page))

r.raise_for_status()

num += 1

except (requests.exceptions.RequestException):

continue

urls.append(url.format(page))

print(num,end=" ! ")

print("")

try:

with open('url.txt', 'a+', encoding='utf_8_sig') as f:

for URL in urls:

f.write(URL)

f.write("\r\n")

print("保存成功！")

except:

print("保存失败！")

运行结果：

1、

……

2、因前面已经测试了128000到128550的网址，且打开文件是用的追加方式。则txt里面有前面的网址。

……

2、爬虫：

import requests

# 引用requests库

from bs4 import BeautifulSoup

# 引用BeautifulSoup库

import re

# 引入正则表达式模块

import csv

# 引入csv库用于处理CSV文件

import threading

import multiprocessing

# 引入多线程和多进程模块

import time

def get_urls():

urls = []

# 打开url.txt文件进行读取

with open('url.txt', 'r',encoding='utf_8_sig') as file:

# 逐行读取文件内容

for line in file:

# 去除每一行末尾的换行符

cleaned_line = line.strip()

# 如果行不为空，则将其添加到urls列表中

if cleaned_line:

urls.append(cleaned_line)

return urls

def web_scraping(url):

try:

news = []

# 设置请求头

headers = {

"user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36"}

# 发送HTTP GET请求

response = requests.get(url, headers=headers)

response.encoding = "utf-8"

# 获取HTML内容

soup = BeautifulSoup(response.text, 'html.parser')

# 解析HTML获取新闻标题

news_title = soup.find_all('h1', class_="zhengbiaoti")[0].text

# 解析HTML获取新闻来源和时间

news_source_div = soup.find_all("div", class_="rexw-ly")

news_spe = news_source_div[0].find_all("p")[0].text

news_time, news_source = get_news_source(news_spe)

# 解析HTML获取新闻正文内容

news_content_div = soup.find_all("div", class_="content")

news_content_all = news_content_div[0].find_all("p")

news_content = ''

#将每一个content里面<p>里面所包含的整合成一个字符串

for p in news_content_all:

news_content += p.get_text(strip=True) + "\n"

news.append(news_title)

news.append(news_time)

news.append(news_source)

news.append(news_content)

return news

except:

return []#如果出错了那么返回空列表

def get_news_source(source):

# 定义正则表达式模式

pattern = r'(\d{4}-\d{2}-\d{2} \d{2}:\d{2})\s+([^\d\s]+)'

# 匹配文本中的日期和新闻源信息

match = re.findall(pattern, source)[0]

# 输出匹配结果

time = match[0]

news_source = match[1]

return (time, news_source)

def save_to_csv(list_all):

try:

# 引用csv模块。

csv_file = open('news.csv', 'w+', newline='', encoding='utf_8_sig')

# 调用open()函数打开csv文件，传入参数：文件名“demo.csv”、写入模式“w”、newline=''、encoding='gbk'

writer = csv.writer(csv_file)

# 用csv.writer()函数创建一个writer对象。

for x in list_all:

writer.writerow(x)

# 调用writer对象的writerow()方法，可以在csv文件里写入一行文字。

print("保存成功")

except:

print("保存失败")

# 单线程爬取

def single_thread(urls):

start_time = time.time()

for url in urls:

web_scraping(url)

return time.time() - start_time

# 多线程爬取

def multi_thread(urls):

start_time = time.time()

threads = []

for url in urls:

# 创建线程对象，指定目标函数和参数

thread = threading.Thread(target=web_scraping, args=(url,))

# 将线程对象添加到列表中

threads.append(thread)

# 启动线程

thread.start()

# 等待所有线程执行完毕

for thread in threads:

thread.join()

return time.time() - start_time

# 单进程爬取

def single_process(urls):

start_time = time.time()

for url in urls:

web_scraping(url)

return time.time() - start_time

# 多进程爬取

def multi_process(urls):

start_time = time.time()

processes = []

for url in urls:

# 创建进程对象，指定目标函数和参数

process = multiprocessing.Process(target=web_scraping, args=(url,))

# 将进程对象添加到列表中

processes.append(process)

# 启动进程

process.start()

# 等待所有进程执行完毕

for process in processes:

process.join()

return time.time() - start_time

if __name__ == "__main__":

# 测试URL列表

URL = "http://www.uscntv.com/news/usa/{0}.html"

urls = get_urls()

# 执行单线程和多线程爬取

time_single_thread = single_thread(urls)

time_multi_thread = multi_thread(urls)

# 执行单进程和多进程爬取

time_single_process = single_process(urls)

time_multi_process = multi_process(urls)

# 输出运行时间对比

print("")

print("对比！")

print(f"单线程时间: {time_single_thread}")

print(f"多线程时间: {time_multi_thread}")

print("")

print("对比！")

print(f"单进程时间: {time_single_process}")

print(f"多进程时间: {time_multi_process}")

print("")

print("保存爬虫结果！")

all_news = [["题目", "时间", "来源", "正文"]]

for url in urls:

all_news.append(web_scraping(url))

save_to_csv(all_news)

运行结果：

总结：

总的来说，选择多进程还是多线程取决于任务的性质。如果是CPU密集型任务，多进程可能更合适；如果是I/O密集型任务，多线程可能更合适。

在本次任务中明显看到单进（线）程慢于多进程于多线程。

而多进程的效果不如多线程的效果。