豆瓣小组爬虫

最近要租房,写了一个豆瓣租房小组的爬虫,直接上代码

from lxml import etree
import requests
import time  
import pandas as pd
import tqdm

def get_code(start, group_url):
    url = group_url
    headers = {
   #填自己登录之后的cookie 参考教程https://blog.csdn.net/weixin_41666747/article/details/80315002
    }
    data = {
    'start': start,
    'type': 'new'
    }
    request = requests.get(url=url, params=data, headers=headers)
    response = request.text
    return response


def list_posts(response, page, titles, urls, dates, authors, replies):
    tree = etree.HTML(response)
    titles_arr = tree.xpath('//table[@class="olt"]/tr//td/a/@title')
    for t in range(len(titles_arr)):
        titles[page].append(titles_arr[t])
    urls_arr = tree.xpath('//table[@class="olt"]/tr//td[@class="title"]//a/@href')
    for u in range(len(urls_arr)):
        urls[page].append(urls_arr[u])
    dates_arr = tree.xpath('//table[@class="olt"]/tr//td[@class="time"]/text()')
    for d in range(len(dates_arr)):
        dates[page].append(dates_arr[d])
    authors_arr = tree.xpath('//table[@class="olt"]/tr//td[@nowrap="nowrap"]/a/text()')
    for a in range(len(authors_arr)):
        authors[page].append(authors_arr[a])
    replies_arr = tree.xpath('//table[@class="olt"]/tr//td[@class="r-count "]/text()')
    for r in range(len(replies_arr)):
        replies[page].append(replies_arr[r])
    return titles, urls, dates, authors, replies


def get_page(all_page, group_url):
    titles = [[] for i in range(all_page)]
    urls = [[] for i in range(all_page)]
    dates = [[] for i in range(all_page)]
    authors = [[] for i in range(all_page)]
    replies = [[] for i in range(all_page)]
    for i in tqdm.tqdm(range(all_page)):
        start = i * 25
        time.sleep(0.05)
        response = get_code(start, group_url)
        titles, urls, dates, authors, replies = list_posts(response, i, titles, urls, dates, authors, replies)
    return titles, urls, dates, authors, replies


#北京租房 100万+ 个成员 在此聚集
group_url = 'https://www.douban.com/group/beijingzufang/discussion'
# 北京租房 481519 个成员 在此聚集
group_url = 'https://www.douban.com/group/sweethome/discussion'

all_page = 100
print('正在爬取' + str(int(all_page) * 25) + '篇帖子,请稍后...')

titles, urls, dates, authors, replies = get_page(int(all_page), group_url)

data = []  

for i in range(int(all_page)):
    for j in range(len(titles[i])):
        data.append({"Title": titles[i][j], "Author": authors[i][j], "Date": dates[i][j], "Url": urls[i][j]})  
#             print("【查找成功!第%d页】" % (i + 1))
#             print("标题:" + titles[i][j])
#             print("作者:" + authors[i][j])
# #             print("回复数:" + replies[i][j])
#             print("最后回复时间:" + dates[i][j])
#             print("链接:" + urls[i][j])
#             print('-----------------------------------------------------')
df = pd.DataFrame(data)  
# 将表格写入Excel文件  
df.to_excel("output.xlsx", index=True)
print('爬取完成!')

在这里插入图片描述
在这里插入图片描述

  • 11
    点赞
  • 11
    收藏
    觉得还不错? 一键收藏
  • 3
    评论
您好!是的,我可以帮助您进入豆瓣小组进行爬虫。首先,您需要安装Python的requests和BeautifulSoup库,用于发送HTTP请求和解析HTML页面。您可以使用以下命令来安装这些库: ``` pip install requests pip install beautifulsoup4 ``` 然后,您可以按照以下步骤进行豆瓣小组爬虫: 1. 导入所需的库: ```python import requests from bs4 import BeautifulSoup ``` 2. 发送HTTP请求并获取小组页面的HTML内容: ```python url = 'https://www.douban.com/group/your_group_id/' # 替换为您要爬取的小组URL response = requests.get(url) html_content = response.text ``` 3. 使用BeautifulSoup解析HTML页面: ```python soup = BeautifulSoup(html_content, 'html.parser') ``` 4. 根据页面结构提取所需信息,例如帖子标题、内容等: ```python # 获取帖子列表 post_list = soup.find_all('tr', class_='pl') for post in post_list: title = post.find('td', class_='title').a.text.strip() # 获取帖子标题 content = post.find('td', class_='title').a['href'] # 获取帖子链接 # 打印帖子标题和链接 print('标题:', title) print('链接:', content) print('---') ``` 这样就能获取到小组中的帖子标题和链接。您可以根据需求进一步解析帖子内容或其他信息。 请注意,爬取网页内容时需要遵守网站的使用规则,并尊重网站的爬虫限制。同时,豆瓣网站也有一些反爬虫机制,如果您频繁访问或过度请求,可能会导致您的IP被封禁。建议在进行爬虫之前先阅读豆瓣网站的相关规定。
评论 3
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值