论坛爬虫爬取tid,title,uid

自己编写的练习爬虫代码,希望对大家思路有帮助,技术欠佳,正在练习。加油!!

# -*- codeing = utf-8 -*-
# @Time : 2021/8/10 16:56
# @Author : yz
# @File : 论坛2.py
# @Software : PyCharm
# 引入request库, requests是python实现的简单易用的HTTP库,使用起来比urllib简洁很多
import requests
from lxml import etree
import re


def main():
    head()
    url = "http://114.112.74.132:8089/forum.php?mod=viewthread&tid="
    askurl(url)


findUid = re.compile(r'uid=(\d+)')
max_tid = 5846

########判断url是否合法,是否可连通,HTTP状态码是否为200
def get_url_content(url):
    response = requests.get(url)
    if response.status_code == 200:
        if "抱歉,指定的主题不存在或已被删除或正在被审核" in response.text:
            return False
        else:
            # 如果可以连通返回网页源码
            html = etree.HTML(response.text)
            return html
    else:
        return False

# 定义方法解析html_text
def pare_post_data(html_text,tid):
    title_list = html_text.xpath('//*[@id="thread_subject"]')
    title = title_list[0].text

    uid_list = html_text.xpath('//*[@id="favatar%s"]/div[1]/div/a/@href'%tid)
    uid = str(uid_list[0])
    uid = (re.findall(r"\d+\.?\d*", uid))[0]

    post_content_info = {
        "tid": tid,
        "title": title,
        "uid": uid,
    }
    return post_content_info

def head():
    f = open(".\\luntan.txt", "w", encoding='utf-8')
    f.write("tid" + "," + "title" + "," + "uid\n")
    f.close()

#爬取所有网页并处理数据
def askurl(url):
    for i in range(max_tid):
        html = get_url_content(url + str(i))
        if html != False:
            tid = str(i)
            pare_post_data1 = pare_post_data(html,tid)
            get_tid = pare_post_data1.get("tid")
            get_title = pare_post_data1.get("title")
            get_uid = pare_post_data1.get("uid")
            print(get_tid + ',' + get_title + ',' + get_uid)
            f = open(".\\luntan.txt", "a+", encoding='utf-8')
            f.write(get_tid + ',' + get_title + ',' + get_uid + '\n')
            f.close()

if __name__ == "__main__":
    print("tid"+","+"title"+","+"uid")
    main()
    print("爬取完毕!!!")
评论
成就一亿技术人!
拼手气红包6.0元
还能输入1000个字符
 
红包 添加红包
表情包 插入表情
 条评论被折叠 查看
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值