BeautifulSoup爬取虎扑步行街信息到Mysql

最新推荐文章于 2024-07-28 20:52:46 发布

Kobe_WEZ

最新推荐文章于 2024-07-28 20:52:46 发布

阅读量473

点赞数

分类专栏： python爬虫文章标签： python爬虫练手

本文链接：https://blog.csdn.net/Kobe_WEZ/article/details/93142412

版权

本文介绍了使用Python爬虫BeautifulSoup来抓取虎扑步行街的帖子信息，并将数据存储到MySQL数据库的过程。首先获取前十个页面链接，然后解析HTML获取每条信息的内容，最后通过数据库连接保存数据。

摘要由CSDN通过智能技术生成

人生苦短，我用python

今天给大家爬取虎扑步行街的详细信息
效果图如下：
在这里插入图片描述
首先根据虎扑步行街链接获取前十个页面链接

for i in range(1, 10):
    link = "https://bbs.hupu.com/bxj-" + str(i)

接着根据链接获取到页面html

headers = {'User-Agent' : 'Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US; rv:1.9.1.6) Gecko/20091201 Firefox/3.5.6'}
r = requests.get(link, headers = headers)
html = r.content
html = html.decode('utf-8')
soup = BeautifulSoup(html, 'lxml')

有了整个页面html后就可以获取需要的信息，首先提取每一条发布的信息

post_list = soup.find('ul',class_="for-list")
    # print(post_list)
    post_all=post_list.find_all('li')

在每一条信息内我们可以获取信息内详细的内容

data_list =[]
    for post in post_list:
        title = post.find('div',class_='titlelink box').text.strip()
        post_link = post.find('div',class_='titlelink box').a['href']
        post_link = "https://bbs.hupu.com" + post_link

        author = post.find('div',class_='author box').a.text.strip()
        author_page = post.find('div',class_='author box').a['href']
        start_date = post.find('div',class_='author box').contents[5].text.strip()

        reply_view = post.find('span',class_='ansour box').te