python新手爬取论坛贴吧特定人的帖子——虎扑《健美大神之路》

最新推荐文章于 2023-09-17 21:51:37 发布

AbelXv

最新推荐文章于 2023-09-17 21:51:37 发布

阅读量1.8k

点赞数

分类专栏： python 文章标签： python 编程爬虫贴吧论坛

本文链接：https://blog.csdn.net/ACHPXYZ/article/details/77461609

版权

python 专栏收录该内容

4 篇文章 0 订阅

订阅专栏

在虎扑上，有博主翻译《健美大神之路》，感觉很好，但是想要找电子书却没有，所以就打算自己爬下来存在文本文档中。

我应用的是urllib2,beautifulsoup这两个工具。

在这个编程中，我遇到的最大的麻烦就是，编码标准错误和我爬取的帖子文本中有他人的帖子。

第一个问题我现在还是不太懂，最后胡乱试解决了。

            for string in tags.next_sibling.next_sibling.find('div',class_='quote-content').strings:
                string_gbk=string.encode('utf-8')
                file.write(string_gbk)

第二行如果不用方法。encode('utf-8'),就会报出gbk读码错误。

第二个问题主要要解决的便是找出特定人发的帖子和别人发的帖子，在html源中有什么不同，然后限定条件。

url='https://bbs.hupu.com/19201877.html'

一开始我找的是

<div class="quote-content">

这样的标签，然后筛选出其中的strings

因为所有人发的帖子都在这样的标签中所以自然的就都爬了。

得到这样的原因后，我就仔细的查看和比较如何才能找出我想要的

<div class="quote-content">

然后我发现了一个显而易见的逻辑，那便是帖子的头部都会有发帖人的信息，所以这就是突破口。

虽然有了正确的方向，也有了正确的工具beautifulsoup的兄弟节点，但是我在兄弟节点的处理也就是整个html的逻辑树的结构上的认识错误导致我一直出现错误，其中有两点我觉得十分重要：

In real documents, the .next_sibling or .previous_sibling of atag will usually be a string containing whitespace. Going back to the“three sisters” document:

 
  <a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a>
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>

You might think that the .next_sibling of the first <a> tag wouldbe the second <a> tag. But actually, it’s a string: the comma andnewline that separate the first <a> tag from the second:

 
  link = soup.a
lin
# <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>

link.next_sibling
# u',\n'

The second <a> tag is actually the .next_sibling of the comma:

 
  link.next_sibling.next_sibling
# <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>

bs官方文档对兄弟节点一个常见错误的解释。

2.一定要从大到小的顺序观察html的标签结构（节点结构），浏览器的检查元素功能要比直接查看源好。

最后放代码了，只是小的程序，所以没有按照工程结构写，推荐自己建工程定义模块定义类的方式，可以自己的逻辑和面向对象的思维有好处。

# -*- coding:utf-8 -*-
import urllib2
from bs4 import BeautifulSoup

file=open('book.txt','w')
start_url='https://bbs.hupu.com/19201877.html'
all_urls=[]
all_urls.append(start_url)

for x in range(2,6):
    all_urls.append('https://bbs.hupu.com/19201877-'+str(x)+'.html')
for url in all_urls:
    request=urllib2.Request(url)
    response=urllib2.urlopen(request)
    cont=response.read()
    soup=BeautifulSoup(cont,"lxml",from_encoding='utf-8')
    for tags in soup.find_all('div',class_="author"):
  #      print tags.next_sibling.next_sibling.find('div',class_='quote-content')
        if tags.div.a['href']=='https://my.hupu.com/232157742256797':
            for string in tags.next_sibling.next_sibling.find('div',class_='quote-content').strings:
                string_gbk=string.encode('utf-8')
                file.write(string_gbk)

AbelXv

关注

0
点赞
踩
2

收藏

觉得还不错? 一键收藏
0
评论
python新手爬取论坛贴吧特定人的帖子——虎扑《健美大神之路》

在虎扑上，有博主翻译《健美大神之路》，感觉很好，但是想要找电子书却没有，所以就打算自己爬下来存在文本文档中。我应用的是urllib2,beautifulsoup这两个工具。在这个编程中，我遇到的最大的麻烦就是，编码标准错误和我爬取的帖子文本中有他人的帖子。第一个问题我现在还是不太懂，最后胡乱试解决了。 for string in tags.next_sibling
复制链接

扫一扫