爬取了贴吧的帖子后，再对这些帖子搜索自己想要的信息

最新推荐文章于 2023-06-26 14:15:34 发布

行者刘6

最新推荐文章于 2023-06-26 14:15:34 发布

阅读量1.8k

点赞数 1

本文链接：https://blog.csdn.net/qq_38282706/article/details/80958610

版权

博主爬取了贴吧的帖子并保存为JSON文件，通过编写查询工具，针对关键字'肠胃'进行了搜索，重点关注发帖人'秋天的小绵羊3'在2018-10-22之后的所有内容。

摘要由CSDN通过智能技术生成

帖子保存为json文件，格式如下：

tiezi={'title':    '标题',
       'author':   '发帖人',
       'tid':      '帖子的编号',
       'reply_num':'回复数量',
       'last_reply_time':'最后回复时间',
       'last_reply_author':'最后回复人',
       'pages':          '共多少页',
   #帖子里的具体内容，每一层楼
       'post_list': ['1楼',
                     '2楼',
                     '3楼',
                     '4楼',
                     '.....'
                     ]
   }

#每一层楼的list
post_list=[ #1楼
                {
                 'page':      '所在页数',
                 'author':    '发帖人',
                 'floor':     '楼层',
                 'time':'回复时间',
                 'pid':       '该楼层的编号',
                 'content':   '回帖内容(包含了文字、图片、自定义表情)',
                 'voice':     '如果有语音的话，就有',
                 'comment_num':'楼内楼回复数量',
                 'comment_list':    #如果上面不为0，就有
                               ['回复1',
                                '回复2',
                                '回复3']
                 },
                 {'2楼'},
                 {'3楼'}
                ]
#楼内楼
comment_list=[#回复1
         {'page':      '所在楼内楼页数',
          'author':    '发帖人',
          'time':'回复时间',
          'pid':       '该楼层的编号',
          'content':   '回帖内容(包含了文字)',
          'voice':     '如果有语音的话，就有',
          },

          {'回复2'},
          {'回复3'},
        ]

现在写了个文件，方便查询

'''
3个操作函数：
search_keyword：搜索发帖内容里的某个关键字
search_author： 搜索发帖人回复过的内容
get_content:   某时间后所有的发帖内容(或者是该文件夹内的所有内容)

操作思路：
所有文件的list(file_list)》设定储存文件名(save_word_filename)
》循环每个文件----(search_keyword/search_author/get_content)
        》循环每一行(search_one_file)
                    》进行搜索》再次筛选》设定保存的格式，添加到目标value_dict中----(keyword_paths/author_paths/content_paths)
                    》返回value_dict，储存到文件(save_file)
        》所有文件都遍历保存了，再对存储文件的每一行进行合并，再次保存(Data_Merge)

'''

文件如下：

class find_the_word(object):

    def __init__(self,dir_path,tieba):
        self.dir_path=dir_path
        self.tieba=tieba
        self.tieba_dir=dir_path+'\\'+tieba
        self.count=0


    def file_list(self):
        '''根据存放json文件的文件夹，得到json文件的绝对地址组成的list'''
        file_list=[]
        for name in os.listdir(self.tieba_dir):
            file_path=os.path.join(self.tieba_dir, name)
            if os.path.isfile(file_path):
                file_list.append(file_path)
        return file_list


    def save_word_filename(self,search_type,target_word):
        '''设定保存的文件名，如果已经存在，那就删除'''
        save_filename=self.tieba_dir+'~%s：%s.json'%(search_type,target_word)
        if os.path.exists(save_filename):  # 返回文件名，如果存在，删掉文件
            os.remove(save_filename)
        return save_filename



    def search_one_f