爬取好友QQ空间的说说(增强版及使用过程中的困难总结)


 

 

 

之前从网上下载了源码,在拿同学练手时,发现全是一堆广告说说,经查找是由于原来的代码采用的是识别可点击的“下一页”来进行翻页,而我那个同学

曾转发过一个叫“下一个”网友的说说,导致爬虫进入了他的空间,而他的说说又全是各种广告。。。

    改正后的代码是直接从最下方的页码进行跳转。

    但页码的ID一直在变化,几经修改,最后的代码如下。

from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
import time
import json
# -*- coding: utf-8 -*-
#definde save file
def saveFile0(date,add):
        f_obj = open(add, 'a', encoding='gbk', errors='ignore')#this kind of arg format can ignore encoding errors
        f_obj.write(date,)#a , can add an Enter in the tail
        #f_obj.write(\n)
        f_obj.close()#should not forget this
#使用selenium
driver = webdriver.PhantomJS(executable_path="C:\\Python\\Python36-32\\Scripts\\phantomjs-2.1.1-windows\\bin\\phantomjs.exe")#?the address in windows should be this kind of format
driver.set_window_size(1920, 1080)
#log in
def get_shuoshuo(qq,add):
    print(qq)
    driver.get('https://user.qzone.qq.com/{}/311'.format(qq))#enter your friend's Qzone
    time.sleep(3)
    try:
        driver.find_element_by_id('login_div')
        log_sign = True
    except:
        log_sign = False
    if log_sign == True:
        driver.switch_to.frame('login_frame')
        driver.find_element_by_id('switcher_plogin').click()
        driver.find_element_by_id('u').clear()
        driver.find_element_by_id('u').send_keys('   ')#your Q number to login
        driver.find_element_by_id('p').clear()
        driver.find_element_by_id('p').send_keys('   ')#your Q key to login
        driver.find_element_by_id('login_button').click()
        time.sleep(6)
    driver.implicitly_wait(6)
    #add the follow codes to test whether you have the right to enter your friend's Q zone
    #try:
    #    driver.find_element_by_id('QM_OwnerInfo_Icon')
    #   v_right = True
    #except:
     #   v_right = False
    #if v_right == True:
    driver.switch_to.frame('app_canvas_frame')#if you don't have a friend named qq, there will be wrong
    not_tail = True
    num = 0
    while not_tail == True:
        nickname = driver.find_elements_by_css_selector('qz_311_author.c_tx nickname.goProfile')
        content = driver.find_elements_by_css_selector('.content')
        stime = driver.find_elements_by_css_selector('.c_tx.c_tx3.goDetail')
        for con,sti in zip(content,stime):
            data = {
                '昵称':qq,
                '时间':sti.text,
                '说说':con.text
                }
            #print(data)
            saveFile0(str(data),add)
        try:
            driver.find_element_by_link_text(str(num+2))    #there should be less spell wrong like drinver
            #t = driver.find_element_by_link_text('下一页')    #if there is guy named '下一页',there will be wrong.
            
            not_tail = True
        except:
            not_tail = False
        if not_tail == True:
            driver.find_element_by_id('pager_go_'+str(num)).send_keys(str(num+2))#the x-path is always changing.
            num=num+1
            t = driver.find_elements_by_class_name("bt_tx2")[1]#t = driver.find_element_by_link_text('确定')
            t.click()    #the feedback of driver.find_element_by_class_name()是list类,需要指定具体元素才能使用click
            time.sleep(1)    #there should be enough time waiting Qzone cache.
        
            
               
    cookie = driver.get_cookies()    #gain the cookies
    cookie_dict = []
    for c in cookie:
        ck = "{0}={1};".format(c['name'],c['value'])
        cookie_dict.append(ck)
    i = ''
    for c in cookie_dict:
        i += c
    print("\n")
    print("==========完成================")
 

 
if __name__ == '__main__':
    get_shuoshuo(int("QQ number"),'E:/Python/crawler/test1.txt')    #input the QQ number you want crawl and the place you want to save.

困难总结:1.使用pip下载selenium和bs4时显示ssl证书错误:

跳过验证:

pip --trusted-host pypi.python.org install 包名

并且多试几次,有时是因为访问国外网站延时太高。

2.output error not gbk之类的错误:

修改python.sublime-build 文件为:

{
	
"cmd":["python.exe", "-u", "$file"],
"path":"C:Python/Python36-32",   // 注意:路径根据自己的python安装路径而定 
"file_regex": "^[ ]*File \"(...*?)\", line ([0-9]*)",
"selector": "source.python",
"encoding":"cp936",
}

并采用如下方式打开文件:

f_obj = open(add, 'a', encoding='gbk', errors='ignore')

2.

发布了10 篇原创文章 · 获赞 12 · 访问量 1万+
展开阅读全文

没有更多推荐了,返回首页

©️2019 CSDN 皮肤主题: 大白 设计师: CSDN官方博客

分享到微信朋友圈

×

扫一扫,手机浏览