2020.01.05

最新推荐文章于 2021-03-16 14:24:26 发布

bobbykey

最新推荐文章于 2021-03-16 14:24:26 发布

阅读量169

点赞数

分类专栏：毕业设计-舆情挖掘与分析

本文链接：https://blog.csdn.net/bobbykey/article/details/103839232

版权

毕业设计-舆情挖掘与分析专栏收录该内容

2 篇文章 0 订阅

订阅专栏

1、scrapy 将str转化为HTML用于xpath

from scrapy.selector import Selector
names = Selector(text=datas).xpath("//div[contains(@class,'jDesc')]/a/text()").extract()

2、selenium webdriver find_element_by_xpath（）内容带参数方法：（和C语言输出是方法类似，与xpath不一样）

driver.find_element_by_xpath("//td[contains(text(),'%s')]" % cluster_name)

其中cluster_name是参数名称，%s是参数类型（当前为字符串，整型为%d），参数提前赋值

3、设置主键自增从1开始

truncate table ‘tablename'

4、豆瓣源 pip install -i https://pypi.doubanio.com/simple/ XXX

5、微博详情点击：

ac = self.web.find_element_by_xpath(".//div[@class = 'm-container-max']/div/div/div[%s]" % j).find_element_by_xpath(".//footer/div[2]/h4")
self.web.execute_script("arguments[0].click();", ac)  # 用js执行

只能使用self.web.execute_script才能模拟点击微博

6、点击QQ登陆：

打开qq登陆后

self.web.page_source中没有左边的源代码，左边源代码在iframe中，需要再进入iframe中

self.web.switch_to.frame(self.web.find_element_by_xpath(".//iframe[@id = 'ptlogin_iframe']"))#进入iframe，如果不进入，则拿不到iframe中的源码
 ac = self.web.find_element_by_xpath(".//span[@id = 'img_out_11943809']")#id根据QQ号决定
 self.web.execute_script("arguments[0].click();", ac)  # 用js执行

7、微博爬虫未登录状态，每次只可以最多连续爬取29个网页内容

8 scrapyd 启动爬虫

跳转到爬虫项目根目录下

1、scrapyd

2、scrapyd-deploy

3、curl http://localhost:6800/schedule.json -d project=weibo -d spider=film

停止爬虫
curl http://localhost:6800/cancel.json -d project=scrapy项目名称 -d job=运行ID

9 Python 启动其他py文件

 # 加入cwd 并切换工作目录
        GRANDFA = os.path.dirname(os.path.dirname(os.path.dirname(os.path.dirname(os.path.realpath(__file__)))))
        os.chdir(GRANDFA+'/Data_Management') #跳转到待执行目录的父目录下
        # pdb.set_trace()
        # os.system('cd Data_Management')
        os.system('python data_handle.py') #对数据进行处理