python爬新闻并保存_Python,爬虫,小脚本,爬搜狐新闻列表存入数据库,爬新闻,新闻采集...

weixin_39633891

于 2020-11-26 04:14:16 发布

阅读量235

点赞数

文章标签： python爬新闻并保存

DrivingSubject.py

(2017.11.16)

基于Python2.7爬取驾考宝典所有题目爬虫

安装好拓展后，使用方法如下（控制台输入）:

python DrivingSubject.py 小车科目一

加入更新判断,linux下可放入crontab定时执行,通过判断题目达到更新目的

具体看代码

备注:

驾考宝典获取题目ID接口:

http://api2.jiakaobaodian.com/api/open/question/list-by-tag.htm?_r=111922017237088616081&cityCode=511300&page=1&limit=25&course=kemu1&tagId=2&carType=car&_=0.5066246786512065

根据ID读取题目接口:

http://api2.jiakaobaodian.com/api/open/question/question-list.htm?_r=19604815519963578102&page=1&limit=25&questionIds=909400

返回Json如下:

News.py

基于Python2.7写的单线程爬搜狐新闻列表批量存入数据库

新闻列表：http://wei.sohu.com/roll/ 大概100个页码，每页40*100 约4000篇新闻

使用需安装几个拓展

pip install requests

pip install BeautifulSoup

pip install bs4

pip install MySQL-python

其中新闻伪原创没加在Python里，可以自己定义进去

伪原创 PHP代码为：

function str_reWords($str)

{

$words=array();

$content = file_get_contents('词库.txt');

$content = str_replace( "\r", "",$content);

$content = preg_split('/\n/', $content, -1, PREG_SPLIT_NO_EMPTY);

foreach($content as $k=>$v)

{

if($k!=0)

{

$str_data = explode('_',$v);

$words+=array("$str_data[0]"=>"$str_data[1]");

}

}

return strtr($str,$words);

}

die(json_encode(array ('content'=>str_reWords($_POST['content']))));

词库文件内容：

善良_善意

好人_不坏的人

Python将文章存入数据库时会转义HTML

PHP读取文章可使用函数stripslashes进行反转义：

$content = str_replace('\n','',content); //替换换行

$content = stripslashes($row['content']);//反转义 /

拓展：

可根据页面详情抓取{腾讯新闻}{百度新闻}{网易新闻}列表

weixin_39633891

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
复制链接

分享到 QQ

分享到新浪微博

扫一扫

评论

被折叠的条评论为什么被折叠?

到【灌水乐园】发言

查看更多评论

添加红包

成就一亿技术人!

hope_wisdom

发出的红包

实付元

使用余额支付

点击重新获取

扫码支付

钱包余额 0

抵扣说明：

1.余额是钱包充值的虚拟货币，按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载，可以购买VIP、付费专栏及课程。