python自动生成采集规则_Python爬虫结合dedecms自动采集发布

最新推荐文章于 2021-07-21 21:50:48 发布

凌沦

最新推荐文章于 2021-07-21 21:50:48 发布

阅读量892

点赞数 1

文章标签： python自动生成采集规则

本文链接：https://blog.csdn.net/weixin_28877505/article/details/113988394

版权

本文介绍了如何使用Python自定义爬虫配合DedeCMS实现自动采集和发布文章。通过分析DedeCMS的登录接口和文章发布HTTP数据包，实现了模拟登录并发布文章的功能。此外，还探讨了爬虫架构设计，包括处理不同采集页面的规则，以及数据库设计以避免重复发布。

摘要由CSDN通过智能技术生成

之前想实现一个爬虫，实时采集别人的文章，根据自己的规则去修改采集到的文章，然后自动发布。决定用dedecms做新闻发布，还可以自动生成html，自动把远程图片本地化等一些优点，为了安全，完全可以把前后台分离。

起初想用scrapy爬虫框架去实现，觉得定制开发的话用scrapy只能用到里面的一些基础的功能，有一些情况要跟着框架的规则走，如果自己写的话可以自己写规则去处理，也有优点爬虫、处理器等，最后还是自己写了一个demo。

首先分析需求，python做爬虫，dedecms做发布，起初先考虑了发布功能，实现了模拟登陆，或者研究dedecms的数据库设计，直接写到数据库，实际中没有这样去做，开始做模拟登陆的时候，需要改dedecms的代码去掉验证码，不然还要实现验证码识别，这个完全没有必要，因为要发布的是自己的网站，自己也有账户、密码、发布文章权限，然后就改了下dedecms的登陆功能，加了一个登陆接口，分析了dedecms的发布文章HTTP数据包。这块搞定了后就开始设计爬虫了，最后设计的感觉和scrapy的一些基础的处理机制很像。

做dedecms的登陆接口如下：

后台目录下的config.php 34行找到

/**

//检验用户登录状态

$cuserLogin = new userLogin();

if($cuserLogin->getUserID()==-1)

{

header(“location:login.php?gotopage=”.urlencode($dedeNowurl));

exit();

}

**/

$cuserLogin = new userLogin();

if($cuserLogin->getUserID()==-1) {

if($_REQUEST['username'] != ''){

$res = $cuserLogin->checkUser($_REQUEST['username'], $_REQUEST['password']);

if($res==1) $cuserLogin->keepUser();

}

if($cuserLogin->getUserID()==-1) {

header("location:login.php?gotopage=".urlencode($dedeNowurl));

exit();

}

}```

这样只要请求：http://127.0.0.2/dede/index.php?username=admin&password=admin 就可以得到一个sessionid，只要用这个sessionid去发布文章就可以了。

发布文章的HTTP数据包如下：

#http://127.0.0.2/dede/article_add.php

POST /dede/article_add.php HTTP/1.1

Host: 127.0.0.2

User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:48.0) Gecko/20100101 Firefox/48.0

Accept: text/html,application/xhtml+xml,application/xml;q=0.9,/;q=0.8

Accept-Language: zh-CN,zh;q=0.8,en-US;q=0.5,en;q=0.3

Accept-Encoding: gzip, deflate

Referer: http://127.0.0.2/dede/article_add.php?cid=2

Cookie: menuitems=1_1%2C2_1%2C3_1; CNZZDATA1254901833=1497342033-1472891946-%7C1473171059; Hm_lvt_a6454d60bf94f1e40b22b89e9f2986ba=1472892122; ENV_GOBACK_URL=%2Fmd5%2Fcontent_list.php%3Farcrank%3D-1%26cid%3D11; lastCid=11; lastCidckMd5=2f82387a2b251324; DedeUserID=1; DedeUserIDckMd5=74be9ff370c4536f; DedeLoginTime=1473174404; DedeLoginTime__ckMd5=b8edc1b5318a3923; hasshown=1; Hm_lpvt_a6454d60bf94f1e40b22b89e9f2986ba=1473173893; PHPSESSID=m2o3k882tln0ttdi964v5aorn6

Connection: keep-alive

Upgrade-Insecure-Requests: 1

Content-Type: multipart/form-data; boundary=—————————2802133914041

Content-Length: 3639

—————————–2802133914041

Content-Disposition: form-data; name=”channelid”

—————————–2802133914041

Content-Disposition: form-data; name=”dopost”

save

—————————–2802133914041

Content-Disposition: form-data; name=”title”

2222222222

—————————–2802133914041

Content-Disposition: form-data; name=”shorttitle”

—————————–2802133914041

Content-Disposition: form-data; name=”redirecturl”

—————————–2802133914041

Content-Disposition: form-data; name=”tags”

—————————–2802133914041

Content-Disposition: form-data; name=”weight”

100

—————————–2802133914041

Content-Disposition: form-data; name=”picname”

—————————–2802133914041

Content-Disposition: form-data; name=”litpic”; filename=””

Content-Type: application/octet-stream

—————————–2802133914041

Content-Disposition: form-data; name=”source”

—————————–2802133914041

Content-Disposition: form-data; name=”writer”

—————————–2802133914041

Content-Disposition: form-data; name=”typeid”

—————————–2802133914041

Content-Disposition: form-data; name=”typeid2″

—————————–2802133914041

Content-Disposition: form-data; name=”keywords”

—————————–2802133914041

Content-Disposition: form-data; name=”autokey”

—————————–2802133914041

Content-Disposition: form-data; name=”description”

—————————–2802133914041

Content-Disposition: form-data; name=”dede_addonfields”

—————————–2802133914041

Content-Disposition: form-data; name=”remote”

—————————–2802133914041

Content-Disposition: form-data; name=”autolitpic”

—————————–2802133914041

Content-Disposition: form-data; name=”needwatermark”

—————————–2802133914041

Content-Disposition: form-data; name=”sptype”

hand

—————————–2802133914041

Content-Disposition: form-data; name=”spsize”

—————————–2802133914041

Content-Disposition: form-data; name=”body”

2222222222

—————————–2802133914041

Content-Disposition: form-data; name=”voteid”

—————————–2802133914041

Content-Disposition: form-data; name=”notpost”

0—————————–2802133914041

Content-Disposition: form-data; name=”click”

—————————–2802133914041

Content-Disposition: form-data; name=”sortup”

0—————————–2802133914041

Content-Disposition: form-data; name=”color”

—————————–2802133914041

Content-Disposition: form-data; name=”arcrank”

0—————————–2802133914041

Content-Disposition: form-data; name=”money”

0—————————–2802133914041

Content-Disposition: form-data; name=”pubdate”

2016-09-06 23:07:52

—————————–2802133914041

Content-Disposition: form-data; name=”ishtml”

—————————–2802133914041

Content-Disposition: form-data; name=”filename”

—————————–2802133914041

Content-Disposition: form-data; name=”templet”

—————————–2802133914041

Content-Disposition: form-data; name=”imageField.x”

—————————–2802133914041

Content-Disposition: form-data; name=”imageField.y”

—————————–2802133914041–

#更新生成html请求

http://127.0.0.2/dede/task_do.php?typeid=2&aid=109&dopost=makeprenext&nextdo=

GET /dede/task_do.php?typeid=2&aid=109&dopost=makeprenext&nextdo= HTTP/1.1

Host: 127.0.0.2

User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:48.0) Gecko/20100101 Firefox/48.0

Accept: text/html,application/xhtml+xml,application/xml;q=0.9,/;q=0.8

Accept-Language: zh-CN,zh;q=0.8,en-US;q=0.5,en;q=0.3

Accept-Encoding: gzip, deflate

Referer: http://127.0.0.2/dede/article_add.php

Connection: keep-alive

Upgrade-Insecure-Requests: 1

通过上面数据包可以分析到如下结果：

POST http://127.0.0.2/dede/article_add.php

需要配置的参数：

channelid:1 #普通文章提交

dopost:save #提交方式

shorttitle:” #短标题

autokey:1 #自动获取关键词

remote:1 #不指定缩略图,远程自动获取缩略图

autolitpic:1 #提取第一个图片为缩略图

sptype:auto #自动分页

spsize:5 #5k大小自动分页

notpost:1 #禁止评论

sortup:0 #文章排序、默认

arcrank:0 #阅读权限为开放浏览

money: #消费金币0

ishtml:1 #生成html

title:”文章标题” #文章标题

source:”文章来源” #文章来源

writer:”文章作者” #文章作者

typeid:”主栏目ID2″ #主栏目ID

body:”文章内容” #文章内容

click:”文章点击量” #文章点击量

pubdate:”提交时间” #提交时间

然后开始模拟dedecms发布文章测试了，python代码如下：

#!/usr/bin/python

#coding:utf8

import requests,random,time

#访问登陆接口保持cookies

sid = requests.session()

login_url = "http://127.0.0.2/dede/index.php?username=admin&password=admin"

header = { "User-Agent":"Mozilla/5.0 (Windows NT 6.1; WOW64; rv:44.0) Gecko/20100101 Firefox/44.0",

"Referer" :"http://127.0.0.2"

}

#登陆接口获取Cookies

loadcookies = sid.get(url = login_url,headers = header)

#进入增加文章页面

#get_html = sid.get('http://127.0.0.2/dede/article_add.php?channelid=1',headers = header)

#print get_html.content

#定义固定字段

article = {

'channelid':1, #普通文章提交

'dopost':'save', #提交方式

'shorttitle':'', #短标题

'autokey':1, #自动获取关键词

'remote':1, #不指定缩略图,远程自动获取缩略图

'autolitpic':1, #提取第一个图片为缩略图

'sptype':'auto', #自动分页

'spsize':5, #5k大小自动分页

'notpost':1, #禁止评论

'sortup':0, #文章排序、默认

'arcrank':0, #阅读权限为开放浏览

'money': 0,#消费金币0

'ishtml':1, #生成html

'click':random.randint(10, 300), #随机生成文章点击量

'pubdate':time.strftime("%Y-%m-%d %H:%M:%S", time.localtime()), #s生成当前提交时间

}

#定义可变字段

article['source'] = "文章来源" #文章来源

article['writer'] = "文章作者" #文章作者

article['typeid'] = "2" #主栏目ID

"""

#测试提交数据

article['title'] = "测试_文章标题" #文章标题

article['body'] = "测试_文章内容" #文章内容

#提交后会自动重定向生成html，http返回状态为200则成功！

res = sid.post(url = article_request,data = article, headers = header)

print res

"""

for i in range(50):

article['title'] = str(i) + "_文章标题" #文章标题

article['body'] = str(i) + "_文章内容" #文章内容

#print article

res = sid.post(url = article_request,data = article, headers = header)

print res

其次就是分析爬虫需求阶段了，如下：

收集采集页面：

http://www.tunvan.com/col.jsp?id=115

http://www.zhongkerd.com/news.html

http://www.qianxx.com/news/field/

http://www.ifenguo.com/news/xingyexinwen/

http://www.ifenguo.com/news/gongsixinwen/

每一个采集页面和要改的规则都不一样，发布文章的栏目可能也有变化，要写多个爬虫，一个爬虫实现不了这个功能，要有爬虫、处理器、配置文件、函数文件(避免重复写代码)、数据库文件。

数据库里面主要是保存文章url和标题，主要是判断这篇文章是否是更新的，如果已经采集发布了就不要重复发布了，如果不存在文章就是最新的文章，需要写入数据库并发布文章。数据库就一个表几个字段就好，采用的sqlite3，数据库文件db.dll建表如下：

CREATE TABLE history (

id INTEGER PRIMARY KEY ASC AUTOINCREMENT,

url VARCHAR( 100 ),

title TEXT,

date DATETIME DEFAULT ( ( datetime( 'now', 'localtime' ) ) )

);

架构设计如下：

│ db.dll #sqlite数据库

│ dede.py #测试dede登陆接口

│ function.py #公共函数

│ run.py #爬虫集开始函数

│ settings.py #爬虫配置设置

│ spiders.py #爬虫示例

│ sqlitestudio-2.1.5.exe #sqlite数据库编辑工具

│ __init__.py #前置方法供模块用

dede.py如下：

#!/usr/bin/python

#coding:utf8

import requests,random,time

import lxml

#定义域名

domain = "http://127.0.0.2/"

admin_dir = "dede/"

houtai = domain + admin_dir

username = "admin"

password = "admin"

#访问登陆接口保持cookies

sid = requests.session()

login_url = houtai + "index.php?username=" + username + "&password=" + password

header = { "User-Agent":"Mozilla/5.0 (Windows NT 6.1; WOW64; rv:44.0) Gecko/20100101 Firefox/44.0",

"Referer" : domain

}

#登陆接口获取Cookies

loadcookies = sid.get(url = login_url,headers = header)

#定义固定字段

article = {

'channelid':1, #普通文章提交

'dopost':'save', #提交方式

'shorttitle':'', #短标题

'autokey':1, #自动获取关键词

'remote':1, #不指定缩略图,远程自动获取缩略图

'autolitpic':1, #提取第一个图片为缩略图

'sptype':'auto', #自动分页

'spsize':5, #5k大小自动分页

'notpost':1, #禁止评论

'sortup':0, #文章排序、默认

'arcrank':0, #阅读权限为开放浏览

'money': 0,#消费金币0

'ishtml':1, #生成html

'click':random.randint(10, 300), #随机生成文章点击量

'pubdate':time.strftime("%Y-%m-%d %H:%M:%S", time.localtime()), #s生成当前提交时间

}

#定义可变字段

article['source'] = "文章来源" #文章来源

article['writer'] = "文章作者" #文章作者

article['typeid'] = "2" #主栏目ID

#定义提交文章请求URL

article_request = houtai + "article_add.php"

"""

#测试提交数据

article['title'] = "11测试_文章标题" #文章标题

article['body'] = "11测试_文章内容" #文章内容

#提交后会自动重定向生成html，http返回状态为200则成功！

res = sid.post(url = article_request,data = article, headers = header)

print res

"""

for i in range(50):

article['title'] = str(i) + "_文章标题" #文章标题

article['body'] = str(i) + "_文章内容" #文章内容

#print article

res = sid.post(url = article_request,data = article, headers = header)

print res

"""

function.py如下：

coding:utf-8

from settings import *

#检查数据库中是否存在文章,0为不存在,1为存在

def res_check(article):

exec_select = "SELECT count(*) FROM history WHERE url = '%s' AND title = '%s' "

res_check = cur.execute(exec_select % (article[0],article[1]))

for res in res_check:

result = res[0]

return result

#写入数据库操作

def res_insert(article):

exec_insert = "INSERT INTO history (url,title) VALUES ('%s','%s')"

cur.execute(exec_insert % (article[0],article[1]))

conn.commit()

#模拟登陆发布文章

def send_article(title,body,typeid = "2"):

article['title'] = title #文章标题

article['body'] = body #文章内容

article['typeid'] = "2"

#print article

#提交后会自动重定向生成html，http返回状态为200则成功！

res = sid.post(url = article_request,data = article, headers = header)

#print res

if res.status_code == 200 :

#print u"send mail!"

send_mail(title = title,body = body)

print u"success article send!"

else:

#发布文章失败处理

pass

#发邮件通知send_mail(收件，标题，内容)

def send_mail(title,body):

shoujian = "admin@127.0.0.1"

设置服务器，用户名、密码以及邮箱的后缀

mail_user = "610358898"

mail_pass="你的邮箱密码"

mail_postfix="qq.com"

me=mail_user+""

msg = MIMEText(body, 'html', 'utf-8')

msg['Subject'] = title

#msg['to'] = shoujian

try:

mail = smtplib.SMTP()

mail.connect("smtp.qq.com")#配置SMTP服务器

mail.login(mail_user,mail_pass)

mail.sendmail(me,shoujian, msg.as_string())

mail.close()

print u"send mail success!"

except Exception, e:

print str(e)

print u"send mail exit!"

run.py如下：

-- coding: utf-8 --

import spiders

#开始第一个爬虫

spiders.start()

settings.py如下：

coding:utf-8

import re,sys,os,requests,lxml,string,time,random,logging

from bs4 import BeautifulSoup

from lxml import etree

import smtplib

from email.mime.text import MIMEText

import sqlite3

import HTMLParser

#刷新系统

reload(sys)

sys.setdefaultencoding( "utf-8" )

#定义当前时间

#now = time.strftime( '%Y-%m-%d %X',time.localtime())

#设置头信息

headers={ "User-Agent":"Mozilla/5.0 (Windows NT 5.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/42.0.2311.152 Safari/537.36",

"Accept":"/",

"Accept-Language":"zh-CN,zh;q=0.8,en-US;q=0.5,en;q=0.3",

"Accept-Encoding":"gzip, deflate",

"Content-Type":"application/x-www-form-urlencoded; charset=UTF-8",

"Connection":"keep-alive",

"X-Requested-With":"XMLHttpRequest",

}

domain = u"北京软件外包".decode("string_escape") #要替换的超链接

html_parser = HTMLParser.HTMLParser() #生成转义器

########################################################dede参数配置

#定义域名

domain = "http://127.0.0.2/"

admin_dir = "dede/"

houtai = domain + admin_dir

username = "admin"

password = "admin"

#访问登陆接口保持cookies

sid = requests.session()

login_url = houtai + "index.php?username=" + username + "&password=" + password

header = { "User-Agent":"Mozilla/5.0 (Windows NT 6.1; WOW64; rv:44.0) Gecko/20100101 Firefox/44.0",

"Referer" : domain

}

#登陆接口获取Cookies

loadcookies = sid.get(url = login_url,headers = header)

#定义固定字段

article = {

'channelid':1, #普通文章提交

'dopost':'save', #提交方式

'shorttitle':'', #短标题

'autokey':1, #自动获取关键词

'remote':1, #不指定缩略图,远程自动获取缩略图

'autolitpic':1, #提取第一个图片为缩略图

'sptype':'auto', #自动分页

'spsize':5, #5k大小自动分页

'notpost':1, #禁止评论

'sortup':0, #文章排序、默认

'arcrank':0, #阅读权限为开放浏览

'money': 0,#消费金币0

'ishtml':1, #生成html

'click':random.randint(10, 300), #随机生成文章点击量

'pubdate':time.strftime("%Y-%m-%d %H:%M:%S", time.localtime()), #s生成当前提交时间

}

#定义可变字段

article['source'] = "文章来源" #文章来源

article['writer'] = "文章作者" #文章作者

#定义提交文章请求URL

article_request = houtai + "article_add.php"

########################################################数据库配置

#建立数据库连接

conn = sqlite3.connect("db.dll")

#创建游标

cur = conn.cursor()

spiders.py如下：

coding:utf-8

from settings import from function import

#获取内容, 文章url,文章内容xpath表达式

def get_content( url = "http://www.zhongkerd.com/news/content-1389.html" , xpath_rule = "//html/body/div[3]/div/div[2]/div/div[2]/div/div[1]/div/div/dl/dd" ):

html = requests.get(url,headers = headers).content

tree = etree.HTML(html)

res = tree .xpath(xpath_rule)[0]

res_content = etree.tostring(res) #转为字符串

res_content = html_parser.unescape(res_content) #转为html编码输出

res_content = res_content.replace('\t','').replace('\n','') #去除空格 .replace(' ','')，换行符，制表符

return res_content

#获取结果,url列表

def get_article_list(url = "http://www.zhongkerd.com/news.html" ):

body_html = requests.get(url,headers = headers).content

#print body_html

soup = BeautifulSoup(body_html,'lxml')