Python爬虫,收藏功能实现记录

最新推荐文章于 2023-05-11 15:23:43 发布

weixin_33743661

最新推荐文章于 2023-05-11 15:23:43 发布

阅读量263

点赞数 1

文章标签： python 爬虫 json

原文链接：http://www.cnblogs.com/bay1/p/10982545.html

版权

经历了20天的时间,慌慌张张的把这个Demo做了出来
源代码

这里就简单记录一下一些遇到的问题

本项目以我们学校学院网作为基点,通过Python爬取主要学院网站学院新闻,通知公告,学生.学术动态三类数据
集成了一个校园Web信息热点分析与推荐系统
本系统为学校网站管理员生成各类信息报表,同时本系统面向学生群体集中数据并展示

爬虫部分

这个系统是围绕学院网站进行的,学校网站还是比较好爬取的
这里采取了requests和解析效率很高的lxml库爬取数据
lxml虽然没有Beautifulsoup闻名,但是速度更为优秀,操作更为便捷
同时爬虫这件事针对不同的网站采取的爬取规则也不一样,所以对于不同的网站要不同区别

大众网页爬取

学院网站大部分都是采取相同的网站模板
因此也为数据的过滤提供了便利

采取的思路为:新闻list链接--list首页读取尾页--生成list_dict--在list读取文章链接--文章链接里读取数据

这里任意抽取一个为例

url_get=requests.get(url,timeout=10)
collegenews_get_result=etree.HTML(collegenews_get.text) # 转化为xpth能识别
article_url_paths=result.xpath("//div[@class='wk_new_lb_title']/a[@title]/@href")
total_page=re.search(r'[0-9]+(?=[^0-9]*$)',str(end_page))
...
for i in range(0,int(total_page)+1):
    new_link=link.replace("list","list"+str(i))
    link_list.append(new_link)
...
article_url_paths=result.xpath("//tr/td/a[@class]/@href")
...
for article_url_path in article_url_paths:
    if '.htm' in article_url_path and 'http://' not in article_url_path:
        article_url.append('http://'+str(article_url_root)+str(article_url_path))
...
article_url_get_result=etree.HTML(article_url_get.text) # 转化为xpth能识别
article_time=article_url_get_result.xpath("//div[@class='article_other_info']/span[@id='Label_time']/text()")
article_author=article_url_get_result.xpath("//div[@class='article_other_info']/span[@class='arti_update']/text()")
title=article_url_get_result.xpath("//head/title/text()")
clicknumber_url=article_url_get_result.xpath("//table[@class='border2']/tr/td[@align='center']/span/@url")

这里的变量名称定义的很清楚了,提取出后的list可能是含有"杂质"的
这里可以使用正则匹配过滤一下
下面是我用到的几个,作用就是名字意思

date_re = re.compile(r'([0-9]{3}[1-9]|[0-9]{2}[1-9][0-9]{1}|[0-9]{1}[1-9][0-9]{2}|[1-9][0-9]{3})-(((0[13578]|1[02])-(0[1-9]|[12][0-9]|3[01]))|((0[469]|11)-(0[1-9]|[12][0-9]|30))|(02-(0[1-9]|[1][0-9]|2[0-8])))')
author_re=re.compile(r'[A-Za-z0-9_\-\u4e00-\u9fa5]+')
url_reg =re.match(r'^http?:\/\/([a-z0-9\-\.]+)[\/\?]?',link)

JS翻页的网站

针对这类的网站,我们可以采取抓包的方式分析数据交互
比如我碰到的这个网站,在page源码有这样一段代码

$.ajax({
    type: "POST",
    url: "/Services/ContentServices.asmx/GetContentList",
    data: {
        pageIndex: start_index + 1
        , countPerPage: items_on_page
        , category: category
    },
    success: function(data) {}
    });

很明显通过post方式获取数据,尝试一个post之后发现获取的是json形式的数据
这时便可以构造代码了

url='http://xxxxx/Services/ContentServices.asmx/GetContentList'
college_name=check_url(url)
college = Colleges.query.filter_by(name=college_name).first()
category=re.search(r'[0-9]+(?=[^0-9]*$)',str(link))
headers={
    'Content-Length': '41',
    'Accept': '*/*',
    'Origin': 'http://xxxxx',
    'X-Requested-With': 'XMLHttpRequest',
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3112.90 Safari/537.36',
    'Content-Type': 'application/x-www-form-urlencoded',
    'Referer': 'http://xxxxx/default.aspx',
    'Accept-Encoding': 'gzip, deflate',
    'Accept-Language': 'zh-CN,zh;q=0.8',
    'Connection': 'close'
}
for pageIndex in range(1,30):
    data={
        'pageIndex': pageIndex,
        'countPerPage':'30',
        'category': category[0]
    }
post_result=requests.post(url,headers=headers,data=data).content
result=json.loads(etree.HTML(post_result).xpath("//string/text()")[0])["Rows"]
...
article_time=result[i]["添加时间"][0:10]
id=result[i]["内容ID"]
title=[result[i]["标题"]]
article_author=result[i]["姓名"]
if result[i]["链接"]=="True":
    article_url=result[i]["链接地址"]
else:
    article_url="http://sm.cumt.edu.cn/detail.aspx?m="+category[0]+"&cid="+id
clicknumber=result[i]["浏览次数"]

储存数据

在数据库设计时
数据库设计时,针对收藏功能,将用户和新闻分类采取多对多的模型设计
针对学院和新闻分类采取了一对多的模型设计

# 关联表
UserToCollegeNews = db.Table('UserToCollegeNews',
            db.Column('user_id', db.Integer, db.ForeignKey('users.id'), primary_key=True),
            db.Column('collegenews_id', db.Integer, db.ForeignKey('collegenews.id'), primary_key=True)
)

UserToCollegeNotices = db.Table('UserToCollegeNotices',
            db.Column('user_id', db.Integer, db.ForeignKey('users.id'), primary_key=True),
            db.Column('collegenotices_id', db.Integer, db.ForeignKey('collegenotices.id'), primary_key=True)
)

UserToStudentWork = db.Table('UserToStudentWork',
            db.Column('user_id', db.Integer, db.ForeignKey('users.id'), primary_key=True),
            db.Column('studentworks_id', db.Integer, db.ForeignKey('studentworks.id'), primary_key=True)
)

class User(db.Model, UserMixin):
    __tablename__ = 'users'
    ...
    collegenews= db.relationship('CollegeNews', secondary=UserToCollegeNews, backref=db.backref('users', lazy='dynamic'))
    collegenotices= db.relationship('CollegeNotices', secondary=UserToCollegeNotices, backref=db.backref('users', lazy='dynamic'))
    studentworks= db.relationship('StudentWork', secondary=UserToStudentWork, backref=db.backref('users', lazy='dynamic'))

class Colleges(db.Model):
    __tablename__='colleges'
    ...
    collegenew = db.relationship('CollegeNews', backref='colleges', lazy='dynamic')
    collegenotice = db.relationship('CollegeNotices', backref='colleges', lazy='dynamic')
    studentwork = db.relationship('StudentWork', backref='colleges', lazy='dynamic')

class CollegeNews(db.Model):
    __tablename__='collegenews'
    ...
    college = db.Column(db.Integer, db.ForeignKey('colleges.name'))

class CollegeNotices(db.Model):
    __tablename__='collegenotices'
    ...
    college = db.Column(db.Integer, db.ForeignKey('colleges.name'))

class StudentWork(db.Model):
    __tablename__='studentworks'
    ...
    college = db.Column(db.Integer, db.ForeignKey('colleges.name'))

图表的构造

本系统使用Plotly引入各类的图表样式,根据官网的示例可以得知图标坐标轴读取的是list
所以我采用了Api的方式传输数据,在前端使用Ajax接收处理数据组合list生成图表
类似下面的形式,其他形式图表同理

$.ajax({
    url: "../api/clicksum",
    dataType: "json",
    json: "callback",
    success: function(datas) {
        var name_list = new Array();
        var clicksum_list = new Array();
        for (var i = 0; i <= datas.length - 1; i++) {
            name_list.push(datas[i]['name']);
            clicksum_list.push(datas[i]['clicksum']);
        }
        var trace = {
            x: name_list,
            y: clicksum_list,
            type: 'lines',
        };
        var data_two = [trace];
        var layout_two = {
            width: '250px',
        };
        Plotly.newPlot('my-graph-2', data_two, layout_two);
    }
});

收藏功能的实现

这个功能的实现我也是头一次做
这里我实现的思路是:用户点击--跳转数据处理链接--数据处理函数Return原点击页面
这个问题我也询问了老组长,他给的思路是Ajax,类似上边提到的JS翻页的方式,使用api实现
只详细说我的思路吧

首先在main蓝图里定义好路由,为了方便回跳到原来的链接,传入了多个参数
2333333333这里总感觉参数很多可控,很危险

@main.route('/collect/<is_collect>/<page>/<collegename>/<typename>/<collectarticle>/<title>')
@login_required
def article_collect(page,collectarticle,collegename,typename,title,is_collect):
    user=User.query.filter(User.username==current_user.username).first()
    query = CollegeNews.query.filter(CollegeNews.title==title).first()
    if is_collect=='1':
        user.collegenews.remove(query)
    else:
        user.collegenews.append(query)
    db.session.commit()
    return redirect(url_for('.articles',page=page,collegename=collegename,typename=typename))

这里也加入了一个is_collect来判断是取消收藏还是收藏
下面是html部分代码

{% if article in is_collect %}
    <a href="{{ url_for('.article_collect',page=pagination.page,collegename=collegename,typename=typename,collectarticle=article,title=article.title,is_collect=1) }}">
    <button class="btn btn-danger btn-xs" title='点击取消收藏'>
{% else %}
    <a href="{{ url_for('.article_collect',page=pagination.page,collegename=collegename,typename=typename,collectarticle=article,title=article.title,is_collect=0) }}">
    <button class="btn btn-default btn-xs" title='点击收藏'>
{% endif %}
    <span class="glyphicon glyphicon-heart-empty" aria-hidden="true"></span>
    </button>
    </a>

form搜索功能

这个问题正好是前几天回答一个群友的问题,就是上一篇文章

一些小知识点

bootstrap/wtf样式自定

这是定义两个form为内联式,其他的可以具体搜索

{ wtf.quick_form(form,form_type='inline') }}

或者你不用wtf.quick_form,自己自定义action以及各种form样式

Bootstrap栅格系统是好东西

Bootstrap提供了一套响应式、移动设备优先的流式栅格系统，随着屏幕或视口（viewport）尺寸的增加，系统会自动分为最多12列
使用单一的一组 .col-md-* 栅格类,就可以创建一个基本的栅格系统,所有“列（column）必须放在 ” .row
使用 .col-md-offset-* 类可以将列向右侧偏移
使用 .col-sm-* 类针对平板设备
使用.col-xs-* 和 .col-md-*类针对超小屏幕和中等屏幕设备

<div class="row">
    <div class="col-md-10 col-md-offset-1">
    ...
    </div>
</div>

小结

整个博文写下来感觉整个Demo也不并不难,23333333333333.............
但是每个人水平参差不齐
当你从第一步开始写时,你就明白到底难易了。。。。
反正我感觉有部分功能的实现过程很痛苦 ORZ

ahaha

转载于:https://www.cnblogs.com/bay1/p/10982545.html

weixin_33743661

关注

1
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
Python爬虫,收藏功能实现记录

经历了20天的时间,慌慌张张的把这个Demo做了出来源代码这里就简单记录一下一些遇到的问题本项目以我们学校学院网作为基点,通过Python爬取主要学院网站学院新闻,通知公告,学生.学术动态三类数据集成了一个校园Web信息热点分析与推荐系统本系统为学校网站管理员生成各类信息报表,同时本系统面向学生群体集中数据并展示爬虫部分这个系统是围绕学院网站进行的,学校网站还是比较好爬取的这...
复制链接

扫一扫