山寨新闻网之使用Python爬取网易科技

最新推荐文章于 2024-09-15 22:31:42 发布

GGGGJW

最新推荐文章于 2024-09-15 22:31:42 发布

阅读量369

点赞数

分类专栏：山寨新闻网文章标签： python mysql

本文链接：https://blog.csdn.net/guojiawei228/article/details/53365209

版权

山寨新闻网专栏收录该内容

1 篇文章 0 订阅

订阅专栏

本文介绍了使用Python爬取网易科技新闻的过程，包括连接数据库、爬取首页获取新闻链接、处理网页内容，以及将新闻存储到MySQL数据库。文章强调了在实际操作中可能遇到的编码问题、HTML解析和数据库配置问题。

摘要由CSDN通过智能技术生成

山寨新闻网之使用Python爬取网易科技

最近想要山寨一个新闻网站，需要一些什么技术呢？
百度Google一番后，发现其实很简单，
只需要一些python知识、一些MySQL知识以及一个Java Web框架。
当然到此我们只完成了后端的开发，前端我们交给我们的前端工程师。

接下来就是一边摸索一遍开发，逐步实现网站的主体功能。
第一部分就是使用python爬取网易科技的部分新闻，并存储到服务器上的数据库中。
用到了beautifulsoup和MySQLdb库来完成爬取和保存的工作。
以下是每个部分具体的代码：

连接数据库

import MySQLdb
db = MySQLdb.connect("服务器地址：端口号","user name","your password","数据库名",use_unicode=True, charset="utf8")
db.ping(True) //连接数据库时断开自动重连
cursor = db.cursor()//获取操作数据库游标

// 操作数据库内容

db.close()//关闭数据库链接

这一部分是数据库的连接，部署到真正的服务器上时，可能会由于种种原因断开，只要自动重连就没关系啦。
在开始的时候可以先使用本地创建的数据库，代码测试完成后，再使用相应的服务器数据库。

爬取网站首页并获取详细新闻地址

import urllib2
import urllib
import cookielib
from bs4 import BeautifulSoup

cookie = cookielib.CookieJar()
handler = urllib2.HTTPCookieProcessor(cookie)
//创建一个opener对象
opener  = urllib2.build_opener() 
 //连接到网易科技并得到一个response对象
response = opener.open("http://tech.163.com/")
//对获取的html进行转码确保不会出现乱码
content = response.read().decode('gbk','ignore').encode('utf-8','ignore')
//使用content创建一个beautifulSoup对象
soup = BeautifulSoup(content,"html.parser")

htmlList = []//保存获取的网址
//从节点newest-lists中获取每个新闻的详细地址保存在htmlList中
for tags in soup.find_all(class_="newest-lists"):
    for a in tags.find_all('a'):
        html = a.get('href')
        htmlList.append(html)
    print 'index end'

分析过程：

div class="lf-newest clearfix" id="lf_viewer">
    <div class="newest-top">
        <h2>最新快讯</h2>
    </div>
    <div ne-role="scrollbar">
        <div class="newest-lists">
            <ul>
                                        <li class="list_item">
                    <span class="wai_cicle"></span>
                    <div class="animate-dot"></div>
                    <a href="http://tech.163.com/16/1125/08/C6N2FOTV00097UF6.html" class="nl_detail">
                        <p class="nl-title">
                                                        BAT弱爆了！硅谷巨头的人工智能收购暗战
                            <em class="nl-time">08:27</em>
                        </p>
                    </a>                                                            
                </li>
                        <li class="list_item">
                    <span class="wai_cicle"></span>
                    <div class="animate-dot"></div>
                    <a href="http://tech.163.com/16/1125/08/C6N1UBL400097UF6.html"

我们发现新闻的列表就藏在class=”newest-lists”之下，那接些来就好办了，只要提取出这个节点并获取其中的href标签内容就可以了。

处理每个网页

def getOne(html):
    title='null'
    time='null'
    text=''
    img='null'
    category = "首页".decode("utf-8")
    #定义新闻的标题(title)、时间(time)、内容(text)、图片地址(img)以及分类(category)
    opener = urllib2.build_opener()
    response = opener.open(html)
    #部分网页会报错，所以忽略不合法的字符
    content = response.read().decode('gb18030').encode('utf-8','ignore')
    soup = BeautifulSoup(content,"html.parser")
    #依次获取title time text
    for tag in soup.find_all(class_="post_content_main"):
        title = tag.h1.get_text("",strip = True)
    for tag in soup.find_all(class_="post_time_source"):
        time = tag.get_text("",strip = True)
    for tag in soup.find_all(class_= "post_text"):
        for del_tag in tag.find_all(class_="gg200x300"):
            del_tag.decompose()
        for del_tag in tag.find_all(class_="ep-source cDGray"):
            del_tag.decompose()
        #此处内容为html代码，保留新闻相应格式
        for statement in tag.find_all('p'):
            text += str(statement)
        text=text.decode('utf-8')
    for tag in soup.find_all(class_= "f_center"):
        img = tag.img.get('src')
    #调用saveNews函数存储至数据库
    saveNews(title,time,img,category,text)

对于每个网页，过程也大同小异，此处就不做赘述。

存储新闻

def saveNews(title,time,image,category,text):
    #获取之前得到的游标
    global db,cursor
    # 使用execute方法执行SQL语句
    sql = "insert into news_list(news_title, \
        news_time,news_image,news_category,news_context) \
        values('%s','%s','%s','%s','%s')" % \
        (title,time,image,category,text)

    try:
        cursor.execute(sql)
        db.commit()
        print "success"
    except:
        db.rollback()
        print "i'm wrong"