Python | Web Crawler

1)爬虫心法 : 做个正常访问者
Example:直接网络连线,不添加任何Header

#抓取电影源码
import ssl
import urllib.request as request

context = ssl._create_unverified_context()

src = 'https://www.ptt.cc/bbs/movie/index.html'
with request.urlopen(src, context= context) as response:
    data = response.read().decode("utf-8")

print(data)

error message:

urllib.error.HTTPError: HTTP Error 403: Forbidden

直接被Server拒绝,F12观察一下正常访问Server时候会发生什么。
在这里插入图片描述
会发送一大堆的Header,其中最重要的莫属user-agent,标识你用的是什么OS,什么Browser。

2)改进后(request中添加header)

#抓取电影源码
import ssl
import urllib.request as request

context = ssl._create_unverified_context()

src = 'https://www.ptt.cc/bbs/movie/index.html'
#建立req Object,附加header信息
req = request.Request(src, headers={
    "User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/76.0.3809.87 Safari/537.36"
})

with request.urlopen(req, context= context) as response:
    data = response.read().decode("utf-8")

print(data)

返回message:

PS C:\Users\85380\Desktop\LearnPy> python .\test2.py
<!DOCTYPE html>
<html>
        <head>
                <meta charset="utf-8">


<meta name="viewport" content="width=device-width, initial-scale=1">

<title>看板 movie 文章列表 - 批踢踢實業坊</title>

<link rel="stylesheet" type="text/css" href="//images.ptt.cc/bbs/v2.26/bbs-common.css">
<link rel="stylesheet" type="text/css" href="//images.ptt.cc/bbs/v2.26/bbs-base.css"
media="screen">
<link rel="stylesheet" type="text/css" href="//images.ptt.cc/bbs/v2.26/bbs-custom.css">
<link rel="stylesheet" type="text/css" href="//images.ptt.cc/bbs/v2.26/pushstream.css" media="screen">
<link rel="stylesheet" type="text/css" href="//images.ptt.cc/bbs/v2.26/bbs-print.css" media="print">




        </head>
    <body>

<div id="topbar-container">
        <div id="topbar" class="bbs-content">
                <a id="logo" href="/bbs/">批踢踢實業坊</a>
                <span>&rsaquo;</span>
                <a class="board" href="/bbs/movie/index.html"><span class="board-label">看板 </span>movie</a>
                <a class="right small" href="/about.html">關於我們</a>
                <a class="right small" href="/contact.html">聯絡資訊</a>
        </div>
</div>

<div id="main-container">
        <div id="action-bar-container">
                <div class="action-bar">
                        <div class="btn-group btn-group-dir">
                                <a class="btn selected" href="/bbs/movie/index.html">看板</a>
                                <a class="btn" href="/man/movie/index.html">精華區</a>
                        </div>
                        <div class="btn-group btn-group-paging">
                                <a class="btn wide" href="/bbs/movie/index1.html">最
舊</a>
                                <a class="btn wide" href="/bbs/movie/index8210.html">&lsaquo; 上頁</a>
                                <a class="btn wide disabled">下頁 &rsaquo;</a>
                                <a class="btn wide" href="/bbs/movie/index.html">最新
</a>
                        </div>
                </div>
        </div>

        <div class="r-list-container action-bar-margin bbs-screen">
                <div class="search-bar">
                        <form type="get" action="search" id="search-bar">
                                <input class="query" type="text" name="q" value="" placeholder="搜尋文章&#x22ef;">
                        </form>
                </div>






                <div class="r-ent">
                        <div class="nrec"><span class="hl f2">8</span></div>
                        <div class="title">

                                <a href="/bbs/movie/M.1565026014.A.B3C.html">[新聞]
「終局之戰」、「亂世佳人」、「阿凡達」誰真正票房冠軍?</a>

                        </div>
                        <div class="meta">
                                <div class="author">orz44444</div>
                                <div class="article-menu">

                                        <div class="trigger">&#x22ef;</div>
                                        <div class="dropdown">
                                                <div class="item"><a href="/bbs/movie/search?q=thread%3A%5B%E6%96%B0%E8%81%9E%5D&#43;%E3%80%8C%E7%B5%82%E5%B1%80%E4%B9%8B%E6%88%B0%E3%80%8D%E3%80%81%E3%80%8C%E4%BA%82%E4%B8%96%E4%BD%B3%E4%BA%BA%E3%80%8D%E3%80%81%E3%80%8C%E9%98%BF%E5%87%A1%E9%81%94%E3%80%8D%E8%AA%B0%E7%9C%9F%E6%AD%A3%E7%A5%A8%E6%88%BF%E5%86%A0%E8%BB%8D%EF%BC%9F">搜尋同標題文章</a></div>

                                                <div class="item"><a href="/bbs/movie/search?q=author%3Aorz44444">搜尋看板內 orz44444 的文章</a></div>

                                        </div>

                                </div>
                                <div class="date"> 8/06</div>
                                <div class="mark"></div>
                        </div>
                </div>





                <div class="r-ent">
                        <div class="nrec"><span class="hl f2">6</span></div>
                        <div class="title">

                                <a href="/bbs/movie/M.1565027230.A.041.html">Re: [新
聞] 凱文費奇透露《雷神索爾4》為何要拍女雷神</a>

                        </div>
                        <div class="meta">
                                <div class="author">godshibainu</div>
                                <div class="article-menu">

                                        <div class="trigger">&#x22ef;</div>
                                        <div class="dropdown">
                                                <div class="item"><a href="/bbs/movie/search?q=thread%3A%5B%E6%96%B0%E8%81%9E%5D&#43;%E5%87%B1%E6%96%87%E8%B2%BB%E5%A5%87%E9%80%8F%E9%9C%B2%E3%80%8A%E9%9B%B7%E7%A5%9E%E7%B4%A2%E7%88%BE4%E3%80%8B%E7%82%BA%E4%BD%95%E8%A6%81%E6%8B%8D%E5%A5%B3%E9%9B%B7%E7%A5%9E">搜尋同標題文章</a></div>

                                                <div class="item"><a href="/bbs/movie/search?q=author%3Agodshibainu">搜尋看板內 godshibainu 的文章</a></div>

                                        </div>

                                </div>
                                <div class="date"> 8/06</div>
                                <div class="mark"></div>
                        </div>
                </div>





                <div class="r-ent">
                        <div class="nrec"><span class="hl f3">10</span></div>
                        <div class="title">

                                <a href="/bbs/movie/M.1565027740.A.927.html">[新聞]
《復仇者4》驚見關史黛西!「就在蜘蛛人</a>

                        </div>
                        <div class="meta">
                                <div class="author">chufenyang</div>
                                <div class="article-menu">

                                        <div class="trigger">&#x22ef;</div>
                                        <div class="dropdown">
                                                <div class="item"><a href="/bbs/movie/search?q=thread%3A%5B%E6%96%B0%E8%81%9E%5D&#43;%E3%80%8A%E5%BE%A9%E4%BB%87%E8%80%854%E3%80%8B%E9%A9%9A%E8%A6%8B%E9%97%9C%E5%8F%B2%E9%BB%9B%E8%A5%BF%EF%BC%81%E3%80%8C%E5%B0%B1%E5%9C%A8%E8%9C%98%E8%9B%9B%E4%BA%BA">搜尋同標題文章</a></div>

                                                <div class="item"><a href="/bbs/movie/search?q=author%3Achufenyang">搜尋看板內 chufenyang 的文章</a></div>

                                        </div>

                                </div>
                                <div class="date"> 8/06</div>
                                <div class="mark"></div>
                        </div>
                </div>





                <div class="r-ent">
                        <div class="nrec"><span class="hl f2">4</span></div>
                        <div class="title">

                                <a href="/bbs/movie/M.1565031671.A.280.html">Re: [新
聞] 必備片單!帝國雜誌評選30年來30部經典代</a>

                        </div>
                        <div class="meta">
                                <div class="author">Payne22</div>
                                <div class="article-menu">

                                        <div class="trigger">&#x22ef;</div>
                                        <div class="dropdown">
                                                <div class="item"><a href="/bbs/movie/search?q=thread%3A%5B%E6%96%B0%E8%81%9E%5D&#43;%E5%BF%85%E5%82%99%E7%89%87%E5%96%AE%EF%BC%81%E5%B8%9D%E5%9C%8B%E9%9B%9C%E8%AA%8C%E8%A9%95%E9%81%B830%E5%B9%B4%E4%BE%8630%E9%83%A8%E7%B6%93%E5%85%B8%E4%BB%A3">搜尋同標題文章</a></div>

                                                <div class="item"><a href="/bbs/movie/search?q=author%3APayne22">搜尋看板內 Payne22 的文章</a></div>

                                        </div>

                                </div>
                                <div class="date"> 8/06</div>
                                <div class="mark"></div>
                        </div>
                </div>



        <div class="r-list-sep"></div>




                <div class="r-ent">
                        <div class="nrec"><span class="hl f3">22</span></div>
                        <div class="title">

                                <a href="/bbs/movie/M.1559611458.A.DCA.html">[公告]
板規 2019/07/05</a>

                        </div>
                        <div class="meta">
                                <div class="author">ckshchen</div>
                                <div class="article-menu">

                                        <div class="trigger">&#x22ef;</div>
                                        <div class="dropdown">
                                                <div class="item"><a href="/bbs/movie/search?q=thread%3A%5B%E5%85%AC%E5%91%8A%5D&#43;%E6%9D%BF%E8%A6%8F&#43;2019%2F07%2F05">搜尋同標題文章</a></div>

                                                <div class="item"><a href="/bbs/movie/search?q=author%3Ackshchen">搜尋看板內 ckshchen 的文章</a></div>

                                        </div>

                                </div>
                                <div class="date"> 6/04</div>
                                <div class="mark">M</div>
                        </div>
                </div>



        </div>


</div>



<script>
  (function(i,s,o,g,r,a,m){i['GoogleAnalyticsObject']=r;i[r]=i[r]||function(){
  (i[r].q=i[r].q||[]).push(arguments)},i[r].l=1*new Date();a=s.createElement(o),
  m=s.getElementsByTagName(o)[0];a.async=1;a.src=g;m.parentNode.insertBefore(a,m)
  })(window,document,'script','https://www.google-analytics.com/analytics.js','ga');

  ga('create', 'UA-32365737-1', {
    cookieDomain: 'ptt.cc',
    legacyCookieDomain: 'ptt.cc'
  });
  ga('send', 'pageview');
</script>



<script src="//ajax.googleapis.com/ajax/libs/jquery/2.1.1/jquery.min.js"></script>
<script src="//images.ptt.cc/bbs/v2.26/bbs.js"></script>

    </body>
</html>

3)利用第三方套件Beautifulsop解析HTML

#抓取电影源码
import ssl
import urllib.request as request

context = ssl._create_unverified_context()

src = 'https://www.ptt.cc/bbs/movie/index.html'
#建立req Object,附加header信息
req = request.Request(src, headers={
    "User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/76.0.3809.87 Safari/537.36"

})

with request.urlopen(req, context= context) as response:
    data = response.read().decode("utf-8")

#解析源码,取得每篇文章的标题
import bs4
root = bs4.BeautifulSoup(data, "html.parser")
#print(root.title.string)#抓到标签"root.title" / 抓到标签里面的文字"root.title.string"

#找到想要的资料在HTML中的特色,如霸王别姬<div><a></a></div>
#titles = root.find("div", class_="title")    #寻找class = 'title'的div标签
#print(titles.a.string)     #titles会打印出其中一个符合条件的div的a标签里面的string

titles = root.find_all("div",class_ = "title")
for title in titles:
    if title.a != None:
        print(title.a.string)

result:

PS C:\Users\85380\Desktop\LearnPy> python .\test2.py
[新聞] 「終局之戰」、「亂世佳人」、「阿凡達」誰真正票房冠軍?
Re: [新聞] 凱文費奇透露《雷神索爾4》為何要拍女雷神
[新聞] 《復仇者4》驚見關史黛西!「就在蜘蛛人
Re: [新聞] 必備片單!帝國雜誌評選30年來30部經典代
Re: [新聞] 必備片單!帝國雜誌評選30年來30部經典代
[公告] 板規 2019/07/05
  • 1
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值