Python之爬虫小人生

最新推荐文章于 2024-10-02 10:53:34 发布

KUBET9

最新推荐文章于 2024-10-02 10:53:34 发布

阅读量190

点赞数 1

文章标签： python 开发语言

本文链接：https://blog.csdn.net/KUBET9/article/details/139503289

版权

先运行Jupyter notebook

于CMD中输入

可于浏览器中自动开启Jupyter notebook的页面

点选右上角的python3新建档案

到这里就是正式的进入到可撰写Python code(Ctrl + Enter编译)

学习程式语言的第一步Hello world!

将Python套件import进来

import requests
from bs4 import BeautifulSoup

将网页Get下来

import requests
from bs4 import BeautifulSoup

r = requests.get("https://www.ptt.cc/bbs/MobileComm/index.html") #将此页面的HTML GET下来
print(r.text) #印出HTML

将抓下来的资料用Beautifulsoup4转为HTML的parser

import requests
from bs4 import BeautifulSoup

r = requests.get("https://www.ptt.cc/bbs/MobileComm/index.html") #将网页资料GET下来
soup = BeautifulSoup(r.text,"html.parser") #将网页资料以html.parser
sel = soup.select("div.title a") #取HTML标中的 <div class="title"></div> 中的<a>标签存入sel

因为我想选取的是网页里的文章标题，所以soup.select中放的才是div.title a

<div class="title">	
    <a href="/bbs/MobileComm/M.1539248247.A.3CF.html">[问题]Pixel3 / XR / XZ3 选择？ </a>		
</div>

最后写一个回圈将爬下来的文章标题印出来

for s in sel:
    print(s["href"], s.text)

完整的code与展示

import requests
from bs4 import BeautifulSoup

r = requests.get("https://www.ptt.cc/bbs/MobileComm/index.html") #将网页资料GET下来
soup = BeautifulSoup(r.text,"html.parser") #将网页资料以html.parser
sel = soup.select("div.title a") #取HTML标中的 <div class="title"></div> 中的<a>标签存入sel
for s in sel:
    print(s["href"], s.text)