Python新手学习（十）：从WEB抓取数据

hbrown

已于 2024-05-05 16:52:29 修改

阅读量482

点赞数 3

文章标签： python 学习前端

于 2024-05-05 16:51:17 首次发布

本文链接：https://blog.csdn.net/hbrown/article/details/138468578

版权

12.从web抓取数据
1）项目：利用webbrowser模块的maplt.py
webbrowser模块 webbrowser.open(‘url’)
test_1201.py

#! python3
# mapIt.py - launches a map in the browser using an address from the command line or clipboard.

import webbrowser,sys,pyperclip
if len(sys.argv) > 1:
    #Get address from command line.
    address = ' '.join(sys.argv[1:])
else:
    #Get address from clipboard.
    address = pyperclip.paste()

webbrowser.open('www.baidu.com/s?wd=' + address)

2）用requests模块下载文件
用requests模块下载网页
requests.get(‘url’)
res.status_code = requests.codes.ok
res.text
检查错误res.raise_for_status()
3）下载文件保存到硬盘上
保存文件 res.iter_content()
测试程序：test_1202.py

import requests

res = requests.get('https://www.csdn.net/')
try:
    res.raise_for_status()
except Exception as exc:
    print('There was a problem: %s' % (exc))

playFile = open('d:/temp/csdn.txt','wb')
for chunk in res.iter_content(100000):
    playFile.write(chunk)

playFile.close()

4）HTML
查看页面源代码
打开浏览器开发者工具，寻找HTML元素。
test_1203.html

<html>
    <body>
        <strong>hello </strong> world!
        Al's free <a href ="https://www.csdn.net">Python books</a>
    </body>
</html>

5）用bs4模块解析HTML
Beautiful Soup模块bs4从HTML页面提取信息
test_1204.py

import requests,bs4

exampleFile = open('test_1204.html')
exampleSoup = bs4.BeautifulSoup(exampleFile.read(),'html.parser')

elems = exampleSoup.select('#author')
print('elem:' +str(elems[0]))
print('elem.text :' + elems[0].getText())
print('elem.attrs:' + str(elems[0].attrs))

pelems =exampleSoup.select('p')
print('<p>0:' + str(pelems[0]))
print('<p>0-text: ' + pelems[0].getText())

print('<p>1:' + str(pelems[1]))
print('<p>2:'+ str(pelems[2]))

test_1204.html

<!-- This is the example.html example file. -->

<html>
<head><title>The Website Title</title></head>
<body>
    <p>Download my <strong>Python</strong>book from <a href="https://www.csdn.net">my website</a>.</p>
    <p class="slogan">Learn Python the easy way!</p>
    <p>By <span id="author">Al Sweigart</span></p>
</body>
</html>

从HTML创建BeautifulSoup对象 bs4.BeautifulSoup()
用select()方法寻找元素 soup.select()
a)找a标签
print(soup.select('a')) # 通过标签的名称查找
b)通过类名来查找:class="sister"
print(soup.select('.sister'))
c)通过id查找:id="link1"
print(soup.select('#link1'))
d)特殊的查找方式：
选择父元素是 <head> 的所有 < title> 元素。注意是'head > title'，不是'head' > title'
print(soup.select('head > 'title'))
e)获取文本内容
print(soup.select('title')[0].string)
print(soup.select('title')[0].get_text())
通过元素属性获取数据。elem.get()
6）项目：打开所有搜索结果
Google被禁止，除非用梯子。
百度有反爬设置，通过网站直接搜索可以，将Url带到程序中，返回页面没有信息。
test_1205.py 目前还不可用

#! python3
# searchpypi.py - Opens serveral search results.

import requests,sys,webbrowser,bs4
import logging
logging.basicConfig(level=logging.DEBUG,format='%(asctime)s - %(levelname)s - %(message)s')

searchUrl = 'https://www.baidu.com/s?wd=site:finance.sina.com.cn ' +  ' '.join(sys.argv[1:])
logging.info('Search...' + searchUrl) # display text while downloading the search result page
webbrowser.open(searchUrl)

res = requests.get(searchUrl)
res.raise_for_status()

# Retrieve top search result links.
logging.debug('res len:' + str(len(res.text)))

pFile= open('d:/temp/res.html','wb')
for chunk in res.iter_content(10000):
    pFile.write(chunk)
pFile.close()

soup = bs4.BeautifulSoup(res.text,'html.parser')

# Open  browser tabl for each result.
linkElems = soup.select('a[href="https://www.baidu.com]')

numOpen = min(5,len(linkElems))
logging.debug('numOpen:' + str(numOpen))


for i in range(numOpen):
    urlToOpen = linkElems[i].get('href')
    print('Open--',urlToOpen)
    webbrowser.open(urlToOpen)

7）项目：下载所有XKCD动画
使用XKCD中文网站，页面元素与书中不同，做了修改。加了前十张爬图的限制
test_1206.py

#! python3
# downloadXkcd.py - Downloads every single XKCD comic.

import requests,os,bs4,logging

os.chdir('d:/temp')

urlXkcd = 'https://xkcd.in/'            # starting url

url = urlXkcd
os.makedirs('d:/temp/xkcd',exist_ok = True)     # store comics in ./xkcd
for phn in range(10):
    if url.endswith('#'):
        break;
    # Download the page.
    print('No. %s Downloading page %s...' % (phn,url))
    res = requests.get(url)
    res.raise_for_status()
   
    print('res len:' + str(len(res.text)))

    #pFile= open('d:/temp/img_res.html','wb')
    #for chunk in res.iter_content(10000):
    #    pFile.write(chunk)
    #    pFile.close()"""  
  
    soup = bs4.BeautifulSoup(res.text,'html.parser')

    # Find the Url of the comic image.
    comicElem = soup.select('.comic-body a img')
    if comicElem == []:
        print('Could not find currentImg image.')
    else:
        comicUrl = urlXkcd + comicElem[0].get('src')
        # Download the image.
        print('Downloading image %s...' % (comicUrl))
        res = requests.get(comicUrl)
        res.raise_for_status()

        # Save the image to ./xkcd.
        imageFile = open(os.path.join('d:/temp/xkcd',os.path.basename(comicUrl)),'wb')

        for chunk in res.iter_content(100000):
            imageFile.write(chunk)
        imageFile.close()
        
    # Get the Next button's url
    nextLink = soup.select('.nextLink a')[0]
    url = urlXkcd + str(nextLink.get('href'))

print('Done.')

8）用selenium模块控制浏览器
from selenium import webdriver
browser = webdriver.Chrome() #用Chrome浏览器
browser.get(‘http://www.baidu.com’)
在页面寻找元素webdriver的方法
find_element_by_class_name(name)

hbrown

关注

3
点赞
踩
10

收藏

觉得还不错? 一键收藏
2
评论
Python新手学习（十）：从WEB抓取数据

注意是'head > title'，不是'head' > title'检查错误res.raise_for_status()browser = webdriver.Chrome() #用Chrome浏览器。webbrowser模块 webbrowser.open(‘url’)百度有反爬设置，通过网站直接搜索可以，将Url带到程序中，返回页面没有信息。print(soup.select('a')) # 通过标签的名称查找。用select()方法寻找元素 soup.select()elem.get()
复制链接

扫一扫