爬虫一些基本用法

最新推荐文章于 2023-01-29 14:22:44 发布

Y_principal

最新推荐文章于 2023-01-29 14:22:44 发布

阅读量652

点赞数 1

分类专栏： 002-爬虫

本文链接：https://blog.csdn.net/Y_principal/article/details/96320253

版权

这篇博客整理了Python爬虫的一些基本用法，包括使用re.findall、urlopen和BeautifulSoup解析网页，以及Scrapy框架的基础操作。还介绍了正则表达式的基本概念，并展示了如何使用requests库进行HTTP请求，包括GET和POST方法。此外，还涵盖了使用BeautifulSoup和Scrapy从网页中抓取链接、图片等数据的方法。

摘要由CSDN通过智能技术生成

一些学习笔记，整理一下

小结：

【1】re.findall 这个是当用import re 时候用的，而find_all是BeautifulSoup

【2】//打开网页用urlopen（右键网页，查看源代码就可看到）

html = urlopen("https://morvanzhou.github.io/static/scraping/list.html").read().decode('utf-8')

//用beautifulsoup“加工下”网页，具体是啥不想了解

soup = BeautifulSoup(html, features='lxml')

//这里也可以用select 语法 (“标签”)，（“标签”>"子标签"），（“.类名”），（“#id”），（"标签 #id"）
img_links = soup.find_all("img", { "src": re.compile('.*?\.jpg')})

【3】scrapy 这个要上手感觉有些费劲，目前前两个感觉也够了

【4】关于正则表达式 import re

. 表示任意符号 *表示数目？表示有没有都行 .*?匹配任意的东东 (.*?.jpg) 括号里面的是输出部分

from urllib.request import urlopen
# if has Chinese, apply decode()
html = urlopen("https://morvanzhou.github.io/static/scraping/basic-structure.html").read().decode('utf-8')
print(html)

import re
res = re.findall(r"<title>(.+?)</title>", html)
print("\nPage title is: ", res[0])
# Page title is: Scraping tutorial 1 | 莫烦Python

res = re.findall(r"<p>(.*?)</p>", html, flags=re.DOTALL)    # re.DOTALL if multi line
print("\nPage paragraph is: ", res[0])
# Page paragraph is:
#     这是一个在 <a href="https://morvanzhou.github.io/">莫烦Python</a>
#     <a href="https://morvanzhou.github.io/tutorials/scraping">爬虫教程</a> 中的简单测试.

res = re.findall(r'href="(.*?)"', html)
print("\nAll links: ", res)
# All links: ['https://morvanzhou.github.io/static/img/description/tab_icon.png', 'https://morvanzhou.github.io/', 'https://morvanzhou.github.io/tutorials/scraping']

（2）

all_href = [l['href'] for l in all_href]
print('\n', all_href)

（3）

from bs4 import BeautifulSoup
from urllib.request import urlopen

# if has Chinese, apply decode()
html = urlopen("https://morvanzhou.github.io/static/scraping/list.html").read().decode('utf-8')

soup = BeautifulSoup(html, features='lxml')
# 例如<div class="list-group has-bdb edit-btn-box">
month = soup.find_all('li', { "class": "month"})
for m in month:
    print(m.get_text())

jan = soup.find('ul', { "class": 'jan'})
d_jan = jan.find_all('li')              # use jan as a parent
for d in d_jan:
    print(d.get_text())

（4）

from bs4 import BeautifulSoup
from urllib.request import urlopen
import re

# if has Chinese, apply decode()
html = urlopen("https://morvanzhou.github.io/static/scraping/table.html").read().decode('utf-8')

soup = BeautifulSoup(html, features='lxml')

img_links = soup.find_all("img", { "src": re.compile('.*?\.jpg')})
for link in img_links:
print(link['src'])
<

最低0.47元/天解锁文章

Y_principal

关注

1
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
爬虫一些基本用法

一些学习笔记，整理一下小结：【1】re.findall 这个是当用import re 时候用的，而find_all是BeautifulSoup【2】//打开网页用urlopen（右键网页，查看源代码就可看到） html = urlopen("https://morvanzhou.github.io/static/scraping/list.html").read()...
复制链接

扫一扫

专栏目录