爬虫一些基本用法

这篇博客整理了Python爬虫的一些基本用法,包括使用re.findall、urlopen和BeautifulSoup解析网页,以及Scrapy框架的基础操作。还介绍了正则表达式的基本概念,并展示了如何使用requests库进行HTTP请求,包括GET和POST方法。此外,还涵盖了使用BeautifulSoup和Scrapy从网页中抓取链接、图片等数据的方法。
摘要由CSDN通过智能技术生成

一些学习笔记,整理一下

小结:

【1】re.findall  这个是当用import re 时候用的,而find_all是BeautifulSoup

【2】//打开网页用urlopen(右键网页,查看源代码就可看到)

        html = urlopen("https://morvanzhou.github.io/static/scraping/list.html").read().decode('utf-8')

        //用beautifulsoup“加工下”网页,具体是啥不想了解

         soup = BeautifulSoup(html, features='lxml')

         //这里也可以用select  语法 (“标签”),(“标签”>"子标签"),(“.类名”),(“#id”),("标签 #id")
         img_links = soup.find_all("img", { "src": re.compile('.*?\.jpg')}) 

【3】scrapy 这个要上手感觉有些费劲,目前前两个感觉也够了

【4】关于正则表达式  import re

        . 表示任意符号      *表示数目   ?表示有没有都行  .*?匹配任意的东东     (.*?.jpg) 括号里面的是输出部分

 

from urllib.request import urlopen
# if has Chinese, apply decode()
html = urlopen("https://morvanzhou.github.io/static/scraping/basic-structure.html").read().decode('utf-8')
print(html)


import re
res = re.findall(r"<title>(.+?)</title>", html)
print("\nPage title is: ", res[0])
# Page title is:  Scraping tutorial 1 | 莫烦Python


res = re.findall(r"<p>(.*?)</p>", html, flags=re.DOTALL)    # re.DOTALL if multi line
print("\nPage paragraph is: ", res[0])
# Page paragraph is:
#     这是一个在 <a href="https://morvanzhou.github.io/">莫烦Python</a>
#     <a href="https://morvanzhou.github.io/tutorials/scraping">爬虫教程</a> 中的简单测试.


res = re.findall(r'href="(.*?)"', html)
print("\nAll links: ", res)
# All links:  ['https://morvanzhou.github.io/static/img/description/tab_icon.png', 'https://morvanzhou.github.io/', 'https://morvanzhou.github.io/tutorials/scraping']

 

 

 

(2)

from bs4 import BeautifulSoup
from urllib.request import urlopen

# if has Chinese, apply decode()
html = urlopen("https://morvanzhou.github.io/static/scraping/basic-structure.html").read().decode('utf-8')

soup = BeautifulSoup(html, features='lxml')
print(soup.h1)
print('\n', soup.p)

all_href = soup.find_all('a') // 通过标签查找

all_href = [l['href'] for l in all_href]
print('\n', all_href)

(3)

from bs4 import BeautifulSoup
from urllib.request import urlopen

# if has Chinese, apply decode()
html = urlopen("https://morvanzhou.github.io/static/scraping/list.html").read().decode('utf-8')

soup = BeautifulSoup(html, features='lxml')
# 例如<div class="list-group has-bdb edit-btn-box">
month = soup.find_all('li', { "class": "month"})

for m in month:
    print(m.get_text())


jan = soup.find('ul', { "class": 'jan'})
d_jan = jan.find_all('li')              # use jan as a parent
for d in d_jan:
    print(d.get_text())

(4)

from bs4 import BeautifulSoup
from urllib.request import urlopen
import re

# if has Chinese, apply decode()
html = urlopen("https://morvanzhou.github.io/static/scraping/table.html").read().decode('utf-8')

soup = BeautifulSoup(html, features='lxml')

img_links = soup.find_all("img", { "src": re.compile('.*?\.jpg')})
for link in img_links:
    print(link['src'])
<

  • 1
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值