从零开始的爬虫学习——靓汤的使用

励志成为生信高手

已于 2023-12-27 21:46:26 修改

阅读量1.9k

点赞数 35

文章标签：学习

于 2023-12-26 23:06:35 首次发布

本文链接：https://blog.csdn.net/qwesdqweds/article/details/135224209

版权

关于网站的基本知识

网站的基本构成

一个完整的网站一般有这几个要素：HTML、css、Javascript，分别定义网站结构信息，网站样式、网站与用户的交互逻辑，简单的把三者理解为骨架，衣服，动作

HTML常见的标签

理解HTML的代码逻辑有助于我们在爬虫时获取想要的信息，把下列复制到文本编辑器中，把后缀改为html，再打开就是一个极其简单的网站了。

<!DOCTYPE HTML>  #告诉浏览器是html
<html>           #根
   <body>        #主体
      <h1>hello world</h1>        #h是大标题
   </body>       #/即告诉浏览器这一部分结束了，类似{}
</html>

以下是常见标签

<>起始标签

</>闭合标签

<h1>一级标题

<h2>二级标题

以此类推...

<p>文本标签

<br>换行标签 #无闭合标签

<b>加粗标签

<i>斜体标签

<u>下划线标签

<img>图片标签 #无闭合标签

<a>链接标签

<div>,<span>都是主体标签，前者在呈现的时候一行只会有一个div

<ol>有序列表，里面的每个元素用<li>

<table>表格标签,<thead>表示表格头部<tbody>表示表格主体<td>表示表格内的具体数据

#<table border='1'>可以给表格加上边框

\

class可以在每个标签内出现，用于定义标签的类，找到class的规律有助于我们更好的爬

一个简单的实战用来帮助更好理解：

<!DOCTYPE HTML>
<html>
   <head>
      <title>从零开始的爬虫学习</title>
   </head>
   <body>
      <h1 style="background-color:red">hello world</h1>
      <h2>helloworld</h2>
      <p>hello<u>world</u></p>
      <p>hello<i>world</i></p>
      <p>hello<p>world</p>
      <p>hel<span style="background-color:violet">lo world</span></p>
      <img src="https://img1.baidu.com/it/u=2001447058,1445058726&fm=253&fmt=auto&app=120&f=JPEG?w=800&h=500"><br>
      <a href="https://blog.csdn.net/qwesdqweds?spm=1000.2115.3001.5343"target="_blank">励志成为生信高手的主页</a>
      <div>hello world</div><div>helloworld</div>
      <span>hello world</span><span>hello world</span>
      <ol>
         <li>h</li>
         <li>e</li>
         <li>l</li>
         <li>l</li>
         <li>o</li>
      </ol>
      <table border="10">
         <thead>
            <td>hello</td>
            <td>world</td>
         </thead>
         <tbody>
            <td>1</td>
            <td>2</td>
         </tbody>
      </table>
   </body>
</html>

靓汤的实战运用——以从ncbi中通过基因名获取所有转录本为例

打开python前端输入

pip install bs4

靓汤的原理其实就是在整个html按需找你想要的标签内容，但由于有些网站为了防爬，把架构设计的极其复杂，因此需要很深的理解力才能成为爬虫大师。

以ncbi为例，观察到ncbi的url是https://www.ncbi.nlm.nih.gov/gene/+gene id的格式

而蛋白质是https://www.ncbi.nlm.nih.gov/protein/+protein id的格式

那么基本思路是：先从所需基因的html里爬出所有的转录本名称，再将这些名称依次加入url的后面，再用简单的for循环爬到所有序列。

以拟南芥wri1基因为例https://www.ncbi.nlm.nih.gov/gene/824599

右键打开检查，点左上角的小箭头

瞄准到某一转录本的位置，发现这是一个链接，即<a>标签

那么思路是爬出所有的<a>标签，再用python最牛逼的筛选来爬出所有的id，代码如下：

from bs4 import BeautifulSoup
import requests
headers = {
    'User-Agent': 由于笔者不愿意透露自己的，想知道获得的看第一篇文章
}
a = requests.get('https://www.ncbi.nlm.nih.gov/gene/824599',headers=headers).text
soup = BeautifulSoup(a,'html.parser')
allprname = soup.find_all('a')  #爬出所有<a>
for i in allprname:             
    pro = str(i.string)          
    if 'NP' in pro[0:2]:        #爬出所有开头为‘NP’的id
        print(pro)

成功得到所有id，简单的用列表储存下来就行。

此后过去2小时，笔者发现requests有局限性，ncbi的源码会被js加工，笔者经过查阅学习，发现selenium库可以借你电脑上的浏览器爬，需要用到一个软件，Chrome for Testing availability

下载和你谷歌浏览器对应的版本

根据上面爬到的转录本id，任意打开一个转录本的url，发现在其后面加上?report=fasta即能直接看到蛋白序列，同样通过检查，发现fasta储存在一个<pre>标签里

以下是代码示例：

from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.chrome.service import Service as ChromeService
import requests
import time
headers = {
    'User-Agent': 你的useragent
}
#配置ChromeService
chrome_path = '你安装的路径/chromedriver.exe'
chrome_service = ChromeService(chrome_path)
driver = webdriver.Chrome(service=chrome_service)
#获得proid
a = requests.get('https://www.ncbi.nlm.nih.gov/gene/824599',headers=headers).text
soup = BeautifulSoup(a,'html.parser')
allprname = soup.find_all('a')
proid = []
for i in allprname:
    pro = str(i.string)
    if 'NP' in pro[0:2]:
       proid.append(pro)
print(proid)
#获得fasta
for i in proid:
    driver.get(f'https://www.ncbi.nlm.nih.gov/protein/{i}?report=fasta')
    page_source = driver.page_source
    soup = BeautifulSoup(page_source, 'html.parser')
    fasta = soup.find_all('pre')
    driver.implicitly_wait(10)     #此处可以根据网速调整，我开太低的话会被ncbi墙掉
    for tag in fasta:
        print(tag.text)

大功告成。后续根据需要，把所有基因名放文本里，让python挨个读取挨个运行即可。但是如果有成百上千条基因的话，这样又可能会很慢，下次学习如何多线程

——————————————————————————————————————————

笔者稍加更改，可以批量读取基因id了。并对有时候会空白读取进行修复，把基因id放在代码的目录里，将名字改为geneid.fasta就能正常读取了。

from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.chrome.service import Service as ChromeService
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import requests
import time
import os
headers = {
    'User-Agent': xxx
}
#配置ChromeService
current_directory = os.getcwd()
chrome_path = '你的地址/chromedriver.exe'
chrome_options = Options()
chrome_options.add_argument('--headless')
chrome_options.add_argument('--disable-gpu')
chrome_service = ChromeService(chrome_path)
driver = webdriver.Chrome(service=chrome_service,options=chrome_options)
#读取基因id
with open(f'{current_directory}\geneid.fasta','r') as geneid:
    line = geneid.readlines()
    lines_stripped = [line.strip() for line in line]
#获得proid
for j in lines_stripped:
    a = requests.get(f'https://www.ncbi.nlm.nih.gov/gene/{j}',headers=headers).text
    soup = BeautifulSoup(a,'html.parser')
    allprname = soup.find_all('a')
    proid = []
    for i in allprname:
        pro = str(i.string)
        if 'NP' in pro[0:2]:
           proid.append(pro)
    print(proid)
    #获得fasta
    for i in proid:
        driver.get(f'https://www.ncbi.nlm.nih.gov/protein/{i}?report=fasta')
        WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.TAG_NAME, 'pre')))
        page_source = driver.page_source
        soup = BeautifulSoup(page_source, 'html.parser')
        fasta = soup.find_all('pre')
        for tag in fasta:
            print(tag.text)