实验五爬虫基础

最新推荐文章于 2023-05-24 10:07:00 发布

胡说八道家

最新推荐文章于 2023-05-24 10:07:00 发布

阅读量796

点赞数 2

分类专栏： python 高级编程文章标签： python Powered by 金山文档

本文链接：https://blog.csdn.net/m0_59088506/article/details/129695512

版权

python 高级编程专栏收录该内容

7 篇文章 2 订阅

订阅专栏

一、实验目的

掌握爬虫常用的函数

掌握爬虫的思路

二、实验环境

操作系统：Windows

主要软件：Jupyter notebook

三、实验内容

已知阿司匹林的ChEMBL id 为CHEMBL25，请编写程序爬取阿司匹林的英文名称、分子量、SMILES、并下载分子结构图片。

网址：https://www.ebi.ac.uk/chembl/compound _report_card/CHEMBL25/

编程思路：

（1）首先在网页中找到需要爬取信息对应的位置

（2）查看网页源代码，找到信息所在位置。

（3）编程获取HTML

（4）通过正则表达式匹配对应信息

（5）输出匹配结果

输出结果示例：

请编写程序用“cancer”关键字搜索pubmed数据库，爬取搜索得到的文献数量，以及前10篇文章的PubMed ID和文章题目。

网址：https://pubmed.ncbi.nlm.nih.gov/

提示：使用urllib.request.urlopen函数获取网页请求；使用req.read().decode()读取网页源代码。

输出结果示例：

四、实验报告

1. 使用jupyter notebook文档填写实验报告，导出并提交pdf格式文件。

文件命名规则：”星期几+学号+姓名+实验5.pdf。

2. 记录实验步骤和实验结果

3. 记录实验中遇到的问题,如何解决的。

import requests
import re

url = "https://www.ebi.ac.uk/chembl/compound_report_card/CHEMBL25/"
req = requests.get(url)
req.encoding = "utf-8"
html = req.text
line1 = re.search('(["a-z:]+[" ]+)([A-Z]+)', html)
if line1 is not None:
    print("英文名称：",line1.group(2))

line2 = re.search('([A-Za-z":]+\s)((\-)?\d+(\.\d{1,2}))', html)
if line2 is not None:
    print("分子量：",line2.group(2))


line3 = re.search('([smile:"]{8}\s\S*\s*\S)([A-Za-z0-9=()]+)', html)
if line3 is not None:
    print("SMILES：",line3.group(2))


line4 = re.search('([image:" ]{10})(((http|ftp|https):\/\/[\w\-_]+(\.[\w\-_]+)+([\w\-\.,@?^=%&:/~\+#]*[\w\-\@?^=%&/~\+#])?))', html)
if line4 is not None:
    global pic
    img=line4.group(2)
    try:
        pic = requests.get(line4.group(2),timeout=100)
    except requests.exceptions.InvalidURL:
        print("无法下载")

    fileurl = "C:/Users/Polo/Desktop/python 高级编程/实验五下载的图片/"+img.split(r'/')[-1]+".svg"
    fp = open(fileurl, 'wb')  # 文件名
    fp.write(pic.content)
    fp.close()
    print("图片下载完成！")

import parsel
from urllib import request
import urllib
url='https://pubmed.ncbi.nlm.nih.gov/?term=cancer'
req = urllib.request.urlopen(url)
req = req.read().decode()
webtext = parsel.Selector(req)
result=webtext.css('.value::text').get().strip(",")
print("与癌症相关论文数量：",result)
title= webtext.css('.full-docsum')
for i in title:
    id=i.css('.docsum-pmid ::text').get()
    print(id,end=' ')
    t=i.css('.docsum-title ::text').getall()
    t=''.join(t).strip()
    print(t)