PubMed（丁香）英汉词典爬取

最新推荐文章于 2024-05-30 13:17:04 发布

KC_A_CO

最新推荐文章于 2024-05-30 13:17:04 发布

阅读量1.4k

点赞数

文章标签： Scrapy PubMed 丁香英汉词典医学辞典爬虫

本文链接：https://blog.csdn.net/KC_A_CO/article/details/81052101

版权

使用Scrapy爬去PubMed（丁香）英汉词典

1.使用Scrapy创建项目

scrapy startproject med

2.进入med文件夹创建HtmlFilter.py用于除去标签。

实现参考：https://blog.csdn.net/yangyang_1009/article/details/19168055

import re

class FilterTag():
    def __init__(self):
        pass

    def filterHtmlTag(self, htmlStr):
        '''
        过滤html中的标签
        :param htmlStr:html字符串 或是网页源码
        '''
        self.htmlStr = htmlStr
        # 先过滤CDATA
        re_cdata = re.compile('//<!\[CDATA\[[^>]*//\]\]>', re.I)  # 匹配CDATA
        re_script = re.compile('<\s*script[^>]*>[^<]*<\s*/\s*script\s*>', re.I)  # Script
        re_style = re.compile('<\s*style[^>]*>[^<]*<\s*/\s*style\s*>', re.I)  # style
        re_br = re.compile('<br\s*?/?>')  # 处理换行
        re_h = re.compile('</?\w+[^>]*>')  # HTML标签
        re_n = re.compile('\\n') #除去\n
        re_t = re.compile('\\t') #除去\t
        re_comment = re.compile('<!--[^>]*-->')  # HTML注释
        s = re_cdata.sub('', htmlStr)  # 去掉CDATA
        s = re_script.sub('', s)  # 去掉SCRIPT
        s = re_style.sub('', s)  # 去掉style
        s = re_br.sub('\n', s)  # 将br转换为换行
        blank_line = re.compile('\n+')  # 去掉多余的空行
        s = blank_line.sub('\n', s)
        s = re_h.sub('', s)  # 去掉HTML 标签
        s = re_n.sub('', s) #替换\n
        s = re_t.sub('', s) #替换\t
        s = re_comment.sub('', s)  # 去掉HTML注释
        # 去掉多余的空行
        blank_line = re.compile('\n+')
        s = blank_line.sub('\n', s)
        filterTag = FilterTag()
        s = filterTag.replaceCharEntity(s)  # 替换实体
        return  s

    def replaceCharEntity(self, htmlStr):
        '''
        替换html中常用的字符实体
        使用正常的字符替换html中特殊的字符实体
        可以添加新的字符实体到CHAR_ENTITIES 中
    CHAR_ENTITIES是一个字典前面是特殊字符实体  后面是其对应的正常字符
        :param htmlStr:
        '''
        self.htmlStr = htmlStr
        CHAR_ENTITIES = {'nbsp': ' ', '160': ' ',
                         'lt': '<', '60': '<',
                         'gt': '>', '62': '>',
                         'amp': '&', '38': '&',
                         'quot': '"', '34': '"', }
        re_charEntity = re.compile(r'&#?(?P<name>\w+);')
        sz = re_charEntity.search(htmlStr)
        while sz:
            entity = sz.group()  # entity全称，如>
            key = sz.group('name')  # 去除&;后的字符如（" "--->key = "nbsp"）    去除&;后entity,如>为gt
            try:
                htmlStr = re_charEntity.sub(CHAR_ENTITIES[key], htmlStr, 1)
                sz = re_charEntity.search(htmlStr)
            except KeyError:
                # 以空串代替
                htmlStr = re_charEntity.sub('', htmlStr, 1)
                sz = re_charEntity.search(htmlStr)
        return htmlStr

    def replace(self, s, re_exp, repl_string):
        return re_exp.sub(repl_string)

    def strip_tags(self, htmlStr):
        '''
        使用HTMLParser进行html标签过滤
        :param htmlStr:
        '''

        self.htmlStr = htmlStr
        htmlStr = htmlStr.strip()
        htmlStr = htmlStr.strip("\n")
        result = []
        parser = HTMLParser()
        parser.handle_data = result.append
        parser.feed(htmlStr)
        parser.close()
        return ''.join(result)

    def stripTagSimple(self, htmlStr):
        '''
        最简单的过滤html <>标签的方法    注意必须是<任意字符>  而不能单纯是<>
        :param htmlStr:
        '''
        self.htmlStr = htmlStr
        #         dr =re.compile(r'<[^>]+>',re.S)
        dr = re.compile(r'</?\w+[^>]*>', re.S)
        htmlStr = re.sub(dr, '', htmlStr)
        return htmlStr

3.在spiders文件夹中新建med_spider.py，获取形态变化、释义以及例句。

其中xpath的使用可以参考Scrapy官方文档和https://blog.csdn.net/flysky1991/article/details/75290805

import scrapy
import HtmlFilter
from scrapy.selector import Selector

class medSpider(scrapy.Spider):
    name = "med"
    start_urls = []
    #获取需要爬去的单词并创建url
    with open("word.txt", 'r') as file:
        content = file.read()
        wordlist = content.split('\n')
        for w in wordlist:
            url = "http://dict.biomart.cn/" + w + ".htm"
            start_urls.append(url)

    def parse(self, response):
        filters = HtmlFilter.FilterTag()
        #去除标签

        dict = {}

        #获取单词
        word = response.css('h5').extract()
        word = filters.filterHtmlTag(word[0])
        dict["单词"] = word

        #获取形态变化
        morph = response.selector.xpath('//p[@class="p1"]').extract()
        #列表第一项为发音，第二项为形态变化
        if len(morph) == 2:
            morph = filters.filterHtmlTag(str(morph[-1]))
            dict["变形"] = morph

        #获取解释，可能有两个
        exp = response.selector.xpath('//h3[@class="x_title3"]').extract()
        exp_list = []
        for e in exp:
            e = filters.filterHtmlTag(e)
            exp_list.append(e)
        dict["释义"] = exp_list

        #获取英中对应句子
        eng_sent = response.selector.xpath('//p[@class="c1 p1"]').extract()
        cn_sent = response.selector.xpath('//p[@class="c1"]').extract()
        sent_list = []
        cnt = 0
        for s in eng_sent:
            es = filters.filterHtmlTag(s)
            cs = filters.filterHtmlTag(cn_sent[cnt])
            cnt += 1
            sent_list.append(es)
            sent_list.append(cs)
        dict["例句"] = sent_list
        if dict["单词"] != None:
            yield dict

        #获取更多例句的url，使用parse_sent解析
        try:
            sent_url = response.selector.xpath('//p[@class="x_title4"]/a/@href').extract()
            yield scrapy.Request(sent_url[0], callback=self.parse_sent)
        except:
            pass



    def parse_sent(self, response):
        filters = HtmlFilter.FilterTag()
        word = response.css('h5').extract()
        word = filters.filterHtmlTag(word[0])
        eng_sent = response.selector.xpath('//p[@class="c1 p1"]').extract()
        cn_sent = response.selector.xpath('//p[@class="c1"]').extract()
        sent_list = []
        cnt = 0
        for s in eng_sent:
            es = filters.filterHtmlTag(s)
            cs = filters.filterHtmlTag(cn_sent[cnt])
            cnt += 1
            sent_list.append(es)
            sent_list.append(cs)
        yield {word: sent_list}

4.使用伪装代理

实现参考：https://www.colabug.com/167327.html

注意：DOWNLOADER_MIDDLEWARES应修改为

DOWNLOADER_MIDDLEWARES = {
    'med.MidWare.HeaderMidWare.ProcessHeaderMidware': 543,
}

5.修改pipelines.py

这里只是输出为简单的文本形式

class MedPipeline(object):
    def open_spider(self, spider):
        self.file = open('dict.txt', 'w',encoding="utf-8")

    def close_spider(self, spider):
        self.file.close()

    def process_item(self, item, spider):
        content = dict(item)
        self.file.write(str(content) + '\n')
        return item

删除settings.py中ITEM_PIPELINES前的‘#’

ITEM_PIPELINES = {
    'med.pipelines.MedPipeline': 300,
}

6.返回初始med文件夹创建‘word.txt’文档，输入需要爬取的单词

7.爬取结果

{'单词': 'jump', '变形': '[   原形: jump 现在分词: jumping 过去分词: jumped 第三人称: jumps ]', '释义': ['    条件转移,跳变,跳跃,突跳,跃变,跃迁,指令转移        ', '    n. 跳跃,上涨,惊跳 vt. 跳跃,跃过,突升,使跳跃 vi. 跳跃,暴涨            '], '例句': ['1.“I have found many hospitals making people jump through all sorts of hoops to try to get them knocked out of the financial-assistance category that they deserve to be in,” Alan Alop, deputy director of the Legal Assistance Foundation of Metropolitan Chicago, told the WSJ.', '美国金融危机，经济萧条，让更多人失去工作和医疗保险，但生老病死仍然依旧发生，医院接受捐款锐减，病人拒付或懒帐坏帐率剧增。如何应对这些问题，令许多医院头疼。部门医院开始借助于信用调查机构来筛查病人的支付能力，这样的防范措施是否奏效，令人关注！', '2.The number of global cancer deaths is projected to increase 45% from 2007 to 2030 (from 7.9 million to 11.5 million deaths), influenced in part by an increasing and aging global population. The estimated rise takes into account expected slight declines in death rates for some cancers in high resource countries. New cases of cancer in the same period are estimated to jump from 11.3 million in 2007 to 15.5 million in 2030.', '从2007年到2030年全球癌症死亡数目要增加45%（死亡人数从790万到1150万），其中部分原因是因为全球老龄化人口的增加。考虑到部分癌症在高发病国家死亡率轻度降低，总的癌症死亡率是呈上升趋势。同一时期新的癌症病理估计从2007年的1130万到2030年的1550万。', '3.Health care giant Johnson & Johnson on Tuesday reported a 30 percent jump in third-quarter profit, beating Wall Street expectations, due to the absence of a $745 million restructuring charge a year ago, as well as higher sales of consumer products and medical devices.', '一年前由于缺乏7 .45亿美元的重组费以及昂贵的产品和医疗设备费用，但本周二医疗保健巨头强生公司公布第三季度的利润上升30%，确实超过华尔街预期，。', "4.Even more surprising, she says, is the jump in the percentage of “e-patients,” as she calls them, who say Internet health resources have been helpful. Some 60% of e-patients say they or someone they know has been helped by following medical advice or health information found on the Web. That's up from 31% of e-patients in 2006. Just 3% said they or someone they know has been harmed by following medical advice or health information found on the Internet, a number that has remained stable since 2006.", '她说，更令人惊奇的是“e-病人”比例的突升，这些病人认为互联网络上的健康资源是有用的。60%的e-病人说，他们或者他们认识的一些人通过遵从在网站上寻找到的医疗建议或卫生知识而获得帮助，这与2006年31%的e-病人持此观点相比有所上升。只有3%的e-病人说，他们或者他们认识的一些人通过遵从在网站上寻找到的医疗建议或卫生知识而受害，这一比例自2006年起保持不变。', '5.HIV now infects 39.5 million people around the world, a jump of 2.3 million over the past 2 years, according to an update released today by the Joint United Nations Programme on HIV/AIDS (UNAIDS). "The evidence is showing that not only is the global epidemic growing, but there are also worrying trends where some countries are seeing a resurgence in infection rates," says Paul De Lay, who directs monitoring and evaluation for UNAIDS.', '联合国艾滋病规划署（UNAIDS）发布的最新报告显示，全球范围内有三千九百五十万人感染了HIV，在过去两年中感染者人数猛增了二百三十万。UNAIDS监测和评估负责人Paul De Lay说：“有证据表明不仅全球艾滋病呈上升趋势，而且部分国家的感染率出现反弹，这一趋势令人担忧。”']}
{'单词': 'cold', '释义': ['    不带放射性的,非标记的,寒,冷,冷的        ', '    n. 寒冷,[物]零下温度,伤风,感冒 adj. 寒冷的,使人战栗的,冷淡的,不热情的,失去知觉的            '], '例句': ['1.If the body is exposed to cold for several weeks, as at the beginning of winter, the thyroid gland enlarges and begins to produce greater quantities of thyroid hormone.', '当身体在初冬季节暴露于寒冷中达几个星期时，甲状腺将增大并开始产生较多的甲状腺激素。', '2.Under extreme conditions of cold, increase in thyroid hormone production over a period of weeks can step up the rate of heat production as much as 20 to 30 percent, thus allowing one to withstand the prolonged cold.', '在很冷的环境中，在几周里甲状腺激素生成增加可使产热率升高20～30％，因此使人能耐受长期寒冷。', '3.That is, when the air temperature falls very low, which excites the cold receptors of the skin, the "setting" of the hypothalamic thermostat is automatically increased to a temperature level several tenths of a degree above normal body temperature.', '即当气温降至很低时就兴奋皮肤的冷感受器，下丘脑恒温器的“调整”就会自动升高到高于正常体温零点几度(此时实际体温就比恒温器调整点低零点几度，于是产热中枢兴奋——注)。', '4.Therefore, even though the temperature is high, the skin remains cold, and shivering occurs.', '因此，虽然体温已升高，但皮肤依旧是冷的，而且发生颤抖。', '5.People with this personality are often affectively cold and may be abnormally aggressive or irresponsible.', '具有这种人格的人经常感情冷淡，可反常地放肆或不负责任。']}
{'单词': 'joint pain', '释义': ['    关节疼痛        '], '例句': []}
{'单词': 'salting out', '释义': ['    盐析,加盐分离        ', '    盐析            '], '例句': []}
{'单词': 'sarcoplasm', '释义': ['    肌浆,肌质        ', '    n. [解]肌质,肌浆            '], '例句': ['1.Nervous rather than endocrine (adrenaline) stimulation of glycogenolysis causes a rise of calcium ions, in the muscle sarcoplasm, from about 10-7to 10-5 mol/l.', '是糖元分解的神经刺激而不是内分泌(肾上腺素)刺激引起肌浆中钙离子升高，从约10-7mol／L升高到l0-5mol／L。']}
{'cold': ['1.根据疾病控制中心的数据，除了感冒病毒（cold viruses）外，肠病毒是第二常见的感染人体的病毒。大多数人感染肠病毒后没有任何症状。', 'Enteroviruses are very common, second only to the common cold viruses as the most common viral infections in humans, according to the CDC. Most people who are infected with an enterovirus have no symptoms at all.', '2."但结果却是相当小的数量——月50个左右的同类型神经元——就已经足以促发行为."与纽约"Cold Spring Harbor"实验室也有联系的Svoboda说.', '"But it turns out that a remarkably small number -- on the order of 50 or so activated neurons -- is sufficient to drive reliable behaviors," said Svoboda, who is also associated with the Cold Spring Harbor Laboratory, in New York.', '3.Cold Spring Harbor实验室领导的研究,可能会把被认为是无害的病毒与染色体不稳定性(CIN)和癌症联系起来.', 'Research led by Cold Spring Harbor Laboratory (CSHL) may link viruses that have been considered harmless to chromosomal instability (CIN) and cancer.']}
{'jump': []}
{'sarcoplasm': []}