通过Python获取维基百科中概念词条的维基信息

通过Python获取维基百科中概念词条的维基信息

维基百科作为全球网络上最大且最受欢迎的参考工具书目前已被许多自然语言处理方面的研究人员所青睐,并将其视为优质的语言资料来源。大多数情况下,我们获取维基百科信息是通过其提供的数据库(http://dumps.wikimedia.org)来实现,但是其数据量巨大让我们难以转存至自己的本机数据库当中(英文的基本10G以上,电脑没有16G内存基本上搞不定),因此如何快速获取其维基数据一直是个难题。
通过在实验室研究基于维基百科的概念先决条件关系,本人开发出一套合理的程序,访问维基百科的API(https://www.mediawiki.org/wiki/API:Main_page)来获取维基信息,由于维基百科的基本信息量非常大,所以不可能面面俱到,但是本程序涉及到维基百科当中的大部分概念特征例如(引用关系、分类关系等),因此举一反三我们也可以获取到其余的所有信息。另外Python的第三方库也提供了可以访问维基百科的接口,但是实际测试发现其网络速度会受到时间限制,因此使用当中会感觉到很慢或者直接报错,如果有兴趣可以自行前去了解(wikipedia)。
希望本文对NLP感兴趣的研究者和维基百科爱好者有所帮助。

# -*- coding:utf-8 -*-
# Author:Zhou Yang

import requests
import json
import logging
import sys
import os.path
import re

agreement = 'https://'
language = 'en'
organization = '.wikipedia.org/w/api.php'

API_URL = agreement + language + organization


program = os.path.basename(sys.argv[0])
logger = logging.getLogger(program)
logging.basicConfig(format='%(asctime)s: %(levelname)s: %(message)s')


def pageid(title = None, np = 0):
    global API_URL
    URL = API_URL
    query_params = {
        'action': 'query',
        'prop': 'info',
        'format': 'json',
        'titles': title
    }
    if np != 0:
        query_params['titles'] = 'Category:' + title
    try:
        r = requests.get(URL, params=query_params)
        r.raise_for_status()
        html, r.encoding = r.text, 'gb2312'
    except:
        html = ""
    if html == "":
        return -1
    else:
        try:
            text = json.loads(html, encoding='gb2312')
        except json.JSONDecodeError:
            return -1
        try:
            for i in text["query"]['pages']:
                return int(i)
        except:
            return -1

def summary(title = None):
    global API_URL
    URL = API_URL
    query_params = {
        'action': 'query',
        'prop': 'extracts',
        'explaintext': '',
        'exintro': '',
        'format': 'json',
        'titles': title
    }
    try:
        r = requests.get(URL, params=query_params)
        r.raise_for_status()
        html, r.encoding = r.text, 'gb2312'
    except:
        logger.error('error summary about ' + title)
        return ""
    text = json.loads(html, encoding='gb2312')
    id = list(text["query"]["pages"].keys())[0]
    try:
        return text["query"]["pages"][id]["extract"]
    except:
        return ""

def body(title = None):
    global API_URL
    URL = API_URL
    query_params = {
        'action': 'query',
        'prop': 'extracts',
        'exlimit' : 'max',
        'format': 'json',
        'titles': title
    }
    try:
        r = requests.get(URL, params=query_params)
        r.raise_for_status()
        html, r.encoding = r.text, 'gb2312'
    except:
        logger.error('error body about ' + title)
        return ""
    text = json.loads(html, encoding='gb2312')
    id = list(text["query"]["pages"].keys())[0]
    try:
        html_text = text["query"]["pages"][id]["extract"]
        def stripTagSimple(htmlStr):
            '''
            最简单的过滤html <>标签的方法    注意必须是<任意字符>  而不能单纯是<>
            :param htmlStr:
            '''
            #         dr =re.compile(r'<[^>]+>',re.S)
            dr = re.compile(r'</?\w+[^>]*>', re.S)
            htmlStr = re.sub(dr, '', htmlStr)
            return htmlStr
        html_text = stripTagSimple(html_text)
        html_text = str(html_text).replace("\n", "")
        return html_text
    except:
        return ""

def links(title = None):
    global API_URL
    URL = API_URL
    query_params = {
        'action': 'query',
        'prop': 'links',
        'pllimit': 'max',
        'plnamespace': '0',
        'format': 'json',
        'titles': title
    }
    try:
        r = requests.get(URL, params=query_params)
        r.raise_for_status()
        html, r.encoding = r.text, 'gb2312'
    except:
        logger.error('error links about ' + title)
        return list()
    text = json.loads(html, encoding='gb2312')
    id = list(text["query"]["pages"].keys())[0]
    link = list()
    summ = summary(title)
    try:
        for obj in text["query"]['pages'][id]["links"]:
            if obj['title'] in summ or obj['title'].lower() in summ:
                link.append(obj['title'])
    except:
        return link
    return link

def linkss(title = None):
    global API_URL
    URL = API_URL
    query_params = {
        'action': 'query',
        'prop': 'links',
        'pllimit': 'max',
        'plnamespace': '0',
        'format': 'json',
        'titles': title
    }
    try:
        r = requests.get(URL, params=query_params)
        r.raise_for_status()
        html, r.encoding = r.text, 'gb2312'
    except:
        logger.error('error linkss about ' + title)
        return list()
    text = json.loads(html, encoding='gb2312')
    id = list(text["query"]["pages"].keys())[0]
    link = list()
    try:
        for obj in text["query"]['pages'][id]["links"]:
            link.append(obj['title'])
    except:
        return link
    return link

def backlinks(title = None):
    global API_URL
    URL = API_URL
    query_params = {
        'action': 'query',
        'list': 'backlinks',
        'bllimit': 'max',
        'blnamespace': '0',
        'format': 'json',
        'bltitle': title
    }
    try:
        r = requests.get(URL, params=query_params)
        r.raise_for_status()
        html, r.encoding = r.text, 'gb2312'
    except:
        logger.error('error backlinks about ' + title)
        return list()
    text = json.loads(html, encoding='gb2312')
    link = list()
    try:
        link = [obj['title'] for obj in text["query"]["backlinks"]]
    except:
        return link
    return link

def categories(title = None):
    global API_URL
    URL = API_URL
    query_params = {
        'action': 'query',
        'prop': 'categories',
        'cllimit': 'max',
        'clshow': '!hidden',
        'format': 'json',
        'clcategories': '',
        'titles': title
    }
    try:
        r = requests.get(URL, params=query_params)
        r.raise_for_status()
        html, r.encoding = r.text, 'gb2312'
    except:
        logger.error('error categories about ' + title)
        return list()
    text = json.loads(html, encoding='gb2312')
    id = list(text["query"]["pages"].keys())[0]
    category = set()
    if id != -1:
        try:
            category = [obj['title'][9:] for obj in text["query"]['pages'][id]["categories"]]
        except:
            return category
    return category

def redirects(title=None):
    global API_URL
    URL = API_URL
    query_params = {
        'action': 'query',
        'prop': 'redirects',
        'rdlimit': 'max',
        'format': 'json',
        'titles': title
    }
    try:
        r = requests.get(URL, params=query_params)
        r.raise_for_status()
        html, r.encoding = r.text, 'gb2312'
    except:
        logger.error('error redirects about ' + title)
        return list()
    text = json.loads(html, encoding='gb2312')
    id = list(text["query"]["pages"].keys())[0]
    redirect = list()
    if id != -1:
        try:
            redirect = [obj['title'] for obj in text["query"]['pages'][id]["redirects"]]
        except:
            return redirect
    return redirect

def subcats(title=None):
    global API_URL
    URL = API_URL
    query_params = {
        'action': 'query',
        'list': 'categorymembers',
        'cmtype': 'subcat',
        'cmlimit': 'max',
        'format': 'json',
        'cmtitle': 'Category:' + title
    }
    try:
        r = requests.get(URL, params=query_params)
        r.raise_for_status()
        html, r.encoding = r.text, 'gb2312'
    except:
        logger.error('error subcats about ' + title)
        return list()
    text = json.loads(html, encoding='gb2312')
    subcat = list()
    try:
        subcat = [obj['title'][9:] for obj in text["query"]['categorymembers']]
    except:
        return subcat
    return subcat

def supercats(title=None):
    global API_URL
    URL = API_URL
    query_params = {
        'action': 'query',
        'prop': 'categories',
        'cllimit': 'max',
        'format': 'json',
        'clshow': '!hidden',
        'titles': 'Category:' + title
    }
    try:
        r = requests.get(URL, params=query_params)
        r.raise_for_status()
        html, r.encoding = r.text, 'gb2312'
    except:
        logger.error('error supercats about ' + title)
        return list()
    text = json.loads(html, encoding='gb2312')
    id = list(text["query"]["pages"].keys())[0]
    supercat = list()
    if id != -1:
        try:
            supercat = [obj['title'][9:] for obj in text["query"]['pages'][id]["categories"]]
        except:
            return supercat
    return supercat

def contributors(title=None):
    global API_URL
    URL = API_URL
    query_params = {
        'action': 'query',
        'prop': 'contributors',
        'pclimit': 'max',
        'format': 'json',
        'titles': title
    }
    try:
        r = requests.get(URL, params=query_params)
        r.raise_for_status()
        html, r.encoding = r.text, 'gb2312'
    except:
        logger.error('error linkss about ' + title)
        return list()
    text = json.loads(html, encoding='gb2312')
    id = list(text["query"]["pages"].keys())[0]
    contributors = list()
    try:
        for obj in text["query"]['pages'][id]["contributors"]:
            contributors.append(obj['userid'])
    except:
        return contributors
    return contributors


if __name__ == '__main__':
    title = "Computer networks"
    id = pageid(title, np = 4)
    summ = summary(title)
    Out = links(title)
    print(id)
    print(summ)
    print(Out)
  • 获取维基概念的ID(“Machine learning”)
233488
  • 获取摘要部分(“Machine learning”)
Machine learning (ML) is the scientific study of algorithms and statistical models that computer systems use to effectively perform a specific task without using explicit instructions, relying on patterns and inference instead. It is seen as a subset of artificial intelligence. Machine learning algorithms build a mathematical model of sample data, known as "training data", in order to make predictions or decisions without being explicitly programmed to perform the task. Machine learning algorithms are used in a wide variety of applications, such as email filtering, and computer vision, where it is infeasible to develop an algorithm of specific instructions for performing the task. Machine learning is closely related to computational statistics, which focuses on making predictions using computers. The study of mathematical optimization delivers methods, theory and application domains to the field of machine learning. Data mining is a field of study within machine learning, and focuses on exploratory data analysis through unsupervised learning. In its application across business problems, machine learning is also referred to as predictive analytics.
  • 获取页面上的超链接概念(“Machine learning”)
['Algorithm', 'Artificial intelligence', 'Computational statistics', 'Computer systems', 'Computer vision', 'Data mining', 'Email filtering', 'Exploratory data analysis', 'Inference', 'Mathematica', 'Mathematical optimization', 'Predictive analytics', 'STATISTICA', 'Statistical model', 'Statistics', 'Supervised learning', 'Training data']

更多功能可以自己尝试去调用,提醒一下的是访问不同语言的话只需要修改代码里面的language参数即可,本文使用的英文维基百科,访问中文只需改为“zh”,但是访问中文需要翻墙,其余的语言类似。

本人第一次写技术博客,有什么不足的地方欢迎各位指正!

  • 8
    点赞
  • 14
    收藏
    觉得还不错? 一键收藏
  • 8
    评论
评论 8
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值