windows中mwdumper导入zhwiki的数据

我使用的是MWDumper is a quick little tool for extracting sets of pages from a MediaWiki dump file.导入mediawiki中的中文数据包. 首先需要强调的是download下的mwdumper.jar包,已经不能导入新版本的了,需要下载http://csomalin.csoma.elte.hu/~tgergo/wiki/mwdumper.jar这里的才能欲行,在windows下还是gui可视化界面,操作比较方便; http://www.mediawiki.org/wiki/MWDumper#A_note_on_character_encoding

第二、运行时需要安装jdk1.5或以上,提示server/jvm.dll找不到,把java/jdk/jre/bin/server文件夹拷贝到java/jre/bin目录

第三、sql语法错误,查看mysql的数据库是否默认为utf8;

第四、 Duplicate entry主键重复,需要把要导入的表和字段字符集修改为 binary,或者是表中已有数据: ALTER TABLE TEXT DEFAULT CHARACTER SET BINARY; ALTER TABLE page DEFAULT CHARACTER SET BINARY; ALTER TABLE revision DEFAULT CHARACTER SET BINARY;

delete from page; delete from revision; delete from text;

第五、字段太长被截取 Exception in thread "main" java.io.IOException: com.mysql.jdbc.MysqlDataTruncati on: Data truncation: Data too long for column ‘rev_comment’ at row 809 at org.mediawiki.importer.XmlDumpReader.readDump(Unknown Source 解决办法:ALTER TABLE revision MODIFY COLUMN rev_comment BLOB NOT null

第六、性能提示: max_allowed_packet = 128M innodb_log_file_size = 100M 导入前先去掉 表的主键和索引; 把mysql的二进制日志先关掉; 自己导入900多万条数据,总共花费3个多小时.

  • 0
    点赞
  • 3
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
关系抽取是自然语言处理的一个重要任务,它的目标是从文本提取实体之间的关系。以下是一个使用Python进行文关系抽取的示例代码: 1. 安装相关依赖库 ```python pip install pyltp pip install gensim pip install numpy pip install pandas ``` 2. 下载LTP模型和Word2Vec模型 LTP模型可以从官网下载,Word2Vec模型可以从[文维基百科语料库](https://dumps.wikimedia.org/zhwiki/latest/zhwiki-latest-pages-articles.xml.bz2)训练得到。 3. 加载模型和数据 ```python import os import numpy as np import pandas as pd import jieba import jieba.posseg as pseg from pyltp import SentenceSplitter, Segmentor, Postagger, Parser from gensim.models import KeyedVectors # 加载LTP模型 LTP_DATA_DIR = 'ltp_data_v3.4.0' cws_model_path = os.path.join(LTP_DATA_DIR, 'cws.model') pos_model_path = os.path.join(LTP_DATA_DIR, 'pos.model') par_model_path = os.path.join(LTP_DATA_DIR, 'parser.model') segmentor = Segmentor() segmentor.load(cws_model_path) postagger = Postagger() postagger.load(pos_model_path) parser = Parser() parser.load(par_model_path) # 加载Word2Vec模型 word2vec_model_path = 'zhwiki_word2vec_300.bin' word2vec = KeyedVectors.load_word2vec_format(word2vec_model_path, binary=True) # 加载数据 data = pd.read_csv('data.csv') ``` 4. 对文本进行分句和分词,提取实体和关系 ```python # 分句 sentences = SentenceSplitter.split(data['text']) # 实体和关系提取 entities = [] relations = [] for sentence in sentences: words = segmentor.segment(sentence) postags = postagger.postag(words) arcs = parser.parse(words, postags) # 提取实体 for i in range(len(words)): if postags[i] == 'nh': entity = words[i] for j in range(i+1, len(words)): if arcs[j].head == i+1 and postags[j] == 'ni': entity += words[j] else: break entities.append(entity) # 提取关系 for i in range(len(words)): if postags[i] == 'v': relation = words[i] for j in range(len(words)): if arcs[j].head == i+1 and postags[j] == 'nh': relation += words[j] else: break relations.append(relation) # 去重 entities = list(set(entities)) relations = list(set(relations)) ``` 5. 计算实体和关系的相似度 ```python # 计算相似度 def similarity(a, b): if a in word2vec.vocab and b in word2vec.vocab: return word2vec.similarity(a, b) else: return 0 # 构建相似度矩阵 entity_matrix = np.zeros((len(entities), len(entities))) for i in range(len(entities)): for j in range(i+1, len(entities)): entity_matrix[i][j] = similarity(entities[i], entities[j]) entity_matrix[j][i] = entity_matrix[i][j] relation_matrix = np.zeros((len(relations), len(relations))) for i in range(len(relations)): for j in range(i+1, len(relations)): relation_matrix[i][j] = similarity(relations[i], relations[j]) relation_matrix[j][i] = relation_matrix[i][j] ``` 6. 输出结果 ```python # 输出结果 print('实体:') for entity in entities: print(entity) print('关系:') for relation in relations: print(relation) ``` 以上是一个简单的文关系抽取示例,具体实现还需要根据具体场景进行调整和优化。
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值