查看wikivoyage.json中page内容是否能转化用,输出作为页面
wikivoyage_xml_to_json.py:
xml_str = open(filename).read() o = xmltodict.parse(xml_str)
json_str = json.dumps(o)#json_str json_d = json.loads(json_str)#dict
with open('data/wikivoyage/wikivoyage.json', 'wb') as j: j.write(json.dumps(json_d))
return json_d |
1、json.dumps()和json.loads()是json格式处理函数(可以这么理解,json是字符串)
(1)json.dumps()函数是将一个Python数据类型列表进行json格式的编码(可以这么理解,json.dumps()函数是将字典转化为json字符串)
(2)json.loads()函数是将json格式数据转换为字典
2、json.dump()和json.load()主要用来读写json文件函数
import json 2 3 # json.dump()函数的使用,将json信息写进文件 4 json_info = "{'age': '12'}" 5 file = open('1.json','w',encoding='utf-8') 6 json.dump(json_info,file) |
Eda:
Page列表->Article字典建立
# articles without :'s in their titles articles_dict = {} for article in articles: # Lets strip the accents using unicodedata # Lets strip the accents title = article['title'] title = clean_word(title)
if ':' not in title: articles_dict[title] = article
print len(articles_dict) articles_dict[articles_dict.keys()[1]] |
final_articles = {} for title in final_titles: if '(disambiguation)' not in title: final_articles[title] = articles_dict[title]['revision']['text']['#text'] print len(final_articles) |
这个结构是一个title对应了一个#text
想办法提取出这一部分json,转化为dict,dictoxml
articles = data['mediawiki']['page'] |
article(dict)->xml
python 中还有一个模块dicttoxml ,将字典转成xml
import dicttoxml ret_xml = dicttoxml.dicttoxml(dict) print(type(ret_xml)) |
you can import the dicttoxml() function from the library.
>>> from dicttoxml import dicttoxml >>> xml = dicttoxml(some_dict) |
fetch a JSON object from a URL and convert it into XML. Simple:
>>> import json >>> import urllib >>> import dicttoxml >>> page = urllib.urlopen('http://quandyfactory.com/api/example') >>> content = page.read() >>> obj = json.loads(content) >>> print(obj) {u'mylist': [u'foo', u'bar', u'baz'], u'mydict': {u'foo': u'bar', u'baz': 1}, u'ok': True} >>> xml = dicttoxml.dicttoxml(obj) >>> print(xml) <?xml version="1.0" encoding="UTF-8" ?><root><mylist><item type="str">foo</item><item type="str">bar</item><item type="str">baz</item></mylist><mydict><foo type="str">bar</foo><baz type="int">1</baz></mydict><ok type="bool">true</ok></root> |
revision的结构:
<item type="dict"> <comment type="str">otherwise a number just looks weird.. (Import from wikitravel.org/en)</comment> <sha1 type="str">jjresq501njc7hkah05ux8kk4t1y605</sha1> <format type="str">text/x-wiki</format> <timestamp type="str">2009-03-02T02:54:37Z</timestamp> <text type="dict"> <key type="str" name="@xml:space">preserve</key> <key type="str" name="#text">#REDIRECT [[Town of 1770]]</key> </text> <contributor type="dict"> <username type="str">Inas</username> <id type="str">1816</id> </contributor> <model type="str">wikitext</model> <id type="str">1</id> </item> |
text、id等属性在page.sql数据集中已有,需要提取revision中的内容(将下一层dict的id取出)
总页面中junk太多,尝试存储清洗之后的数据
遍历清洗好的final_article的title,在articles_dict中分别取出重要内容和id
尝试:
pip install dicttoxml
在相应位置创建page_content.xml
import dicttoxml #清洗前的保存 page_content = [] for article in articles: page_content.append(article['revision']) page_content_xml = dicttoxml.dicttoxml(page_content) with open('../data/page_content.xml', 'wb') as x: x.write(page_content_xml) |
final_articles = {} page_contents = [] for title in final_titles: if '(disambiguation)' not in title: final_articles[title] = articles_dict[title]['revision']['text']['#text'] # page_id = articles_dict[title]['revision']['id'] page_content = {} page_content['id'] = articles_dict[title]['revision']['id'] page_content['title'] = title if 'comment' in articles_dict[title]['revision']: page_content['comment'] = articles_dict[title]['revision']['comment'] else: # print articles_dict[title]['revision'] # page_content[page_id]['comment'] = "parent id :" + str(articles_dict[title]['revision']['parentid']) page_content['comment'] = "parent id :" + articles_dict[title]['revision']['parentid'] page_content['timestamp'] = articles_dict[title]['revision']['timestamp'] page_content['text'] = articles_dict[title]['revision']['text']['#text'] if 'id' in articles_dict[title]['revision']['contributor']: page_content['contributor_id'] = articles_dict[title]['revision']['contributor']['id'] page_content['contributor_name'] = articles_dict[title]['revision']['contributor']['username'] elif 'ip' in articles_dict[title]['revision']['contributor']: page_content['contributor_ip'] = articles_dict[title]['revision']['contributor']['ip'] page_contents.append(page_content) # else: # print articles_dict[title]['revision']['contributor']
print len(final_articles) page_content_xml = dicttoxml.dicttoxml(page_contents) with open('../data/page_content.xml', 'wb') as x: x.write(page_content_xml) |
报错:
KeyError: 'Szczecin' KeyError: 'id' 如果在查找的key不存在的时候就会报:KeyError: |
创建嵌套字典需要先对下一层定义为字典 另外,有些comment与['contributor']['id']可能结构不同,不止一层或没有 判断comment是否在key中,如果多层,取出多层,如果没有,查看整体结构 |
判断key是否在字典内:
除了使用in还可以使用not in,判定这个key不存在,使用in要比has_key要快。 #生成一个字典 d = {'name':Tom, 'age':10, 'Tel':110} #打印返回值,其中d.keys()是列出字典所有的key print ‘name’ in d.keys() print 'name' in d #两个的结果都是返回True |
无comment结构
{u'sha1': u'qw343ibxpjzzqzk6rhkqvb8yets1gvv', u'format': u'text/x-wiki', u'timestamp': u'2019-03-04T08:06:17Z', u'parentid': u'3497383', u'text': {u'@xml:space': u'preserve', u'#text': u"{{***}, u'contributor': {u'username': u'Ground Zero', u'id': u'1423298'}, u'model': u'wikitext', u'id': u'3737196', u'minor': None} |
无comment的都有parentid
contributer无id结构
{u'ip': u'2A02:810D:9040:51DD:B954:392F:EFAC:BFFD'} {u'ip': u'84.248.202.190'} |
大部分有ip,无ip的则为
{u'@deleted': u'deleted'} |
利用dict内置的get(key[,default])方法,如果key存在,则返回其value,否则返回default;使用这个方法永远不会触发KeyError,如:
t = { 'a': '1', 'b': '2', 'c': '3', } print(t.get('d', 'not exist')) print(t) |
拼接parentid,python字符串拼接
使用加号(+)连接,使用加号连接各个变量或者元素必须是字符串类型
print('这个数是:'+str(number)) |
打开文件时,r只读,w覆盖写入,a追加写入
列表嵌套字典结构表示多条记录
desc = '51备忘录'.center(30,'-') print(desc) welcome = 'welcome' print(f'{welcome}作者:',__author__) # 添加备忘信息 """dict = {'time':'8点', 'thing':'起床' } """ all_memo = [] is_add = True while (is_add): one = {} info = input('请输入备忘信息:') one['时间'] = info[info.find('点')-1:info.find('点')+1] one['事件'] = info[info.find('点')+1:] all_memo.append(one) print(f'备忘录{all_memo}') num = 0 for i in all_memo: num += 1 print('项目%s:%s' %(num,i)) print(f'共{len(all_memo)}个待办事项',end='') is_add = input('是否继续 Y/N:') == 'Y' |
Xml数据导入Mysql
https://www.csdn.net/gather_28/MtTaMgysNTg5MS1ibG9n.html
维基百科
下载方法
https://wenku.baidu.com/view/f647b44ae45c3b3567ec8b2d.html
处理方法
https://blog.csdn.net/weixin_34001430/article/details/94267243
https://blog.csdn.net/wangyangzhizhou/article/details/78348949
https://blog.csdn.net/jdbc/article/details/59483767
这些基本上都是只利用text中内容处理语料,没有关注到其他标签
维基导游(Wikivoyage)入门门槛较低,但中文版参与人数很少?
想找个数据集说明结果加载慢的要死,好想有个快一点的代理啊