最近写豆瓣,写一写个人感想,现在的网络环境,就是很多不能说,有些词,不能通过审核,我自己手动把一些关键词替换掉。想到用Python直接写了一个简单脚本。
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import re
import os
from collections import Counter
import time
# import requests
# from scrapy import Selector
# import seaborn as sns
# import jieba
# import jieba.posseg as psg
plt.rcParams['font.family'] = ['SimHei']
plt.rcParams['axes.unicode_minus'] = False
#
wk_dir = "2022——社会科学研究方法/test_替换敏感词"
data_dir = "2022——社会科学研究方法/test_替换敏感词/data_dir_politic_senti_replace"
#---------------------------------------------------------#
#---- * * ----#
#---------------------------------------------------------#
f_replace = open(os.path.join(data_dir, "dct_politic_senti.txt"),'r', encoding="UTF-8").readlines()
f = open("2022——社会科学研究方法/test_替换敏感词/data_dir_politic_senti_replace/dct_politic_senti.txt", encoding="UTF-8")
with open("2022——社会科学研究方法/test_替换敏感词/data_dir_politic_senti_replace/dct_politic_senti.txt", encoding="UTF-8") as f:
f.read()
f = open("2022——社会科学研究方法/test_替换敏感词/data_dir_politic_senti_replace/dct_politic_senti.txt", encoding="UTF-8")
dct_code = f.readlines()
dct_code
dct_code = [x.strip() for x in dct_code]
dct_code
dct_code = [x.split(" ") for x in dct_code]
hanzi = [ x[0] for x in dct_code]
hanzi
yingwen = [ x[1] for x in dct_code]
yingwen
dct_repl = dict(zip(hanzi,yingwen))
dct_repl
txt = open(os.path.join(data_dir, "artical1.txt"), encoding="utf-8").read()
txt
for key, value in dct_repl.items():
if key in txt:
txt = txt.replace(key, value)
txt
需要一个字典。比如,把这些次替换掉。

结果就是这样的,不知道能不能通过审核发布,

关键代码是这一段,
for key, value in dct_repl.items():
if key in txt:
txt = txt.replace(key, value)
这一段,是一遍一遍筛选词,一遍一遍替换,效率有点低,但是还没想到更好更高效的解决办法。
希望有高手帮忙指点。

被折叠的 条评论
为什么被折叠?



