【基于文本内容的垃圾短信识别】

学习康ing

已于 2023-06-19 23:30:26 修改

阅读量1.5k

点赞数

文章标签：深度学习 python 机器学习

于 2023-06-19 23:28:20 首次发布

本文链接：https://blog.csdn.net/YYANyk/article/details/131297147

版权

这个作业属于那个课程	自然语言处理
这个作业要求在哪里	NLP作业02：课程设计报告-CSDN社区
我在这个课程的目标是	学会如何实现基本的文本分类，学会如何对文本数据进行处理
这个作业在那个具体方面帮助我实现目标	在实验过程中多次运用到文本数据处理，让我熟悉了在使用神经网络做文本分类任务时文本数据应该如何进行处理
参考文献	http://t.csdn.cn/peZrv http://t.csdn.cn/4cg96 http://t.csdn.cn/taGGJ

1.设计目的

通过课程设计的练习，加深学生对所学自然语言处理的理论知识与操作技能的理解和掌握，使得学生能综合运用所学理论知识和操作技能进行实际工程项目的设计开发，让学生真正体会到自然语言处理算法在实际工程项目中的具体应用方法，为今后能够独立或协助工程师进行人工智能产品的开发设计工作奠定基础。通过综合应用项目的实施，培养学生团队协作沟通能力，培养学生运用现代工具分析和解决复杂工程问题的能力；引导学生深刻理解并自觉实践职业精神和职业规范；培养学生遵纪守法、爱岗敬业、诚实守信、开拓创新的职业品格和行为习惯。

2.设计要求

2.1 实验仪器及设备

使用64位Windows操作系统的电脑。
使用3.8.5版本的Python。
使用PyCharm Community Edition编辑器。
使用 jieba，PyTorch，Tensorflow……

2.2 设计要求

课程设计的主要环节包括课程设计作品和课程设计报告的撰写。课程设计作品的完成主要包含方案设计、计算机编程实现、作品测试几个方面。课程设计报告主要是将课程设计的理论设计内容、实现的过程及测试结果进行全面的总结，把实践内容上升到理论高度。

3.实验方向

本实验旨在探索基于卷积神经网络（Convolutional Neural Network, CNN）的文本分类方法在多主题数据集上的应用。具体方向如下：

3.1研究卷积神经网络在文本分类任务中的效果

通过实验评估基于CNN的文本分类方法在准确性和效率方面的表现，验证其在多主题数据集上的可行性和有效性。

3.2探索词嵌入技术在文本分类中的应用

使用预训练的词嵌入模型（如GloVe）将文本转换为数值表示，通过实验观察词嵌入技术对文本分类性能的影响，并探索其在卷积神经网络中的应用。

3.3验证实验结果的稳定性和可靠性

通过合理的训练集和测试集划分，以及适当的评估指标，验证实验结果的稳定性和可靠性，确保所得结论具有一定的科学性和可信度。

3.4提出改进和展望

基于实验结果，分析CNN文本分类方法的优缺点，探讨改进方法和可能的研究方向，为进一步优化文本分类方法提供参考和指导。

通过以上实验目的的达成，我期望深入了解基于卷积神经网络的文本分类方法的性能和特点，为实际应用中的文本分类问题提供有效的解决方案，并推动相关研究的进一步发展。

4.实验内容

4.1数据集准备

首先，我从公开的多主题数据集中选择适当的数据集作为实验的基础。我需要确保数据集涵盖了多个主题或类别，并且每个类别都有足够数量的样本。然后，我进行数据的预处理，包括文本清洗、分词和去除停用词等操作，以准备数据集用于模型训练和评估。

4.2构建卷积神经网络模型

我基于CNN架构设计一个适用于文本分类任务的模型。该模型通常包括嵌入层、卷积层、池化层和全连接层等组件。我根据实验需求确定模型的层数、卷积核大小、池化方式等超参数，并选择合适的激活函数和优化算法。

4.3词嵌入技术应用

我将使用预训练的词嵌入模型（如GloVe）将文本数据转换为向量表示。这些向量捕捉了单词之间的语义关系，有助于提取文本中的特征。我将探索不同的词嵌入维度和预训练模型的选择，并将其应用于CNN模型中。

4.4模型训练与优化

使用预处理后的训练集对CNN模型进行训练。在训练过程中，我将采用合适的损失函数和评估指标，并结合交叉验证等技术进行模型的优化和调参。通过调整超参数、正则化和批次大小等策略，提高模型的性能和泛化能力。

4.5模型评估与对比

在训练完成后，我将使用预处理后的测试集对训练好的模型进行评估。我将计算模型在测试集上的准确率、精确率、召回率和F1值等指标，以全面评估模型的性能。此外，我还将与其他常用的文本分类方法进行对比，以验证基于CNN的方法的优势和竞争力。

4.6结果分析与讨论

根据实验结果，我将对模型的性能进行分析和讨论。我将重点关注模型在不同主题或类别上的表现，并探讨词嵌入技术对模型性能的影响。通过对实验结果的深入分析，我可以得出结论并提出改进模型的建议。

5. 数据集与预处理

本报告使用了一个包含多个主题的数据集，其中每个主题都对应一个文件夹，文件夹中包含若干文本文件。我首先遍历每个文件夹，读取文件内容，并为每个文本分配一个标签。然后，对文本进行分词处理，使用jieba库进行中文分词，并根据词频统计建立词汇表。

6. 绘制词云图

词云图是一种可视化工具，用于展示文本数据中单词的频率和重要性。在本次实验中，我使用了绘制词云图的技术，以更好地理解文本数据的特征和关键信息。为了绘制词云图，我首先对文本数据进行了预处理，包括文本清洗、分词和去除停用词等操作。这些步骤有助于减少噪声和提取出文本数据的有效信息。接着，我计算了每个单词在文本中的频率或权重，并根据这些频率或权重进行可视化。其词云图如图所示:

7. 词嵌入与特征表示

为了将文本表示为计算机可处理的形式，我使用预训练的词嵌入模型GloVe将每个词转换为词向量。通过构建一个词嵌入矩阵，将词嵌入与词汇表对应起来。然后，利用Tokenize和pad_sequences等函数，将文本转换为定长的数值序列，以便输入到卷积神经网络中进行训练。

8. 方差分析特征选择

方差分析是一种常用的统计方法，用于比较不同组之间的均值差异。在特征选择中，方差分析可以用于确定哪些特征对于分类或回归任务是最具有显著性的。

本次实验中，我采用了方差分析作为特征选择的方法，旨在识别那些与目标变量高度相关的特征。具体而言，我使用了一组含有多个特征的数据集，并通过方差分析来评估每个特征与目标变量之间的关联程度。

最好选出来的词向量数据如图所示:

9. Textcnn模型

本实验使用了一个简单的卷积神经网络模型进行文本分类。模型由嵌入层、卷积层、池化层和全连接层组成。其中卷积层和池化层的设计旨在提取文本中的局部特征，全连接层则将提取到的特征进行分类。模型的训练过程使用交叉熵损失函数和Adam优化器，并通过训练集和验证集对模型进行训练和评估。

9.1 Embedding层：

将输入的整数序列转换为密集向量表示，每个词语被表示为一个固定长度的向量。该层的输出形状为(None, 1000, 100)，其中1000表示输入序列的长度，100表示每个词语的向量维度。

9.2 Conv1D层

三个卷积层（conv1d，conv1d_1，conv1d_2）分别对输入进行一维卷积操作。这些卷积层使用relu激活函数，并输出形状为(None, sequence_length - kernel_size + 1, filters)的特征图，其中sequence_length表示序列的长度，kernel_size为卷积核大小，filters表示卷积核的数量。

9.3 MaxPooling1D层

三个最大池化层（max_pooling1d，max_pooling1d_1，max_pooling1d_2）用于在特征图上进行最大池化操作，减小特征图的尺寸。这些池化层输出形状为(None, reduced_length, filters)，其中reduced_length是经过池化后的长度。

9.4 Flatten层

将池化层的输出展平为一维张量，用于连接全连接层。

9.5 Dense层

两个全连接层（dense，dense_1），分别包含120和num_classes个神经元，其中num_classes表示分类的类别数。全连接层使用relu和softmax激活函数，输出最终的分类结果。

其模型架构如图所示:

10. 实验结果与分析

在实验中，我将数据集划分为训练集和测试集，用训练集对模型进行训练，再用测试集评估模型的性能。实验结果显示，基于卷积神经网络的文本分类方法在测试集上达到了较高的准确性。同时，通过绘制训练曲线和验证曲线，我观察到模型在训练过程中的准确性和损失的变化情况。

11. 结论与展望

本实验介绍了一种基于卷积神经网络的文本分类方法，并在一个多主题的数据集上进行了实验。实验结果表明，该方法在文本分类任务中取得了较好的性能，具有较高的准确性和效率。然而，仍然有一些改进的空间。例如，可以尝试使用更复杂的卷积神经网络结构，或者引入注意力机制等方法，进一步提升文本分类的性能。

代码以及运行结果：

import os 
import numpy as np 
import sys 
 
# In[2]: 
 
embeddings_index = {}  # 创建一个空字典用于存储单词和对应的词向量 
 
# 打开GloVe文件 
f = open('./data/glove.6B/glove.6B.100d.txt', encoding='utf-8') 
 
# 遍历文件中的每一行 
for line in f: 
    values = line.split()  # 将每一行按空格分割为单词和词向量 
    word = values[0]  # 提取单词 
    coefs = np.asarray(values[1:], dtype='float32')  # 将词向量转换为NumPy数组 
    embeddings_index[word] = coefs  # 将单词和词向量添加到字典中 
 
f.close()  # 关闭文件 
 
# In[3]: 
 
texts = []  # 创建一个空列表用于存储文本数据 
labels_index = {}  # 创建一个空字典用于存储标签和对应的索引 
labels = []  # 创建一个空列表用于存储标签 
 
# 遍历20_newsgroup文件夹中的每个文件夹（表示不同的标签） 
for name in sorted(os.listdir('./data/20_newsgroup')): 
    path = './data/20_newsgroup/' + name  # 获取文件夹的路径 
 
    if os.path.isdir(path):  # 确保当前路径是一个文件夹 
        label_id = len(labels_index)  # 获取当前标签的索引 
        labels_index[name] = label_id  # 将标签和对应的索引添加到字典中 
 
        # 遍历当前文件夹中的每个文件 
        for fname in sorted(os.listdir(path)): 
            if fname.isdigit():  # 确保文件名是数字（排除非文本文件） 
                framepath = path + '/' + fname  # 构建文件的完整路径 
 
                # 打开文件并读取内容 
                if sys.version_info < (3, ): 
                    f = open(framepath) 
                else: 
                    f = open(framepath, encoding='latin-1') 
                texts.append(f.read())  # 将文件内容添加到文本列表中 
                f.close()  # 关闭文件 
 
                labels.append(label_id)  # 将标签索引添加到标签列表中 
 
# In[4]: 
 
import jieba 
 
# In[5]: 
 
import matplotlib.pyplot as plt 
import numpy as np 
from PIL import Image 
from wordcloud import WordCloud, STOPWORDS 
 
# 定义函数计算单词出现次数 
def words_count(): 
    word_dict = {}  # 创建一个空字典用于存储单词和对应的计数 
 
    # 遍历文本列表中的前10000条文本 
    for item in texts[:10000]: 
        for i in jieba.cut(item):  # 使用jieba进行分词 
            if len(i) <= 3:  # 忽略长度小于等于3的单词 
                continue 
            if i not in word_dict:  # 如果单词不在字典中 
                word_dict[i] = 1  # 添加单词并将计数设置为1 
            else: 
                word_dict[i] += 1  # 如果单词已经在字典中，增加计数 
 
    return word_dict  # 返回单词计数字典 
 
words_count() 
 
# In[21]: 
 
def WordCloud_plot(mask_picture='./duihuakuan.jpg'): 
    p1 = plt.figure(figsize=(16, 8), dpi=80)  # 创建一个新的图形窗口 
    image = Image.open(mask_picture)  # 打开遮罩图片 
    graph = np.array(image)  # 将图片转换为NumPy数组 
 
    wc = WordCloud(background_color='White',  # 设置词云的背景颜色为白色 
                   mask=graph,  # 设置词云的遮罩图片 
                   max_words=2000,  # 设置词云显示的最大单词数量 
                   stopwords=STOPWORDS,  # 设置停用词 
                   font_path='./simhei.ttf',  # 设置字体路径 
                   max_font_size=100,  # 设置最大字体大小 
                   random_state=30  # 设置随机种子 
                   ) 
 
    wc.generate_from_frequencies(words_count())  # 根据单词计数字典生成词云 
 
    plt.imshow(wc)  # 显示词云 
    plt.axis("off")  # 隐藏坐标轴 
    plt.show()  # 显示图形 
 
WordCloud_plot() 
 

# In[5]: 
 
import tensorflow 
 
# In[6]: 
 
from tensorflow.keras.preprocessing.text import Tokenizer 
from tensorflow.keras.preprocessing.sequence import pad_sequences 
from keras.utils.np_utils import to_categorical 
 
# In[7]: 
 
tokenizer = Tokenizer(num_words=20000) 
tokenizer.fit_on_texts(texts) 
sequences = tokenizer.texts_to_sequences(texts) 
 
word_index = tokenizer.word_index 
 
data = pad_sequences(sequences, maxlen=1000) 
 
labels = to_categorical(np.asarray(labels)) 
 
# In[8]: 
 
data 
 
# In[9]: 
 
#导入sklearn.model_selection 的train_test_split 
from sklearn.model_selection import train_test_split 
 
#70%用于训练，30%用于测试，注意特征集的选用，标签选用MEDV 
x_train, x_test, y_train, y_test = train_test_split(data, 
                                                    labels, 
                                                    test_size=0.1, 
                                                    random_state=0) 
 
#打印输出X_train, X_test, y_train, y_test的维度 
print('{}{}{}{}'.format(x_train.shape, x_test.shape, y_train.shape, 
                        y_test.shape)) 
 
# In[10]: 
 
nb_words = min(20000, len(word_index))  # 计算实际使用的单词数量，取最小值为20000 
embedding_matrix = np.zeros((nb_words + 1, 100))  # 创建一个全零矩阵，行数为单词数量加1，列数为词向量维度（这里假设为100） 
 
# 遍历word_index中的每个单词和对应的索引 
for word, i in word_index.items(): 
    if i > 20000:  # 如果索引超过20000，跳过当前单词 
        continue 
    embedding_vector = embeddings_index.get(word)  # 从embeddings_index中获取当前单词的词向量 
    if embedding_vector is not None:  # 如果词向量存在 
        embedding_matrix[i] = embedding_vector  # 将词向量赋值给embedding_matrix对应的行 
 
 
# In[11]: 
 
from tensorflow.keras.layers import Dense, Input, Flatten 
from keras.layers import Conv1D, MaxPooling1D, Embedding 
from keras.models import Model 
from keras.optimizers import * 
from keras.models import Sequential 
import tensorflow as tf 
 
# In[12]: 
 
x_train.shape 
 
# In[13]: 
 
class Model_cnn(Model): 
    def __init__(self): 
        super(Model_cnn, self).__init__() 
 
        # 定义模型的各层 
        self.e1 = Embedding(nb_words + 1, 100, input_length=1000, trainable=True) 
        self.c1 = Conv1D(128, 5, activation='relu') 
        self.c2 = Conv1D(128, 5, activation='relu') 
        self.c3 = Conv1D(128, 5, activation='relu') 
        self.p1 = MaxPooling1D(5) 
        self.p2 = MaxPooling1D(5) 
        self.p3 = MaxPooling1D(35) 
        self.flatten = Flatten() 
        self.f1 = Dense(120, activation='relu') 
        self.f2 = Dense(len(labels_index), activation='softmax') 
 
    def call(self, x): 
        # 定义模型的前向传播过程 
        x = self.e1(x)  # 输入经过Embedding层 
        x = self.c1(x)  # 卷积层1 
        x = self.p1(x)  # 最大池化层1 
        x = self.c2(x)  # 卷积层2 
        x = self.p2(x)  # 最大池化层2 
        x = self.c3(x)  # 卷积层3 
        x = self.p3(x)  # 最大池化层3 
        x = self.flatten(x)  # 展平 
        x = self.f1(x)  # 全连接层1 
        y = self.f2(x)  # 全连接层2输出，使用softmax激活函数 
        return y 
 
model = Model_cnn() 

# In[18]: 
 
model = Sequential() 
model.add(Embedding(nb_words + 1, 100, input_length=1000, trainable=True)) 
model.add(Conv1D(128, 5, activation='relu')) 
model.add(MaxPooling1D(5)) 
model.add(Conv1D(128, 5, activation='relu')) 
model.add(MaxPooling1D(5)) 
model.add(Conv1D(128, 5, activation='relu')) 
model.add(MaxPooling1D(35)) 
model.add(Flatten()) 
model.add(Dense(120, activation='relu')) 
model.add(Dense(10, activation='softmax')) 
 
from tensorflow.python.keras.utils.vis_utils import plot_model 
 
plot_model(model, to_file='model.png', show_shapes=True) 
 
# In[13]: 
 
model.compile(loss='categorical_crossentropy', 
              optimizer='Adam', 
              metrics=['accuracy']) 
 
# In[14]: 
 
import tensorflow as tf 
import os 
import numpy as np 
from matplotlib import pyplot as plt 
from tensorflow.keras.layers import Conv2D, BatchNormalization, Activation, MaxPool2D, Dropout, Flatten, Dense 
from tensorflow.keras import Model 
 
# In[15]: 
 
def scheduler(epoch): 
    # 前5个epoch学习率保持不变，5个epoch后学习率按比例衰减 
    if epoch < 1: 
        return 0.001 
    else: 
        lr = 0.001 * tf.math.exp(0.1 * (5 - epoch)) 
        return lr.numpy() 
 
 
reduce_lr = tf.keras.callbacks.LearningRateScheduler(scheduler) 
 
# In[17]: 
 
history = model.fit(x_train, 
                    y_train, 
                    batch_size=64, 
                    epochs=5, 
                    validation_data=(x_test, y_test), 
                    validation_freq=1) 
 
# In[18]: 
 
model.summary() 
 
# In[19]: 
 
from matplotlib import pyplot as plt 
 
# In[20]: 
 
acc = history.history['accuracy']  # 训练准确率列表 
val_acc = history.history['val_accuracy']  # 验证准确率列表 
loss = history.history['loss']  # 训练损失列表 
val_loss = history.history['val_loss']  # 验证损失列表 
 
plt.subplot(1, 2, 1)  # 创建一个1行2列的子图，当前绘制第1个子图 
plt.plot(acc, label='Training Accuracy')  # 绘制训练准确率曲线 
plt.plot(val_acc, label='Validation Accuracy')  # 绘制验证准确率曲线 
plt.title('Training and Validation Accuracy')  # 设置子图标题 
plt.legend()  # 显示图例 
 
plt.subplot(1, 2, 2)  # 创建一个1行2列的子图，当前绘制第2个子图 
plt.plot(loss, label='Training Loss')  # 绘制训练损失曲线 
plt.plot(val_loss, label='Validation Loss')  # 绘制验证损失曲线 
plt.title('Training and Validation Loss')  # 设置子图标题 
plt.legend()  # 显示图例 
 
plt.show()  # 显示图形 
 
# In[26]: 
 
text='Xref: cantaloupe.srv.cs.cmu.edu alt.atheism:49960 alt.atheism.moderated:713 news.answers:7054 alt.answers:126\nPath: cantaloupe.srv.cs.cmu.edu!crabapple.srv.cs.cmu.edu!bb3.andrew.cmu.edu!news.sei.cmu.edu!cis.ohio-state.edu!magnus.acs.ohio-state.edu!usenet.ins.cwru.edu!agate!spool.mu.edu!uunet!pipex!ibmpcug!mantis!mathew\nFrom: mathew <mathew@mantis.co.uk>\nNewsgroups: alt.atheism,alt.atheism.moderated,news.answers,alt.answers\nSubject: Alt.Atheism FAQ: Atheist Resources\nSummary: Books, addresses, music -- anything related to atheism\nKeywords: FAQ, atheism, books, music, fiction, addresses, contacts\nMessage-ID: <19930329115719@mantis.co.uk>\nDate: Mon, 29 Mar 1993 11:57:19 GMT\nExpires: Thu, 29 Apr 1993 11:57:19 GMT\nFollowup-To: alt.atheism\nDistribution: world\nOrganization: Mantis Consultants, Cambridge. UK.\nApproved: news-answers-request@mit.edu\nSupersedes: <19930301143317@mantis.co.uk>\nLines: 290\n\nArchive-name: atheism/resources\nAlt-atheism-archive-name: resources\nLast-modified: 11 December 1992\nVersion: 1.0\n\n                              Atheist Resources\n\n                      Addresses of Atheist Organizations\n\n                                     USA\n\nFREEDOM FROM RELIGION FOUNDATION\n\nDarwin fish bumper stickers and assorted other atheist paraphernalia are\navailable from the Freedom From Religion Foundation in the US.\n\nWrite to:  FFRF, P.O. Box 750, Madison, WI 53701.\nTelephone: (608) 256-8900\n\nEVOLUTION DESIGNS\n\nEvolution Designs sell the "Darwin fish".  It\'s a fish symbol, like the ones\nChristians stick on their cars, but with feet and the word "Darwin" written\ninside.  The deluxe moulded 3D plastic fish is $4.95 postpaid in the US.\n\nWrite to:  Evolution Designs, 7119 Laurel Canyon #4, North Hollywood,\n           CA 91605.\n\nPeople in the San Francisco Bay area can get Darwin Fish from Lynn Gold --\ntry mailing <figmo@netcom.com>.  For net people who go to Lynn directly, the\nprice is $4.95 per fish.\n\nAMERICAN ATHEIST PRESS\n\nAAP publish various atheist books -- critiques of the Bible, lists of\nBiblical contradictions, and so on.  One such book is:\n\n"The Bible Handbook" by W.P. Ball and G.W. Foote.  American Atheist Press.\n372 pp.  ISBN 0-910309-26-4, 2nd edition, 1986.  Bible contradictions,\nabsurdities, atrocities, immoralities... contains Ball, Foote: "The Bible\nContradicts Itself", AAP.  Based on the King James version of the Bible.\n\nWrite to:  American Atheist Press, P.O. Box 140195, Austin, TX 78714-0195.\n      or:  7215 Cameron Road, Austin, TX 78752-2973.\nTelephone: (512) 458-1244\nFax:       (512) 467-9525\n\nPROMETHEUS BOOKS\n\nSell books including Haught\'s "Holy Horrors" (see below).\n\nWrite to:  700 East Amherst Street, Buffalo, New York 14215.\nTelephone: (716) 837-2475.\n\nAn alternate address (which may be newer or older) is:\nPrometheus Books, 59 Glenn Drive, Buffalo, NY 14228-2197.\n\nAFRICAN-AMERICANS FOR HUMANISM\n\nAn organization promoting black secular humanism and uncovering the history of\nblack freethought.  They publish a quarterly newsletter, AAH EXAMINER.\n\nWrite to:  Norm R. Allen, Jr., African Americans for Humanism, P.O. Box 664,\n           Buffalo, NY 14226.\n\n                                United Kingdom\n\nRationalist Press Association          National Secular Society\n88 Islington High Street               702 Holloway Road\nLondon N1 8EW                          London N19 3NL\n071 226 7251                           071 272 1266\n\nBritish Humanist Association           South Place Ethical Society\n14 Lamb\'s Conduit Passage              Conway Hall\nLondon WC1R 4RH                        Red Lion Square\n071 430 0908                           London WC1R 4RL\nfax 071 430 1271                       071 831 7723\n\nThe National Secular Society publish "The Freethinker", a monthly magazine\nfounded in 1881.\n\n                                   Germany\n\nIBKA e.V.\nInternationaler Bund der Konfessionslosen und Atheisten\nPostfach 880, D-1000 Berlin 41. Germany.\n\nIBKA publish a journal:\nMIZ. (Materialien und Informationen zur Zeit. Politisches\nJournal der Konfessionslosesn und Atheisten. Hrsg. IBKA e.V.)\nMIZ-Vertrieb, Postfach 880, D-1000 Berlin 41. Germany.\n\nFor atheist books, write to:\n\nIBDK, Internationaler B"ucherdienst der Konfessionslosen\nPostfach 3005, D-3000 Hannover 1. Germany.\nTelephone: 0511/211216\n\n\n                               Books -- Fiction\n\nTHOMAS M. DISCH\n\n"The Santa Claus Compromise"\nShort story.  The ultimate proof that Santa exists.  All characters and \nevents are fictitious.  Any similarity to living or dead gods -- uh, well...\n\nWALTER M. MILLER, JR\n\n"A Canticle for Leibowitz"\nOne gem in this post atomic doomsday novel is the monks who spent their lives\ncopying blueprints from "Saint Leibowitz", filling the sheets of paper with\nink and leaving white lines and letters.\n\nEDGAR PANGBORN\n\n"Davy"\nPost atomic doomsday novel set in clerical states.  The church, for example,\nforbids that anyone "produce, describe or use any substance containing...\natoms". \n\nPHILIP K. DICK\n\nPhilip K. Dick Dick wrote many philosophical and thought-provoking short \nstories and novels.  His stories are bizarre at times, but very approachable.\nHe wrote mainly SF, but he wrote about people, truth and religion rather than\ntechnology.  Although he often believed that he had met some sort of God, he\nremained sceptical.  Amongst his novels, the following are of some relevance:\n\n"Galactic Pot-Healer"\nA fallible alien deity summons a group of Earth craftsmen and women to a\nremote planet to raise a giant cathedral from beneath the oceans.  When the\ndeity begins to demand faith from the earthers, pot-healer Joe Fernwright is\nunable to comply.  A polished, ironic and amusing novel.\n\n"A Maze of Death"\nNoteworthy for its description of a technology-based religion.\n\n"VALIS"\nThe schizophrenic hero searches for the hidden mysteries of Gnostic\nChristianity after reality is fired into his brain by a pink laser beam of\nunknown but possibly divine origin.  He is accompanied by his dogmatic and\ndismissively atheist friend and assorted other odd characters.\n\n"The Divine Invasion"\nGod invades Earth by making a young woman pregnant as she returns from\nanother star system.  Unfortunately she is terminally ill, and must be\nassisted by a dead man whose brain is wired to 24-hour easy listening music.\n\nMARGARET ATWOOD\n\n"The Handmaid\'s Tale"\nA story based on the premise that the US Congress is mysteriously\nassassinated, and fundamentalists quickly take charge of the nation to set it\n"right" again.  The book is the diary of a woman\'s life as she tries to live\nunder the new Christian theocracy.  Women\'s right to own property is revoked,\nand their bank accounts are closed; sinful luxuries are outlawed, and the\nradio is only used for readings from the Bible.  Crimes are punished\nretroactively: doctors who performed legal abortions in the "old world" are\nhunted down and hanged.  Atwood\'s writing style is difficult to get used to\nat first, but the tale grows more and more chilling as it goes on.\n\nVARIOUS AUTHORS\n\n"The Bible"\nThis somewhat dull and rambling work has often been criticized.  However, it\nis probably worth reading, if only so that you\'ll know what all the fuss is\nabout.  It exists in many different versions, so make sure you get the one\ntrue version.\n\n                             Books -- Non-fiction\n\nPETER DE ROSA\n\n"Vicars of Christ", Bantam Press, 1988\nAlthough de Rosa seems to be Christian or even Catholic this is a very\nenlighting history of papal immoralities, adulteries, fallacies etc.\n(German translation: "Gottes erste Diener. Die dunkle Seite des Papsttums",\nDroemer-Knaur, 1989)\n\nMICHAEL MARTIN\n\n"Atheism: A Philosophical Justification", Temple University Press,\n Philadelphia, USA.\nA detailed and scholarly justification of atheism.  Contains an outstanding\nappendix defining terminology and usage in this (necessarily) tendentious\narea.  Argues both for "negative atheism" (i.e. the "non-belief in the\nexistence of god(s)") and also for "positive atheism" ("the belief in the\nnon-existence of god(s)").  Includes great refutations of the most\nchallenging arguments for god; particular attention is paid to refuting\ncontempory theists such as Platinga and Swinburne.\n541 pages. ISBN 0-87722-642-3 (hardcover; paperback also available)\n\n"The Case Against Christianity", Temple University Press\nA comprehensive critique of Christianity, in which he considers\nthe best contemporary defences of Christianity and (ultimately)\ndemonstrates that they are unsupportable and/or incoherent.\n273 pages. ISBN 0-87722-767-5\n\nJAMES TURNER\n\n"Without God, Without Creed", The Johns Hopkins University Press, Baltimore,\n MD, USA\nSubtitled "The Origins of Unbelief in America".  Examines the way in which\nunbelief (whether agnostic or atheistic)  became a mainstream alternative\nworld-view.  Focusses on the period 1770-1900, and while considering France\nand Britain the emphasis is on American, and particularly New England\ndevelopments.  "Neither a religious history of secularization or atheism,\nWithout God, Without Creed is, rather, the intellectual history of the fate\nof a single idea, the belief that God exists." \n316 pages. ISBN (hardcover) 0-8018-2494-X (paper) 0-8018-3407-4\n\nGEORGE SELDES (Editor)\n\n"The great thoughts", Ballantine Books, New York, USA\nA "dictionary of quotations" of a different kind, concentrating on statements\nand writings which, explicitly or implicitly, present the person\'s philosophy\nand world-view.  Includes obscure (and often suppressed) opinions from many\npeople.  For some popular observations, traces the way in which various\npeople expressed and twisted the idea over the centuries.  Quite a number of\nthe quotations are derived from Cardiff\'s "What Great Men Think of Religion"\nand Noyes\' "Views of Religion".\n490 pages. ISBN (paper) 0-345-29887-X.\n\nRICHARD SWINBURNE\n\n"The Existence of God (Revised Edition)", Clarendon Paperbacks, Oxford\nThis book is the second volume in a trilogy that began with "The Coherence of\nTheism" (1977) and was concluded with "Faith and Reason" (1981).  In this\nwork, Swinburne attempts to construct a series of inductive arguments for the\nexistence of God.  His arguments, which are somewhat tendentious and rely\nupon the imputation of late 20th century western Christian values and\naesthetics to a God which is supposedly as simple as can be conceived, were\ndecisively rejected in Mackie\'s "The Miracle of Theism".  In the revised\nedition of "The Existence of God", Swinburne includes an Appendix in which he\nmakes a somewhat incoherent attempt to rebut Mackie.\n\nJ. L. MACKIE\n\n"The Miracle of Theism", Oxford\nThis (posthumous) volume contains a comprehensive review of the principal\narguments for and against the existence of God.  It ranges from the classical\nphilosophical positions of Descartes, Anselm, Berkeley, Hume et al, through\nthe moral arguments of Newman, Kant and Sidgwick, to the recent restatements\nof the classical theses by Plantinga and Swinburne.  It also addresses those\npositions which push the concept of God beyond the realm of the rational,\nsuch as those of Kierkegaard, Kung and Philips, as well as "replacements for\nGod" such as Lelie\'s axiarchism.  The book is a delight to read - less\nformalistic and better written than Martin\'s works, and refreshingly direct\nwhen compared with the hand-waving of Swinburne.\n\nJAMES A. HAUGHT\n\n"Holy Horrors: An Illustrated History of Religious Murder and Madness",\n Prometheus Books\nLooks at religious persecution from ancient times to the present day -- and\nnot only by Christians.\nLibrary of Congress Catalog Card Number 89-64079. 1990.\n\nNORM R. ALLEN, JR.\n\n"African American Humanism: an Anthology"\nSee the listing for African Americans for Humanism above.\n\nGORDON STEIN\n\n"An Anthology of Atheism and Rationalism", Prometheus Books\nAn anthology covering a wide range of subjects, including \'The Devil, Evil\nand Morality\' and \'The History of Freethought\'.  Comprehensive bibliography.\n\nEDMUND D. COHEN\n\n"The Mind of The Bible-Believer", Prometheus Books\nA study of why people become Christian fundamentalists, and what effect it\nhas on them.\n\n                                Net Resources\n\nThere\'s a small mail-based archive server at mantis.co.uk which carries\narchives of old alt.atheism.moderated articles and assorted other files.  For\nmore information, send mail to archive-server@mantis.co.uk saying\n\n   help\n   send atheism/index\n\nand it will mail back a reply.\n\n\nmathew\nÿ\n' 
 
# In[28]: 
 
text=[text] 
sequences = tokenizer.texts_to_sequences(text) 
 
val = pad_sequences(sequences, maxlen=1000) 
val.shape 
 
# In[30]: 
 
pre=model.predict(val) 
np.argmax(pre,axis=1)

设计体会

经过长达近两周的实训,收获颇深。第一次尝试自己去搭建一个基于文本内容的垃圾短信识别，对其很好奇，花费了大量的功夫和时间去安装相关库并且让我明白了一些思路。

刚开始实训时，有点懵，找不到方向，于是我先直接去用老师所给的代码去了解这个过程。通过老师所给代码明白了一个大致思路，我需要去做的事情是（这是结合老师所给代码以及我自己思考后得出，以及我的思路）：

准备数据集用于文本清洗、分词、去除停用词等操作，以准备数据集用于模型训练和评估。
构建卷积神经网络去适用于文本分类任务的模型
用预训练的词嵌入模型（如GloVe）将文本数据转换为向量表示
模型训练与优化：使用预处理后的训练集对CNN模型进行训练
模型评估与对比：在训练完成后，我将使用预处理后的测试集对训练好的模型进行评估
结果分析与讨论：根据实验结果，我将对模型的性能进行分析

实训过程中，首先我利用老师所给出的参考文件进行一个大致了解，通过对代码的理解首先我将数据集（这里我使用的是自己的数据集）导入到字典里，以便于我后面不需要再进行文件读取。然后我对词进行词向量进行标注并对词计数以用于展示词云，展示词云后接下来我将搭建模型去进行文本分类，最后完成分类，得出结论并分析。

整个过程看起来简单，但我却绞尽脑汁。首先第一个就是相关库不兼容的问题以及python版本不兼容问题，这个问题导致我重新安装anaconda以及相关环境，以至于我第一天花费了大量时间去搭建相关环境，搭建好环境之后遇到的问题是词云的参数问题，由于我的失误导致词云参数设置错误，但及时改正并且成功展示词云。最后一个问题就是刚开始搭建模型时后准确率较低，只有50%左右，后来我将其放入pycharm进行调试后对其修正才得到比较好的准确率。

训练完成后，实验表明该方法用于文本分类任务中取得了较好的性能，具有较高的准确性和效率。但是仍然有一些可以改进的空间。例如，可以尝试使用更复杂的卷积神经网络结构，或者引入注意力机制等方法，进一步提升文本分类的性能。

完成实验后，有很大的成就感，也对自己有了更加清晰的认识，也知道自己很多不足，但是这次实验也让我明白“也许你今天什么也不会，但是你今天学了一点，那么你就比前面进步一点”，这就是一个不断学习的过程。