python学习之文章数据分析-CSDN博客

通常我们在进行NLP学习的时候,会经常的处理一些语料,同时也会对这些语料进行一些分析,今天的这篇文章我们通过分析quora上的Andrew NG的一个回答来实际操作一下:

原文复制如下:

Deep Learning is an amazing tool that is helping numerous groups create exciting AI applications. It is helping us build self-driving cars, accurate speech recognition, computers that can understand images, and much more.
Despite all the recent progress, I still see huge untapped opportunities ahead. There're many projects in precision agriculture, consumer finance, medicine, ... where I see a clear opportunity for deep learning to have a big impact, but that none of us have had time to focus on yet. So I'm confident deep learning isn't going to "plateau" anytime soon and that it'll continue to grow rapidly.
Deep Learning has also been overhyped. Because neural networks are very technical and hard to explain, many of us used to explain it by drawing an analogy to the human brain. But we have pretty much no idea how the biological brain works. UC Berkeley's Michael Jordan calls deep learning a "cartoon" of the biological brain--a vastly oversimplified version of something we don't even understand--and I agree. Despite the media hype, we're nowhere near being able to build human-level intelligence. Because we fundamentally don't know how the brain works, attempts to blindly replicate what little we know in a computer also has not resulted in particularly useful AI systems. Instead, the most effective deep learning work today has made its progress by drawing from CS and engineering principles and at most a touch of biological inspiration, rather than try to blindly copy biology.
Concretely, if you hear someone say "The brain does X. My system also does X. Thus we're on a path to building the brain," my advice is to run away!

Many of the ideas used in deep learning have been around for decades. Why is it taking off only now? Two of the key drivers of its progress are: (i) scale of data and (ii) scale of computation. With our society spending more time on websites and mobile devices, for the past two decades we've been rapidly accumulating data. It was only recently that we figured out how to scale computation so as to build deep learning algorithms that can take advantage of this voluminous amount of data.
This has now put us in two positive feedback loops, which is accelerating the progress of deep learning:
First, now that we have huge machines to absorb huge amounts of data, the value of big data is clearer. This creates a greater incentive to acquire more data, which in turn creates a greater incentive to build bigger/faster neural networks.
Second, that we have fast deep learning implementations also speeds up innovation, and accelerates deep learning's research progress. Many people underestimate the impact of computer systems investments in deep learning. When carrying out deep learning research, we start out not knowing what algorithms will and won't work, and our job is to run a lot of experiments and figure it out. If we have an efficient compute infrastructure that lets you run an experiment in a day rather than a week, then your research progress could be almost 7x as fast!
This is why around 2008 my group at Stanford started advocating shifting deep learning to GPUs (this was really controversial at that time; but now everyone does it); and I'm now advocating shifting to HPC (High Performance Computing/Supercomputing) tactics for scaling up deep learning. Machine learning should embrace HPC. These methods will make researchers more efficient and help accelerate the progress of our whole field.

To summarize: Deep learning has already helped AI made tremendous progress. But the best is still to come!
复制代码

我们分析这篇文章有两个需求,一个是分析一篇文章当中的词频,另外一个是每一个词出现的次数,而我们也将奔着这两个目标去处理:

这里我们要用到matplotlib这个模块来进行图像的绘制:

1:分词处理

英文文章一个好处是他们每个词之间会有空格来进行区分,但是词和词之间往往会有句号,逗号这样的标点来去干扰,因此我们是通过string这个模块来去除标点和空格,其中string.punctuation是去除标点,string.whitespace是去除空格.至于hist[word]=hist.get(word,0)+1,这句话等同于上边的if-else,这里记录的是每一个单词和这个单词出现的次数.

结果如下:

2:排序处理

这一个函数是在上文获取了每一个单词和这个单词出现的次数之后,他不是有顺序的,,在这里我们要用数组的排序来处理一下,数组有一个sort()函数,可以从大到小进行排序.

结果如下:

3:绘图处理:

这里用的matplotlib绘图大家都很熟悉了,绘制出来

其实本来应该下边包含有标签,比如下边:

这样应该是最好的,但是我换了一台电脑后发现最下边的标签实在是太丑了,拥挤不堪,于是就去掉了,如果有兴趣的小伙伴可以自己再加上.

完整代码如下:

#-*- coding:utf-8 -*-
import string
from matplotlib import pyplot as plt
#字符串类型
hist = []

def process_line(line, hist):
    #处理每一行
    for word in line.split():
        # 去除单词里边的标点符号
        word = word.strip(string.punctuation+string.whitespace)
        #单词的格式统一小写
        word = word.lower()
        #字典语法
        '''
        原逻辑
        if word not in res:
             res[word]=1
        else:
             res[word]+=1  
        '''
        hist[word]= hist.get(word,0)+1


def process_file(filename):
    #处理文章
    res = {}
    with open(filename,'r') as f:
        for line in f:
            process_line(line, res)
    return res


def most_word(hist,num):
    #从高到低返回指定个数的词频信息
    temp = []
    for key,value in hist.items():
        temp.append([value,key])
    temp.sort(reverse=True)
    print(temp)
    return temp[:num]



if __name__ == '__main__':
    hist = process_file('emma.txt')
    data = most_word(hist,20)#最多的单词
    for i in range(20):
        # plt.bar(t[i][1:],t[i][:-1])
        plt.bar(i, [data[i][0]])
    plt.legend()
    plt.xlabel('word')
    plt.ylabel('rate')
    plt.title('show')
    plt.show()复制代码