python数据分析概述和实例

wdyx55

于 2024-06-12 21:02:35 发布

阅读量574

点赞数 16

文章标签： python 数据分析

本文链接：https://blog.csdn.net/wdyx55/article/details/139636028

版权

一，了解数据分析的优势

python是一门引用、、引用广泛的计算机语言，特别是在数据科学方面有天然的优势。

（1）语法简单精练。对于初学者来说，比起其他编程语言，Python更容易上手。

（2）含有大量功能强大的库。结合其编程方面的强大实力，可以只使用Python这一门语言去构建以数据为中心的应用程序。

（3）功能强大。从特性角度来看，Python 是一个混合体。丰富的工具集使Python介于传统的脚本语言和系统语言之间。Python 不仅具备脚本语言简单和易用的特点，而且提供编译语言所具有的高级软件工程工具。

（4）Python不仅适用于研究和原型构建，而且适用于构建生产系统。研究人员和工程技术人员使用同一种编程工具，会给企业带来非常显著的组织效益，并降低企业的运营成本。

（5）Python 是一门胶水语言。Python 程序能够以多种方式轻易地与其他语言的组件“粘连”在一起。例如，Python的C语言API可以帮助Python程序灵活地调用C程序，这意味着用户可以根据需要给Python 程序添加功能，或在其他环境中使用Python。

二，了解python数据分析常用库

我们在python的学习中常用的库主要有：numpy ， scipy ，pandas ， matplotlib， seaborn, pyecharts, scikit-learn等以下为每个数据库的优缺点：

NumPy

优点：

高效性：对于同样的数值计算任务，使用NumPy要比直接编写Python代码便捷得多。
存储效率：NumPy中的数组的存储效率和输入输出性能均远远优于Python中等价的基本数据结构。
底层性能：NumPy的大部分代码都用C语言写，其底层算法在设计时有优异的性能，使得NumPy比纯Python代码高效。

缺点：

内存限制：由于NumPy使用内存映射文件以达到最优的数据读写性能，而内存的大小限制了其对TB级大文件的处理。
通用性：NumPy数组的通用性不及Python提供的list容器。

SciPy

优点：

优化：SciPy包括了各种数学优化算法，可用于寻找函数的最小值或最大值。
信号处理：提供了一系列信号处理工具，用于分析和处理信号数据。
统计分析：包括了各种统计分析函数，用于描述和分析数据的统计特性。

缺点：

对于特定领域的专业任务，可能需要更专业的库或工具。

Pandas

（注意：Pandas是Python库，而不是Java库）

优点：

数据处理功能：提供了丰富的数据处理和分析功能，能够方便地对大规模数据进行处理和分析。
高效的数据结构：基于NumPy数组构建，能够高效地处理大规模数据。
易于学习和使用：API设计简单易懂，可以快速上手并进行数据处理和分析。

缺点：

性能相对较慢：与某些原生数据处理库相比，在处理大规模数据时性能可能较慢。
学习成本高：对于不熟悉Python的开发者来说，学习Pandas的语法和API可能需要一定的时间。

Matplotlib

优点：

强大而灵活：提供了丰富的功能和灵活的选项，适用于各种数据可视化场景。
跨平台：可以在多个操作系统上运行。
与其他库集成：可以与其他Python数据科学库无缝集成。

缺点：

默认样式简单：在某些场景下可能需要自定义样式。
操作相对繁琐：对于初学者来说，某些操作可能较为复杂。

Seaborn

优点：

美观的图形：画图风格偏向于统计图形，色彩和构图都非常漂亮。
简洁的API：对于初学者来说，API更加简洁明了。

缺点：

定制性相对较弱，但提供了丰富的预设主题和颜色方案。

Pyecharts

优点：

直观的数据展示：利用柱子的高度反映数据的差异，对高度差异敏感。
交互性：提供了丰富的交互功能，如缩放、拖拽等。

缺点：

只适用中小规模的数据集：对于大规模数据集的可视化可能不适用。

scikit-learn

优点：

简单易用：提供了一种简单而一致的API，方便用户训练、评估和部署模型。
丰富的算法支持：包括多种监督学习算法，满足各种问题的需求。
性能优化：库中的算法通常经过高度优化，提供高性能的计算。

缺点：

某些特殊算法性能：对于某些特殊问题，可能需要更专业的库或深度学习框架。
处理大规模数据集的能力：对于大规模数据集，性能和内存管理可能存在挑战。

以上是对这些Python库的优缺点概述。

接下来进行实例展示

1.读取文本数据

使用Python内置的open()函数或第三方库如pandas读取文本文件：

# 使用open()函数读取文本文件
with open('text_data.txt', 'r') as file:
    text_content = file.read()

# 使用pandas读取文本文件
import pandas as pd
df = pd.read_csv('text_data.csv', delimiter='\t')

2.文本预处理

清理文本数据是文本分析的第一步，包括去除停用词、标点符号，转换为小写等：

import re
from nltk.corpus import stopwords

def preprocess_text(text):
    text = text.lower()
    text = re.sub(r'\W', ' ', text)
    text = re.sub(r'\s+', ' ', text)
    stop_words = set(stopwords.words('english'))
    tokens = [word for word in text.split() if word not in stop_words]
    return ' '.join(tokens)

preprocessed_text = preprocess_text(text_content)

3.词频统计

使用nltk或Counter库进行词频统计：

from nltk import FreqDist
from collections import Counter

# 使用nltk进行词频统计
freq_dist = FreqDist(preprocessed_text.split())
print(freq_dist.most_common(10))

# 使用Counter进行词频统计
word_count = Counter(preprocessed_text.split())
print(word_count.most_common(10))

4.文本情感分析

使用nltk或TextBlob库进行情感分析：

from nltk.sentiment import SentimentIntensityAnalyzer
from textblob import TextBlob

# 使用nltk进行情感分析
sia = SentimentIntensityAnalyzer()
sentiment_nltk = sia.polarity_scores(text_content)
print(sentiment_nltk)

# 使用TextBlob进行情感分析
blob = TextBlob(text_content)
sentiment_textblob = blob.sentiment
print(sentiment_textblob)

5.文本相似度计算

使用nltk或gensim库进行文本相似度计算：

from nltk.metrics import jaccard_distance
from gensim.models import Word2Vec

# 使用nltk计算Jaccard相似度
text1 = "This is a sample text."
text2 = "This is another example text."
set1 = set(text1.split())
set2 = set(text2.split())
similarity_nltk = 1 - jaccard_distance(set1, set2)
print(similarity_nltk)

# 使用gensim计算Word2Vec相似度
model = Word2Vec([text1.split(), text2.split()], min_count=1)
similarity_gensim = model.wv.similarity('sample', 'example')
print(similarity_gensim)

6.文本分类

使用scikit-learn库进行文本分类

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score, classification_report

# 使用TfidfVectorizer将文本转换为TF-IDF特征
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(text_data)
y = labels

# 划分训练集和测试集
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# 使用Multinomial Naive Bayes进行文本分类
classifier = MultinomialNB()
classifier.fit(X_train, y_train)

# 进行预测和评估
y_pred = classifier.predict(X_test)
print("Accuracy:", accuracy_score(y_test, y_pred))
print("Classification Report:\n", classification_report(y_test, y_pred))

7.主题建模

使用gensim库进行主题建模，例如使用Latent Dirichlet Allocation (LDA)：

from gensim import corpora, models

# 创建语料库和字典
corpus = [text.split() for text in text_data]
dictionary = corpora.Dictionary(corpus)

# 将文本转换为词袋表示
bow_corpus = [dictionary.doc2bow(text) for text in corpus]

# 使用LDA进行主题建模
lda_model = models.LdaModel(bow_corpus, num_topics=3, id2word=dictionary, passes=10)

# 打印主题
for idx, topic in lda_model.print_topics(-1):
    print(f"Topic {idx + 1}: {topic}")

8.文本生成

使用循环神经网络 (RNN) 进行文本生成，例如使用tensorflow和keras：

import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, LSTM, Dense
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences

# 使用Tokenizer将文本转换为序列
tokenizer = Tokenizer()
tokenizer.fit_on_texts(text_data)
total_words = len(tokenizer.word_index) + 1

# 创建输入序列
input_sequences = []
for line in text_data:
    token_list = tokenizer.texts_to_sequences([line])[0]
    for i in range(1, len(token_list)):
        n_gram_sequence = token_list[:i+1]
        input_sequences.append(n_gram_sequence)

# 对输入序列进行填充
max_sequence_length = max([len(x) for x in input_sequences])
input_sequences = pad_sequences(input_sequences, maxlen=max_sequence_length, padding='pre')

# 创建模型
model = Sequential()
model.add(Embedding(total_words, 100, input_length=max_sequence_length-1))
model.add(LSTM(100))
model.add(Dense(total_words, activation='softmax'))

# 编译模型
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])

9.文本可视化

使用wordcloud库制作词云图，展示词语的频率：

from wordcloud import WordCloud
import matplotlib.pyplot as plt

# 生成词云图
wordcloud = WordCloud(width=800, height=400, random_state=21, max_font_size=110).generate_from_frequencies(word_count)

# 绘制词云图
plt.figure(figsize=(10, 7))
plt.imshow(wordcloud, interpolation="bilinear")
plt.axis('off')
plt.show()

10.自定义文本分析任务

在文本数据分析中，有时候需要执行一些定制化的任务，如命名实体识别 (NER)、关键词提取等。以下是使用两个流行的库，spaCy 和 bert-for-tf2，来执行这些任务的简单示例：

1. 命名实体识别 (NER) 使用 spaCy

import spacy

# 加载spaCy的英文模型
nlp = spacy.load("en_core_web_sm")

# 示例文本
text = "Apple Inc. was founded by Steve Jobs, Steve Wozniak, and Ronald Wayne."

# 处理文本并进行命名实体识别
doc = nlp(text)

# 打印识别到的命名实体及其类型
for ent in doc.ents:
    print(f"Entity: {ent.text}, Type: {ent.label_}")

11. 关键词提取使用 `bert-for-tf2`

首先，确保已经安装了 bert-for-tf2 库：

pip install bert-for-tf2

然后执行下列代码：

from bert import BertModelLayer
from bert.loader import StockBertConfig, load_stock_weights
from transformers import BertTokenizer

# 加载 BERT 模型和 tokenizer
bert_model_name = 'bert-base-uncased'
bert_ckpt_dir = 'path/to/bert/ckpt/directory'

bert_tokenizer = BertTokenizer.from_pretrained(bert_model_name)
bert_config = StockBertConfig.from_pretrained(bert_model_name)
bert_layer = BertModelLayer.from_params(bert_config.to_json(), name='bert')

# 示例文本
text = "Natural language processing (NLP) is a subfield of artificial intelligence."

# 利用 tokenizer 编码文本
input_ids = bert_tokenizer.encode(text, add_special_tokens=True)

# 打印关键词
keywords = bert_tokenizer.convert_ids_to_tokens(input_ids)
print("Keywords:", keywords)

总结

本文概述了使用Python进行文本数据分析的多个关键步骤和工具。首先介绍了文本的读取、预处理和转换为小写。接着，深入探讨了词频统计、情感分析、文本相似度计算和文本分类等核心任务，通过nltk、TextBlob、scikit-learn和gensim等库提供了示例。

文章还介绍了更高级的主题建模和文本生成任务，利用gensim和tensorflow库进行了演示。此外，展示了如何使用wordcloud库创建词云图以可视化关键词。

最后，强调了自定义文本分析任务（如NER和关键词提取）的重要性，并使用spaCy和bert-for-tf2等库提供了示例。这些代码旨在帮助读者更好地理解和应用Python工具进行文本数据分析。