python数据处理常用函数（持续更新）

最新推荐文章于 2024-04-22 10:35:14 发布

DecafTea

最新推荐文章于 2024-04-22 10:35:14 发布

阅读量1.3k

点赞数 1

分类专栏： python PyTorch NLP

本文链接：https://blog.csdn.net/DecafTea/article/details/113796443

版权

NLP 同时被 3 个专栏收录

52 篇文章 3 订阅

订阅专栏

python

40 篇文章 0 订阅

订阅专栏

PyTorch

4 篇文章 0 订阅

订阅专栏

enumerate(sequence, [start=0])

参数：
sequence – 一个序列、迭代器或其他支持迭代对象。
start – 下标起始位置。
返回值：
列出数据和数据下标，一般用在 for 循环当中。

for i, data in enumerate(train_loader):
	inputs, labels = data
	print(inputs,shape)
	print(labels.shape)
	break
# print output: 
# torch.Size([64, 1, 28, 28])
# torch.Size([64])

enumerate函数使用非常广泛。但是有一点需要注意，如果我们迭代的是一个多元组数组，我们需要注意要将index和value区分开。举个例子：

data = [(1, 3), (2, 1), (3, 3)]

在不用enumerate的时候，我们有两种迭代方式，这两种都可以运行。

for x, y in data:
 
for (x, y) in data:

但是如果我们使用enumerate的话，由于引入了一个index，我们必须要做区分，否则会报错，所以我们只有一种迭代方式：

for i, (x, y) in enumerate(data):

x = x.view(x.size()[0], -1)

x是多维tensor，使用函数view（）变成二维tensor，行数=batch size，列数为每个input的维度。

print(“Test acc:{0}”.format(correct.item()/len(test_dataset)))

通过 {} 和 : 来代替以前的 % 。

format 函数可以接受不限个参数，位置可以不按顺序。

>>>"{} {}".format("hello", "world")    # 不设置指定位置，按默认顺序
'hello world'
 
>>> "{0} {1}".format("hello", "world")  # 设置指定位置
'hello world'
 
>>> "{1} {0} {1}".format("hello", "world")  # 设置指定位置
'world hello world'

Counter
data = dict(Counter(data))

reference: https://blog.csdn.net/mouday/article/details/82012731

Counter计数器，继承了dict类，基本可以和字典的操作一样

from collections import Counter

# 实例化
counter = Counter("abcabcccaaabbb")
print(counter)
# Counter({'a': 5, 'b': 5, 'c': 4})

# 数量最多的2个
print(counter.most_common(2))
# [('a', 5), ('b', 5)]

# 查看所有元素
print("".join(counter.elements()))
# aaaaabbbbbcccc

# 类似dict，查看键
print(counter.keys())
# dict_keys(['a', 'b', 'c'])

# 类似dict，查看值
print(counter.values())
# dict_values([5, 5, 4])

filename.read()

data = jieba.cut(fr.read(size))
read() 方法用于从文件读取指定的字节数，如果未给定或为负则读取所有。返回从字符串中读取的字节。
size – 从文件中读取的字节数，如不填则默认为 -1，表示读取整个文件。

def stopwordslist(filepath):
    stopwords = [line.strip() for line in open(filepath, 'r', encoding='UTF-8').readlines()]
    return stopwords

dict.items()

Python 字典(Dictionary) items() 函数以列表返回可遍历的(键, 值) 元组数组。

#!/usr/bin/python
# coding=utf-8
 
dict = {'Google': 'www.google.com', 'Runoob': 'www.runoob.com', 'taobao': 'www.taobao.com'}
 
print "字典值 : %s" %  dict.items()
 
# 遍历字典列表
for key,values in  dict.items():
    print key,values

# 字典值 : [('Google', 'www.google.com'), ('taobao', 'www.taobao.com'), ('Runoob', 'www.runoob.com')]
# Google www.google.com
# taobao www.taobao.com
# Runoob www.runoob.com

with open('file3', 'w') as fw:  # 读入存储wordcount的文件路径
    for k, v in data.items():
        fw.write('%s,%d\n' % (k, v))

variablename.item()

把不同类型的变量如tensor变为python基础变量。可做加减乘除。

print("Test acc:{0}".format(correct.item()/len(test_dataset)))

str.strip([chars])

chars – 移除字符串头尾指定的字符序列。
返回移除字符串头尾指定的字符生成的新字符串。

Python strip() 方法用于移除字符串头尾指定的字符（默认为空格或换行符）或字符序列。

注意：该方法只能删除开头或是结尾的字符，不能删除中间部分的字符。

#!/usr/bin/python
# -*- coding: UTF-8 -*-
 
str = "00000003210Runoob01230000000"; 
print str.strip( '0' );  # 去除首尾字符 0
 
 
str2 = "   Runoob      ";   # 去除首尾空格
print str2.strip();

# output：
# 3210Runoob0123
# Runoob

在这里插入图片描述
8. 两种文件读写方式

inputs = open('file1.txt', 'r', encoding='UTF-8')  # 加载要处理的文件的路径
outputs = open('file2', 'w')  # 加载处理后的文件路径
for line in inputs:
    line_seg = seg_sentence(line)  # 这里的返回值是字符串
    outputs.write(line_seg)
outputs.close()
inputs.close()

# WordCount
with open('file2', 'r') as fr:  # 读入已经去除停用词的文件
    data = jieba.cut(fr.read()) # 为什么fr已经是分好词的string，还要再分一遍词？ fr.read()返回从字符串中读取的字节,所以需要再分一遍词。
data = dict(Counter(data)) # 如不再分一遍词，则结果为：data = {dict:5} {'你': 1, ' ': 3, '今': 1, '天': 1, '吃': 1}

with open('file3', 'w') as fw:  # 读入存储wordcount的文件路径
    for k, v in data.items():
        fw.write('%s,%d\n' % (k, v))

random函数

np.random.randint(start, end): both included
np.random.randrange(start,end): start included, end excluded

字符串处理

Python strip() 方法用于移除字符串头尾指定的字符（默认为空格或换行符）或字符序列。
str.strip([chars]) chars – 移除字符串头尾指定的字符序列。

import re

line = '    the quick brown fox jumped over a lazy dog    ' \
       '   a    '
# re.compile('\\s+')用来匹配字符串内部的any whitespace (space, tab, newline, etc.)，根据空白来截取空白格开的子串
# line.strip()用于去掉头尾的空白符或换行符
result = re.compile('\\s+').split(line)
result1 = re.compile('\\s+').split(line.strip())

print(result) # ['', 'the', 'quick', 'brown', 'fox', 'jumped', 'over', 'a', 'lazy', 'dog', 'a', '']
print(result1) # ['the', 'quick', 'brown', 'fox', 'jumped', 'over', 'a', 'lazy', 'dog', 'a']

使用counter统计词数、词频

f = Counter()
for word in result1:
    f[word] += 1
print(len(f), sum(f.values()))  # 词类数，总词频

Python Counter()计数工具: https://www.cnblogs.com/nisen/p/6052895.html

“str”.join(sequence)
join() 方法用于将序列中的元素以指定的字符连接生成一个新的字符串。sequence – 要连接的元素序列。

str = "-";
seq = ("a", "b", "c"); # 字符串序列
print str.join( seq );

# output: a-b-c

zip（iterable1， iterable2，…)

zip() 函数用于将可迭代的对象作为参数，将对象中对应的元素打包成一个个元组，然后返回由这些元组组成的对象，这样做的好处是节约了不少的内存。

我们可以使用 list() 转换来输出列表。

如果各个迭代器的元素个数不一致，则返回列表长度与最短的对象相同.

>>>a = [1,2,3]
>>> b = [4,5,6]
>>> c = [4,5,6,7,8]
>>> zipped = zip(a,b)     # 返回一个对象
>>> zipped
<zip object at 0x103abc288>
>>> list(zipped)  # list() 转换为列表
[(1, 4), (2, 5), (3, 6)]
>>> list(zip(a,c))              # 元素个数与最短的列表一致
[(1, 4), (2, 5), (3, 6)]
 
>>> a1, a2 = zip(*zip(a,b))          # 与 zip 相反，zip(*) 可理解为解压，返回二维矩阵式
>>> list(a1)
[1, 2, 3]
>>> list(a2)
[4, 5, 6]
>>>