python导入data文件_使用Python导入包含文本和数字数据的文件

1586010002-jmsa.png

I have a .txt file which has text data and numerical data. The first two rows of the file have essential information in text data form, while the first column (I am referring to the zeroth column as the first column) also has essential data in text form. At all other locations in the file, the data is in numerical form. I wish to analyze the numerical data present in the file using libraries in python ,preferably numpy or pandas, or a combination of both (analysis like regression, correlation, scikit-learn etc). I reiterate that all of the data in the file is essential for my analysis. The following snapshot (taken from Excel) shows a truncated version of the format in which my data is in:UmDgT.png

The data shown in this snapshot can be found here.

In particular, what I want is to be able to import all the numerical data from this file using python (numpy or pandas), and be able to refer to specific rows in this data using the text data in the first two rows (Type, Tag) and the first column (object number). In my actual data file, I have hundreds of thousands of rows (object types) and scores of columns.

I have already tried using numpy.loadtxt(...) and pandas.read_csv(...) to open this file, but I have either run into errors, or have loaded data in clumsy formats. I will be really thankful to have some direction as to how I can import the file in python in a way so that I have the functionality that I desire.

解决方案

If I were you, I would use pandas, and import it using something like this:

df = pd.read_csv('dum.txt',sep='\t', header=[0,1], index_col=0)

This gives you the dataframe:

>>> df

Type T1 T2 T3 T4 T5

Tag Good Good Good Good Good

object1 1.1 2.1 3.1 4.1 5.1

object2 1.2 2.2 3.2 4.2 5.2

object3 1.3 2.3 3.3 4.3 5.3

object4 1.4 2.4 3.4 4.4 5.4

object5 1.5 2.5 3.5 4.5 5.5

object6 1.6 2.6 3.6 4.6 5.6

object7 1.7 2.7 3.7 4.7 5.7

object8 1.8 2.8 3.8 4.8 5.8

And all of your columns are floats:

>>> df.dtypes

Type Tag

T1 Good float64

T2 Good float64

T3 Good float64

T4 Good float64

T5 Good float64

dtype: object

It contains a multi-indexed column header:

>>> df.columns

MultiIndex(levels=[['T1', 'T2', 'T3', 'T4', 'T5'], ['Good']],

labels=[[0, 1, 2, 3, 4], [0, 0, 0, 0, 0]],

names=['Type', 'Tag'])

And a regular index containing the information from Type:

>>> df.index

Index(['object1', 'object2', 'object3', 'object4', 'object5', 'object6',

'object7', 'object8'],

dtype='object')

Furthermore, you can convert your values to a numpy array of floats simply by using:

>>> df.values

array([[1.1, 2.1, 3.1, 4.1, 5.1],

[1.2, 2.2, 3.2, 4.2, 5.2],

[1.3, 2.3, 3.3, 4.3, 5.3],

[1.4, 2.4, 3.4, 4.4, 5.4],

[1.5, 2.5, 3.5, 4.5, 5.5],

[1.6, 2.6, 3.6, 4.6, 5.6],

[1.7, 2.7, 3.7, 4.7, 5.7],

[1.8, 2.8, 3.8, 4.8, 5.8]])

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值