I have a .txt file which has text data and numerical data. The first two rows of the file have essential information in text data form, while the first column (I am referring to the zeroth column as the first column) also has essential data in text form. At all other locations in the file, the data is in numerical form. I wish to analyze the numerical data present in the file using libraries in python ,preferably numpy or pandas, or a combination of both (analysis like regression, correlation, scikit-learn etc). I reiterate that all of the data in the file is essential for my analysis. The following snapshot (taken from Excel) shows a truncated version of the format in which my data is in:
The data shown in this snapshot can be found here.
In particular, what I want is to be able to import all the numerical data from this file using python (numpy or pandas), and be able to refer to specific rows in this data using the text data in the first two rows (Type, Tag) and the first column (object number). In my actual data file, I have hundreds of thousands of rows (object types) and scores of columns.
I have already tried using numpy.loadtxt(...) and pandas.read_csv(...) to open this file, but I have either run into errors, or have loaded data in clumsy formats. I will be really thankful to have some direction as to how I can import the file in python in a way so that I have the functionality that I desire.
解决方案
If I were you, I would use pandas, and import it using something like this:
df = pd.read_csv('dum.txt',sep='\t', header=[0,1], index_col=0)
This gives you the dataframe:
>>> df
Type T1 T2 T3 T4 T5
Tag Good Good Good Good Good
object1 1.1 2.1 3.1 4.1 5.1
object2 1.2 2.2 3.2 4.2 5.2
object3 1.3 2.3 3.3 4.3 5.3
object4 1.4 2.4 3.4 4.4 5.4
object5 1.5 2.5 3.5 4.5 5.5
object6 1.6 2.6 3.6 4.6 5.6
object7 1.7 2.7 3.7 4.7 5.7
object8 1.8 2.8 3.8 4.8 5.8
And all of your columns are floats:
>>> df.dtypes
Type Tag
T1 Good float64
T2 Good float64
T3 Good float64
T4 Good float64
T5 Good float64
dtype: object
It contains a multi-indexed column header:
>>> df.columns
MultiIndex(levels=[['T1', 'T2', 'T3', 'T4', 'T5'], ['Good']],
labels=[[0, 1, 2, 3, 4], [0, 0, 0, 0, 0]],
names=['Type', 'Tag'])
And a regular index containing the information from Type:
>>> df.index
Index(['object1', 'object2', 'object3', 'object4', 'object5', 'object6',
'object7', 'object8'],
dtype='object')
Furthermore, you can convert your values to a numpy array of floats simply by using:
>>> df.values
array([[1.1, 2.1, 3.1, 4.1, 5.1],
[1.2, 2.2, 3.2, 4.2, 5.2],
[1.3, 2.3, 3.3, 4.3, 5.3],
[1.4, 2.4, 3.4, 4.4, 5.4],
[1.5, 2.5, 3.5, 4.5, 5.5],
[1.6, 2.6, 3.6, 4.6, 5.6],
[1.7, 2.7, 3.7, 4.7, 5.7],
[1.8, 2.8, 3.8, 4.8, 5.8]])