Importing data in python: Introducton and flat fils-CSDN博客

1. Introduction and flat files

在这里插入图片描述

1.1 importing entire text files

# Open a file: file
file = open('moby_dick.txt', mode = 'r')

# Print it
print(file.read())

# Check whether file is closed using closed method
print(file.closed)

# Close file
file.close()
# Check whether file is closed
print(file.closed)

CHAPTER 1. Loomings.

Call me Ishmael. Some years ago–never mind how long precisely–having
little or no money in my purse, and nothing particular to interest me on
shore, I thought I would sail about a little and see the watery part of
the world. It is a way I have of driving off the spleen and regulating
the circulation. Whenever I find myself growing grim about the mouth;
whenever it is a damp, drizzly November in my soul; whenever I find
myself involuntarily pausing before coffin warehouses, and bringing up
the rear of every funeral I meet; and especially whenever my hypos get
such an upper hand of me, that it requires a strong moral principle to
prevent me from deliberately stepping into the street, and methodically
knocking people’s hats off–then, I account it high time to get to sea
as soon as I can. This is my substitute for pistol and ball. With a
philosophical flourish Cato throws himself upon his sword; I quietly
take to the ship. There is nothing surprising in this. If they but knew
it, almost all men in their degree, some time or other, cherish very
nearly the same feelings towards the ocean with me.
False
True

1.2 importing text files line by line

When a file called file is open, you can print out the first line by executing file.readline(). If you execute the same command again, the second line will print, and so on.

# Read & print the first 4 lines
with open('moby_dick.txt') as file:
    print(file.readline())
    print(file.readline())
    print(file.readline())
    print(file.readline())

CHAPTER 1. Loomings.

Call me Ishmael. Some years ago–never mind how long precisely–having
little or no money in my purse, and nothing particular to interest me on

1.3 Using NumPy to import flat files

# Import package
import numpy as np
import matplotlib.pyplot as plt

# Assign filename to variable: file
file = 'digits.csv'

# Load file as array: digits
digits = np.loadtxt(file, delimiter=',')

# Print datatype of digits
print(type(digits))

# Select and reshape a row
im = digits[21, 1:]
im_sq = np.reshape(im, (28, 28))

# Plot reshaped data (matplotlib.pyplot already loaded as plt)
plt.imshow(im_sq, cmap='Greys', interpolation='nearest')
plt.show()

在这里插入图片描述
There are a number of arguments that np.loadtxt() takes that you’ll find useful: delimiter changes the delimiter that loadtxt() is expecting, for example, you can use ‘,’ and ‘\t’ for comma-delimited and tab-delimited respectively; skiprows allows you to specify how many rows (not indices) you wish to skip; usecols takes a list of the indices of the columns you wish to keep.

# Import numpy
import numpy as np

# Assign the filename: file
file = 'digits_header.txt'

# Load the data: skip the first row and you only want to import the first and third columns.
data = np.loadtxt(file, delimiter='\t', skiprows=1, usecols=[0, 2])

# Print data
print(data)

import numpy as np
import matplotlib.pyplot as plt

# Assign filename: file
file = 'seaslug.txt'

# Import file: data
data = np.loadtxt(file, delimiter='\t', dtype=str)

# Print the first element of data
print(data[0])

# Import data as floats and skip the first row: data_float
data_float = np.loadtxt(file, delimiter='\t', dtype=float, skiprows=1)

# Print the 10th element of data_float
print(data_float[9])

# Plot a scatterplot of the data
plt.scatter(data_float[:, 0], data_float[:, 1])
plt.xlabel('time (min.)')
plt.ylabel('percentage of larvae')
plt.show()

在这里插入图片描述
np.genfromtxt(), which can handle such structures. If we pass dtype=None to it, it will figure out what types each column should be.

data = np.genfromtxt('titanic.csv', delimiter=',', names=True, dtype=None)

You have just used np.genfromtxt() to import data containing mixed datatypes. There is also another function np.recfromcsv() that behaves similarly to np.genfromtxt(), except that its default dtype is None.

# Assign the filename: file
file = 'titanic.csv'

# Import file using np.recfromcsv: d
d = np.recfromcsv(file, delimiter = ',', names=True)

# Print out first three entries of d
print(d[:3])

[(1, 0, 3, b’male’, 22., 1, 0, b’A/5 21171’, 7.25 , b’’, b’S’)
(2, 1, 1, b’female’, 38., 1, 0, b’PC 17599’, 71.2833, b’C85’, b’C’)
(3, 1, 3, b’female’, 26., 0, 0, b’STON/O2. 3101282’, 7.925 , b’’, b’S’)]

1.4 Importing flat files using pandas

# Import pandas as pd
import pandas as pd

# Assign the filename: file
file = 'titanic.csv'

# Read the file into a DataFrame: df
df = pd.read_csv(file)

# View the head of the DataFrame
print(df.head())

# Assign the filename: file
file = 'digits.csv'

# Read the first 5 rows of the file into a DataFrame: data
data = pd.read_csv(file, header = None, nrows = 5)

# Build a numpy array from the DataFrame: data_array
data_array = np.array(data)

# Print the datatype of data_array to the shell
print(type(data_array))

# Import matplotlib.pyplot as plt
import matplotlib.pyplot as plt

# Assign filename: file
file = 'titanic_corrupt.txt'

# Import file: data
data = pd.read_csv(file, sep='\t', comment='#', na_values='Nothing')
#na_values takes a list of strings to recognize as NA/NaN, in this case the string 'Nothing'.

# Print the head of the DataFrame
print(data.head())

# Plot 'Age' variable in a histogram
pd.DataFrame.hist(data[['Age']])
plt.xlabel('Age (years)')
plt.ylabel('count')
plt.show()

在这里插入图片描述