DataCamp “Data Scientist with Python track” 第五章 Importing Data in Python 学习笔记

最新推荐文章于 2021-02-08 10:22:49 发布

DeclanNYC

最新推荐文章于 2021-02-08 10:22:49 发布

阅读量501

点赞数

文章标签： DataCamp 数据科学 Python

本文链接：https://blog.csdn.net/weixin_41803041/article/details/84316784

版权

Customizing your pandas import:

# Import matplotlib.pyplot as plt
import matplotlib.pyplot as plt

# Assign filename: file
file = 'titanic_corrupt.txt'

# Import file: data
data = pd.read_csv(file, sep='\t', comment='#', na_values='Nothing')

# Print the head of the DataFrame
print(data.head())

# Plot 'Age' variable in a histogram
pd.DataFrame.hist(data[['Age']])
plt.xlabel('Age (years)')
plt.ylabel('count')
plt.show()

也许有的时候pandas默认被当作的缺失值还不能满足要求，我们可以通过设置na_values，将指定的值替换成为NaN值。语句中的意思是将 'Nothing' 用NaN进行替代，将所有的Nothing都替换成了NaN。

'sep' is the 'pandas' version of 'delim', which in this case is tab-delimited.

data.head() #默认出5行，括号里可以填其他数据。

Introduction to other file types:

pickle提供了一个简单的持久化功能，可以将对象以文件的形式存放在磁盘上。python中几乎所有的数据类型（列表，字典，集合，类等）都可以用pickle来序列化，而pickle序列化后的数据可读性差。

If you merely want to be able to import them into Python, you can serialize them. All this means is converting the object into a sequence of bytes, or a bytestream.

Customizing your spreadsheet import:

# Parse the first sheet and rename the columns: df1
df1 = xl.parse(0, skiprows=[0], names=['Country', 'AAM due to War (2002)'])

# Print the head of the DataFrame df1
print(df1.head())

# Parse the first column of the second sheet and rename the column: df2
df2 = xl.parse(1, parse_cols=[0], skiprows=[0], names=['Country'])

# Print the head of the DataFrame df2
print(df2.head())

The Hello World of SQL Queries!

# Import packages
from sqlalchemy import create_engine
import pandas as pd

# Create engine: engine
engine = create_engine('sqlite:///Chinook.sqlite')

# Open engine connection: con
con = engine.connect()

# Perform query: rs
rs = con.execute("SELECT * FROM Album")

# Save results of the query to DataFrame: df
df = pd.DataFrame(rs.fetchall())

# Close connection
con.close()

# Print head of DataFrame df
print(df.head())

Execute the query that selects ALL columns from the Album table.

Store all of your query results in the DataFrame df by applying the fetchall() method to the results rs.

*为了保持DataFrame的每列的head和数据库一致，我们可以使用df.columns = rs.keys()

*这种code方法要注意的是要在每次与数据库操作完成后关闭与数据库的连接：con.close()

with engine.connect() as con:
    rs = con.execute("SELECT * FROM Employee WHERE EmployeeId >= 6")
    df = pd.DataFrame(rs.fetchall())
    df.columns = rs.keys()

数据筛选：Execute the query that selects all records from the Employee table where 'EmployeeId' is greater than or equal to 6.

with engine.connect() as con:
    rs = con.execute("SELECT * FROM Employee ORDER BY BirthDate")
    df = pd.DataFrame(rs.fetchall())

按某列的数据对整个数据进行排序：In the context manager, execute the query that selects all records from the Employee table and orders them in increasing order by the column BirthDate.

Pandas and The Hello World of SQL Queries!

我们可以将原本对数据库的四行query代码变成一行（Line 9）：

# Import packages
from sqlalchemy import create_engine
import pandas as pd

# Create engine: engine
engine = create_engine('sqlite:///Chinook.sqlite')

# Execute query and store records in DataFrame: df
df = pd.read_sql_query("SELECT * FROM Album", engine)

# Print head of DataFrame
print(df.head())

# Open engine in context manager and store query result in df1
with engine.connect() as con:
    rs = con.execute("SELECT * FROM Album")
    df1 = pd.DataFrame(rs.fetchall())
    df1.columns = rs.keys()

# Confirm that both methods yield the same result
print(df.equals(df1))

甚至在这一行中加入更多的限定：

df = pd.read_sql_query("SELECT * FROM Employee WHERE EmployeeId >= 6 ORDER BY BirthDate", engine)

也可以利用query方法从两个table中合并数据，我们将Album table中的Title column和Artist table中的Name column按照彼此匹配的ArtistID column数据进行合并：

with engine.connect() as con:
    rs = con.execute("SELECT Title, Name FROM Album INNER JOIN Artist on Album.ArtistID = Artist.ArtistID")
    df = pd.DataFrame(rs.fetchall())
    df.columns = rs.keys()

豪华升级版单行输出（还加入了限定）：

df = pd.read_sql_query("SELECT * FROM PlaylistTrack INNER JOIN Track on PlaylistTrack.TrackID = Track.TrackID WHERE Milliseconds < 250000", engine)

DeclanNYC

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
2
评论
复制链接

分享到 QQ

分享到新浪微博

扫一扫