【Python】数据分析 Section 2.1: Series Data | from Coursera “Applied Data Science with Python“

1. Pandas Introduction

This week we're going to deepen our investigation to how Python can be used to manipulate, clean, and query data by looking at the Pandas data tool kit. Pandas was created by Wes McKinney in 2008, and is an open source project under a very permissive license. As an open source project it's got a strong community, with over one hundred software developers all committing code to help make it better. Before pandas existed we had only a hodge podge of tools to use, such as numpy, the python core libraries, and some python statistical tools. But pandas has quickly become the defacto library for representing relational data for data scientists.

I want to take a moment here to introduce the question answersing site Stack Overflow. Stack Overflow is used broadly within the software development community to post questions about programming, programming languages, and programming toolkits. What's special about Stack Overflow is that it's heavily curated by the community. And the Pandas community, in particular, uses it as their number one resource for helping new members. It's quite possible if you post a question to Stack Overflow, and tag it as being Pandas and Python related, that a core Pandas developer will actually respond to your question. In addition to posting questions, Stack Overflow is a great place to go to see what issues people are having and how they can be solved. You can learn a lot from browsing Stacks at Stack Overflow and with pandas, this is where the developer community is.

A second resource you might want to consider are books. In 2012 Wes McKinney wrote the definitive Pandas reference book called Python for Data Analysis and published by O'Reilly, and it's recently been update to a second edition. I consider this the go to book for understanding how Pandas works. I also appreciate the more brief book "Learning the Pandas Library" by Matt Harrison. It's not a comprehensive book on data analysis and statistics. But if you just want to learn the basics of Pandas and want to do so quickly, I think it's a well laid out volume and it can be had for a good price.

The field of data science is rapidly changing. There's new toolkits and method being created everyday. It can be tough to stay on top of it all. Marco Rodriguez and Tim Golden maintain a wonderful blog aggregator site called Planet Python. You can visit the webpage at planetpython.org, subscribe with an RSS reader, or get the latest articles from the @PlanetPython Twitter feed. There's lots of regular Python data science contributors, and I highly recommend it if you follow RSS feeds.

Here's my last plug on how to deepen your learning. Kyle Polich runs an excellent podcast called Data Skeptic. It isn't Python based per se, but it's well produced and it has a wonderful mixture of interviews with experts in the field as well as short educational lessons. Much of the word he describes is specific to machine learning methods. But if that's something you are planning to explore through this specialization this course is in, I would really encourage you to subscribe to his podcast.

That's it for a little bit of an introduction to this week of the course. Next we're going to dive right into Pandas library and talk about the series data structure.

2. Series Datastructure

In this lecture we're going to explore the pandas Series structure. By the end of this lecture you should be familiar with how to store and manipulate single dimensional indexed data in the Series object.

The series is one of the core data structures in pandas. You think of it a cross between a list and a dictionary. The items are all stored in an order and there's labels with which you can retrieve them. An easy way to visualize this is two columns of data. The first is the special index, a lot like keys in a dictionary. While the second is your actual data. It's important to note that the data column has a label of its own and can be retrieved using the .name attribute. This is different than with dictionaries and is useful when it comes to merging multiple columns of data. And we'll talk about that later on in the course.

# Let's import pandas to get started
import pandas as pd

# As you might expect, you can create a series by passing in a list of values. When you do this, Pandas automatically assigns an index starting with zero and sets the name of the series to None. Let's work on an example of this.

# One of the easiest ways to create a series is to use an array-like object, like a list. 

# Here I'll make a list of the three of students, Alice, Jack, and Molly, all as strings
students = ['Alice', 'Jack', 'Molly']

# Now we just call the Series function in pandas and pass in the students
pd.Series(students)

>>> 
0    Alice
1     Jack
2    Molly
dtype: object

# The result is a Series object which is nicely rendered to the screen. We see here that the pandas has automatically identified the type of data in this Series as "object" and set the dytpe parameter as appropriate. We see that the values are indexed with integers, starting at zero
# We don't have to use strings. If we passed in a list of whole numbers, for instance, we could see that panda sets the type to int64. Underneath panda stores series values in a typed array using the Numpy library. This offers significant speedup when processing data versus traditional python lists.

# Let's create a little list of numbers
numbers = [1, 2, 3]
# And turn that into a series
pd.Series(numbers)

>>> 
0    1
1    2
2    3
dtype: int64

# And we see on my architecture that the result is a dtype of int64 objects
# There's some other typing details that exist for performance that are important to know. The most important is how Numpy and thus pandas handle missing data. 

# In Python, we have the none type to indicate a lack of data. But what do we do if we want to have a typed list like we do in the series object?

# Underneath, pandas does some type conversion. If we create a list of strings and we have one element, a None type, pandas inserts it as a None and uses the type object for the underlying array. 

# Let's recreate our list of students, but leave the last one as a None
students = ['Alice', 'Jack', None]
# And let's convert this to a series
pd.Series(students)

>>> 
0    Alice
1     Jack
2     None
dtype: object

# However, if we create a list of numbers, integers or floats, and put in the None type, pandas automatically converts this to a special floating point value designated as NaN, which stands for "Not a Number".

# So let's create a list with a None value in it
numbers = [1, 2, None]
# And turn that into a series
pd.Series(numbers)

>>> 
0    1.0
1    2.0
2    NaN
dtype: float64
# You'll notice a couple of things. First, NaN is a different value. Second, pandas set the dytpe of this series to floating point numbers instead of object or ints. That's maybe a bit of a surprise - why not just leave this as an integer? Underneath, pandas represents NaN as a floating point number, and because integers can be typecast to floats, pandas went and converted our integers to floats. So when you're wondering why the list of integers you put into a Series is not floats, it's probably because there is some missing data.

# For those who might not have done scientific computing in Python before, it is important to stress that None and NaN might be being used by the data scientist in the same way, to denote missing data, but that underneath these are not represented by pandas in the same way.

# NaN is *NOT* equivilent to None and when we try the equality test, the result is False.

# Lets bring in numpy which allows us to generate an NaN value
import numpy as np
# And lets compare it to None
np.nan == None

>>> False

# It turns out that you actually can't do an equality test of NAN to itself. When you do, the answer is always False. 

np.nan == np.nan

>>> False

# Instead, you need to use special functions to test for the presence of not a number, such as the Numpy library isnan().

np.isnan(np.nan)

>>> True

# So keep in mind when you see NaN, it's meaning is similar to None, but it's a numeric value and treated differently for efficiency reasons.
# Let's talk more about how pandas' Series can be created. While my list might be a common way to create some play data, often you have label data that you want to manipulate. A series can be created directly from dictionary data. If you do this, the index is automatically assigned to the keys of the dictionary that you provided and not just incrementing integers.

# Here's an example using some data of students and their classes.

students_scores = {'Alice': 'Physics',
                   'Jack': 'Chemistry',
                   'Molly': 'English'}
s = pd.Series(students_scores)
s

>>> 
Alice      Physics
Jack     Chemistry
Molly      English
dtype: object

# We see that, since it was string data, pandas set the data type of the series to "object". We see that the index, the first column, is also a list of strings.
# Once the series has been created, we can get the index object using the index attribute.

s.index

>>> Index(['Alice', 'Jack', 'Molly'], dtype='object')

# As you play more with pandas you'll notice that a lot of things are implemented as numpy arrays, and have the dtype value set. This is true of indicies, and here pandas infered that we were using objects for the index.
# Now, this is kind of interesting. The dtype of object is not just for strings, but for arbitrary objects. Lets create a more complex type of data, say, a list of tuples.
students = [("Alice","Brown"), ("Jack", "White"), ("Molly", "Green")]
pd.Series(students)

>>>
0    (Alice, Brown)
1     (Jack, White)
2    (Molly, Green)
dtype: object

# We see that each of the tuples is stored in the series object, and the type is object.
# You can also separate your index creation from the data by passing in the index as a list explicitly to the series.

s = pd.Series(['Physics', 'Chemistry', 'English'], index=['Alice', 'Jack', 'Molly'])
s

>>> 
Alice      Physics
Jack     Chemistry
Molly      English
dtype: object

# So what happens if your list of values in the index object are not aligned with the keys in your dictionary for creating the series? Well, pandas overrides the automatic creation to favor only and all of the indices values that you provided. So it will ignore from your dictionary all keys which are not in your index, and pandas will add None or NaN type values for any index value you provide, which is not in your dictionary key list.

# Here's and example. I'll pass in a dictionary of three items, in this case students and their courses
students_scores = {'Alice': 'Physics',
                   'Jack': 'Chemistry',
                   'Molly': 'English'}
# When I create the series object though I'll only ask for an index with three students, and I'll exclude Jack
s = pd.Series(students_scores, index=['Alice', 'Molly', 'Sam'])
s

>>> 
Alice    Physics
Molly    English
Sam          NaN
dtype: object

# The result is that the Series object doesn't have Jack in it, even though he was in our original dataset, but it explicitly does have Sam in it as a missing value.

In this lecture we've explored the pandas Series data structure. You've seen how to create a series from lists and dictionaries, how indicies on data work, and the way that pandas typecasts data including missing values.

3. Querying Series

In this lecture, we'll talk about one of the primary data types of the Pandas library, the Series. You'll learn about the structure of the Series, how to query and merge Series objects together, and the importance of thinking about parallelization when engaging in data science programming.

# A pandas Series can be queried either by the index position or the index label. If you don't give an index to the series when querying, the position and the label are effectively the same values. To query by numeric location, starting at zero, use the iloc attribute. To query by the index label, you can use the loc attribute. 

# Lets start with an example. We'll use students enrolled in classes coming from a dictionary
import pandas as pd
students_classes = {'Alice': 'Physics',
                   'Jack': 'Chemistry',
                   'Molly': 'English',
                   'Sam': 'History'}
s = pd.Series(students_classes)
s

>>> 
Alice      Physics
Jack     Chemistry
Molly      English
Sam        History
dtype: object

# So, for this series, if you wanted to see the fourth entry we would we would use the iloc attribute with the parameter 3.
s.iloc[3]

>>> 'History'

# If you wanted to see what class Molly has, we would use the loc attribute with a parameter of Molly.
s.loc['Molly']

>>> 'English'

Keep in mind that iloc and loc are not methods, they are attributes. So you don't use  parentheses to query them, but square brackets instead, which is called the indexing operator.  In Python this calls get or set for an item depending on the context of its use. This might seem a bit confusing if you're used to languages where encapsulation of attributes, variables, and properties is common, such as in Java.

# Pandas tries to make our code a bit more readable and provides a sort of smart syntax using the indexing operator directly on the series itself. For instance, if you pass in an integer parameter, the operator will behave as if you want it to query via the iloc attribute
s[3]

>>> 'History'

# If you pass in an object, it will query as if you wanted to use the label based loc attribute.
s['Molly']

>>> 'English'

# So what happens if your index is a list of integers? This is a bit complicated and Pandas can't determine automatically whether you're intending to query by index position or index label. So you need to be careful when using the indexing operator on the Series itself. The safer option is to be more explicit and use the iloc or loc attributes directly.

# Here's an example using class and their classcode information, where classes are indexed by classcodes, in the form of integers
class_code = {99: 'Physics',
              100: 'Chemistry',
              101: 'English',
              102: 'History'}
s = pd.Series(class_code)

# If we try and call s[0] we get a key error because there's no item in the classes list with an index of zero, instead we have to call iloc explicitly if we want the first item.

s[0] 

# So, that didn't call s.iloc[0] underneath as one might expect, instead it generates an error 

Now we know how to get data out of the series, let's talk about working with the data. A common task is to want to consider all of the values inside of a series and do some sort of operation. This could be trying to find a certain number, or summarizing data or transforming  the data in some way.

# A typical programmatic approach to this would be to iterate over all the items in the series, and invoke the operation one is interested in. For instance, we could create a Series of integers representing student grades, and just try and get an average grade

grades = pd.Series([90, 80, 70, 60])

total = 0
for grade in grades:
    total+=grade
print(total/len(grades))

>>> 75.0

# This works, but it's slow. Modern computers can do many tasks simultaneously, especially, but not only, tasks involving mathematics.
# Pandas and the underlying numpy libraries support a method of computation called vectorization. Vectorization works with most of the functions in the numpy library, including the sum function.

# Here's how we would really write the code using the numpy sum method. First we need to import the numpy module

import numpy as np

# Then we just call np.sum and pass in an iterable item. In this case, our panda series.

total = np.sum(grades)
print(total/len(grades))

>>> 75.0

# Now both of these methods create the same value, but is one actually faster? The Jupyter Notebook has a magic function which can help. 

# First, let's create a big series of random numbers. This is used a lot when demonstrating techniques with Pandas
numbers = pd.Series(np.random.randint(0,1000,10000))

# Now lets look at the top five items in that series to make sure they actually seem random. We can do this with the head() function
numbers.head()

>>> 
0    309
1    803
2    755
3    277
4     19
dtype: int64

# We can actually verify that length of the series is correct using the len function
len(numbers)
>>> 10000

Ok, we're confident now that we have a big series. The ipython interpreter has something called magic functions begin with a percentage sign. If we type this sign and then hit the Tab key, you can see a list of the available magic functions. You could write your own magic functions too,  but that's a little bit outside of the scope of this course.

# Here, we're actually going to use what's called a cellular magic function. These start with two percentage signs and wrap the code in the current Jupyter cell. The function we're going to use is called timeit. This function will run our code a few times to determine, on average, how long it takes.

# Let's run timeit with our original iterative code. You can give timeit the number of loops that you would like to run. By default, it is 1,000 loops. I'll ask timeit here to use 100 runs because we're recording this. Note that in order to use a cellular magic function, it has to be the first line in the cell

%%timeit -n 100
total = 0
for number in numbers:
    total+=number

total/len(numbers)

>>> 1.52 ms ± 120 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

# Not bad. Timeit ran the code and it doesn't seem to take very long at all. Now let's try with vectorization.

%%timeit -n 100
total = np.sum(numbers)
total/len(numbers)

>>> 67.5 µs ± 11 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Wow! This is a pretty shocking difference in the speed and demonstrates why one should be aware of parallel computing features and start thinking in functional programming terms. Put more simply, vectorization is the ability for a computer to execute multiple instructions at once, and with high performance chips, especially graphics cards, you can get dramatic speedups. Modern graphics cards can run thousands of instructions in parallel.

# A Related feature in pandas and nummy is called broadcasting. With broadcasting, you can apply an operation to every value in the series, changing the series. For instance, if we wanted to increase every random variable by 2, we could do so quickly using the += operator directly on the Series object. 

# Let's look at the head of our series
numbers.head()

>>> 
0    309
1    803
2    755
3    277
4     19
dtype: int64

# And now lets just increase everything in the series by 2
numbers+=2
numbers.head()

>>> 
0    311
1    805
2    757
3    279
4     21
dtype: int64

# The procedural way of doing this would be to iterate through all of the items in the series and increase the values directly. Pandas does support iterating through a series much like a dictionary, allowing you to unpack values easily.

# We can use the iteritems() function which returns a label and value 
for label, value in numbers.iteritems():
    # in the early version of pandas we would use the set_value() function
    # in the current version, we use the iat() or at() functions,
    numbers.iat[label]= value+2
# And we can check the result of this computation
numbers.head()

>>> 
0    313
1    807
2    759
3    281
4     23
dtype: int64

So the result is the same, though you may notice a warning depending upon the version of pandas being used. But if you find yourself iterating pretty much *any time* in pandas, you should question whether you're doing things in the best possible way.

# Lets take a look at some speed comparisons. First, lets try five loops using the iterative approach

%%timeit -n 10
# we'll create a blank new series of items to deal with
s = pd.Series(np.random.randint(0,1000,1000))
# And we'll just rewrite our loop from above.
for label, value in s.iteritems():
    s.loc[label]= value+2

>>> 51.2 ms ± 2.56 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

# Now lets try that using the broadcasting methods
%%timeit -n 10
# We need to recreate a series
s = pd.Series(np.random.randint(0,1000,1000))
# And we just broadcast with +=
s+=2

>>> 338 µs ± 136 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

# Amazing. Not only is it significantly faster, but it's more concise and even easier to read too. The typical mathematical operations you would expect are vectorized, and the nump documentation outlines what it takes to create vectorized functions of your own. 
# One last note on using the indexing operators to access series data. The .loc attribute lets you not only modify data in place, but also add new data as well. If the value you pass in as the index doesn't exist, then a new entry is added. And keep in mind, indices can have mixed types. While it's important to be aware of the typing going on underneath, Pandas will automatically change the underlying NumPy types as appropriate.

# Here's an example using a Series of a few numbers. 
s = pd.Series([1, 2, 3])
# We could add some new value, maybe a university course
s.loc['History'] = 102
s

>>> 
0            1
1            2
2            3
History    102
dtype: int64

We see that mixed types for data values or index labels are no problem for Pandas. Since "History" is not in the original list of indices, s.loc['History'] essentially creates a new element in the series, with the index named "History", and the value of 102

# Up until now I've shown only examples of a series where the index values were unique. I want to end this lecture by showing an example where index values are not unique, and this makes pandas Series a little different conceptually then, for instance, a relational database.

# Lets create a Series with students and the courses which they have taken
students_classes = pd.Series({'Alice': 'Physics',
                   'Jack': 'Chemistry',
                   'Molly': 'English',
                   'Sam': 'History'})
students_classes

>>> 
Alice      Physics
Jack     Chemistry
Molly      English
Sam        History
dtype: object

# Now lets create a Series just for some new student Kelly, which lists all of the courses she has taken. We'll set the index to Kelly, and the data to be the names of courses.
kelly_classes = pd.Series(['Philosophy', 'Arts', 'Math'], index=['Kelly', 'Kelly', 'Kelly'])
kelly_classes

>>> 
Kelly    Philosophy
Kelly          Arts
Kelly          Math
dtype: object

# Finally, we can append all of the data in this new Series to the first using the .append() function.
all_students_classes = students_classes.append(kelly_classes)

# This creates a series which has our original people in it as well as all of Kelly's courses
all_students_classes

>>> 
Alice       Physics
Jack      Chemistry
Molly       English
Sam         History
Kelly    Philosophy
Kelly          Arts
Kelly          Math
dtype: object

# There are a couple of important considerations when using append. First, Pandas will take the series and try to infer the best data types to use. In this example, everything is a string, so there's no problems here. Second, the append method doesn't actually change the underlying Series objects, it instead returns a new series which is made up of the two appended together. This is a common pattern in pandas - by default returning a new object instead of modifying in place - and one you should come to expect. By printing the original series we can see that that series hasn't changed.
students_classes

>>> 
Alice      Physics
Jack     Chemistry
Molly      English
Sam        History
dtype: object

# Finally, we see that when we query the appended series for Kelly, we don't get a single value, but a series itself. 
all_students_classes.loc['Kelly']

>>> 
Kelly    Philosophy
Kelly          Arts
Kelly          Math
dtype: object

In this lecture, we focused on one of the primary data types of the Pandas library, the Series. You learned how to query the Series, with .loc and .iloc, that the Series is an indexed data structure, how to merge two Series objects together with append(), and the importance of vectorization.

There are many more methods associated with the Series object that we haven't talked about. But with these basics down, we'll move on to talking about the Panda's two-dimensional data structure, the DataFrame. The DataFrame is very similar to the series object, but includes multiple columns of data, and is the structure that you'll spend the majority of your time working with when cleaning and aggregating data.

  • 38
    点赞
  • 13
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值