5-AM Project: day8 Practical data science with Python 1-CSDN博客

本文链接：https://blog.csdn.net/wendyponcho/article/details/136047254

Getting started with Python

Pip

Pip is the package installer for Python and is the classic way to manage packages in Python. The command to install packages with pip is pip install <packagename>. For example, a common package used in data science is pandas, which we could install with pip install pandas. This will install the latest version of the package by default. If our installed package is a little old and we want to upgrade it, we can upgrade with pip --upgrade pandas. Sometimes we need to downgrade packages for compatibility or other issues. We can install specific versions of packages, such as pip install pandas==1.1.4. The --force-reinstall flag can be added to the command to force installation of a particular version (for example, for downgrading or upgrading if the usual pip install is not working).

Python basics

Numbers

When we think of what to use programming for, one of the first things that comes to mind is math. A lot of what we do with Python for data science is math, so understanding how to use numbers in Python is crucial. Most of the time, we only care about two types of numbers in Python: integers and floats. Integers are whole numbers without a decimal place, such as 2. Floats have a decimal place, such as 2.0. We can check the type of an object in Python with the function type(), such as type(2), which outputs int.

With numbers, we can perform the usual math operations: addition, subtraction, multiplication, and division, as shown here:

2 + 2 # addition 
2 – 2 # subtraction 
2 * 2 # multiplication 
2 / 2 # division 
2 // 2 # integer division

Strings

In programming, strings are text. We can create a string in Python with either single or double quotes:

'a string'
"a string"

Similar to how we convert a number to an integer or float with int() or float(), we can convert other objects (such as numbers) to strings with str().

Python strings can be thought of as a series of characters and can be indexed. Indexing means selecting a subset, or part, of the string.

In Python, indexing starts at 0, and is done with square brackets after a string. That is, the first element of a string can be selected with the 0th index, like 'a string'[0], which would return a. The second character in the string can be selected with 'a string'[1] (returning the space character), and so on. If we want the last character in a string, we can specify -1 as our index, such as 'a string'[-1], which would give us g.

We can also select a subset of a string with indexing by providing start and stop points, such as 'a string'[0:4], which gives us the first four characters: a st.

Lastly, we can choose a 'step' for our indexing as the third part of our indexing format. For example, if we want every other letter, we can index a string like this: 'a string'[::2], giving us asrn. If we want to reverse a string, we can provide -1 as the step: 'a string'[::-1], yielding gnirts a.

These three parts of the Python indexing system can be thought of as a start, stop, and step, separated by colons: [start:stop:step].

If we do not specify any of start, stop, or step, the default values are taken, [0:None:1], meaning the entire string one character at a time, going forward.

Tying it together, if we wanted every other character from the first five characters of a string, we could index it as [:5:2] or [0:5:2]. We will revisit Python indexing again soon when we learn about lists.

'a string'[0]  # first character of a string
'a string'[-1]  # last character of a string
'a string'[0:4]  # index a string to get first 4 characters
'a string'[:4]  # index a string to get first 4 characters
'a string'[::2]  # get every other letter
'a string'[::-1]  # reverse the string
'a string'[:5:2]  # every other letter in the first 5 charact

Variables

Variables in programming are used to hold values. For example, say we want to keep track of how many books we've read. Here, we could use a variable. In this case, our variable would be an integer: books = 1

Lists, tuples, sets, and dictionaries

In most programming languages, we have data structures that can store a sequence of values. In Python, we have lists, tuples, sets and dictionaries; let's go through them each in turn.

Lists

One of the core data structures in Python is a list. Lists are contained in square brackets and can contain several values of differing data types. Lists can even contain lists themselves. We already saw a list of strings earlier when we covered the join() method of strings. Here is another example – a list of integers: [1, 2, 3]

Lists have several useful methods available in Python. We will cover the basics here:

Concatenation
Repetition
Length
Appending
Sorting
Indexing

To get the last element of a list, we could use the length of the list minus 1, or -1:

a_list[len(a_list) - 1]
a_list[-1]

The negative number syntax works by counting backward from the last element of a list. -1 denotes the last element, -2 the second-to-last element, and so on.

If we want to select a range of elements, say the first three elements of a list, we can use indexing as follows:

a_list[0:3]
a_list[:3]

Since the default values for start and stop in [start:stop:step] are 0 and None, this gives us every other element of the entire list, and would be the same as a_list[0:None:2]. We can also reverse lists with the handy trick of using -1 as our step:

a_list[::-1]

Tuples

A tuple is similar to a list, but it cannot be changed once it is created – this is also called immutability. Tuples have parentheses instead of square brackets, like this:

a_tuple = (2, 3)

Tuples are called "immutable" objects because they cannot be changed. These are sometimes used as data structures in various Python packages. Lists or sets can be converted to tuples with the tuple() function:

tuple(a_list)

Sets

Sets follow the mathematical definition, which is a group of unique values. Sets can be created with curly brackets or the set() function. For example, if we want to get the unique numbers from a list, we can convert it to a set:

set(a_list)

We can also create a set from scratch with curly brackets:

a_set = {1, 2, 3, 3}

Sets find uses in natural language processing, examining unique values present in datasets, and more.

Dictionaries

Dictionaries are similar to sets because they have a unique set of keys, but they also contain elements with key-value pairs. Here is an example of a dictionary:

a_dict = {'books': 1, 'magazines': 2, 'articles': 7}

Loops and comprehensions

Loops are fundamental to programming because we can step through lists or other data structures methodically, one element at a time. In Python, a for loop can be used to loop through a list or dictionary:

a_list = [1, 2, 3]
for element in a_list:
    print(element)

A few special keywords are available with loops: continue and break. The break keyword will end a loop, while the continue keyword will immediately move on to the next iteration in the loop, skipping any code left below it. If we wanted to run only one iteration of our loop for testing, for example, we could use break:

for element in a_list:
    print(element)
    break

To get the index of a list:

Some commonly used functions with loops are range() and len(). For example, if we want to loop through a list and get the index of that list, we can do that with range and len:

for i in range(len(a_list)):
    print(i)

The range() function takes at least one argument (the size of the range from 0 to our size), but can also take up to three arguments, start, stop, and step, which is the same idea as indexing lists and strings. For example, if we want to get a range of numbers starting at 1 and going to 6 in steps of 2, our function call to range is range(1, 7, 2). Just like indexing, the upper boundary to the range is non-inclusive, so the value 7 means our range stops at 6.

A similar approach to this is to use the built-in enumerate() function:

a_list = [1, 2, 3]
for index, element in enumerate(a_list):
    print(index, element)

Both of these approaches would print out the numbers 0, 1, and 2, although the enumerate example also prints out the list elements 1, 2, and 3. The enumerate function returns a tuple of a counter that starts at 0, along with the elements of the list or other iterable object.

So, our output from the preceding example looks like this:

0 1
1 2
2 3

Loops through lists can also be accomplished with list comprehensions. These are handy because they can make code shorter and sometimes run slightly faster than for loops.

Here is an example of a for loop and the same thing accomplished with a list comprehension:

a_list = []
for i in range(3):
    a_list.append(i)
# a list comprehension for the same end-result as the loop above
a_list = [i for i in range(3)]

We can also loop through dictionaries. To loop through a dictionary, we can use the .items() method/function:

a_dict = {'books': 1, 'magazines': 2, 'articles': 7}
for key, value in a_dict.items():
    print(f'{key}:{value}')

The idea is the same as looping through a list, but the .items() method of dictionaries gives us tuples of keys and values from the dictionary, one pair at a time.

Notice that we are using f-string formatting here to dynamically print out the keys and values as we loop through our dictionary.

We can also use dictionary comprehensions, which are very similar to list comprehensions. The following code creates a dictionary where the keys are the values 1, 2, 3, and the values are their squares (1, 4, 9):

a_dict = {i: i ** 2 for i in range(1, 4)}

Booleans and conditionals

The last variable type we'll cover in Python are Booleans. These can take the binary values of True (1) or False (0). As implied in the parentheses, we can also use the values of 1 for True and 0 for False. We can use Booleans to test for a condition. Say we want to see whether the number of books we've read is greater than 10 – we can test it in Python like so:

books_read = 11
books_read > 10

Packages and modules

Libraries, also called packages, are much of what makes Python so powerful for data science. Each package adds new functionality that wasn't there before, such as installing new apps on our smartphones. Python has a host of built-in modules and packages providing basic functionality, but the real power comes from community packages on GitHub and PyPI.

The built-in time module has utilities for timing (by built-in, I mean it comes installed with Python). One function in this module is time.time(), which gets us the current time in seconds since the epoch (since January 1, 1970). We can change a package or module name with an alias, like so:

import time as t
t.time()

Above, we change the name of our imported time module to t, and then use the same time.time() function. But instead of time.time(), it's now t.time() with the new alias.

Functions

Functions in Python always use parentheses after their function name. We place arguments within the parentheses, like this:

a_list = [2, 4, 1]
sorted(a_list, reverse=True)

In this case, we are using the sorted function to sort our list from greatest to least. We also provides the reverse argument as True, so it sorts from greatest to least instead of from least to greatest.

To create a function, we use the def keyword, and then give the function name. The function name can be composed of letters, numbers, underscores, and cannot start with a number, just like variables.

Then we give any arguments between the parentheses, although we can also specify no arguments if we choose. If we want to supply a default value to an argument, we set it to the default value with an equals sign, as with printAdd above (printAdd='more').

Then we put a colon character after the closing parenthesis, and the function starts on the next line after an indentation of four spaces.

Often, we write some documentation about the function below the function definition as a multi-line comment. If we want to return something from the function, we can add a return statement, which will give us printAdd in this case. The return statement will exit the function

def test_function(doPrint, printAdd='more'):
    """
    A demo function.
    """
    if doPrint:
        print('test' + printAdd)
        return printAdd

One key concept with functions is scoping. If we create a variable inside a function, we can only access that variable within the function. For example, if we try and access the func_var variable outside test_function, we cannot:

def test_function():
    """
    A demo function.
    """
    func_var = 'testing'
    print(func_var)
print(func_var)  # returns NameError; variable not defined

If we run the preceding code, we will define the test_function function. Then, when we try to print out func_var outside of the function, we get an error: NameError: name 'func_var' is not defined. The func_var variable can only be accessed from within the function. There are ways around this, such as declaring variables as global variables. However, using global variables is not considered best practice and should be avoided.

Classes

Python is an object-oriented language, which is a category of programming languages. It means that the Python language is fundamentally based on objects. Objects are a collection of data/variables and functions/methods. Objects can be defined with classes using the class keyword. For example, a simple object can be created like so:

class testObject:
    def __init__(self, attr=10):
        self.test_attribute = attr
    
    def test_function(self):
        print('testing123')
        print(self.test_attribute)

This creates a testObject class on the first line.

The __init__ function is run when we create a new instance of the class and is a standard feature of classes in Python.

For example, if we create an instance of the class with t_o = testObject(123), we create a new testObject object in the variable t_o, and set the attribute t_o.test_attribute equal to 10.

Setting the attribute test_attribute equal to 10 is done in the __init__ function, which runs when we initialize the t_o variable as a testObject class. We can access attributes from classes such as t_o.test_attribute. Functions can be included with classes, such as the test_function function above.

Note that all function definitions in classes require the self keyword as the first argument, which allows us to refer to the instance of the class in the functions. This enables us to set attributes of the object and use them throughout the methods (functions) that the object has.

Multithreading and multiprocessing

Modern CPUs have several CPU cores, which can all run calculations simultaneously. However, Python is not parallelized by default, meaning it can only run on one core at a time. Instead, it has the global interpreter lock, or GIL, which restricts the running Python process to one thread (virtual CPU core) at a time.

There are often two threads per CPU core these days, so a lot of CPU power goes unused by Python by default. This limitation of Python is considered a weakness. Even though Python has the infamous GIL, we can still parallelize code with a few lines of Python.

The multiprocessing and multithreading modules in Python allow for multiprocessing and multithreading, but it's easier to use the functions from the concurrent.futures package. You can check out the multithreading_demo.py file in the book's GitHub repository, which briefly shows how to use multiprocessing and multithreading.

Note that multiprocessing is useful for improving performance, but often we can use tools others have built and avoid handcoding it ourselves with the concurrent.futures module. For example, we'll see in a future chapter that we can use the swifter package for parallelizing data processing, and it's much easier than using concurrent.futures ourselves.

Debugging errors and utilizing documentation

Debugging

Python comes with a module for debugging code called pdb. To use it, we insert the line import pdb; pdb.set_trace() in our code. Then, when we run our code, the execution stops on that line and allows us to type in Python codes to examine the variables there. For example, try running the following code (or running the pdb_demo.py file from this book's GitHub repository):

test_str = 'a test string'
a = 2
b = 2
import pdb; pdb.set_trace()
c = a + b

Documentation

Using documentation is extremely important when coding. For any major programming language or package, there is documentation explaining how its components work.

For example, the official Python documentation has been referenced throughout the chapter so far and is useful for built-in Python functions and Python fundamentals.

Documentation for other packages can be found by searching an internet search engine for "<package name> documentation" or "<package name> docs", since documentation is often abbreviated as "docs". Searching for a specific function in a package can also be helpful in getting to the information you need faster.

Lastly, we can access documentation within IPython or Jupyter Notebooks with a question mark next to any object or using the help() command. For example, to bring up the documentation for the range function, we could use ?range or range?.

Version control with Git

Since data science tends to consist of Python code, we need a way to save and keep track of our code. The best practice for saving code, collaborating, and tracking changes is to use version control. There are several version control systems and software solutions out there, but Git is the most frequently used version control software for now, with GitHub being one of the most frequently used code-hosting platform utilizing Git.

Git is a protocol for keeping track of changes in code, and GitHub allows us to use Git with a web service. GitHub lets us create accounts, store our code on their servers, and share it with the world. We can also easily collaborate with other people using GitHub. A Git/GitHub crash course is beyond the scope of this book, but if you are interested in a book on the subject, we can recommend Version Control with Git and GitHub, by Alex Magana from Packt.

Productivity tips

There are a few productivity hacks that can help you code faster. One big trick that we have already touched on is tab autocompletion. Tab autocompletion is available within many command consoles (terminals), IPython, Jupyter Notebooks, IDEs, and code editors. We can simply start typing a word, hit the Tab button, and the word can be autocompleted, or possible completions can be suggested.

A similar trick is using the up arrow in a terminal and/or IPython session. This will cycle through your recent commands, so you don't need to re-type the same exact thing more than once.

Another useful trick is using the control key (command or option keys on Mac) on your keyboard to navigate by word chunks. While holding down the Ctrl key, we can press the left and right arrow key to move one word at a time. This can also be used with the delete and backspace keys to delete whole words at a time. Related to this is the use of the Ctrl key to select words by chunks, or even entire lines at a time by using the "home" and "end" keys on your keyboard. Combining this with Ctrl + c or Ctrl + x for copy or cut commands allows you to duplicate or move lines of code around quickly.

Also related to the Ctrl + arrows trick is adding brackets and quotes around a chunk of text. For example, if we type a word without quotes in Python, but want to make it a string, we can use the Ctrl key and the left arrow key to select the entire word and then type a quotation mark (either " or '). In most IDEs, text editors, and Jupyter Notebooks, this will add quotations on both sides of the word. We can also quickly add brackets or parentheses around text in this same way.