Week1 notes of data science intro by Michigan University

最新推荐文章于 2023-01-15 14:57:58 发布

Walker0319

最新推荐文章于 2023-01-15 14:57:58 发布

阅读量546

点赞数

分类专栏：笔记文章标签：大数据 python

本文链接：https://blog.csdn.net/weixin_57901967/article/details/118026845

版权

笔记专栏收录该内容

33 篇文章 0 订阅

订阅专栏

1. Python basics for data science

1.1 Functions:

def add_numbers(x, y):
	return x + y

1.2 types and sequences

type(None)

NoneType

type(add_numbers)

function

slice a string:

x = 'This is a string'
x[0] #first character  
x[0:2] #first 2 characters  
x[-1] #last character

split a string:

x.split(' ') #['This', 'is', 'a', 'string']

dictionaries associate keys with values:

x = {'Chiris Brooks': 'brooks@umich.edu', 'Bill Gates': 'gates@microsoft.com'}  
x['Chiris Brooks'] # 'brooks@umich.edu'

iterate over all the keys:

for key in x: 
	print(key) #each key in the dictionary, also x.keys()  
	print(x[key]) #values of each key, also x.values()

unpack a sequence into variables:

fname, lname, email = ('Chris', 'Brooks', 'Brooks@umich.edu')

a covenient method of string formatting:

sales_record = {'price': 3.24, 'num_items': 4, 'person': 'Chris'}  
sales_statement = '{} bought {} item(s) at a price of {} each for a total of {}'
print(sales_statement.format(sales_record['person'], sales_record['num_items'], sales_record['price'], sales_record['num_items']*sales_record['price']))

1.3 Reading and writing CSV files

import csv
%precision 2 #important
with open('mpg.csv') as csvfile:
    mpg = list(csv.DictReader(csvfile)) # csv.DictReader read each row of csv file as a dictionary. [{k1:v11,k2:v12,...}, {k1:v21, k2:v22,...}, {},...]
#? there is only one column of keys in mpg.csv file, how is the DictReader combines each dictionary in list mpg? is there some parameters in DictReader to achieve this effect?
	mpg[:3] #list the first three values.
	#mpg[0], mpg[1], mpg[2], ... mpg[233] are dictionaries, total len(mpg)=234.

more conditions to consider:

hwyavg = sum(float(d['hwy']) for  d in mpg) /  len(mpg) ##count the average hwy mpg.
cylinders = set (d['cyl'] for d in mpg) ##unique number of cylinders in data
cyl_avg.sort(key=*ambda x: x[0])

notice lambda here

1.4 dates and times

import datetime as dt
import time as tm

dt.date(), dt.timedelta(), tm.time()

1.5 objects and map()

define a class

class Person:
depart = 'School of Info'
def set_name(self, new_name):
    self.name = new_name
def set_loca(self, new_loca):
    self.loca = new_loca

use a class

person = Person()
person.set_name('Chris Brooks') #no 'self' parameter here
person.set_loca('Xingangxi Road')
print('{} live in {} and works in the depart {}'.format(person.name, person.loca, person.depart))

map()

store1 = [19, 23, 42, 12]
sotre2 = [3, 34, 22, 45]
cheaper = map(min, store1, store2)

1.6 lambda and list comprehensions

lambda

my_func = lambda a, b, c : a + b ## learn more about lambda

list comprehension

my_list_comp = [numb for numb in range(0, 1000) if numb %2 == 0]

2. Numpy

Numpy is foundation that Pandas is built on.

import numpy as np
import math

2.1 Array Creation

a = np.array([[1,2,3],[4,5,6]])
array([[1, 2, 3],
       [4, 5, 6]])
a.shape ## (2,3)
a.ndim ## 2
a.dtype ## int

other attributes

b = np.random.rand(3,4)
c = np.zeros((2,3)) 
e = np.ones((7,8)) ## no twos

create an array of every even number from ten (inclusive) to fifty (exclusive)

f = np.arange(10, 50, 2) ## array([10, 12, 14, 16, 18, 20, 22, 24, 26, 28, 30, 32, 34, 36, 38, 40, 42, 44, 46, 48])

linspace() to generate a sequence of floats

np.linspace( 0, 2, 15 ) # 15 numbers from 0 (inclusive) to 2 (inclusive), array([0.        , 0.14285714, ..., 1.85714286, 2. ])

Array Operations
create a couple of arrays and compute

a = np.array([10,20,30,40])
b = np.array([1, 2, 3,4])
c = a-b
d = a*b

*** if we want to do matrix product, we use the “@” sign or use the dot function***

Numpy arrays have many interesting aggregation functions on them, such as sum(), max(), min(), and mean()
print(array3.sum())
print(array3.max())
print(array3.min())
print(array3.mean())

b = np.arange(1,16,1).reshape(3,5)
2D arrays is made up of rows and columns, and is also regarded as just a giant ordered list of numbers, and the shape of the array, is just an abstraction.

an example to see how numpy comes into play.
use the python imaging library (PIL) and a function to display images

from PIL import Image
from IPython.display import display
im = Image.open('../chris.tiff')	
display(im)
array=np.array(im)

array is a 200x200 array and that the values are all uint8.
For black and white images, black is stored as 0 and white is stored as 255.
invert image:

mask=np.full(array.shape,255)
modified_array = mask - array
modified_array = modified_array.astype(np.uint8)

display this array with reshape

reshaped = np.reshape(modified_array,(100,400))

numpy arrays are abstractions on top of data, and that data has an underlying format (in this case, uint8). In some ways, this whole degree is about data and the abstractions . A data scientist is to understand what the data means.

2.2 Indexing, Slicing and Iterating

Indexing

a = np.array([[1,2], [7, 8], [5, 6]])
a[1,1] # 8, remember in python we start at 0!

Boolean Indexing

print(a >5)

[[False False]
[ True  True]
[False  True]]

Slicing

a = np.array([[1,2,3,4], [5,6,7,8], [9,10,11,12]])
a[:2, 1:3] ## [[2,3], [6,7]]

use genfromtxt() to load a dataset

wines = np.genfromtxt("wine.csv", dtype=None, delimiter=";", skip_header=1)

a dataset on admission chances

graduate_admission = np.genfromtxt('Admission_Predict.csv', dtype=None, delimiter=',', skip_header=1,
      names=('Serial No','GRE Score', 'TOEFL Score', 'University Rating', 'SOP',
      'LOR','CGPA','Research', 'Chance of Admit')) ## have numpy try and infer the type of a column by setting the dtype parameter to None

#Notice that the resulting array is actually a one-dimensional array with 400 tuples
graduate_admission['CGPA'][0:5] ## retrieve a column using column's name

len(graduate_admission[graduate_admission['Research'] == 1]) ## find how many students have research experence

Let’s see if students with high chance of admission (>0.8) on average have higher GRE score than those with lower chance of admission (<0.4)

graduate_admission[graduate_admission['Chance_of_Admit'] > 0.8]['GRE_Score'].mean()
graduate_admission[graduate_admission['Chance_of_Admit'] < 0.4]['GRE_Score'].mean()

When we do the boolean masking
(graduate_admission[‘Chance_of_Admit’] > 0.8), we are left with an array with tuples in it still, and numpy holds a list of the columns and their name and indexes

3. regex

3.1 basics

pattern matching in strings using regular expressions, or regexes
3 main reasons:

check whether a pattern exists
get all instances of a complex pattern
clean source data using a pattern through splitting.

Regexe is foundational technique for data cleaning in data science applications, and a solid understanding of regexs will help you quickly and efficiently manipulate text data for further data science application.

In order to best learn regexes you need to write regexes.

import re
# match() checks the beginning 
# search() checks anywhere
# findall() and split() parse the string 
text = "Amy works diligently. Amy gets good grades. Our student Amy is succesful.another Amy"
re.split("Amy", text)

['',
' works diligently. ',
' gets good grades. Our student ',
' is succesful.another ',
'']

#notice: before and aftre string ‘Amy’, there are empty string ’ ', because Amy is both ithe first and the last string.

re.findall("Amy", text)

['Amy', 'Amy', 'Amy']

# Anchors: caret character ^ means start, dollar character $ means end. put ^ before a string, put $ after a string.
text = "Amy works diligently. gets good grades. Our student Amy is succesful.$"
textfind = re.search("^Amy",text) ## textfind[0]='Amy'
textfind = re.search("is succesful.$",text)

3.2 Patterns and Character Classes

grades="ACAAAABCBCBAA"
re.findall("B",grades) # how many 'B's in the list

['B', 'B', 'B']

re.findall("[AB]",grades) # set operator []

['A', 'A', 'A', 'A', 'A', 'B', 'B', 'B', 'A', 'A']

re.findall("[A][B-C]",grades)

['AC', 'AB']

# the same as follow:
re.findall("AB|AC",grades)

re.findall("^[^A]",grades) #first ^: begin with, second ^: not

3.3 Quantifiers

# Quantifiers are the number of times a pattern to be matched.
re.findall("A{2,3}",grades) # 2 as min, 3 as max

['AAA', 'AA']

re.findall("A{1,1}A{1,2}",grades)

['AAA', 'AA']

Don’t deviate from the {m,n} pattern. In particular, if you have an extra space in between the braces you’ll get an empty result

re.findall("A{2, 2}",grades) ## an extra space between the braces will get an empty result

[]

re.findall("AA",grades) ## default {1,1} without quantifier

['AA', 'AA', 'AA']

re.findall("A{2}",grades) ## one number is considered to be both m and n

['AA', 'AA', 'AA']

# Using this, we could find a decreasing trend in a student's grades
re.findall("A{1,10}B{1,10}C{1,10}",grades)

['AAAABC']

# other quantifiers
# an asterix * to match 0 or more times, 
# a question mark ? to match one or more times, 
# a + plus sign to match one or more times. 
with open("datasets/ferpa.txt","r") as file:
    wiki=file.read()

re.findall("[a-zA-Z]{1,100}\[edit\]",wiki)

['Overview[edit]', 'records[edit]', 'records[edit]']

# Lets iteratively improve this. First, use \w to match any letter, including digits and numbers.
re.findall("[\w]{1,100}\[edit\]",wiki)

['Overview[edit]', 'records[edit]', 'records[edit]']

# metacharacter: \w [a-zA-Z], include digits
# \s matches any space character.
re.findall("[\w]*\[edit\]",wiki)

['Overview[edit]', 'records[edit]', 'records[edit]']

re.findall('[\w ]*\[edit\]',wiki) ## add a space character

['Overview[edit]',
 'Access to public records[edit]',
 'Student medical records[edit]']

for title in re.findall("[\w ]*\[edit\]",wiki):
    print(re.split("[\[]",title)[0])

Overview
Access to public records
Student medical records

3.4 groups

re.findall("([\w ]*)(\[edit\])",wiki)

[('Overview', '[edit]'),
 ('Access to public records', '[edit]'),
 ('Student medical records', '[edit]')]

# findall() returns strings, and search() and match() return individual Match objects. But if we want a list of Match objects, use finditer()
for item in re.finditer("([\w ]*)(\[edit\])",wiki):
    print(item.groups())

('Overview', '[edit]')
('Access to public records', '[edit]')
('Student medical records', '[edit]')

for item in re.finditer("([\w ]*)(\[edit\])",wiki):
    print(item.group(1)) ## group(2) is 3 [edit]

Overview
Access to public records
Student medical records

# One more to regex groups：labeling or naming. 
for item in re.finditer("(?P<title>[\w ]*)(?P<edit_link>\[edit\])",wiki):
    print(item.groupdict()['title'])

Overview
Access to public records
Student medical records

3.5 Look-ahead and Look-behind

# look ahead with ?= syntax
for item in re.finditer("(?P<title>[\w ]+)(?=\[edit\])",wiki):
    print(item.groupdict())

{'title': 'Overview'}
{'title': 'Access to public records'}
{'title': 'Student medical records'}

3.6 Examples

with open("datasets/buddhist.txt","r") as file:
    wiki=file.read()

pattern="""
(?P<title>.*)        #the university title
(–\ located\ in\ )   #an indicator of the location
(?P<city>\w*)        #city the university is in
(,\ )                #separator for the state
(?P<state>\w*)       #the state the city is located in"""
for item in re.finditer(pattern,wiki,re.VERBOSE):
    print(item.groupdict())

{'title': 'Dhammakaya Open University ', 'city': 'Azusa', 'state': 'California'}
{'title': 'Dharmakirti College ', 'city': 'Tucson', 'state': 'Arizona'}

…
another example

with open("datasets/nytimeshealth.txt","r") as file:
    health=file.read()

pattern = '#[\w\d]*(?=\s)'
re.findall(pattern, health) ## omit \d is also the sanme. because \w include digits?

4. regex cheatcheet

source: turorialspoint

4.1 functions

findall() Returns a list containing all matches
search() Returns a Match object, if the match found anywhere in the string
split() Returns a list, where the string has been split at each math
sub() Replaces one or many matches with a string

4.2 metacharacters

Characters	Description	Example
[]	A set of characters	‘[a-m]’
\	Signals a special sequence, also used to escape special characters	“\d”
.	Any character except newline character	“he…o”
^	Starts with	“^Hello”
$	Ends with	“World$”
*	Zero or more occurences	“aix*”
+	One or more occurrences	“aix+”
{}	Exactly the specified number of occurences	“a\|{2}”
\|	Either or	“short\|long”
()	Capture and group

4.3 special sequences

Characters	Description	Example
\A	Returns a match if the specified characters are at the beginning of the string	“\APyt”
\b	Returns a match if the specified characters are at the start or at the end of a word	r”\bPython” r”world\b”
\B	Returns a match if the specified characters are present, but NOT at the start(or at the end) of a word	r”\BPython” r”World\B”
\d	Returns a match if the string contains digits	“\d”
\D	Returns a match if the string DOES NOT contain digits	“\D”
\s	Returns a match where the string contains a white space character	“\s”
\S	Returns a match where the string DOES NOT contain a white space character	“\S”
\w	Returns a match if the string contains any word characters (characters may be letter from a to Z, digits from 0-9, and the underscore _ character	“\w”
\W	Returns a match where the string DOES NOT contain any word characters	“\W”
\Z	Returns a match if the specified characters are at the end of the string	“world\Z”

4.4 sets

Set	Description
[raj]	Returns a match if one of the specified characters (a, r or n) are present
[a-r]	Returns a match for any lower case letter, alphabetically between a and r
[^raj]	Returns a match for any character Except r, a and j
[0123]	Returns a match where any of the spe
[0-9]	Returns a match for any digit between 0 and 9
[0-3][0-8]	Returns a match for any two-digit numbers between 00 and 38
[a-zA-Z]	Returns a match for any character alphabetically between a to z or A to Z
[+]	Return a match for any + character in the string

5. assignments

list all of the names

simple_string = """Amy is 5 years old, and her sister Mary is 2 years old. Ruth and Peter,
   their parents, have 3 kids."""
re.findall('[A-Z]\w+',simple_string)

list those students who received a B in the course.

with open ("assets/grades.txt", "r") as file:
    grades = file.read()
re.findall("([\w ]+)(?=\:\sB)",grades)

convert log file to list of dictionaries

with open("assets/logdata.txt", "r") as file:
    logdata = file.read()
ftemp = re.finditer('(?P<host>\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3})(\s\-\s)(?P<user_name>\w+)(\s\[)(?P<time>[\w\d\s\/\:\-]*)(\]\s\")(?P<request>[\w\/\s\.]+)(\"[\d\s]+\\n)',logdata)  
for item in ftemp:
    print(item.groupdict())

Walker0319

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
Week1 notes of data science intro by Michigan University

1. Python basics for data science1.1 Functions:def add_numbers(x, y): return x + y1.2 types and sequencestype(None)NoneTypetype(add_numbers) functionslice a string:x = 'This is a string'x[0] #first character x[0:2] #first 2 characters x[
复制链接

扫一扫