Week1 notes of data science intro by Michigan University

1. Python basics for data science

1.1 Functions:

def add_numbers(x, y):
	return x + y

1.2 types and sequences

type(None)

NoneType

type(add_numbers)  

function

slice a string:

x = 'This is a string'
x[0] #first character  
x[0:2] #first 2 characters  
x[-1] #last character

split a string:

x.split(' ') #['This', 'is', 'a', 'string']

dictionaries associate keys with values:

x = {'Chiris Brooks': 'brooks@umich.edu', 'Bill Gates': 'gates@microsoft.com'}  
x['Chiris Brooks'] # 'brooks@umich.edu'  

iterate over all the keys:

for key in x: 
	print(key) #each key in the dictionary, also x.keys()  
	print(x[key]) #values of each key, also x.values()

unpack a sequence into variables:

fname, lname, email = ('Chris', 'Brooks', 'Brooks@umich.edu')

a covenient method of string formatting:

sales_record = {'price': 3.24, 'num_items': 4, 'person': 'Chris'}  
sales_statement = '{} bought {} item(s) at a price of {} each for a total of {}'
print(sales_statement.format(sales_record['person'], sales_record['num_items'], sales_record['price'], sales_record['num_items']*sales_record['price']))

1.3 Reading and writing CSV files

import csv
%precision 2 #important
with open('mpg.csv') as csvfile:
    mpg = list(csv.DictReader(csvfile)) # csv.DictReader read each row of csv file as a dictionary. [{k1:v11,k2:v12,...}, {k1:v21, k2:v22,...}, {},...]
#? there is only one column of keys in mpg.csv file, how is the DictReader combines each dictionary in list mpg? is there some parameters in DictReader to achieve this effect?
	mpg[:3] #list the first three values.
	#mpg[0], mpg[1], mpg[2], ... mpg[233] are dictionaries, total len(mpg)=234.

more conditions to consider:

hwyavg = sum(float(d['hwy']) for  d in mpg) /  len(mpg) ##count the average hwy mpg.
cylinders = set (d['cyl'] for d in mpg) ##unique number of cylinders in data
cyl_avg.sort(key=*ambda x: x[0]) 

notice lambda here

1.4 dates and times

import datetime as dt
import time as tm

dt.date(), dt.timedelta(), tm.time()

1.5 objects and map()

define a class

class Person:
depart = 'School of Info'
def set_name(self, new_name):
    self.name = new_name
def set_loca(self, new_loca):
    self.loca = new_loca

use a class

person = Person()
person.set_name('Chris Brooks') #no 'self' parameter here
person.set_loca('Xingangxi Road')
print('{} live in {} and works in the depart {}'.format(person.name, person.loca, person.depart))

map()

store1 = [19, 23, 42, 12]
sotre2 = [3, 34, 22, 45]
cheaper = map(min, store1, store2)

1.6 lambda and list comprehensions

lambda

my_func = lambda a, b, c : a + b ## learn more about lambda

list comprehension

my_list_comp = [numb for numb in range(0, 1000) if numb %2 == 0]

2. Numpy

Numpy is foundation that Pandas is built on.

import numpy as np
import math

2.1 Array Creation

a = np.array([[1,2,3],[4,5,6]])
array([[1, 2, 3],
       [4, 5, 6]])
a.shape ## (2,3)
a.ndim ## 2
a.dtype ## int

other attributes

b = np.random.rand(3,4)
c = np.zeros((2,3)) 
e = np.ones((7,8)) ## no twos

create an array of every even number from ten (inclusive) to fifty (exclusive)

f = np.arange(10, 50, 2) ## array([10, 12, 14, 16, 18, 20, 22, 24, 26, 28, 30, 32, 34, 36, 38, 40, 42, 44, 46, 48])

linspace() to generate a sequence of floats

np.linspace( 0, 2, 15 ) # 15 numbers from 0 (inclusive) to 2 (inclusive), array([0.        , 0.14285714, ..., 1.85714286, 2. ])

Array Operations
create a couple of arrays and compute

a = np.array([10,20,30,40])
b = np.array([1, 2, 3,4])
c = a-b
d = a*b

*** if we want to do matrix product, we use the “@” sign or use the dot function***

Numpy arrays have many interesting aggregation functions on them, such as sum(), max(), min(), and mean()
print(array3.sum())
print(array3.max())
print(array3.min())
print(array3.mean())

b = np.arange(1,16,1).reshape(3,5)
2D arrays is made up of rows and columns, and is also regarded as just a giant ordered list of numbers, and the shape of the array, is just an abstraction.

an example to see how numpy comes into play.
use the python imaging library (PIL) and a function to display images

from PIL import Image
from IPython.display import display
im = Image.open('../chris.tiff')	
display(im)
array=np.array(im)

array is a 200x200 array and that the values are all uint8.
For black and white images, black is stored as 0 and white is stored as 255.
invert image:

mask=np.full(array.shape,255)
modified_array = mask - array
modified_array = modified_array.astype(np.uint8)

display this array with reshape

reshaped = np.reshape(modified_array,(100,400))

numpy arrays are abstractions on top of data, and that data has an underlying format (in this case, uint8). In some ways, this whole degree is about data and the abstractions . A data scientist is to understand what the data means.

2.2 Indexing, Slicing and Iterating

  • Indexing
a = np.array([[1,2], [7, 8], [5, 6]])
a[1,1] # 8, remember in python we start at 0!
  • Boolean Indexing
print(a >5)
[[False False]
[ True  True]
[False  True]]
  • Slicing
a = np.array([[1,2,3,4], [5,6,7,8], [9,10,11,12]])
a[:2, 1:3] ## [[2,3], [6,7]]

use genfromtxt() to load a dataset

wines = np.genfromtxt("wine.csv", dtype=None, delimiter=";", skip_header=1)

a dataset on admission chances

graduate_admission = np.genfromtxt('Admission_Predict.csv', dtype=None, delimiter=',', skip_header=1,
      names=('Serial No','GRE Score', 'TOEFL Score', 'University Rating', 'SOP',
      'LOR','CGPA','Research', 'Chance of Admit')) ## have numpy try and infer the type of a column by setting the dtype parameter to None

#Notice that the resulting array is actually a one-dimensional array with 400 tuples
graduate_admission['CGPA'][0:5] ## retrieve a column using column's name

len(graduate_admission[graduate_admission['Research'] == 1]) ## find how many students have research experence

Let’s see if students with high chance of admission (>0.8) on average have higher GRE score than those with lower chance of admission (<0.4)

graduate_admission[graduate_admission['Chance_of_Admit'] > 0.8]['GRE_Score'].mean()
graduate_admission[graduate_admission['Chance_of_Admit'] < 0.4]['GRE_Score'].mean()

When we do the boolean masking
(graduate_admission[‘Chance_of_Admit’] > 0.8), we are left with an array with tuples in it still, and numpy holds a list of the columns and their name and indexes

3. regex

3.1 basics

pattern matching in strings using regular expressions, or regexes
3 main reasons:

  • check whether a pattern exists
  • get all instances of a complex pattern
  • clean source data using a pattern through splitting.

Regexe is foundational technique for data cleaning in data science applications, and a solid understanding of regexs will help you quickly and efficiently manipulate text data for further data science application.

In order to best learn regexes you need to write regexes.

import re
# match() checks the beginning 
# search() checks anywhere
# findall() and split() parse the string 
text = "Amy works diligently. Amy gets good grades. Our student Amy is succesful.another Amy"
re.split("Amy", text)
['',
' works diligently. ',
' gets good grades. Our student ',
' is succesful.another ',
'']

#notice: before and aftre string ‘Amy’, there are empty string ’ ', because Amy is both ithe first and the last string.

re.findall("Amy", text)
['Amy', 'Amy', 'Amy']
# Anchors: caret character ^ means start, dollar character $ means end. put ^ before a string, put $ after a string.
text = "Amy works diligently. gets good grades. Our student Amy is succesful.$"
textfind = re.search("^Amy",text) ## textfind[0]='Amy'
textfind = re.search("is succesful.$",text)

3.2 Patterns and Character Classes

grades="ACAAAABCBCBAA"
re.findall("B",grades) # how many 'B's in the list
['B', 'B', 'B']
re.findall("[AB]",grades) # set operator []
['A', 'A', 'A', 'A', 'A', 'B', 'B', 'B', 'A', 'A']
re.findall("[A][B-C]",grades)
['AC', 'AB']
# the same as follow:
re.findall("AB|AC",grades)
re.findall("^[^A]",grades) #first ^: begin with, second ^: not

3.3 Quantifiers

# Quantifiers are the number of times a pattern to be matched.
re.findall("A{2,3}",grades) # 2 as min, 3 as max
['AAA', 'AA']
re.findall("A{1,1}A{1,2}",grades)
['AAA', 'AA']

Don’t deviate from the {m,n} pattern. In particular, if you have an extra space in between the braces you’ll get an empty result

re.findall("A{2, 2}",grades) ## an extra space between the braces will get an empty result
[]
re.findall("AA",grades) ## default {1,1} without quantifier
['AA', 'AA', 'AA']
re.findall("A{2}",grades) ## one number is considered to be both m and n
['AA', 'AA', 'AA']
# Using this, we could find a decreasing trend in a student's grades
re.findall("A{1,10}B{1,10}C{1,10}",grades)
['AAAABC']
# other quantifiers
# an asterix * to match 0 or more times, 
# a question mark ? to match one or more times, 
# a + plus sign to match one or more times. 
with open("datasets/ferpa.txt","r") as file:
    wiki=file.read()
re.findall("[a-zA-Z]{1,100}\[edit\]",wiki)
['Overview[edit]', 'records[edit]', 'records[edit]']
# Lets iteratively improve this. First, use \w to match any letter, including digits and numbers.
re.findall("[\w]{1,100}\[edit\]",wiki)
['Overview[edit]', 'records[edit]', 'records[edit]']
# metacharacter: \w [a-zA-Z], include digits
# \s matches any space character.
re.findall("[\w]*\[edit\]",wiki)
['Overview[edit]', 'records[edit]', 'records[edit]']
re.findall('[\w ]*\[edit\]',wiki) ## add a space character
['Overview[edit]',
 'Access to public records[edit]',
 'Student medical records[edit]']
for title in re.findall("[\w ]*\[edit\]",wiki):
    print(re.split("[\[]",title)[0])
Overview
Access to public records
Student medical records

3.4 groups

re.findall("([\w ]*)(\[edit\])",wiki)
[('Overview', '[edit]'),
 ('Access to public records', '[edit]'),
 ('Student medical records', '[edit]')]
# findall() returns strings, and search() and match() return individual Match objects. But if we want a list of Match objects, use finditer()
for item in re.finditer("([\w ]*)(\[edit\])",wiki):
    print(item.groups())
('Overview', '[edit]')
('Access to public records', '[edit]')
('Student medical records', '[edit]')
for item in re.finditer("([\w ]*)(\[edit\])",wiki):
    print(item.group(1)) ## group(2) is 3 [edit]
Overview
Access to public records
Student medical records
# One more to regex groups:labeling or naming. 
for item in re.finditer("(?P<title>[\w ]*)(?P<edit_link>\[edit\])",wiki):
    print(item.groupdict()['title'])
Overview
Access to public records
Student medical records

3.5 Look-ahead and Look-behind

# look ahead with ?= syntax
for item in re.finditer("(?P<title>[\w ]+)(?=\[edit\])",wiki):
    print(item.groupdict())
{'title': 'Overview'}
{'title': 'Access to public records'}
{'title': 'Student medical records'}

3.6 Examples

with open("datasets/buddhist.txt","r") as file:
    wiki=file.read()
pattern="""
(?P<title>.*)        #the university title
(–\ located\ in\ )   #an indicator of the location
(?P<city>\w*)        #city the university is in
(,\ )                #separator for the state
(?P<state>\w*)       #the state the city is located in"""
for item in re.finditer(pattern,wiki,re.VERBOSE):
    print(item.groupdict())
{'title': 'Dhammakaya Open University ', 'city': 'Azusa', 'state': 'California'}
{'title': 'Dharmakirti College ', 'city': 'Tucson', 'state': 'Arizona'}


another example

with open("datasets/nytimeshealth.txt","r") as file:
    health=file.read()
pattern = '#[\w\d]*(?=\s)'
re.findall(pattern, health) ## omit \d is also the sanme. because \w include digits?

4. regex cheatcheet

source: turorialspoint

4.1 functions

  • findall() Returns a list containing all matches
  • search() Returns a Match object, if the match found anywhere in the string
  • split() Returns a list, where the string has been split at each math
  • sub() Replaces one or many matches with a string

4.2 metacharacters

CharactersDescriptionExample
[]A set of characters‘[a-m]’
\Signals a special sequence, also used to escape special characters“\d”
.Any character except newline character“he…o”
^Starts with“^Hello”
$Ends with“World$”
*Zero or more occurences“aix*”
+One or more occurrences“aix+”
{}Exactly the specified number of occurences“a|{2}”
|Either or“short|long”
()Capture and group

4.3 special sequences

CharactersDescriptionExample
\AReturns a match if the specified characters are at the beginning of the string“\APyt”
\bReturns a match if the specified characters are at the start or at the end of a wordr”\bPython” r”world\b”
\BReturns a match if the specified characters are present, but NOT at the start(or at the end) of a wordr”\BPython” r”World\B”
\dReturns a match if the string contains digits“\d”
\DReturns a match if the string DOES NOT contain digits“\D”
\sReturns a match where the string contains a white space character“\s”
\SReturns a match where the string DOES NOT contain a white space character“\S”
\wReturns a match if the string contains any word characters (characters may be letter from a to Z, digits from 0-9, and the underscore _ character“\w”
\WReturns a match where the string DOES NOT contain any word characters“\W”
\ZReturns a match if the specified characters are at the end of the string“world\Z”

4.4 sets

SetDescription
[raj]Returns a match if one of the specified characters (a, r or n) are present
[a-r]Returns a match for any lower case letter, alphabetically between a and r
[^raj]Returns a match for any character Except r, a and j
[0123]Returns a match where any of the spe
[0-9]Returns a match for any digit between 0 and 9
[0-3][0-8]Returns a match for any two-digit numbers between 00 and 38
[a-zA-Z]Returns a match for any character alphabetically between a to z or A to Z
[+]Return a match for any + character in the string

5. assignments

  1. list all of the names
simple_string = """Amy is 5 years old, and her sister Mary is 2 years old. Ruth and Peter,
   their parents, have 3 kids."""
re.findall('[A-Z]\w+',simple_string)
  1. list those students who received a B in the course.
with open ("assets/grades.txt", "r") as file:
    grades = file.read()
re.findall("([\w ]+)(?=\:\sB)",grades)
  1. convert log file to list of dictionaries
with open("assets/logdata.txt", "r") as file:
    logdata = file.read()
ftemp = re.finditer('(?P<host>\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3})(\s\-\s)(?P<user_name>\w+)(\s\[)(?P<time>[\w\d\s\/\:\-]*)(\]\s\")(?P<request>[\w\/\s\.]+)(\"[\d\s]+\\n)',logdata)  
for item in ftemp:
    print(item.groupdict())
  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值