入门Pandas必练习100题基础到进阶|入门教程2

zg1g

于 2024-07-25 08:00:41 发布

阅读量228

点赞数 2

文章标签： pandas

本文链接：https://blog.csdn.net/daigualu/article/details/140732692

版权

作者:郭震

16. How to get the positions of items of series A in another series B?

如何获取Series A 中的项在另一个 Series B 中的位置？

Get the positions of items of ser2 in ser1 as a list.

# input
ser1 = pd.Series([10, 9, 6, 5, 3, 1, 12, 8, 13])
ser2 = pd.Series([1, 3, 10, 13])

# get's the index, but it's sorts the index
list(ser1[ser1.isin(ser2)].index)

# using numpy where
[np.where(i == ser1)[0].tolist()[0] for i in ser2]

# using pandas Index and get location
[pd.Index(ser1).get_loc(i) for i in ser2]

17. How to compute the mean squared error on a truth and predicted series?

如何计算真实值和预测值Series之间的均方误差（MSE）？

Compute the mean squared error of truth and pred series.

# input
truth = pd.Series(range(10))
pred = pd.Series(range(10)) + np.random.random(10)

# BAD, don't use it
(np.mean([(truth_i - pred_i)#### 2 for truth_i, pred_i in zip(truth, pred)]))

# using numpy
np.mean((truth-pred)#### 2)

# using sklear metrics
from sklearn.metrics import mean_squared_error
mean_squared_error(truth, pred)

18. How to convert the first character of each element in a series to uppercase?

如何将Series中每个元素的首字母转换为大写？

Change the first character of each word to upper case in each word of ser.

# input
ser = pd.Series(['just', 'a', 'random', 'list'])
ser

# using python string method title() Assumes we only encounter string in the list
[i.title() for i in ser]

# using lambda
ser.map(lambda x: x.title())

# other solution
ser.map(lambda x: x[0].upper() + x[1:])

19. How to calculate the number of characters in each word in a series?

# input
ser = pd.Series(['just', 'a', 'random', 'list'])

# using list comprehension
[len(i) for i in ser]

# using series map
ser.map(len)

# using series apply
ser.apply(len)

20. How to compute difference of differences between consequtive numbers of a series?

Difference of differences between the consequtive numbers of ser.

# input
ser = pd.Series([1, 3, 6, 10, 15, 21, 27, 35])

# Desired Output
# [nan, 2.0, 3.0, 4.0, 5.0, 6.0, 6.0, 8.0]
# [nan, nan, 1.0, 1.0, 1.0, 1.0, 0.0, 2.0]

# using pandas diff()
ser.diff(periods = 1).tolist()
ser.diff(periods = 1).diff(periods = 1).tolist()

21. How to convert a series of date-strings to a timeseries?

# input
ser = pd.Series(['01 Jan 2010', '02-02-2011', '20120303', '2013/04/04', '2014-05-05', '2015-06-06T12:20'])


'''
Desired Output

0   2010-01-01 00:00:00
1   2011-02-02 00:00:00
2   2012-03-03 00:00:00
3   2013-04-04 00:00:00
4   2014-05-05 00:00:00
5   2015-06-06 12:20:00
'''

# using pands to_datetime
pd.to_datetime(ser)

# using dateutil parse
from dateutil.parser import parse
ser.map(lambda x: parse(x))

22. How to get the day of month, week number, day of year and day of week from a series of date strings?

Get the day of month, week number, day of year and day of week from ser.

# input
ser = pd.Series(['01 Jan 2010', '02-02-2011', '20120303', '2013/04/04', '2014-05-05', '2015-06-06T12:20'])

'''
Desired output

Date:  [1, 2, 3, 4, 5, 6]
Week number:  [53, 5, 9, 14, 19, 23]
Day num of year:  [1, 33, 63, 94, 125, 157]
Day of week:  ['Friday', 'Wednesday', 'Saturday', 'Thursday', 'Monday', 'Saturday']
'''

# day
pd.to_datetime(ser).dt.day.to_list()
# week
pd.to_datetime(ser).dt.week.to_list()
# another method
pd.to_datetime(ser).dt.weekofyear.to_list()
# day of year
pd.to_datetime(ser).dt.dayofyear.to_list()
# day of week in words
week_dict = {0:"Monday", 1:"Tuesday", 2:"Wednesday", 3:"Thursday", 4:"Friday", 5:"Saturday", 6:"Sunday"}
pd.to_datetime(ser).dt.dayofweek.map(week_dict).to_list()
# another method
pd.to_datetime(ser).dt.weekday_name.to_list()

23. How to convert year-month string to dates corresponding to the 4th day of the month?

Change ser to dates that start with 4th of the respective months.

# input
ser = pd.Series(['Jan 2010', 'Feb 2011', 'Mar 2012'])

'''
Desired Output

0   2010-01-04
1   2011-02-04
2   2012-03-04
dtype: datetime64[ns]

'''

# solution using parser
from dateutil.parser import parse
ser.map(lambda x: parse('04 ' + x))

# another solution

from dateutil.parser import parse
# Parse the date
ser_ts = ser.map(lambda x: parse(x))

# Construct date string with date as 4
ser_datestr = ser_ts.dt.year.astype('str') + '-' + ser_ts.dt.month.astype('str') + '-' + '04'

# Format it.
[parse(i).strftime('%Y-%m-%d') for i in ser_datestr]

24. How to filter words that contain atleast 2 vowels from a series?

From ser, extract words that contain atleast 2 vowels.

# input
ser = pd.Series(['Apple', 'Orange', 'Plan', 'Python', 'Money'])

'''
Desired Output


0     Apple
1    Orange
4     Money
dtype: object
'''

# using nested loops
vowels = list("aeiou")
list_ = []
for w in ser:
    c = 0
    for l in list(w.lower()):
        if l in vowels:
            c += 1
    if c >= 2:
        print(w)
        list_.append(w)

ser[ser.isin(list_)]

# another solution using counter

from collections import Counter
mask = ser.map(lambda x: sum([Counter(x.lower()).get(i, 0) for i in list('aeiou')]) >= 2)
ser[mask]

25. How to filter valid emails from a series?

Extract the valid emails from the series emails. The regex pattern for valid emails is provided as reference.

# input
emails = pd.Series(['buying books at amazom.com', 'rameses@egypt.com', 'matt@t.co', 'narendra@modi.com'])
pattern ='[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\\.[A-Za-z]{2,4}'

'''
Desired Output

1    rameses@egypt.com
2            matt@t.co
3    narendra@modi.com
dtype: object
'''

# using powerful regex
import re
re_ = re.compile(pattern)
emails[emails.str.contains(pat = re_, regex = True)]

# other solutions
pattern ='[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\\.[A-Za-z]{2,4}'
mask = emails.map(lambda x: bool(re.match(pattern, x)))
emails[mask]

# using str.findall
emails.str.findall(pattern, flags=re.IGNORECASE)

# using list comprehension
[x[0] for x in [re.findall(pattern, email) for email in emails] if len(x) > 0]

26. How to get the mean of a series grouped by another series?

Compute the mean of weights of each fruit.

# doesn't incluide the upper limit
fruit = pd.Series(np.random.choice(['apple', 'banana', 'carrot'], 10))
fruit
weights = pd.Series(np.linspace(1, 10, 10))
weights
#print(weights.tolist())
#print(fruit.tolist())

'''
Desired output

# values can change due to randomness
apple     6.0
banana    4.0
carrot    5.8
dtype: float64
'''

# using pandas groupby
df = pd.concat([fruit, weights], axis = 1)
df
df.groupby(0).mean()

# use one list to calculate a kpi from another
weights.groupby(fruit).mean()

27. How to compute the euclidean distance between two series?

Compute the euclidean distance between series (points) p and q, without using a packaged formula.

# Input
p = pd.Series([1, 2, 3, 4, 5, 6, 7, 8, 9, 10])
q = pd.Series([10, 9, 8, 7, 6, 5, 4, 3, 2, 1])
'''
Desired Output

18.165
'''

# using list comprehension
suma = np.sqrt(np.sum([(p - q)#### 2 for p, q in zip(p, q)]))
suma

# using series one to one operation
sum((p - q)#### 2)#### .5

# using numpy
np.linalg.norm(p-q)

28. How to find all the local maxima (or peaks) in a numeric series?

# input
ser = pd.Series([2, 10, 3, 4, 9, 10, 2, 7, 3])

'''
Desired output

array([1, 5, 7])
'''

# using pandas shift
local_max = ser[(ser.shift(1) < ser) & (ser.shift(-1) < ser)]
local_max.index

# using numpy
dd = np.diff(np.sign(np.diff(ser)))
dd
peak_locs = np.where(dd == -2)[0] + 1
peak_locs

29. How to replace missing spaces in a string with the least frequent character?

Replace the spaces in my_str with the least frequent character.

Go back to the table of contents

# input
my_str = 'dbc deb abed ggade'

'''
Desired Output

'dbccdebcabedcggade'  # least frequent is 'c'
'''

# using Counter
from collections import Counter
my_str_ = my_str
Counter_ = Counter(list(my_str_.replace(" ", "")))
Counter_
minimum = min(Counter_, key = Counter_.get)

print(my_str.replace(" ", minimum))

# using pandas
ser = pd.Series(list(my_str.replace(" ", "")))
ser.value_counts()
minimum = list(ser.value_counts().index)[-1]
minimum
print(my_str.replace(" ", minimum))

30. How to create a TimeSeries starting ‘2000-01-01’ and 10 weekends (saturdays) after that having random numbers as values?

'''
Desired Output
values can be random

2000-01-01    4
2000-01-08    1
2000-01-15    8
2000-01-22    4
2000-01-29    4
2000-02-05    2
2000-02-12    4
2000-02-19    9
2000-02-26    6
2000-03-04    6
'''

dti = pd.Series(pd.date_range('2000-01-01', periods=10, freq='W-SAT'))
random_num = pd.Series([np.random.randint(1, 10) for i in range(10)])


df = pd.concat({"Time":dti, "Numbers":random_num}, axis = 1)
df

# for more about time series functionality 
# https://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html#timeseries-offset-aliases

# another solution just using pandas Series
ser = pd.Series(np.random.randint(1,10,10), pd.date_range('2000-01-01', periods=10, freq='W-SAT'))
ser

31. How to fill an intermittent time series so all missing dates show up with values of previous non-missing date?

ser has missing dates and values. Make all missing dates appear and fill up with value from previous date.

# input
ser = pd.Series([1,10,3,np.nan], index=pd.to_datetime(['2000-01-01', '2000-01-03', '2000-01-06', '2000-01-08']))

'''
Desired Output

2000-01-01     1.0
2000-01-02     1.0
2000-01-03    10.0
2000-01-04    10.0
2000-01-05    10.0
2000-01-06     3.0
2000-01-07     3.0
2000-01-08     NaN
'''

# Solution 1
# first let's fill the missing dates
indx = pd.date_range("2000-01-01", "2000-01-08")
# now let's reindex the series ser with the new index
# we have to reasing back to ser
ser = ser.reindex(indx)
# lastly let's populate the missing values
ser.fillna(method = "ffill")

# Solution 2
ser = pd.Series([1,10,3,np.nan], index=pd.to_datetime(['2000-01-01', '2000-01-03', '2000-01-06', '2000-01-08']))
ser.resample('D').ffill()  # fill with previous value
ser.resample('D').bfill()  # fill with next value
ser.resample('D').bfill().ffill()  # fill next else prev value

32. How to compute the autocorrelations of a numeric series?

Compute autocorrelations for the first 10 lags of ser. Find out which lag has the largest correlation.

# input
ser = pd.Series(np.arange(20) + np.random.normal(1, 10, 20))

'''
Desired Output

# values will change due to randomness
[0.29999999999999999, -0.11, -0.17000000000000001, 0.46000000000000002, 0.28000000000000003, -0.040000000000000001, -0.37, 0.41999999999999998, 0.47999999999999998, 0.17999999999999999]
Lag having highest correlation:  9
'''

# using pandas autocorr
# ser.autocorr(lag = 10)

# solution using list comprehension
autocorrelations = [ser.autocorr(i).round(2) for i in range(11)]
print(autocorrelations[1:])
print('Lag having highest correlation: ', np.argmax(np.abs(autocorrelations[1:]))+1)

33. How to import only every nth row from a csv file to create a dataframe?

Import every 50th row of BostonHousing dataset as a dataframe.

# input
import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# data comes without headers, but we searched for it
names = ['CRIM', 'ZN', 'INDUS', 'CHAS', 'NOX', 'RM', 'AGE', 'DIS', 'RAD', 'TAX', 'PTRATIO', 'B', 'LSTAT', 'MEDV']

# pure Python implementation
with open("/kaggle/input/boston-house-prices/housing.csv") as f:
    data = f.read()
    nth_rows = []
    for i, rows in enumerate(data.split("\n")):
        if i%50 == 0:
            nth_rows.append(rows)

# nth_rows is a list of strings separated by blank spaces " "
# the next list comprehension will do the trick

nth_rows[0]
data_ = [nth_rows[i].split() for i in range(len(nth_rows))]
df = pd.DataFrame(data_, columns=names)
df

# other solutions

# Solution 2: Use chunks and for-loop
# df = pd.read_csv("/kaggle/input/boston-house-prices/housing.csv", chunksize=50)
# df2 = pd.DataFrame()
# for chunk in df:
#     df2 = df2.append(chunk.iloc[0,:])
# df2

# Solution 3: Use chunks and list comprehension
# df = pd.read_csv("/kaggle/input/boston-house-prices/housing.csv", chunksize=50)
# df2 = pd.concat([chunk.iloc[0] for chunk in df], axis=1)
# df2 = df2.transpose()
# df2

34. How to change column values when importing csv to a dataframe?

Import the boston housing dataset, but while importing change the 'medv' (median house value) column so that values < 25 becomes ‘Low’ and > 25 becomes ‘High’.

# input
import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# first let's import using the previuos code and save as a normal csv

names = ['CRIM', 'ZN', 'INDUS', 'CHAS', 'NOX', 'RM', 'AGE', 'DIS', 'RAD', 'TAX', 'PTRATIO', 'B', 'LSTAT', 'MEDV']
with open("/kaggle/input/boston-house-prices/housing.csv") as f:
    data = f.read()
    nth_rows = []
    for i, rows in enumerate(data.split("\n")):
        nth_rows.append(rows)

data_ = [nth_rows[i].split() for i in range(len(nth_rows))]

df = pd.DataFrame(data_, columns=names)
df.head()
df.to_csv("housing_preprocessed.csv")
del df

# now let's start importing as normal and use converters to convert the values
# skipfooter because we had the last rows with nan values and index_col to specify that the first column is the index
df = pd.read_csv("housing_preprocessed.csv",  index_col = 0, skipfooter=1,  converters = {"MEDV": lambda x: "HIGH" if float(x) >= 25 else "LOW"})
df

35. How to create a dataframe with rows as strides from a given series?

# input
L = pd.Series(range(15))

'''
Desired Output

array([[ 0,  1,  2,  3],
       [ 2,  3,  4,  5],
       [ 4,  5,  6,  7],
       [ 6,  7,  8,  9],
       [ 8,  9, 10, 11],
       [10, 11, 12, 13]])
'''

# using slicing
# let's generate a list of indexes we need to use
# outputs array([ 0,  2,  4,  6,  8, 10, 12, 14])
index_ = np.arange(0, 15, 2)
index_
my_list = []
for i in range(6):
    my_list.append(list(L[index_[i]:index_[i+2]]))
np.array(my_list)

# above code as list comprehension
np.array([L[index_[i]:index_[i+2]] for i in range(6)])

# another solution
def gen_strides(a, stride_len=5, window_len=5):
    n_strides = ((a.size-window_len)//stride_len) + 1
    return np.array([a[s:(s+window_len)] for s in np.arange(0, a.size, stride_len)[:n_strides]])

gen_strides(L, stride_len=2, window_len=4)

感谢你的支持,原创不易,希望转发,点击,以及收藏,也可以点击阅读原文更多AI知识分享,同时也可以关注知识星球:郭震AI学习星球

长按上图二维码查看「郭震AI学习星球」

更多Python、数据分析、爬虫、前后端开发、人工智能等教程参考.
以上全文,欢迎继续点击阅读原文学习,阅读更多AI资讯,[请点击这里] https://ai-jupyter.com/

zg1g

关注

2
点赞
踩
3

收藏

觉得还不错? 一键收藏
0
评论
入门Pandas必练习100题基础到进阶|入门教程2

作者:郭震16. How to get the positions of items of series A in another series B?如何获取Series A 中的项在另一个 Series B 中的位置？Get the positions of items of ser2 in ser1 as a list.#inputser1=pd.Series([10,9,6,5...
复制链接

扫一扫