作者:郭震
16. How to get the positions of items of series A in another series B?
如何获取Series A 中的项在另一个 Series B 中的位置?
Get the positions of items of ser2 in ser1 as a list.
# input
ser1 = pd.Series([10, 9, 6, 5, 3, 1, 12, 8, 13])
ser2 = pd.Series([1, 3, 10, 13])
# get's the index, but it's sorts the index
list(ser1[ser1.isin(ser2)].index)
# using numpy where
[np.where(i == ser1)[0].tolist()[0] for i in ser2]
# using pandas Index and get location
[pd.Index(ser1).get_loc(i) for i in ser2]
17. How to compute the mean squared error on a truth and predicted series?
如何计算真实值和预测值Series之间的均方误差(MSE)?
Compute the mean squared error of truth and pred series.
# input
truth = pd.Series(range(10))
pred = pd.Series(range(10)) + np.random.random(10)
# BAD, don't use it
(np.mean([(truth_i - pred_i)#### 2 for truth_i, pred_i in zip(truth, pred)]))
# using numpy
np.mean((truth-pred)#### 2)
# using sklear metrics
from sklearn.metrics import mean_squared_error
mean_squared_error(truth, pred)
18. How to convert the first character of each element in a series to uppercase?
如何将Series中每个元素的首字母转换为大写?
Change the first character of each word to upper case in each word of ser.
# input
ser = pd.Series(['just', 'a', 'random', 'list'])
ser
# using python string method title() Assumes we only encounter string in the list
[i.title() for i in ser]
# using lambda
ser.map(lambda x: x.title())
# other solution
ser.map(lambda x: x[0].upper() + x[1:])
19. How to calculate the number of characters in each word in a series?
# input
ser = pd.Series(['just', 'a', 'random', 'list'])
# using list comprehension
[len(i) for i in ser]
# using series map
ser.map(len)
# using series apply
ser.apply(len)
20. How to compute difference of differences between consequtive numbers of a series?
Difference of differences between the consequtive numbers of ser.
# input
ser = pd.Series([1, 3, 6, 10, 15, 21, 27, 35])
# Desired Output
# [nan, 2.0, 3.0, 4.0, 5.0, 6.0, 6.0, 8.0]
# [nan, nan, 1.0, 1.0, 1.0, 1.0, 0.0, 2.0]
# using pandas diff()
ser.diff(periods = 1).tolist()
ser.diff(periods = 1).diff(periods = 1).tolist()
21. How to convert a series of date-strings to a timeseries?
# input
ser = pd.Series(['01 Jan 2010', '02-02-2011', '20120303', '2013/04/04', '2014-05-05', '2015-06-06T12:20'])
'''
Desired Output
0 2010-01-01 00:00:00
1 2011-02-02 00:00:00
2 2012-03-03 00:00:00
3 2013-04-04 00:00:00
4 2014-05-05 00:00:00
5 2015-06-06 12:20:00
'''
# using pands to_datetime
pd.to_datetime(ser)
# using dateutil parse
from dateutil.parser import parse
ser.map(lambda x: parse(x))
22. How to get the day of month, week number, day of year and day of week from a series of date strings?
Get the day of month, week number, day of year and day of week from ser.
# input
ser = pd.Series(['01 Jan 2010', '02-02-2011', '20120303', '2013/04/04', '2014-05-05', '2015-06-06T12:20'])
'''
Desired output
Date: [1, 2, 3, 4, 5, 6]
Week number: [53, 5, 9, 14, 19, 23]
Day num of year: [1, 33, 63, 94, 125, 157]
Day of week: ['Friday', 'Wednesday', 'Saturday', 'Thursday', 'Monday', 'Saturday']
'''
# day
pd.to_datetime(ser).dt.day.to_list()
# week
pd.to_datetime(ser).dt.week.to_list()
# another method
pd.to_datetime(ser).dt.weekofyear.to_list()
# day of year
pd.to_datetime(ser).dt.dayofyear.to_list()
# day of week in words
week_dict = {0:"Monday", 1:"Tuesday", 2:"Wednesday", 3:"Thursday", 4:"Friday", 5:"Saturday", 6:"Sunday"}
pd.to_datetime(ser).dt.dayofweek.map(week_dict).to_list()
# another method
pd.to_datetime(ser).dt.weekday_name.to_list()
23. How to convert year-month string to dates corresponding to the 4th day of the month?
Change ser to dates that start with 4th of the respective months.
# input
ser = pd.Series(['Jan 2010', 'Feb 2011', 'Mar 2012'])
'''
Desired Output
0 2010-01-04
1 2011-02-04
2 2012-03-04
dtype: datetime64[ns]
'''
# solution using parser
from dateutil.parser import parse
ser.map(lambda x: parse('04 ' + x))
# another solution
from dateutil.parser import parse
# Parse the date
ser_ts = ser.map(lambda x: parse(x))
# Construct date string with date as 4
ser_datestr = ser_ts.dt.year.astype('str') + '-' + ser_ts.dt.month.astype('str') + '-' + '04'
# Format it.
[parse(i).strftime('%Y-%m-%d') for i in ser_datestr]
24. How to filter words that contain atleast 2 vowels from a series?
From ser, extract words that contain atleast 2 vowels.
# input
ser = pd.Series(['Apple', 'Orange', 'Plan', 'Python', 'Money'])
'''
Desired Output
0 Apple
1 Orange
4 Money
dtype: object
'''
# using nested loops
vowels = list("aeiou")
list_ = []
for w in ser:
c = 0
for l in list(w.lower()):
if l in vowels:
c += 1
if c >= 2:
print(w)
list_.append(w)
ser[ser.isin(list_)]
# another solution using counter
from collections import Counter
mask = ser.map(lambda x: sum([Counter(x.lower()).get(i, 0) for i in list('aeiou')]) >= 2)
ser[mask]
25. How to filter valid emails from a series?
Extract the valid emails from the series emails. The regex pattern for valid emails is provided as reference.
# input
emails = pd.Series(['buying books at amazom.com', 'rameses@egypt.com', 'matt@t.co', 'narendra@modi.com'])
pattern ='[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\\.[A-Za-z]{2,4}'
'''
Desired Output
1 rameses@egypt.com
2 matt@t.co
3 narendra@modi.com
dtype: object
'''
# using powerful regex
import re
re_ = re.compile(pattern)
emails[emails.str.contains(pat = re_, regex = True)]
# other solutions
pattern ='[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\\.[A-Za-z]{2,4}'
mask = emails.map(lambda x: bool(re.match(pattern, x)))
emails[mask]
# using str.findall
emails.str.findall(pattern, flags=re.IGNORECASE)
# using list comprehension
[x[0] for x in [re.findall(pattern, email) for email in emails] if len(x) > 0]
26. How to get the mean of a series grouped by another series?
Compute the mean of weights of each fruit.
# doesn't incluide the upper limit
fruit = pd.Series(np.random.choice(['apple', 'banana', 'carrot'], 10))
fruit
weights = pd.Series(np.linspace(1, 10, 10))
weights
#print(weights.tolist())
#print(fruit.tolist())
'''
Desired output
# values can change due to randomness
apple 6.0
banana 4.0
carrot 5.8
dtype: float64
'''
# using pandas groupby
df = pd.concat([fruit, weights], axis = 1)
df
df.groupby(0).mean()
# use one list to calculate a kpi from another
weights.groupby(fruit).mean()
27. How to compute the euclidean distance between two series?
Compute the euclidean distance between series (points) p and q, without using a packaged formula.
# Input
p = pd.Series([1, 2, 3, 4, 5, 6, 7, 8, 9, 10])
q = pd.Series([10, 9, 8, 7, 6, 5, 4, 3, 2, 1])
'''
Desired Output
18.165
'''
# using list comprehension
suma = np.sqrt(np.sum([(p - q)#### 2 for p, q in zip(p, q)]))
suma
# using series one to one operation
sum((p - q)#### 2)#### .5
# using numpy
np.linalg.norm(p-q)
28. How to find all the local maxima (or peaks) in a numeric series?
# input
ser = pd.Series([2, 10, 3, 4, 9, 10, 2, 7, 3])
'''
Desired output
array([1, 5, 7])
'''
# using pandas shift
local_max = ser[(ser.shift(1) < ser) & (ser.shift(-1) < ser)]
local_max.index
# using numpy
dd = np.diff(np.sign(np.diff(ser)))
dd
peak_locs = np.where(dd == -2)[0] + 1
peak_locs
29. How to replace missing spaces in a string with the least frequent character?
Replace the spaces in my_str with the least frequent character.
Go back to the table of contents
# input
my_str = 'dbc deb abed ggade'
'''
Desired Output
'dbccdebcabedcggade' # least frequent is 'c'
'''
# using Counter
from collections import Counter
my_str_ = my_str
Counter_ = Counter(list(my_str_.replace(" ", "")))
Counter_
minimum = min(Counter_, key = Counter_.get)
print(my_str.replace(" ", minimum))
# using pandas
ser = pd.Series(list(my_str.replace(" ", "")))
ser.value_counts()
minimum = list(ser.value_counts().index)[-1]
minimum
print(my_str.replace(" ", minimum))
30. How to create a TimeSeries starting ‘2000-01-01’ and 10 weekends (saturdays) after that having random numbers as values?
'''
Desired Output
values can be random
2000-01-01 4
2000-01-08 1
2000-01-15 8
2000-01-22 4
2000-01-29 4
2000-02-05 2
2000-02-12 4
2000-02-19 9
2000-02-26 6
2000-03-04 6
'''
dti = pd.Series(pd.date_range('2000-01-01', periods=10, freq='W-SAT'))
random_num = pd.Series([np.random.randint(1, 10) for i in range(10)])
df = pd.concat({"Time":dti, "Numbers":random_num}, axis = 1)
df
# for more about time series functionality
# https://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html#timeseries-offset-aliases
# another solution just using pandas Series
ser = pd.Series(np.random.randint(1,10,10), pd.date_range('2000-01-01', periods=10, freq='W-SAT'))
ser
31. How to fill an intermittent time series so all missing dates show up with values of previous non-missing date?
ser has missing dates and values. Make all missing dates appear and fill up with value from previous date.
# input
ser = pd.Series([1,10,3,np.nan], index=pd.to_datetime(['2000-01-01', '2000-01-03', '2000-01-06', '2000-01-08']))
'''
Desired Output
2000-01-01 1.0
2000-01-02 1.0
2000-01-03 10.0
2000-01-04 10.0
2000-01-05 10.0
2000-01-06 3.0
2000-01-07 3.0
2000-01-08 NaN
'''
# Solution 1
# first let's fill the missing dates
indx = pd.date_range("2000-01-01", "2000-01-08")
# now let's reindex the series ser with the new index
# we have to reasing back to ser
ser = ser.reindex(indx)
# lastly let's populate the missing values
ser.fillna(method = "ffill")
# Solution 2
ser = pd.Series([1,10,3,np.nan], index=pd.to_datetime(['2000-01-01', '2000-01-03', '2000-01-06', '2000-01-08']))
ser.resample('D').ffill() # fill with previous value
ser.resample('D').bfill() # fill with next value
ser.resample('D').bfill().ffill() # fill next else prev value
32. How to compute the autocorrelations of a numeric series?
Compute autocorrelations for the first 10 lags of ser. Find out which lag has the largest correlation.
# input
ser = pd.Series(np.arange(20) + np.random.normal(1, 10, 20))
'''
Desired Output
# values will change due to randomness
[0.29999999999999999, -0.11, -0.17000000000000001, 0.46000000000000002, 0.28000000000000003, -0.040000000000000001, -0.37, 0.41999999999999998, 0.47999999999999998, 0.17999999999999999]
Lag having highest correlation: 9
'''
# using pandas autocorr
# ser.autocorr(lag = 10)
# solution using list comprehension
autocorrelations = [ser.autocorr(i).round(2) for i in range(11)]
print(autocorrelations[1:])
print('Lag having highest correlation: ', np.argmax(np.abs(autocorrelations[1:]))+1)
33. How to import only every nth row from a csv file to create a dataframe?
Import every 50th row of BostonHousing dataset as a dataframe.
# input
import os
for dirname, _, filenames in os.walk('/kaggle/input'):
for filename in filenames:
print(os.path.join(dirname, filename))
# data comes without headers, but we searched for it
names = ['CRIM', 'ZN', 'INDUS', 'CHAS', 'NOX', 'RM', 'AGE', 'DIS', 'RAD', 'TAX', 'PTRATIO', 'B', 'LSTAT', 'MEDV']
# pure Python implementation
with open("/kaggle/input/boston-house-prices/housing.csv") as f:
data = f.read()
nth_rows = []
for i, rows in enumerate(data.split("\n")):
if i%50 == 0:
nth_rows.append(rows)
# nth_rows is a list of strings separated by blank spaces " "
# the next list comprehension will do the trick
nth_rows[0]
data_ = [nth_rows[i].split() for i in range(len(nth_rows))]
df = pd.DataFrame(data_, columns=names)
df
# other solutions
# Solution 2: Use chunks and for-loop
# df = pd.read_csv("/kaggle/input/boston-house-prices/housing.csv", chunksize=50)
# df2 = pd.DataFrame()
# for chunk in df:
# df2 = df2.append(chunk.iloc[0,:])
# df2
# Solution 3: Use chunks and list comprehension
# df = pd.read_csv("/kaggle/input/boston-house-prices/housing.csv", chunksize=50)
# df2 = pd.concat([chunk.iloc[0] for chunk in df], axis=1)
# df2 = df2.transpose()
# df2
34. How to change column values when importing csv to a dataframe?
Import the boston housing dataset, but while importing change the 'medv' (median house value) column so that values < 25 becomes ‘Low’ and > 25 becomes ‘High’.
# input
import os
for dirname, _, filenames in os.walk('/kaggle/input'):
for filename in filenames:
print(os.path.join(dirname, filename))
# first let's import using the previuos code and save as a normal csv
names = ['CRIM', 'ZN', 'INDUS', 'CHAS', 'NOX', 'RM', 'AGE', 'DIS', 'RAD', 'TAX', 'PTRATIO', 'B', 'LSTAT', 'MEDV']
with open("/kaggle/input/boston-house-prices/housing.csv") as f:
data = f.read()
nth_rows = []
for i, rows in enumerate(data.split("\n")):
nth_rows.append(rows)
data_ = [nth_rows[i].split() for i in range(len(nth_rows))]
df = pd.DataFrame(data_, columns=names)
df.head()
df.to_csv("housing_preprocessed.csv")
del df
# now let's start importing as normal and use converters to convert the values
# skipfooter because we had the last rows with nan values and index_col to specify that the first column is the index
df = pd.read_csv("housing_preprocessed.csv", index_col = 0, skipfooter=1, converters = {"MEDV": lambda x: "HIGH" if float(x) >= 25 else "LOW"})
df
35. How to create a dataframe with rows as strides from a given series?
# input
L = pd.Series(range(15))
'''
Desired Output
array([[ 0, 1, 2, 3],
[ 2, 3, 4, 5],
[ 4, 5, 6, 7],
[ 6, 7, 8, 9],
[ 8, 9, 10, 11],
[10, 11, 12, 13]])
'''
# using slicing
# let's generate a list of indexes we need to use
# outputs array([ 0, 2, 4, 6, 8, 10, 12, 14])
index_ = np.arange(0, 15, 2)
index_
my_list = []
for i in range(6):
my_list.append(list(L[index_[i]:index_[i+2]]))
np.array(my_list)
# above code as list comprehension
np.array([L[index_[i]:index_[i+2]] for i in range(6)])
# another solution
def gen_strides(a, stride_len=5, window_len=5):
n_strides = ((a.size-window_len)//stride_len) + 1
return np.array([a[s:(s+window_len)] for s in np.arange(0, a.size, stride_len)[:n_strides]])
gen_strides(L, stride_len=2, window_len=4)
感谢你的支持,原创不易,希望转发,点击,以及收藏,也可以点击阅读原文更多AI知识分享,同时也可以关注知识星球:郭震AI学习星球
![229e3ce5ccbab2ceda4983300339140d.png](https://img-blog.csdnimg.cn/img_convert/229e3ce5ccbab2ceda4983300339140d.png)
长按上图二维码查看「郭震AI学习星球」
更多Python、数据分析、爬虫、前后端开发、人工智能等教程参考.
以上全文,欢迎继续点击阅读原文学习,阅读更多AI资讯,[请点击这里] https://ai-jupyter.com/