原文:Bioinformatics with Python Cookbook
作者的代码可读性对于初学者来说较差,为了便于理解我会使用矫正后的代码。
Using pandas to process vaccine-adverse events
在第一部分中,我们将学习pandas的基本使用方式
import numpy as np
import pandas as pd
vdata = pd.read_csv('2021VAERSDATA.csv.gz', encoding="iso-8859-1").set_index('VAERS_ID')
首先我们导入要使用的包和数据
vdata.head()
vdata.shape
vdata.dtypes
vdata.loc[0,'SYMPTOM_TEXT']
print(vdata.loc[:5,'SYMPTOM_TEXT'])
vdata.loc[:5,2]
vdata.AGE_YRS.max()
vdata = vdata.set_index("VAERS_ID")
随后对数据进行初步的探索(iloc基于整数位置来选择数据,它接受行和列的位置作为输入。loc基于标签来选择数据。)
vax = pd.read_csv("2021VAERSVAX.csv.gz", encoding="iso-8859-1").set_index("VAERS_ID")
导入疫苗信息,重点是关注covid19疫苗
vax.groupby("VAX_TYPE").size().sort_values()
VAX_TYPE DTOX 1 DTPHEP 1 H5N1 1 PNC10 1 TDAPIPV 1 ... HPV9 1823 FLU4 5431 UNK 9458 VARZOS 14392 COVID19 742056 Length: 70, dtype: int64
看得出来covid19对牢美人民确实影响挺大的
vdata["is_dead"] = (vdata.DIED == "Y")
dead = vdata[vdata.is_dead]
vax19 = vax[vax.VAX_TYPE == "COVID19"]
DEAD19 = dead.join(vax19)
筛选死亡人数与covid19合并,得到因注射covid19疫苗死亡的人员的信息的数据框
len(DEAD19)
看看人数,12231人
DEAD_19 = DEAD19[~DEAD19['VAX_LOT'].isin(['Unknown', 'unknown'])]
baddy = DEAD19.groupby("VAX_LOT").size().sort_values(ascending=False)
for i, (lot, cnt) in enumerate (baddy.items()):
stat_counts = DEAD19[DEAD19.VAX_LOT == lot]["STATE"].nunique()
print(lot, cnt, stat_counts)
if i == 10:
break
将结果按照疫苗批次分组并统计出现次数,结果存储在一个series中。接下来统计出现最多次的批次与其影响的州数
EN6201 158 33 EN5318 120 29 EN6200 120 22 EN6198 115 24 EL9261 111 22 EL3248 102 18 EM9810 100 21 EN6202 98 19 EL9269 97 19 039K20A 97 13 EL3249 83 21
Dealing with the pitfalls of joining pandas DataFrames
说实话,没懂作者在这一部分具体想表达什么,好像是在阐述inner_join、left_join和right_join的区别,我们快进到第三部分。
Reducing the memory usage of pandas DataFrames
在第三部分的内容中,我们将学习pandas数据框减少内存使用的方式。
import numpy as np
import pandas as pd
vdata = pd.read_csv("2021VAERSDATA.csv.gz",encoding="iso-8859-1")
vdata.info(memory_usage="deep")
使用先前的数据框,看看使用了多少内存。
<class 'pandas.core.frame.DataFrame'> Index: 753042 entries, 916600 to 2594447 Data columns (total 35 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 RECVDATE 753042 non-null object 1 STATE 637152 non-null object 2 AGE_YRS 671962 non-null float64 3 CAGE_YR 604338 non-null float64 ..................... 32 ER_ED_VISIT 90291 non-null object 33 ALLERGIES 298538 non-null object 34 is_dead 753042 non-null bool dtypes: bool(1), float64(5), int64(1), object(28) memory usage: 1.4 GB
vdata.DIED=vdata.DIED.fillna(False).astype(bool)
vdata.info(memory_usage="deep")
将DIED列改为逻辑值,发现虽然显著减少了该列的内存,但并不能减少总数据框的内存消耗;也就是说,该死的另有其人。
vdata["STATE"] = vdata.STATE.str.upper()
states = list(vdata["STATE"].unique())
vdata["encoded_state"] = vdata.STATE.apply(lambda STATE: states.index(STATE))
vdata["encoded_state"] = vdata["encoded_state"].astype(np.uint8)
再看看STATE列;我们首先将STATE列改为全大写并将其唯一值赋给一个新的变量,随后通过.apply()方法应用了一个lambda函数到STATE列的每个元素上。lambda函数接受一个参数STATE列,然后在states列表中查找每个值的索引,并将索引作为编码值返回。这样,每个值就被映射到了一个唯一的整数编码上,并将这些编码存储在新的列encoded_state中。
vdata["STATE"].memory_usage(index=False, deep=True)
vdata["encoded_state"].memory_usage(index=False, deep=True)
41300448753042
该列的内存占用确实显著减少了,但似乎并不影响整个数据框仍然占据大量内存使用。
ps:一个问题,既然我要减少内存的使用,那么我必然会删除原来的STATE列,倘若我在后面的分析中需要state的实际信息,我该如何转换成对应的state呢?
for name in vdata.columns:
col_bytes = vdata[name].memory_usage(index=False, deep=True)
col_type = vdata[name].dtype
print(name,col_type, col_bytes // (1024 ** 2))
SYMPTOM_TEXT object 520244878
我们仔细检查每一列的内存占用情况,发现SYMPTOM_TEXT列实际上占用了巨大的内存,于是作者给出解决方案,直接不读。
vdata = pd.read_csv("2021VAERSDATA.csv.gz",
encoding="iso-8859-1",
index_col="VAERS_ID",
converters={"DIED": lambda died: died == "Y",},
usecols=lambda name: name != "SYMPTOM_TEXT")
Accelerating pandas processing with Apache Arrow
Apache Arrow是什么:
import gzip
import pandas as pd
from pyarrow import csv
import pyarrow.compute as pc
加载需要的库
vdata_pd = pd.read_csv("2021VAERSDATA.csv.gz",
encoding="iso-8859-1",
usecols=lambda x: x != "SYMPTOM_TEXT")
columns = vdata_pd.columns
vdata_arrow = csv.read_csv(
"2021VAERSDATA.csv.gz",
convert_options=csv.ConvertOptions(include_columns=columns))
vdata_pd.info(memory_usage="deep")
vdata_arrow.nbytes
在不导入'SYMPTOM_TEXT'列的情况下比较二者的占用内存,很明显,arrow使用的内存更小。
vdata = vdata_arrow.to_pandas(self_destruct=True)
我们的目标是使用 Arrow 将数据加载到 pandas 中,因此我们需要转换数据结构
Understanding NumPy as the engine behind Python data science and bioinformatics
贴一段作者原文:Most of your analysis will make use of NumPy, even if you don’t use it explicitly. NumPy is an array manipulation library that is behind libraries such as pandas, Matplotlib, Biopython, and scikit-learn, among many others. While much of your bioinformatics work may not require explicit direct use of NumPy, you should be aware of its existence as it underpins almost everything you do, even if only indirectly via the other libraries.
意思就是numpy很nb,大致就是numpy是python上做数据分析的底层结构吧
import numpy as np
import pandas as pd
vdata = pd.read_csv("2021VAERSDATA.csv.gz",
encoding="iso-8859-1",
index_col="VAERS_ID",
usecols=lambda name: name != "SYMPTOM_TEXT")
导入所需要的包和数据
by_states = vdata.groupby("STATE").size().sort_values(ascending=False)
top_state = pd.DataFrame({'state': by_states.head(5).index,
'size': by_states.head(5)})
top_state["rank"] = top_state.index
top_state = top_state.set_index("state")
top_state
size rank state CA 69740 0 FL 41342 1 TX 40921 2 NY 38910 3 PA 25965 4
首先我们创造一个包含被疫苗影响人数前五的州的数据框,其中将州名转换为索引,将索引转换成每个州的对应编号。CA - 加利福尼亚、FL - 佛罗里达、TX - 得克萨斯、NY - 纽约、PA - 宾夕法尼亚。
top_vdata = vdata[vdata["STATE"].isin(top_state.index)]
top_vdata.loc[:, "state_code"] = top_vdata["STATE"].apply(lambda state: top_state["rank"].at[state]).astype(np.uint8)
top_vdata = top_vdata[top_vdata["AGE_YRS"].notna()]
top_vdata.loc[:,"AGE_YRS"] = top_vdata["AGE_YRS"].astype(int)
接下来我们从原数据框中筛选出被疫苗影响的这五个州的原始数据,并导出年龄数据。
age_state = top_vdata[['state_code','AGE_YRS']]
type(state_code_arr), state_code_arr.shape, state_code_arr.dtype
age_state_mat = np.zeros((5,6), dtype=np.uint64)
for row in age_state.itertuples():
age_state_mat[row.state_code,row.AGE_YRS//20] += 1
age_state_mat
array([[ 5535, 17607, 21406, 16680, 2662, 24], [ 1601, 7151, 11262, 14650, 2762, 11], [ 3199, 10568, 13286, 9253, 1334, 2], [ 2556, 9826, 11701, 9630, 1632, 14], [ 1591, 6213, 8124, 6958, 1052, 7]], dtype=uint64)
首先创造一个全0数组,接下来将数据导入到矩阵中,根据州编号和年龄区间,例:若有一个30岁的人在加州被疫苗所影响,那么在矩阵位置索引(0,1)处+1。
def compute_frac(arr_1d):
return arr_1d / arr_1d.sum()
frac_age_stat_mat = np.apply_along_axis(compute_frac, 1, age_state_mat)
perc_age_stat_mat = frac_age_stat_mat * 100
perc_age_stat_mat = perc_age_stat_mat.astype(np.uint8)
perc_age_stat_mat
array([[ 8, 27, 33, 26, 4, 0], [ 4, 19, 30, 39, 7, 0], [ 8, 28, 35, 24, 3, 0], [ 7, 27, 33, 27, 4, 0], [ 6, 25, 33, 29, 4, 0]], dtype=uint8)
将矩阵转换成百分比数更加易读。
fig = plt.figure()
ax = fig.add_subplot()
ax.matshow(perc_age_stat_mat, cmap=plt.get_cmap("Greys"))
ax.set_yticks(range(5))
ax.set_yticklabels(top_states.index)
ax.set_xticks(range(6))
ax.set_xticklabels(["0-19", "20-39", "40-59", "60-79", "80-99", "100-119"])
fig.savefig("matrix.png")
年龄矩阵热图