我一直以来都有这个问题,像
sc.dataset.paul15()
sc.datasets.pbmc3k_processed()
sc.datasets.pbmc68k_reduced()
这些数据集其实我都是下载失败的,尤其是在jupyter中运行,基本上没有成功过,所以可以采取本地导入的方式
paul15
数据集
复制以下网址到浏览器
http://falexwolf.de/data/paul15.h5
然后下载,保存到本地某个文件夹中,这里在浏览器中下载其实非常快的,导入的时候用以下代码
import scanpy as sc
import h5py
import anndata as ad
filename="/Users/xiaokangyu/scanpy_dataset/paul15/paul15.h5"
with h5py.File(filename, 'r') as f:
X = f['data.debatched'][()]
gene_names = f['data.debatched_rownames'][()].astype(str)
cell_names = f['data.debatched_colnames'][()].astype(str)
clusters = f['cluster.id'][()].flatten().astype(int)
infogenes_names = f['info.genes_strings'][()].astype(str)
# each row has to correspond to a observation, therefore transpose
adata = ad.AnnData(X.transpose(), dtype=X.dtype)
adata.var_names = gene_names
adata.row_names = cell_names
# names reflecting the cell type identifications from the paper
cell_type = 6 * ['Ery']
cell_type += 'MEP Mk GMP GMP DC Baso Baso Mo Mo Neu Neu Eos Lymph'.split()
adata.obs['paul15_clusters'] = [f'{i}{cell_type[i-1]}' for i in clusters]
# make string annotations categorical (optional)
#_utils.sanitize_anndata(adata)
# just keep the first of the two equivalent names per gene
adata.var_names = [gn.split(';')[0] for gn in adata.var_names]
# remove 10 corrupted gene names
infogenes_names = np.intersect1d(infogenes_names, adata.var_names)
# restrict data array to the 3461 informative genes
adata = adata[:, infogenes_names]
# usually we'd set the root cell to an arbitrary cell in the MEP cluster
# adata.uns['iroot'] = np.flatnonzero(adata.obs['paul15_clusters'] == '7MEP')[0]
# here, set the root cell as in Haghverdi et al. (2016)
# note that other than in Matlab/R, counting starts at 0
adata.uns['iroot'] = 840
print(adata)
上面获得的adata,与
adata=sc.datasets.paul15()
得到的adata是一样的,同理可以对其他数据集这样操作