BS6202 2Python

Java Python Assignment 2

BS6202

Please find attached with this assignment, data pertaining to gene expression profiles of lymphoblastoid cells.

Dataset Description

1. “data.csv” – Gene expression profiles with rows representing the genes and columns the samples.

2. “meta_data.csv” – Meta. data corresponding to the gene expression profiles with rows representing the samples and columns the various clinical attributes such as age, treatment status, etc.

Task 1: Cluster the samples using the gene expression profile and evaluate the goodness of your clustering. Also, describe the rationale behind choosing a specific clustering algorithm.

We should use PCA first, and then use Kmeans and finally apply clustering to finish the questions. The data.csv file contains the gene sets for each person in the census. PCA is principle component analysis. It can reduce larger data sets but maintain the patterns and trends. We need to reduce the dimensions of such complex sets of data. K-means is another algorithm which can group the unlabeled data sets into different clusters. If we use python to deliver this diagrams, there should be two plots. Each diagram has its PC1 on x-axis and PC2 on y-axis. The different data sets will be grouped into different colors and different groups. The k-mean is a BS6202 Assignment 2Python round 0.05. PC1 is ranged from almost -40 to 80 and PC2 is ranged from -40 to 100 on y-axis.

Task 2: Create a predictive model to predict “sex” using the given gene expression profile and evaluate your predictive model. Also, describe the rationale behind choosing a specific predictive algorithm.

For task2, the data sets contain more information about the personal information such as sex. We should use PCA first, and then use Kmeans and finally apply clustering to finish the questions. The data.csv file contains the gene sets for each person in the census. PCA is principle component analysis. It can reduce larger data sets but maintain the patterns and trends. We need to reduce the dimensions of such complex sets of data. K-means is another algorithm which can group the unlabeled data sets into different clusters.I calculated the average values of the genes and used python to read through the data of those gene sets. And then we can check which sex is closer to those average values of the gene sets. If they are close, we can select that pair of sex. In my diagram, the PC1 on x-axis is around 80% and PC2 on y-axis is around 8%.

You may perform. the above tasks in your groups using a variety of methods and strategies. However, each person is to take this preliminary analysis, further develop and refine. Write into a short 2-4 page report and submit individually         

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值