BS6202 2Python

wx_codinghelp

于 2024-09-06 16:05:44 发布

阅读量474

点赞数 21

文章标签：开发语言

本文链接：https://blog.csdn.net/wx_codinghelp/article/details/141962277

版权

Java Python Assignment 2

BS6202

Please find attached with this assignment, data pertaining to gene expression profiles of lymphoblastoid cells.

Dataset Description

1. “data.csv” – Gene expression profiles with rows representing the genes and columns the samples.

2. “meta_data.csv” – Meta. data corresponding to the gene expression profiles with rows representing the samples and columns the various clinical attributes such as age, treatment status, etc.

Task 1: Cluster the samples using the gene expression profile and evaluate the goodness of your clustering. Also, describe the rationale behind choosing a specific clustering algorithm.

We should use PCA first, and then use Kmeans and finally apply clustering to finish the questions. The data.csv file contains the gene sets for each person in the census. PCA is principle component analysis. It can reduce larger data sets but maintain the patterns and trends. We need to reduce the dimensions of such complex sets of data. K-means is another algorithm which can group the unlabeled data sets into different clusters. If we use python to deliver this diagrams, there should be two plots. Each diagram has its PC1 on x-axis and PC2 on y-axis. The different data sets will be grouped into different colors and different groups. The k-mean is a BS6202 Assignment 2Python round 0.05. PC1 is ranged from almost -40 to 80 and PC2 is ranged from -40 to 100 on y-axis.

Task 2: Create a predictive model to predict “sex” using the given gene expression profile and evaluate your predictive model. Also, describe the rationale behind choosing a specific predictive algorithm.

For task2, the data sets contain more information about the personal information such as sex. We should use PCA first, and then use Kmeans and finally apply clustering to finish the questions. The data.csv file contains the gene sets for each person in the census. PCA is principle component analysis. It can reduce larger data sets but maintain the patterns and trends. We need to reduce the dimensions of such complex sets of data. K-means is another algorithm which can group the unlabeled data sets into different clusters.I calculated the average values of the genes and used python to read through the data of those gene sets. And then we can check which sex is closer to those average values of the gene sets. If they are close, we can select that pair of sex. In my diagram, the PC1 on x-axis is around 80% and PC2 on y-axis is around 8%.

You may perform. the above tasks in your groups using a variety of methods and strategies. However, each person is to take this preliminary analysis, further develop and refine. Write into a short 2-4 page report and submit individually