signature=1c79d58d04498553c591967102afe603,Gene-expression data integration to squamous cell lung ca...-CSDN博客

研究了肺鳞状细胞癌（SqCC）的四种子类型（基底样、经典型、原始型和分泌型）的基因表达特征，通过多重比较和聚类分析确定了各子类型的标志性基因集，并探讨了这些子类型与正常气道细胞发育和肺鳞癌细胞线系的关系。此外，分析了药物敏感性与子类型间的关联，为治疗靶点发现提供线索。

摘要由CSDN通过智能技术生成

Signature genes for SqCC subtypes

The 178 samples include 43 basal, 65 classical, 27 primitive and 43 secretory samples with robust separation among the subtypes. The sample relationship in the SqCC subtype data is shown in the MDS plot (Figure 1A).

Figure 1

Differential expression analysis results of the 178 SqCC tumour samples. (A) Multidimensional scaling plot of the normalised data for the 178 SqCC subtype samples. (B) Venn diagram of the DE analysis of the comparisons among six groups. The numbers represent the number of DE genes in each comparison (red for up and blue for down). The total genes in each diagram is the number of DE genes including up and down in any of the three comparisons involving that subtype. Regarding the subtypes, b for basal, c for classical, p for primitive and s for secretory. The overlap of the three comparison results for that subtype defines the signature gene set for each of the four SqCC subtypes. (C) Heatmap to shown the expression pattern of the signature gene sets (including up and down diretions) of the four SqCC subtypes in the 178 tumour samples. No clustering method was used in this heatmap. Rows for genes, columns for samples.

To define a specific gene signature for each subtype, we performed the DE analysis of the RNAseq data (Smyth, 2004, 2005). We compare the gene-expression profiles among the SqCC subtypes in the samples examined above. After normalisation, the data set was fitted to the linear model in which the subtype is a covariate. Two types of comparisons were made, the pairwise comparison among the four SqCC subtypes (six pairwise comparisons among four subtypes, Figure 1B) and the comparison of each single subtype to the average of the other three subtypes termed as ‘1 vs others’ (four comparisons for four subtypes). More detail of the DE results, including the heatmaps of the significant DE genes in each of the six pairs, and these gene lists in the pairwise comparisons are shown in Supplementary Figures 1–6 and Supplementary Tables 8–13. The significant DE gene sets in the six comparisons (among four subtypes) may be overlapped.

On the basis of the DE results of the first type of six pairwise comparisons, we defined a signature gene set for each of the subtypes. A signature gene set of a subtype captures the uniqueness of the subtype and has a very important role in relating different data sets. In brief, signature genes were chosen if they were consistently up- or downregulated in that subtype vs each of the other subtypes (Figure 1B, Table 1). This procedure (Lim et al, 2009) selects a set of signature genes that strongly characterise each subtype by their high or low transcriptional activity (heatmap in Figure 1C). The signature genes in each of the four subtypes are shown in Supplementary Tables 4–7. It is worth noting that there are more positive-signature genes than negative-signature genes in each of the subtypes. As shown in Figure 1C, the signature gene sets are uniquely upregulated or downregulated in a subtype so that the genes in signature gene sets are unique regarding regulation directions.

Table 1 The number of signature/DE genes based on SqCC data

For the second type of four comparisons, we identified 618 upregulated and 305 downregulated genes while comparing the basal subtype to other three subtypes. There are 626 upregulated and 785 downregulated genes in the primitive subtype, 702 up- and 866 downregulated genes in the classical subtype, and 1360 up- and 436 downregulated in the secretory subtype (Table 1).

Pathway analysis for SqCC subtypes

To better understand the four subtypes in terms of pathways, we performed a CAMERA gene set test (Wu and Smyth, 2012), which considers the expression of gene sets, such as pathways, instead of individual genes. We used the curated gene sets in a publicly available gene set database, the Category 2 in MsigDB (human version) (Subramanian et al, 2005), which comprises 3272 sets. We aimed to identify sets of genes that are differentially expressed among the SqCC subtypes.

Here, we focused on the basal and primitive subtypes using the ‘1 vs others’ comparison. On the basis of the FDR cutoff 0.05 from CAMERA in those four comparisons, the basal subtype has seven significant gene sets in the up direction but no gene sets in the down direction (Supplementary Table 1), and the primitive subtype has 628 significant gene sets in the down direction but no gene sets in the up direction (all these seven gene sets are also among the 628 gene sets but in the reverse directions in the two SqCC subtypes).

Next, for the 628 gene sets that are down in the primitive subtype, we adjusted the multiple testing P-values generated by CAMERA in the comparison of basal vs others. This is to focus on the significance of the primitive signature gene set in the comparison of basal vs others. On the basis of this procedure, 26 among the 628 are significant in basal vs others, and all of them are upregulated in basal (Supplementary Table 2). The significance of the gene set CHARAFE-BREAST-CANCER-BASAL-VS-MESENCHYMAL-UP (adjusted P 0.01 in primitive down, 0.03 in basal up) with 115 genes suggests some shared genes in the breast cancer basal subtype and the SqCC basal subtype.

Both of the gene sets RICKMAN-TUMOR-DIFFERENTIATED-WELL-VS-POORLY-DN (with 361 genes) and RICKMAN-TUMOR-DIFFERENTIATED-WELL-VS-MODERATELY-DN (with 108 genes) are significantly downregulated in primitive, and significantly upregulated in basal at adjusted P 0.006 and 0.011, respectively. On the other hand, the RICKMAN-TUMOR-DIFFERENTIATED-WELL-VS-POORLY-UP gene set (with 226 genes) is the most upregulated gene set in primitive, with nominal P-value 0.0015.

This gene set is also the second most downregulated gene set in basal vs others, with the nominal P-value 0.0001. In brief, genes overexpressed in the well-differentiated tumours are upregulated in primitive and genes underexpressed in the well-differentiated tumours are downregulated in primitive subtype. Therefore, the primitive subtype may originate from a later differentiation stage than the basal subtype.

Our pathway analysis results indicate that the basal and primitive subtypes are very different subtypes, as they do not share significant pathways in the same direction. The fact that the basal SqCC subtype and the primitive subtype display the most discordant overall survival in patients (Wilkerson et al, 2010) may be explained by the differences in signature genes and signature gene sets of the two subtypes.

Relationship between human bronchial epithelial normal cells at different culture time points and SqCC subtypes

SqCC has been considered to initiate in human bronchial epithelial cells but it is not clear when, how and in which subpopulation of the cells (Wilkerson et al, 2010). Investigators have compared the SqCC subtypes to several model systems of normal lung cell compartments. Here, we sought to evaluate this subtype to cell-type relationship using the TCGA cohort and alternative statistical methodology to that reported by Wilkerson et al (2010). We used a previously reported gene-expression data set from the HBEC-ALIC cell line in a time series of cultured normal, healthy, human bronchial epithelial cells (Ross et al, 2007; Wilkerson et al, 2010). The cells were collected at 11 different time points from day 0 to day 28.

The order of the time points of the samples can almost be recovered in the first dimension of the MDS plot (Figure 2). This suggests that the samples can be clustered according to the time points. The heatmap in this data set further confirms that there are three clusters among the samples (Figure 2). To be convenient, three clusters were defined by us as days 0, 1, 2, 4 for the early stage, days 8, 10, 12 for the middle stage and days 14, 17, 21, 28 for the late stage.

Figure 2

Sample relationship in the normal airway time-course data. (A) Multidimensional scaling plot of the normal airway time-course data. The first dimension represents the HBEC-ALIC time points well. It suggests that the samples can be clustered into three stages of early, middle and late. (B) Heatmap of the bronchial time-course data based on the hierarchical clustering. Five hundred genes with the largest variability across samples were used. Columns are for samples and rows are for genes. The label for x axis is the days in culture. This plot further supports to cluster the 11 tHBEC-ALIC time points into three clusters. Days 0, 1, 2, 4 are the early stage; days 8, 10, 12 are the middle stage and days 14, 17, 21, 28 are the late stage.

We used the linear models and empirical Bayes methods (Smyth, 2005) to perform the DE analysis of the bronchial time-course data. In the linear model, the difference among time points was considered. We also included the donor ID as a factor variable in a random effect model, in which the correlation among samples from the same donor was computed first before fitting the model. In fact, the variable of time points can be either taken as a continuous variable or as a categorical variable into three stages of early, middle and late (Figure 2).

Here, we only showed the DE results from the three clusters of samples as follows. The FDR was controlled globally using the Benjamini and Hochberg algorithm. Probes with FDR <0.05 and fold change >2 were judged to be differentially expressed. Comparing the middle stage to the early stage, there are 605 upregulated probe sets and 302 downregulated probe sets. Comparing the late stage to the middle stage, using the same criteria, there are 843 upregulated probe sets and 534 downregulated probe sets.

Following logic similar to that of prior work (Lim et al, 2009), signature scores of the subtypes and the bronchial epithelial culture time points were computed. The signature scores include two pieces of information – the average logFC of signature genes in SqCC subtypes and the expression level at bronchial epithelial culture time points. A signature score is defined for each SqCC subtype and each bronchial sample. The higher the scores the more similar the subtype and the bronchial samples are. The signature scores were plotted according to subtypes and bronchial culture time points in Figure 3. A linear model was run to test the trend of the signature scores with the corresponding actual time points. The P-values obtained from the linear model are as follows: 4.13e-05 for basal with slope −0.014, 0.205 for classical with slope −0.001, 1.74e-05 for primitive with slope 0.010733 and 0.0519 for secretory with slope 0.0017. This suggests that there is a significant association between SqCC subtype signature scores and time points in basal and primitive subtypes. There is marginal significance in the secretory subtype and no significance in the classical subtype. Regarding the slope, classical and secretory subtypes also have much smaller slopes that are 10–20% of the slopes of the other two subtypes. In general, our results confirm what was previously published using different tumour cohorts (Wilkerson et al, 2010). The basal signature scores are highest in the early bronchial samples, whereas the primitive signature scores are highest in the late stage. Therefore, the basal SqCC subtype is most similar to early bronchial samples in which there are predominantly basal cells, and the primitive subtype is most similar to late bronchial samples in which there are many cell types and greater proliferation. The classical signature score is highest at 2d and 4d but lower in other early and late time points. The classical subtype may come from early time points, but later than the stage from which the basal subtype comes. The secretory subtype is similar to the middle stage and the late stage culture in which there are more secretory cells, although this association is not as extreme as the primitive trend in our signature score approach. This analysis discriminated the primitive and secretory subtypes clearly, further indicating that these subtypes have distinct biological properties.

Figure 3

Signature scores of the SqCC subtype signature genes in the normal bronchial time-course data. x-axis represents the days. The higher the scores are, the more similar the subtype and the time-course samples become.

To statistically confirm the conclusion we draw from the signature scores, we performed a self-contained gene set test called Rotation gene set test (ROAST) (Wu et al, 2010) to each subtype signature gene set in the comparisons between the middle and early stages, and between the late and middle stages. A self-contained gene set test has high power to relate the two data sets of the subtype data and the bronchial time series data by giving the significance level of P-values. We used the average of moderated t value as the summary statistics in ROAST. The results (ROAST P-values 0.001–0.004 in different comparisons, detail not shown here) confirmed the above conclusion and further suggested that the order of similarity to the early bronchial epithelial culture time from high to low is basal, classical, secretory and primitive, whereas the order of similarity to the late culture is reversed as being highest in the primitive subtype and lowest in the basal subtype.

Relationship between 20 SqCC human cell lines and the SqCC subtypes

To help direct efforts for the discovery of therapeutic targets in lung SqCCs, we classified 28 lung SqCC lines by expression subtypes. In our study, a microarray data set of 28 SqCC cell lines (Table 2) was obtained from the Broad-Novartis Cancer Cell Line Encyclopedia (CCLE) downloaded from www.broadinstitute.org/ccle in May 2013. In Wilkerson et al (2010), four SqCC cell lines, HCC-15, HCC-95, HCC-2450 and H-157, have been previously classified into one of the four SqCC subtypes. Among the four cell lines, HCC-15 and HCC-95 are the only two cell lines in the 28 SqCC cell lines in CCLE.

Table 2 Classification results of the 28 SqCC cell lines to SqCC subtypes

The signature scores of the four SqCC subtypes in each of the cell lines were computed (Figure 4). Here, these signature scores were calculated based on the general cutoff as fold-change 2 and FDR 0.05 to obtain the signature gene sets. The cell line ranks of the scores in the four subtypes remain similar even if a less stringent cutoff (fold-change 1.5 and FDR 0.1, detail unpublished) is used. Higher scores represent higher similarities between cell lines and SqCC subtypes.

Figure 4

Cell-line ranks of signature scores of the SqCC subtype signature genes in the 28 SqCC cell lines. x-axis represents the different SqCC cell lines. y-axisrepresents the ranks of the signature scores in each SqCC subtype.

A procedure was developed to generate reproducible classification results using signature scores. We rank the 28 cell lines based on their signature scores in a subtype – for example the basal subtype. The rank is from 1 to 28. If each row is for a cell line in a data matrix, we have four columns (cell line rank per subtype) of the ranks as seen in Supplementary Table 14 and the ranks are plotted in Figure 3, shown per cell line. The top ranks have smaller rank numbers. Therefore, for each cell line, the subtype with the smallest rank number was considered as the ‘1st subtype’ of that cell line, followed by the second smallest rank number for the ‘2nd subtype’ of that cell line as in ‘subtype rank’ as seen in Supplementary Table 14 and the corresponding brief Table 2. Here, we use this procedure to determine the most similar subtype to a cell line based on signature scores. We also include the ‘2nd subtype’ to represent variability.

The cell line LUDLU-1 is similar to both basal and classical subtypes. The cell line LC-1/sq-SF is most similar to the classical subtype, and the similarity was ranked quite low in all the other three subtypes. HCC-95 is most similar to the classical subtype, the same as suggested in Wilkerson et al (2010). Although Wilkerson et al (2010) suggested HCC-15 to be a primitive subtype, it may have some mixed features of other subtypes for two reasons. First, the range of signature scores of the cell lines to the primitive subtypes is small; therefore, the difference among cell lines may be subtle. Second, the signatures of HCC-15 in other three subtypes are not very low. The brief results are shown in Table 2.

We assigned cell lines by two methods. Signature scores using signature genes of subtypes provide a general relationship between cell lines and subtypes. Some cell lines do not fit easily into a subtype – for example, sq-1, Calu-1 and LUDLU-1 (Figure 4). The classification method we next used gave a more clear indication of subtypes.

As a complementary subtype assignment, a classification method to nearest centroids (ClaNC), classified the 28 cell lines into the four SqCC subtypes, with seven basal, seven classical, five primitive and nine secretory cell lines (first subtype of ClaNC in Table 2).

As ClaNC uses distance between 28 cell lines and 4 subtype centroids, we output the distance matrix of 28 × 4. The nearest distance is used to define the ‘1st subtype’ – that is, the classification results. We also obtained the ‘2nd subtype’ for a cell line – that is, the second nearest centroid. To evaluate the uncertainty of this classification method, we first made an MDS plot (Supplementary Figure 7) with the centralised tumour samples together with the centralised cell line samples. This shows that the cell line samples tend to be on the centre of all samples. We permute the subtype labels randomly to generate four random centroids for 1000 times. The distance between cell lines and centroids is mostly in a much smaller scale (data not shown) compared with the observed distance. This is because of the fact that the cell-line samples tend to be on the centre of all samples. To correct this scaling bias, for each sample, we convert the distance to percentage of the distance to the sum of the four distances. This was carried out in both permutation-based samples and the observed cell-line samples. P-values were computed, as the probability of the observed percentage–distance is larger than the percentage–distance from the permutations. Smaller P-value represents how significant the distance from the cell-line sample to the nearest centroid is. Same procedure was performed for the ‘2nd subtype’ that shares exactly the same permutation. Here we explained how to access the uncertainty of the classification method ClaNC.

With the results from signature scores (first and second subtypes) and ClaNC (first and second subtypes and P-values), we determine the predicted subtype of a cell line. The criteria are as follows: the permutation-based P-value for ClaNC first subtype <0.2. If not, the majority vote is used among four columns regarding first and second subtypes, and the prediction of these samples is highlighted as blue in Supplementary Table 14, with less certainty (Table 2).

In the ClaNC results, both cell lines HCC-95 and HCC-15 were classified as classical samples. Therefore, HCC-95 is highly likely to be a classical cell line, being consistent with the previous classification (Wilkerson et al, 2010) and the results of signature scores. HCC-15 has a mixed background.

Drug target of cancer cell lines

To determine whether expression subtype may predict drug response in SqCC cell lines, we obtained publicly available drug-sensitivity data for SqCC cell lines (www.broadinstitute.org/ccle/data).

In the context of the pharmacological profiles for 24 anticancer drugs across 504 cancer cell lines (Barretina et al, 2012), we located the drug sensitivities for the different SqCC subtypes. Only 17 cell lines among the 28 SqCC cell lines were treated in this drug-response experiment (Barretina et al, 2012) (Table 2), with 24 drugs at eight dosages. According to ClaNC, the 17 cell lines comprise four basal, four classical, three primitive and six secretory SqCC cell lines. In Barretina et al (2012), a novel score termed ‘activity area’ was created to combine the information of the half maximal inhibitory concentration (IC50), the half maximal effective concentration (EC50) and maximum inhibited percentage (MIP). A large activity area comprising small IC50, small EC50 and larger MIP indicates high sensitivity of a cell line to a drug (Supplementary Table 3, previously published (Barretina et al, 2012)).

The response sensitivities, represented by activity area, for the 17 cell lines as shown in Figures 5A and B, are varied across drugs. Most cell lines responded to five of the drugs (Panobinostat, 17-AAG, Irinotecan, Topotecan and Paclitaxel). The drug targets of these five drugs are HDAC, HSP90, Topoisomerase-I, Topoisomerase-I and beta-tubulin, respectively. A basal cell line EBC-1 is also sensitive to three additional drugs (PF2341066 with target c-MET, AZD6244 with target MEK and PD-0325901 with target MEK).

Figure 5

Drug sensitivity to SqCC subtypes through CCLE. (A) Heatmap of the activity area score for 17 SqCC cell lines (four basal, four classical, three primitive, six secretory and SqCC cell lines based on ClaNC) and the 24 drugs. The white block in the plot is for missing data due to the lack of some drug treatments to the cell lines. The rows are for the SqCC cell lines and the columns are for the 24 drugs. Both dimensions have been clustered by hierarchical clustering. The ClaNC results of SqCC subtype classification were shown for each cell line. (B) Scatter plot of the activity area score for 17 cell lines and 24 drugs. Colours represent the four subtypes. (C) On the left panel, proliferation scores for secretory SqCC samples or other SqCC samples (P-value 7.5e-05). On the right panel, the activity area score of all 24 drugs for secretory cell lines or others (P-value 0.014). Secretory SqCC subtype has lower proliferation scores and lower activity area scores of drug treatment. Two-sided P-value was obtained by Wilcoxon Rank sum test (P-value 7.5e-05 on the left, 0.068 on the right). (D) Focusing on the five drugs (Panobinostat, 17-AAG, Irinotecan, Topotecan and Paclitaxel), this shows the area scores for secretory cell lines are significantly different to the scores in each of the other three SqCC types of cell lines. Two-sided Wilcoxon mean rank test was used (secretory vs basal P-value 0.002, vs classical P-value 0.006 and vs primitive P-value 0.071).

Compared with most of the basal and classical cell lines, the secretory cell line NCIH-226 is least sensitive to the drugs Panobinostat, 17-AAG, Irinotecan, Topotecan, Paclitaxel, AZD6244 and PF2341066. For each cell line, we can calculate an average activity area across drugs by averaging the columns in the heatmap. The global mean of the average activity areas across the drugs for the 17 cell lines is 1.34 (s.d.=0.41).

On the basis of Figure 5B, there is a trend that the secretory cell lines (orange colour) have lower activity area scores across drugs. Particularly, all cell lines have at least moderate response to Paclitaxel, whereas secretory cell lines have lower drug response to Paclitaxel. As the secretory SqCC tumours have lower proliferation activity, we investigate a proliferation score for each tumour sample. We used a proliferation signature set of 43 genes (Whitfield et al, 2006) to get a proliferation score for a tumour sample that is the average log expression of these 43 genes in that sample. Higher proliferation score represents higher proliferation activity. Figure 5C (left) shows that the secretory SqCC samples have significant (P 7.5e-05) lower proliferation scores than other subtypes. This might explain why the cell lines of the secretory SqCC subtype have lower activity area scores of drug treatment at P-value 0.014 (Figure 5C-right). Focusing on the five drugs (Panobinostat, 17-AAG, Irinotecan, Topotecan and Paclitaxel), we used two-sided Wilcoxon mean rank test to test whether the average area scores for secretory cell lines are significantly different to the scores in each of the other three SqCC types of cell lines (Figure 5D). The secretory cell lines have significant lower scores than basal (P-value 0.002), classical (P-value 0.006) and primitive cell lines (P-value 0.071).

Overall, the SqCC cell lines are not sensitive to any drug with EGFR or FGFR as the target. We reported what we observed and the results seem reasonable in terms of the prior knowledge that EGFR-targeted drugs are not for SqCC patients. Our results may be generalised without the current limitation of the number of cell lines and number of drugs. Generally speaking, a larger number of SqCC cell lines and an increased number of profiled compounds may be required to make more robust conclusions about the drug repurpose for SqCC subtypes.