Genes that exist in old Affymetrix platform but not in the newer one

I am using two gene expression datasets from an Affy U95Av2 platform and an Affy U133 Plus 2.0 platform. When I map the Affy probe names to HUGO gene names, there are thousands of genes which exist in the newer Affy U133 Plus 2.0 dataset while not in the old Affy U95Av2 dataset, which is something expected. But there are also 97 genes which exist in the old Affy U95Av2 platform while not in Affy U133 Plus 2.0 platform. I would not expect that because Affy U133 Plus 2.0 is a much newer platform and I would expect it to contain all genes that were measured by Affy U95Av2. What does that mean? Should I understand that those 97 gene measurements in the Affy U95Av2 platform were not reliable and that's why they don't exist in Affy U133 Plus 2.0? Here are those 97 genes:

"ACSL4" "ACSM2A" "AP3S1" "AQP7" "ARPC3" "ATF4" "ATP5H" "BAK1" "BAK1P1" "CBX1" "CCL15" "CELP" "CFHR3" "CHEK2" "CLCNKA" "COL8A1" "CS" "CXorf40B" "CYP2D6" "DDI2" "EIF3F" "EIF3IP1" "EIF5AL1" "FCGR2A" "FCGR3A" "GBX1" "GPX1" "HAVCR1" "HBZ" "HIST1H2AH" "HIST1H2AI" "HIST1H2BC" "HIST1H2BJ" "HIST1H4I" "HOXA9" "HSPB1" "IFNA14" "IGF2" "IL9R" "ITGA1" "KAT7" "KRT33A" "KRTAP26-1" "LDHA" "MAGEA12" "MAP2K4P1" "MIA" "MKRN3" "MROH7" "MSX2P1" "MT1A" "MT1B" "NDUFV2" "OPHN1" "OR7E24" "PARP4" "PCDHA12" "PCDHA13" "PCDHGA12" "PCDHGB4" "PINK1-AS" "PMS2P3" "PSMC6" "PSME2" "RAB13" "RCN1" "RNF216P1" "RNF5" "RPL10A" "RPL18" "RPL27" "RPL35" "RPL37" "RPLP1" "RPS15A" "RPS26" "RPS29" "RPS5" "RPS9" "RSC1A1" "S100A7" "SAA1" "SAA4" "SNX29" "SPRR2D" "TOMM40" "UBC" "UBE2E3" "UBE2S" "UGT2B7" "UQCRFS1" "UQCRH" "VDAC2" "VENTXP7" "VOPP1" "XCL2" "ZNF799"

aI used to work at Affymetrix when most of these arrays were designed. I was not on the design team itself, but I can maybe talk about this a bit more.

RNA Array designs were built to cover anything that might possibly be real transcript in the mix of EST collections, cDNA, in silico gene detections and miscellaneous entries in public databases. There were a lot of different people trying to find genes as quickly as possible and a good deal of it was not real gene naturally. I'm sure there was a reasonable amount of contamination in the millions of transcripts we took in as well.

The team would find a good number of errors in the sequence database. There is no way to submit this in a meaningful way to most bioinformatics databases by the way. Just a note:)

When a new design came out the team would do auditing to see if any of the transcripts had fallen out of favor with the evidence and some of those 'genes' would be pared from the content.

This is useful because DNA hybridization tech is very high throughput for the dollar but it has a background noise and even a probeset with no correspondence in the RNA sample will give numbers that are non-zero.

RNAseq has similar issues from assemblies and sensitivity from limits of reads on the sample BTW. There's no perfect solution as of yet.

BTW sometimes genes get renamed. I didn't get into your methods to see if this is a case, but something to keep in mind.

My experience is with Affymetrix probes for Drosophila, not H.sapiens, and only with one version. Nevertheless I'll describe the situation I encountered in case it is relevant to yours. Apologies if it is a red herring.

What I did with the Affymetrix data sheet was use it to construct my own SQL relational database containing probesetIDs and geneIDs (as well as the experimental data, of course). I was then able to make some 'housekeeping' queries to the database and was surprised (perhaps I shouldn't have been) to find the following:

  • Some genes were picked up by more than one probeset. No great worry. Just had to choose the probeset that gave the highest signal, unless it fell into the second category.
  • Some probesets picked up more than one gene. This was a problem, and meant I had to classify the probesets as ambiguous or unambiguous. But an even greater problem was that for some genes there existed no unambigous probesets.

Obviously in designing the probesets Affymetrix thought they were producing unambigous gene-specific ones. When they updated the probesets to include new or corrected gene designations one imagines they would try to deal with this problem (assuming it also existed in the human gene sets). It seems hard to believe, but could the genes you mention be refractory to the preparation of unambigous probesets?

Comparative transcriptome analysis of embryonic and adult stem cells with extended and limited differentiation capacity

Recently, several populations of postnatal stem cells, such as multipotent adult progenitor cells (MAPCs), have been described that have broader differentiation ability than classical adult stem cells. Here we compare the transcriptome of pluripotent embryonic stem cells (ESCs), MAPCs, and lineage-restricted mesenchymal stem cells (MSCs) to determine their relationship.


Applying principal component analysis, non-negative matrix factorization and k-means clustering algorithms to the gene-expression data, we identified a unique gene-expression profile for MAPCs. Apart from the ESC-specific transcription factor Oct4 and other ESC transcripts, some of them associated with maintaining ESC pluripotency, MAPCs also express transcripts characteristic of early endoderm and mesoderm. MAPCs do not, however, express Nanog or Sox2, two other key transcription factors involved in maintaining ESC properties. This unique molecular signature was seen irrespective of the microarray platform used and was very similar for both mouse and rat MAPCs. As MSC-like cells isolated under MAPC conditions are virtually identical to MSCs, and MSCs cultured in MAPC conditions do not upregulate MAPC-expressed transcripts, the MAPC signature is cell-type specific and not merely the result of differing culture conditions.


Multivariate analysis techniques clustered stem cells on the basis of their expressed gene profile, and the genes determining this clustering reflected the stem cells' differentiation potential in vitro. This comparative transcriptome analysis should significantly aid the isolation and culture of MAPCs and MAPC-like cells, and form the basis for studies to gain insights into genes that confer on these cells their greater developmental potency.


The development started with early work on the underlying sensor technology. One of the first portable, chemistry-based sensors was the glass pH electrode, invented in 1922 by Hughes. [2] The basic concept of using exchange sites to create permselective membranes was used to develop other ion sensors in subsequent years. For example, a K + sensor was produced by incorporating valinomycin into a thin membrane. [3]

In 1953, Watson and Crick announced their discovery of the now familiar double helix structure of DNA molecules and set the stage for genetics research that continues to the present day. [4] The development of sequencing techniques in 1977 by Gilbert [5] and Sanger [6] (working separately) enabled researchers to directly read the genetic codes that provide instructions for protein synthesis. This research showed how hybridization of complementary single oligonucleotide strands could be used as a basis for DNA sensing. Two additional developments enabled the technology used in modern DNA-based. First, in 1983 Kary Mullis invented the polymerase chain reaction (PCR) technique, [4] a method for amplifying DNA concentrations. This discovery made possible the detection of extremely small quantities of DNA in samples. Secondly in 1986 Hood and co-workers devised a method to label DNA molecules with fluorescent tags instead of radiolabels, [7] thus enabling hybridization experiments to be observed optically.

Figure 1 shows the make up of a typical biochip platform. The actual sensing component (or "chip") is just one piece of a complete analysis system. Transduction must be done to translate the actual sensing event (DNA binding, oxidation/reduction, etc.) into a format understandable by a computer (voltage, light intensity, mass, etc.), which then enables additional analysis and processing to produce a final, human-readable output. The multiple technologies needed to make a successful biochip—from sensing chemistry, to microarraying, to signal processing—require a true multidisciplinary approach, making the barrier to entry steep. One of the first commercial biochips was introduced by Affymetrix. Their "GeneChip" products contain thousands of individual DNA sensors for use in sensing defects, or single nucleotide polymorphisms (SNPs), in genes such as p53 (a tumor suppressor) and BRCA1 and BRCA2 (related to breast cancer). [8] The chips are produced by using microlithography techniques traditionally used to fabricate integrated circuits (see below).

The microarray—the dense, two-dimensional grid of biosensors—is the critical component of a biochip platform. Typically, the sensors are deposited on a flat substrate, which may either be passive (e.g. silicon or glass) or active, the latter consisting of integrated electronics or micromechanical devices that perform or assist signal transduction. Surface chemistry is used to covalently bind the sensor molecules to the substrate medium. The fabrication of microarrays is non-trivial and is a major economic and technological hurdle that may ultimately decide the success of future biochip platforms. The primary manufacturing challenge is the process of placing each sensor at a specific position (typically on a Cartesian grid) on the substrate. Various means exist to achieve the placement, but typically robotic micro-pipetting [9] or micro-printing [10] systems are used to place tiny spots of sensor material on the chip surface. Because each sensor is unique, only a few spots can be placed at a time. The low-throughput nature of this process results in high manufacturing costs.

Fodor and colleagues developed a unique fabrication process (later used by Affymetrix) in which a series of microlithography steps is used to combinatorially synthesize hundreds of thousands of unique, single-stranded DNA sensors on a substrate one nucleotide at a time. [11] [12] One lithography step is needed per base type thus, a total of four steps is required per nucleotide level. Although this technique is very powerful in that many sensors can be created simultaneously, it is currently only feasible for creating short DNA strands (15–25 nucleotides). Reliability and cost factors limit the number of photolithography steps that can be done. Furthermore, light-directed combinatorial synthesis techniques are not currently possible for proteins or other sensing molecules.

As noted above, most microarrays consist of a Cartesian grid of sensors. This approach is used chiefly to map or "encode" the coordinate of each sensor to its function. Sensors in these arrays typically use a universal signalling technique (e.g. fluorescence), thus making coordinates their only identifying feature. These arrays must be made using a serial process (i.e. requiring multiple, sequential steps) to ensure that each sensor is placed at the correct position.

"Random" fabrication, in which the sensors are placed at arbitrary positions on the chip, is an alternative to the serial method. The tedious and expensive positioning process is not required, enabling the use of parallelized self-assembly techniques. In this approach, large batches of identical sensors can be produced sensors from each batch are then combined and assembled into an array. A non-coordinate based encoding scheme must be used to identify each sensor. As the figure shows, such a design was first demonstrated (and later commercialized by Illumina) using functionalized beads placed randomly in the wells of an etched fiber optic cable. [13] [14] Each bead was uniquely encoded with a fluorescent signature. However, this encoding scheme is limited in the number of unique dye combinations that can be used and successfully differentiated.

Microarrays are not limited to DNA analysis protein microarrays, antibody microarray, chemical compound microarray can also be produced using biochips. Randox Laboratories Ltd. launched Evidence, the first protein Biochip Array Technology analyzer in 2003. In protein Biochip Array Technology, the biochip replaces the ELISA plate or cuvette as the reaction platform. The biochip is used to simultaneously analyze a panel of related tests in a single sample, producing a patient profile. The patient profile can be used in disease screening, diagnosis, monitoring disease progression or monitoring treatment. Performing multiple analyses simultaneously, described as multiplexing, allows a significant reduction in processing time and the amount of patient sample required. Biochip Array Technology is a novel application of a familiar methodology, using sandwich, competitive and antibody-capture immunoassays. The difference from conventional immunoassays is that, the capture ligands are covalently attached to the surface of the biochip in an ordered array rather than in solution.

In sandwich assays an enzyme-labelled antibody is used in competitive assays an enzyme-labelled antigen is used. On antibody-antigen binding a chemiluminescence reaction produces light. Detection is by a charge-coupled device (CCD) camera. The CCD camera is a sensitive and high-resolution sensor able to accurately detect and quantify very low levels of light. The test regions are located using a grid pattern then the chemiluminescence signals are analysed by imaging software to rapidly and simultaneously quantify the individual analytes.


Large microarray depositories like GEO and ArrayExpress focus on the archiving of expression data as used in specific publications. These archives play an essential role in biological science by allowing transparent replication of microarray analyses by other researchers. Experimenters using the same array platform often use different normalization methods for their analyses, so that data downloaded from different projects on GEO or ArrayExpress are unlikely to be directly comparable. GEO at NCBI provides GEO DataSets to alleviate this problem. A GEO DataSet contains a collection of biologically and statistically comparable microarray samples processed using the same platform. Unfortunately, there is a significant delay between when a sample is submitted to GEO and when it is available as a GEO DataSet. Only one-fifth of the number of samples in M 3D were available from GEO DataSets ( Figure 1 A and B).

All of the available E. coli Affymetrix Antisense2 expression data for the transcription factor lexA and its known target recA were downloaded from NCBI GEO Profiles (A) and from M 3D compendium E_coli_v3_Build_1 (B and C). NCBI GEO Profile data is derived from NCBI GEO DataSets that contain only a subset of the data in GEO, therefore many more samples were available for plotting from M 3D (445) than from GEO (85). The correlation between lexA and its known target was higher when the raw data was uniformly normalized with RMA (C) rather than normalizing each microarray individually with MAS5 (A and B).

We have initially chosen to include only single-channel Affymetrix microarrays in M 3D . The photolithography process used by Affymetrix allows all laboratories to start with a very consistent substrate for hybridization. In addition, the single-channel design eliminates the need for a common reference condition for all arrays. Thus, in contrast to two-color array designs, data from different laboratories and projects can be integrated without artifacts due to an inconsistent reference condition. The remaining systematic biases in the Affymetrix platform are due to researcher-specific differences in the RNA preparation and hybridization protocols. However, when the raw probe-level microarray data (CEL files) are normalized as a group with RMA (12), we find that these systematic researcher biases are small relative to the biological changes that occur across experimental conditions (7). In addition, the RMA normalized data tends to have higher correlation between the expression of transcription factors and their known targets ( Figure 1 B and C).

To employ the RMA normalization approach in M 3D , all expression profiles for a particular array design (e.g. the E. coli Antisense 2 array) are collected, uniformly normalized and deposited as a 𠆋uild’. Periodically, we add new expression profiles for a particular array design, renormalize all data, and release a new 𠆋uild’. This ensures that all experiments in any build are uniformly normalized and comparable across conditions. The renormalization process may result in small changes in the expression values of all profiles. Thus, all builds are labeled with a version number that references the underlying mysql schema of the database and a build number that denotes the particular set of microarray data (e.g. E_coli_v3_Build_2 uses mysql schema version 3 and is the second compendium built for E. coli). Builds are maintained in perpetuity. This system, like the build system used by the human genome assembly, allows computational researchers to specify the exact dataset used for a particular analysis.


We hereby present a comprehensive genome-wide validation of pooled genotyping on the higher throughput SNP genotyping platforms. Using the complete Affymetrix 500 k array set as the basis of comparison, we have shown that the reliability and accuracy of pooled genotyping is as good as or improved over the previously tested 10 k and 100 k array sets. This comparison has been extended to the new SNP6.0 platform, which has yet been shown to be useful for pooled genotyping. We believe that this work would reaffirm that SNP-MaP is still a viable alternative to individually genotyping a large sample population.

Novel Pooling strategy

Strategies for pooled genotyping have classically followed the path of having 3 identical replicate pools at the very least with the intent of "averaging" out the error normally associated with pooling [16, 17, 22]. The novel pooling strategy presented in this paper does not aim to replace the tried and tested method of replicates, but is rather proposed as an alternative. While it was performed in an attempt to evaluate the outcome of a thought experiment, the obtained results exceeded our expectations. Our pooling strategy involved the creation of 3 over-lapping pools from 3 sub-pools of 20 samples each. Comparing the accuracy of allele frequency estimates of each of the sub-pools to the average obtained across all 3 sub-pools (Table 1) showed that the novel pooling strategy of overlapping pools produced similar benefits in improved allele frequency estimation as compared to doing pooled replicates. As pooled replicates were not used for this part of our study, we chose to compare the capabilities of our overlapping sub-pools with that of pooled replicates as reported by others. The average correlation of estimated allele frequencies to actual allele frequencies improved by nearly 1% when the pools in each study group were considered as a whole and averaged. The average error in the allele frequency estimates was reduced by up to 0.01. These improvements in allele frequency estimates obtained from this novel pooling strategy compare well with those obtained from our replicate pools on the SNP6.0 platform as well as in other studies where even more chips were used [19]. While each of the samples was in effect replicated twice across 3 pools, they technically could not be considered as replicates. As such, the estimated allele frequencies from each of the 3 pools within our study groups were not as highly correlated with each other as they were with the actual allele frequencies they were estimating. Nonetheless, we showed that when the estimates obtained from the 3 pools were aggregated, they were able to more accurately estimate allele frequencies to a level comparably achieved by replicate pools [5, 11, 16]. This supports the fact that a sufficient number of replicates can control the pooling error to give results which can be very similar to those obtained from individual genotyping.

Estimating Allele Frequencies

Individual genotyping classically produces genotype calls for each sample from which an average allele frequency can be calculated. However, in pooled genotyping, the microarray software is unable to assign a genotype due to the heterogeneous nature of the pooled sample, and the unequal hybridization to the various probes. As such, an algorithm to estimate allele frequencies from probe intensities was necessitated. To account for the unequal allelic amplification in pooled genotyping, relative allele signals (RAS) used together with a k-correction to improve accuracy of estimates was used initially [23]. This algorithm was extensively validated on the Affymetrix 10 k microarrays by various groups [9, 11–15]. This relatively simple and yet accurate method of allele frequency estimation made it highly popular among researchers. So, even when a new algorithm (polynomial-based probe-specific correction or PPC) which improved on the highly popular RAS/k-correction method was proposed and was shown to give the best estimates of allele frequency from pooled genotyping on the Affymetrix 10 k platform [21], the tried and tested algorithm prevailed with its usefulness further extended to the Affymetrix 100 K microarray set [16], as well as the 500 K microarray set [5, 19]. The main criticisms of the PPC algorithm were the time consuming computation in Perl and R, and the need for all 3 genotypes in the reference samples limiting the number of SNPs analysed [19]. Our group felt that with the rapid advancements in computing technology in recent years, the former criticism should not prevent usage of the more accurate PPC algorithm, even when considering the large volumes of data generated by the Affymetrix 500 k array set. The second criticism may not really be valid depending on the sample data set used to train the algorithm.

Choice of Reference Data Set Affecting Accuracy of Allele Frequency Estimates

Regardless of the method used to estimate allele frequencies from the probe intensity data of pooled genotyping, the necessity of a set of reference samples is paramount. In most situations, allele frequency data from reference samples (usually from an appropriate Hapmap population) are used as a benchmark to compare the allele frequency estimates against. While the issue of reference samples was brought up [13] in the context of differential hybridization of heterozygous SNPs affecting accuracy of estimation of allele frequencies from pooled genotyping, no follow-up studies have been done in an attempt to quantify these differences. We have showed in this paper that the choice of reference samples does impact the accuracy of allele frequency estimates.

Our initial comparison of the accuracy of allele frequency estimates from pooled genotyping on the 500 k platform revealed that using a genetically homogeneous reference sample set, such as one from a particular ethnic group, produced estimated allele frequencies which were more accurate than using a more heterogeneous one. While our use of the same set of samples for individual and pooled genotyping provided a better indication of the capabilities of the 500 k platform in allelotyping, it might be thought that such a result would be expected given that the same samples were used for both. Our results from the first individual vs pool comparison (Table 3) were confirmed in the second comparison of actual and estimated allele frequencies (Table 4) from a completely different set of pooled samples, where we showed a similar high level of accuracy of the estimates.

This difference can be possibly attributed to availability of samples with all 3 genotypes for SNPs in the reference sample set. For the RAS method of calculating allele frequencies, the presence of heterozygous samples together with both homozygotes, allow the calculation of the k-correction which helps improve the accuracy of allele frequency estimates. Similarly, for PPC, the heterozygous samples allow the derivation of second-degree polynomials which increased accuracy of estimated allele frequencies by accounting for unequal hybridization efficiencies of different SNPs [21]. So a reference sample data set with a greater proportion of SNPs with heterozygous samples would, in theory, produce better allele frequency estimates than one with fewer SNPs with heterozygous samples. Furthermore, a genetically heterogeneous population should have more SNPs with heterozygous members. The 500 k Sample Data Set had 72.7% (364,140) of all SNPs with homozygous and heterozygous samples, while our individually typed samples had 63.3% or 316,623 SNPs with all 3 genotypes represented in the sample population. The difference in number of SNPs with all 3 genotypes between the two sample data sets reflects their heterogeneity while the 500 k Sample Data Set was made up of representatives from the four major Hapmap populations, our own set of individually types samples were all ethnic Chinese. However, while this difference is expected given the ethnic differences in the two sample sets, the disparity in accuracy of allele frequency estimates produced by them is not. When our individually typed samples were used to estimated the polynomials (beta values) for PPC, the estimated allele frequencies were closer to the actual allele frequencies by more than 3% (mean difference in allele frequency of up to 0.05) when compared to the estimates obtained from the 500 k Sample Data Set (Table 3 and Table 4). These results indicate that a greater proportion of SNPs with 3 available genotypes in the reference sample set does not necessarily improve accuracy of allele frequency estimates. It could be that the SNPs which are or are not variable in the study population may not necessarily be the same as those in the reference population as such variability (presence of homozygous and heterozygous samples for any particular SNP) of those SNPs in the reference population is not helpful in improving accuracy of estimated allele frequencies.

We believed that SNP variability was related to the ethnicity of the samples in the reference data set. While complete reference sample data sets from different ethnicities were not easily available for the 500 k platform, complete data for all 270 Hapmap samples was made available by Affymetrix when the SNP6.0 was released. This allowed us to compare the accuracy of allele frequencies from our pooled genotyping calculated using beta values from the four major Hapmap populations against the allele frequencies of those very populations. While we have yet to do individual genotyping on the SNP6.0 platform, such a comparison would still be valid as we have already shown that our Singapore Chinese samples are similar to the Hapmap Han Chinese (CHB) population (unpublished data). Our results (Table 5) confirmed our suspicions that the ethnicity of the reference data set is indeed important higher levels of accuracy were observed when allele frequencies were estimated from beta values calculated using a reference population of similar ethnicity. While the accuracy of estimation improved when the four Hapmap populations were considered as a whole as compared to the 500 k Sample Data Set, this could have been due to the greater number of samples (270 vs 48) in the reference set. While the CEU and YRI data sets had significantly more informative SNPs with all 3 genotypes called (66.31% and 72.47% respectively), the CHB population data set still managed to produce better estimates of allele frequency with a relatively lower (55%) proportion of such SNPs. Neither the increased numbers in the CEU, YRI and combined data sets over the CHB or JPT reference data sets, nor the availability of heterozygous samples with both homozygotes improved the accuracy of allele frequency estimates. While we believe that the differences in accuracy of allele frequencies when using the different reference sample sets may be due to the rather disparate variability between the various Hapmap populations [24], the most important property of the reference sample set which would affect accuracy of allele frequency estimates is its ethnic background and whether it shared this with the study population.

The importance of a reference sample set which is genetically homogeneous with the study population in genome-wide association studies using pooled genotyping, might be taken to mean that if researchers are studying a population for which reference genotyping data is not available (most likely outside the 4 main Hapmap populations), they would need to perform a round individual genotyping so as to generate a set of reference data which they can use for subsequent pooling experiments. This greatly detracts from the benefits offered by pooled genotyping as a more economical and more efficient way of performing an initial whole genome scan as part of an association study. However, this is where genotyping repositories, as suggested by various authors [12, 13], would come in useful, in providing complete reference data sets of populations not currently covered in the International Hapmap Project.

Validation of Pooled Genotyping on High Throughput Platforms

In this paper, we reinforce the capabilities of SNP-MaP as an alternative to individual genotyping of hundreds or thousands of samples in a genome-wide case-control association study. While pooled genotyping had been previously validated on the smaller scale Affymetrix 10 k and 100 k array sets, similarly detailed analysis had not been done on the 500 k or newer SNP genotyping platforms. Previous validation studies have shown accuracies of pooled genotyping on the 10 k platform ranging from 0.923 [11] to 0.987 [13] and from 0.971 [16] to 0.983 [17] on the 100 k array set. While pooled genotyping seemed immensely popular using the relatively lower throughput 10 k and 100 k genotyping platforms, researchers did not seem equally enthused with the newer improved efficiency SNP genotyping chips [15]. This could have been due to the apprehension about the 'trade-offs' associated with trying to squeeze more probes onto a microchip. While both the 10 k and 100 k chips had 40 probes for each SNP, the 500 k and SNP6.0 arrays had it reduced to 24 and 6 per SNP respectively, with certain SNPs being represented by an extra 4 and 2 probes respectively.

Nonetheless, validation of pooled genotyping was indeed carried out on the 500 k arrays with estimation accuracies ranging from 0.926 [5] to 0.983 [19]. While Wilkening et al. used only 40% of the SNPs, (SNPs found on the Nsp I chip of the 500 k array set), Docherty et al. evaluated the performance of almost all the SNPs (> 90%) in the array set. Building on Docherty et al.'s work, we chose to base our study on the full repertoire of 500,568 SNPs. The high level of accuracy we have shown (Pearson's Correlation = 0.988) is comparable with that obtained by others. The estimated allele frequencies show minimal variability from the actual allele frequencies (mean error = 0.036), and is similarly comparable to previous studies. Despite the apprehension about pooled genotyping on the 500 k platform, we have shown that allelotyping of pooled samples on this platform is both reliable and accurate. These results add to the work done by others to further affirm that pooled genotyping is extremely viable on this higher throughput platform.

We took this analysis one step further by focusing on the currently available ultra high-throughput SNP genotyping SNP6.0 platform and the 906,600 SNPs it covered (the other 946,000 probes on the SNP6.0 chip were for the detection of copy number variations which are outside the scope of this paper). Estimated allele frequencies from our pooling experiment highly represented those from our selected reference data set (Pearson's Correlation = 0.989, mean error = 0.035). Despite the reduction in intensity data available per-SNP, the SNP6.0 platform seems equally well suited as its predecessors for SNP-MaP. Although our allele frequency estimates from pooled genotyping on the SNP6.0 platform were based on individual genotyping data of Hapmap CHB samples instead of the samples in the pools (which we used in our validation on the 500 k platform), we are still highly confident of its relevance due to the ethnic similarity of Hapmap CHB and our Singapore Chinese samples.

In the 10 k and 100 k arrays, relative allele signal data was readily available thus allowing the use of the RAS method to estimate allele frequencies together with the k-correction to account for unequal hybridization. While such data was directly unavailable for the 500 k data, various authors [12, 15] provided scripts or formulae to extract this information from the raw intensity data. In the three generations of SNP chips, both PM (Perfect Match) and MM (Mis-Match) probes were present, thus allowing relative signal intensities to be calculated. However, with the newer SNP6.0 chip only PM probes were available, probably due to the increased coverage of genetic variants. With the availability of only PM signal intensities (instead of RAS signals), PPC was the only method for estimating allele frequencies from pooled genotyping data using only the PM probes while still accounting for unequal hybridization. Prior to this study, PPC had only been validated on the 10 k platform [19, 21]. Following our validation of pooled genotyping on the 500 k array set using PPC for allele frequency estimation, the current ascertainment of the performance of the SNP6.0 array in SNP-MaP would be the first on such a high density microarray.

Previous studies [5, 12, 18] have suggested that high estimates of reliability of pooled genotyping are inflated by a variety of factors such as quality of genotype calls for certain SNPs, and rare or non-polymorphic SNPs. Both these factors were examined to evaluate their relationship with the accuracy of allele frequency estimates. We discovered (Table 6) that SNPs with missing genotype calls in the reference data set did not affect accuracy of estimated allele frequencies derived from beta values calculated from the reference samples unlike mentioned previously [12]. Excluding SNPs which were rare in the reference sample set (minor allele frequency < 5%) did cause accuracy of allele frequency estimates to reduce slightly to 0.976 (Table 7) however, this difference is minor, unlike what was reported before [5], and should not be taken as an indication that the high levels of accuracy observed were in fact due to non-polymorphic SNPs in the populations. As a measure of the performance of allele frequency estimation, sensitivity and specificity were calculated for subsets of SNPs following various minor allele frequency cut-offs. The high specificity (95.4%, Table 8) of allele frequency estimates of common SNPs (minor allele frequency > 5%) indicates that Type I errors in the approximation of true allele frequency are low while not really compromising on the sensitivity of the test (sensitivity = 85.9%).

Regardless of how we compared our pooled estimates of allele frequencies with the actual allele frequencies obtained from our individually typed samples and known allele frequencies from Hapmap CHB samples, the allele frequency estimates that we obtained proved to be extremely reliable. With reliability and validity improvements over that previously demonstrated on 10 k, 100 k and 500 k arrays, we have shown that both the 500 k and SNP6.0 platforms perform well in pooled genotyping.

While we have showed that pooled genotyping allows the estimation of allele frequencies which are highly accurate compared to the actual allele frequencies, it cannot be used to completely replace individual genotyping the availability of actual genotype data as obtained from individual genotyping allows a more detailed analysis and understanding of the genomic variability in the sample population, and also permits linkage and haplotype analysis within the population. Furthermore, while the genotyping of pooled samples introduces errors, and the errors due to pooling are usually minimal, and random errors due to the array itself can be corrected for by having multiple pooled replicates [22], systematic errors due to the array itself might go unnoticed unless individual genotyping is done. Therefore, pooled genotyping would be best suited when relative instead of absolute allele frequencies are desired, such as in case control association studies. Even then, pooled genotyping should always be followed up by individual genotyping, such as in a two-stage study design [3], so as to validate the observations from the pooled estimates.

Genes that exist in old Affymetrix platform but not in the newer one - Biology

Optimised consensus clustering of one or more heterogeneous datasets.

Or read below for an easy-to-use clust command line!

Clust is a fully automated method for identification of clusters (groups) of genes that are consistently co-expressed (well-correlated) in one or more heterogeneous datasets from one or multiple species.

Figure 1: Clust processes one gene expression dataset to identify (K) clusters of co-expressed genes. Clust automatically identifies the number of clusters (K).

The multiple datasets case:

Figure 2: Clust processes multiple gene expression datasets (X1, X2, . X(L)) to identify clusters of genes that are co-expressed (well-correlated) in each of the input datasets. The left-hand panel shows the gene expression profiles of all genes in each one of the input datasets, while the right-hand panel shows the gene expression profiles of the genes in the clusters (C1, C2, . C(k)). Note that the number of conditions or time points are different for each dataset.

No need to pre-process your data clust automatically normalises the data.

No need to preset the number of clusters clust finds this number automatically.

You can control the tightness of the clusters by varying a single parameter -t

It is okay if the datasets:

  • Were generated by different technologies (e.g. RNA-seq or microarrays)
  • Are from different species
  • Have different numbers of conditions or time points
  • Have multiple replicates for the same condition
  • Require different types of normalisation
  • Were generated in different years and laboratories
  • Have some missing values
  • Do not include every single gene in every single dataset

Clust generates the following output files:

  • A table of clustering statistics
  • A table listing genes included in each cluster
  • Pre-processed (normalised, summarised, and filtered) datasets' files
  • Plotted gene expression profiles of clusters (a PDF file)

Figure 3: Automatic Clust analysis pipeline

Then run it from any directory as:

Then run it from any directory as:

Clust is available on Bioconda as well!

Then run it from any directory as:

First, make sure you have all of the following Python packages installed:

  • numpy
  • scipy
  • matplotlib
  • scikit-learn
  • pandas
  • joblib
  • portalocker

Then, download the latest release file (clust-..*.tar.gz) file from the release tab and run clust without installation directly by running the script that is in the top level directory of the source code by:

Hint: you can check which package you have installed by:

Upgrade clust to a newer version

If you already have clust and you want to upgrade it, then based on the way you used to install clust (from the ways above), upgrade it by:

Way 1. sudo pip install clust --upgrade

Way 2. pip install --user clust --upgrade

Way 3. conda update -c bioconda clust

Way 4. Download the newer release file (clust-..*.tar.gz) and use it to run clust instead of the older one

Clust has not been tried in Windows thoroughly. If you try it, your feedback will be much appreciated.

We recommend that you download and install WinPython which provides you with many Python packages that clust requires from

Open WinPython Powershell Prompt.exe from the directory in which you installed WinPython.

Then you can run clust by:

For normalised homogeneous datasets, simply run:

Where data_path is either the path to a single data file (v1.8.5+), or a path to a directory including one or more data files. This command runs clust with default parameters. If the output directory is not provided using the -o option, clust creates a new directory for the results within the current working directory.

For raw RNA-seq TPM, FPKM, or RPKM data, consider the Normalisation section below. Other sections below address handling replicates, handling data from mulitple species, and handling microarray data (only or mixed with RNA-seq data).

Each dataset is represented in a single TAB delimited (TSV) file in which the first column represents gene IDs, the first row represents unique labels of the samples, and the rest of the file includes numerical values, mainly gene expression values.

Figure 4: Snapshots of the first few lines of three data files X1.txt, X2.txt, and X3.txt.

  • When the same gene ID appears in different datasets, it is considered to refer to the same gene.
  • If more than one row in the same file had the same identifier, they are automatically summarised by summing up their values.
  • IMPORTANT: Gene names should not include spaces, commas, or semicolons.


Clust applies data normalisation during its pre-processing step.

Version 1.7.0 and newer: Clust automatically detects the most suitable normalisation for each dataset unless otherwise stated by the user via the -n option. The normalisation codes that clust decides to apply are stored in the output file /Normalisation_actual

Version 1.6.0 and earlier: The required normalisation techniques should be stated by the user via the -n option. Otherwise, no normalisation is applied.

Tell clust how to normalise your data in one of two ways:

clust data_path -n code1 [code2 code3 . ] [. ] (V1.7.0 and newer)

  • List one or more normalisation codes (from the table below) to be applied to your one or more datasets
  • Example: clust data_path -n 101 3 4 [. ]

clust data_path -n normalisation_file [. ]

  • Provide a file listing the normalisation codes for each dataset (see Fig. 5).
  • Each line of the file includes these elements in order:
    1. The name of the dataset file (e.g. X0.txt)
    2. One or more normalisation codes. The order of these codes defines the order of the application of normalisation techniques.
  • Delimiters between these elements can be spaces, TABs, commas, or semicolons.

Figure 5: Normalisation file indicating the types of normalisation that should be applied to each of the datasets.

Codes suggested for commonly used datasets

  • RNA-seq TPM, FPKM, and RPKM data: 101 3 4
  • Log2 RNA-seq TPM, FPKM, and RPKM data: 101 4
  • One-colour microarray gene expression data: 101 3 4
  • Log2 one-colour microarray gene expression data: 101 4
  • Two-colour microarray gene expression data: 3 6
  • Log2 two-colour microarray gene expression data: 6
  • Log2 fold-changes 4

Based on these, if your data is recommended to use one of the codes which include the code 3, but the dataset has too many zeros or some negative values, it is recommended to use 31 in the place of 3. For example, if you have a one-colour microarray data with too many zeros or few negative values, use 101 31 4 instead of 101 3 4.

Code Definition
0 No normalisation (Default in v1.6.0 and earlier)
1 Divide by the mean value of the row
2 Divide by the first value of the row
3 Log2
31 Set all values that are less than 1.0 to 1.0, then log2 (v1.7.0+)
4 Z-score: subtract the mean of the row and then divide by its standard deviation
5 Divide by the total (sum) of the row
6 Subtract the mean value of the row
7 Divide by the maximum value of the row
8 2 to the power X
9 Subtract the minimum value of the row
10 Rank across rows (1 for the lowest, up to N for N columns average ranks at ties)
11 Rank across rows (1 for the lowest, up to N for N columns order arbitrarly at ties)
12 Linear transformation to the [0, 1] range across rows (0.0 for the lowest and 1.0 for the highest)
13 Set all values of genes with low expression everywhere to zeros. The threshold of low expression is found by fitting a bimodal distribution to per-gene maximum expression values over all samples (v1.7.0+)
- -
101 Quantile normalisation
102 Column-wise mean subtraction
103 Subtract the global mean of the entire dataset
- -
1000 Automatic detection of suitable normalisation (Default in v1.7.0 and newer)

If multiple replicates exist for the same condition, include this information in a replicates file and provide it to clust by:

Each line in the replicates file relates to the replicates of a single condition or time point, and includes these elements in order:

  1. The name of the dataset file (e.g. X0.txt).
  2. A name for the condition of time-point this can be any label that the user chooses.
  3. One or more names of the replicates of this condition. These should match column names in the dataset file.

Data from multiple species

If your datasets come from multiple species, you can include a mapping file that defines gene mapping across species.

The mapping file is a TAB delimited file in which the first row shows the names of the species and the first column shows the IDs of the orthologue groups (OGs). Each OG includes zero, one, or many orthologous genes in each species' column split by commas.

Figure 7: Mapping fission and budding yeast genes

Figure 8: Mapping rice, setaria, and maize genes. Notice that some OGs do not include genes in some species

You can use Orthofinder to identify the OGs across multiple species. Orthofiner's output file Orthogroups.csv can be provided directly to clust as the mapping file.

If some genes do not exist in some species (e.g. Figure 8), have a look at the section Genes missing from some datasets below.

Data from multiple technologies (e.g. mixing RNA-seq and microarrays)

Incorporating microarray data in the analysis with or without RNA-seq data can be straightforwardly done. The main point to be taken care of is to include the correct normalisation codes for the different datasets as detailed in the Normalisation section above.

Also, if the first column of the microarray data file includes probe IDs which are not identical across datasets generated by using different microarray/RNA-seq platforms, make sure that probe-gene mapping information is included in the map file described above.

For example, you may apply clust to tens of human and mouse datasets generated by these different technologies / platforms:

Platform / Format Technology Example identifier
Human RNA-seq reads (TPM) RNA-seq NM_000014.4
Mouse RNA-seq reads (TPM) RNA-seq NM_001166382.1
Affymetrix Human Genome U133+ 2.0 Microarray 1552258_at
Illumina Human WG-6 v3.0 Microarray ILMN_1825594
Illumina Mouse WG-6 v2.0 Microarray ILMN_1243094

In this case, provide clust with a mapping file (TAB delimited) which looks like this:

OG H_RNAseq M_RNAseq H_U133+ H_WG6 M_WG6
OG00001 NM_001105537.2 NM_001310668.1, NM_001310668.1 204474_at, 37586_at ILMN_1676745 ILMN_1236966
. . . . . .

Here, the probes/transcripts that represent the human gene ZNF142 or its mouse orthologue Znf142 from the five platforms are mapped to a single unique OrthoGroup (OG) identifier (OG00001).

This mapping file is provided to clust by the -m option:

These are many reasons that result in missing some genes from some datasets:

  • Datasets are from multiple species and some genes do not exist in some species (see Figure 8 above for example)
  • Older platforms of microarrays did not include probes for some genes
  • Other reasons

Clust allows you to automatically discard genes that do not appear in all (or most) datasets by using the -d option. This option specifies the minimum number of datasets in which a gene has to be present for it to be included in the analysis.

For example, if you have 20 datasets, you can force clust to discard any gene that is not included at least in 17 datasets by:

Handling genes with low expression

By default in v1.7.0+, Clust filters out genes with flat expression profiles (profiles with absolutely no change in expression) after summarising replicates and normalisation. To switch this option off, use the --no-fil-flat option.

Also, clust can automatically filter out genes with low expression values if you provide the three options -fil-v , -fil-c , and -fil-d to clust:

This will discard any gene that does not have at least the value of value , at least at conditions conditions, at least in datasets . This is applied before normalisation but after summarising replicates and handling gene mapping across multiple species.

Are you obtaining noisy clusters?

A tightness parameter -t controls how tight the clusters should be (tighter and smaller clusters versus less tight and larger clusters). This is a real positive number with the default value of 1.0. Values smaller than 1.0 (e.g. 0.5) produce less tight clusters, while values larger than 1.0 (e.g. 2.0, 5.0, 10.0, . ) produce tighter clusters.

Try larger values of -t to obtain tighter clusters:

Parameter Definition
data_directory The path of the directory including all data files
- -
-n <file or integer list> Path of the normalisation file or a list of normalisation codes. See the Normalisation section above for details.
-r <file> Path of the replicates file
-m <file> Path to orthogroup mapping file
-o <directory> Custom path of the output directory
- -
-t <real number> (Cluster tightness) versus (cluster size) weight: a real positive number, where 1.0 means equal weights, values smaller than 1.0 means larger and less tight clusters, and values larger than 1.0 produce smaller and tighter clusters (default: 1.0).
-q3s <real number> Defines the threshold for outliers in terms of the number of Q3's (third quartiles). Smaller values lead to tighter clusters (default: 2.0).
- -
-fil-v <real number> Threshold of data values (e.g. gene expression). Any value lower than this will be set to 0.0. If a gene never exceeds this value at least in FILC conditions in at least FILD datasets, it is excluded from the analysis (default: -inf)
-fil-c <integer> Minimum number of conditions in a dataset in which a gene should exceed the data value FILV at least in FILD datasets to be included in the analysis (default: 0)
-fil-d <integer> Minimum number of datasets in which a gene should exceed the data value FILV at least in FILC conditions to be included in the analysis (default: 0)
--fil-abs -fil-v is used as a threshold for the absolute values of expression. Useful when the data has positive and negative values (e.g. log-ratio 2-colour microarray data). (default: not used).
--fil-perc -fil-v is a percentile of gene expression rather than an absolute expression value (e.g. -fil-v 25 sets the 25th percentile of all gene expression values as the threshold). (default: not used).
--fil-flat Filter out genes with flat expression profiles (constant expression over all samples in all datasets). (default: used).
--no-fil-flat Cancels the default --fil-flat option.
- -
-d <integer> Minimum number of datasets in which a gene has to be included for it to be considered in the clust analysis. If a gene is included only in fewer datasets than this, it will be excluded from the analysis (default: 1)
-cs <integer> Smallest cluster size (default: 11)
-K <integer> [<integer> . ] K values: refer to the publication for details (default: all even integers from 4 to 20 inclusively)
- -
--no-optimisation Skip the cluster optimisation step. Not recommended except to compare results before and after optimisation (default: optimisation is performed).
-basemethods <string> [<string> . ] One or more base clustering methods (default (V1.8.0+): k-means)
- -
-h, --help show the help message and exit

Raw expression data from multiple species

Example datasets are available in ExampleData/1_RawData. These are three datasets from two yeast species, two datasets from fission yeast, and one from budding yeast.

That directory contains the datasets' files in a Data sub-directory, and includes three other files specifying the replicates, the required normalisation, and the gene mapping across the datasets, i.e. orthologous genes across the two yeast species.

Run clust over this data by:

Or let clust automatically detect suitable normalisation by running (v1.7.0+):

You may like to specify a tightness level -t other than the default by adding:

You may also specify an output directory other than the default by adding:

Example datasets of datasets taken from one species, have no replicates, and already normalised are available in ExampleData/2_Preprocessed, or more specifically in the Data directory therein. These datasets require no pre-processing, so you can simply run this command over the directory "Data":

Find the results in the Results_[Date] directory that clust will have generated in your current working directory.

This runs clust with the default tightness -t value of 1.0. You may like to make the generated clusters tighter by increase -t or less tight by decreasing -t . For example, try -t = 5.0 or -t = 0.2 by:

You may also like to save results in an output directory of your choice by using -o :

When publishing work that uses clust, please cite this pre-print:

  1. Basel Abu-Jamous and Steven Kelly (2018) Clust: automatic extraction of optimal coexpressed gene clusters from gene expression data. Genome Biology : 19:172 doi:


The results shown here are, in part, based on data from multiple previously published studies. We acknowledge the investigators and patients who contributed to the acquisition and analysis of the data used in this study. This work was partially supported by research funding from National Natural Science Foundation of China (Grant no. 81472220), Shanghai Science and Technology Development Fund (the Domestic Science and Technology Cooperation Project, No. 14495800300) and Canhelp Genomics. We thank Yang Yang, Xinming Zhang, Yi Cai, and Minzhe Fang for excellent technical and operational assistance.


Department of Radiation Oncology, University of Michigan Medical School, Ann Arbor, MI, USA

S G Zhao, W C Jackson, V Kothari, M J Schipper, J R Evans, C Speers, D A Hamstra & F Y Feng

Department of Biostatistics, University of Michigan Medical School, Ann Arbor, MI, USA

GenomeDx Biosciences Inc., Vancouver, British Columbia, Canada

Comprehensive Cancer Center, University of Michigan Medical School, Ann Arbor, MI, USA

Michigan Center for Translational Pathology, University of Michigan Medical School, Ann Arbor, MI, USA

Department of Radiation Oncology, Harvard University, Boston, MA, USA

Departments of Urology, Oncology, and Pathology, Johns Hopkins University, Baltimore, MD, USA

Department of Radiation Oncology, Thomas Jefferson University, Philadelphia, PA, USA

Glickman Urological and Kidney Institute, Cleveland Clinic, Cleveland, OH, USA

Department of Laboratory Medicine and Pathology, Mayo Clinic, Rochester, MI, USA

You can also search for this author in PubMed Google Scholar

You can also search for this author in PubMed Google Scholar

You can also search for this author in PubMed Google Scholar

You can also search for this author in PubMed Google Scholar

You can also search for this author in PubMed Google Scholar

You can also search for this author in PubMed Google Scholar

You can also search for this author in PubMed Google Scholar

You can also search for this author in PubMed Google Scholar

You can also search for this author in PubMed Google Scholar

You can also search for this author in PubMed Google Scholar

You can also search for this author in PubMed Google Scholar

You can also search for this author in PubMed Google Scholar

You can also search for this author in PubMed Google Scholar

You can also search for this author in PubMed Google Scholar

You can also search for this author in PubMed Google Scholar

You can also search for this author in PubMed Google Scholar

You can also search for this author in PubMed Google Scholar

Corresponding author

DNA Microarrays and Genetic Testing

Lars Dyrskjøt , . Torben F. Ørntoft , in Molecular Diagnostics (Second Edition) , 2010 Bladder Cancer

For the study of bladder cancer development and progression, microarray gene expression profiling has also been applied with success. In one of the first studies using clinical material, Affymetrix GeneCchips with probes for approximately 5,000 human genes and ESTs were used to identify gene expression pattern changes between superficial and invasive tumors ( Thykjaer et al., 2001 ). The identified genes encoded oncogenes, growth factors, proteinases, and transcription factors together with proteins involved in cell cycle, cell adhesion, and immunology. This was the first study to identify genes that separated superficial from invasive bladder tumors. A later microarray-based study on bladder tumors showed advances in disease classification and outcome prediction ( Dyrskjøt et al., 2003 ). The authors of this study also used the GeneChips with probes for approximately 5,000-genes and ESTs for identification of a 32-gene expression pattern using 40 tumor samples for classifying tumors according to disease stage. This stage classifier was successfully validated on an independent test set consisting of 68 bladder tumors analyzed on a different array platform. The stage classifier did not only reproduce histopathological staging, but added important information regarding subsequent disease progression.

Prediction of disease progression from non-muscle-invasive to invasive stage would be of great benefit in the clinical management of patients diagnosed with early stage bladder tumors. In one study, a 45-gene molecular classifier was developed by comparing 29 non-muscle-invasive tumors (13 without later progression and 16 with later progression) using custom Affymetrix GeneChip arrays ( Dyrskjøt et al., 2005 ). The 45-gene classifier was tested on a series of 74 independent tumors using a two-color oligonucleotide array platform with only the genes of interest. The classification results showed a positive correlation to disease outcome (P < 0.03) with a positive predictive value of 0.3 and a negative predictive value of 0.95. The low positive predictive value may be explained by the fact that patients were continuously treated with transurethral resection and BCG installations. In another study of progression prediction ( Wild et al., 2005 ) the authors used 42 Ta tumors, where eight showed later progression to invasive bladder cancer and eight showed later CIS lesions to delineate a gene set optimal for predicting progression. Using cross-validation test, the predictor correctly classified 33 of the samples, which gives a sensitivity of 86% and a specificity of 71%. No independent test set validation results have been reported for this gene set. The consensus gene set of 11 genes resulting from the most commonly used genes in cross-validation loops show no overlap with the 45-gene set signature from Dyrskjøt and colleagues ( Dyrskjøt et al., 2005 ). The progression signature reported by Dyrskjøt and colleagues was recently validated in a large retrospective study using bladder tumors from a cohort of 404 patients diagnosed with bladder cancer in hospitals in Denmark, Sweden, France, England, and Spain ( Dyrskjøt et al., 2007 ). The molecular progression classifier was highly significantly correlated with progression-free survival (P < 0.001) and cancer-specific survival (P = 0.001). Furthermore, multivariate Cox’s regression analysis showed the progression classifier to be an independent significant variable associated with disease progression after adjustment for known risk factors as age, sex, stage, grade, and treatment (hazard ratio 2.3, P = 0.007). Consequently, the retrospective multi-center validation study confirmed the potential clinical utility of the molecular classifier to predict the outcome of patients initially diagnosed with non-muscle invasive bladder cancer.

Gene expression profiles predictive of chemotherapy response have been published in several neoplasms. In a small study of muscle invasive bladder cancer, the response to neoadjuvant (in advance to surgical treatment) chemotherapy was investigated using cDNA microarrays ( Takata et al., 2005 ). Fourteen tumors were used to identify a signature of 14 predictive genes, which was validated on nine additional tumors. RT-PCR results showed good correlation with the microarray, warranting further validation in a larger series. In a recent study, Als and colleagues (2007) identified 55 genes that correlated significantly with survival following chemotherapy. The authors validated two of the protein products (emmprin and survivin) using immunohistochemistry on an independent sample set of 124 tumors. Multivariate analysis identified emmprin expression (hazard ratio, 2.23 P < 0.0001) and surviving expression (hazard ratio, 2.46 P < 0.0001) as independent prognostic markers for poor outcome, together with the presence of visceral metastases (hazard ratio, 2.62 P < 0.0001). In the clinical good prognostic group of patients without visceral metastases, both markers showed significant discriminating power as supplemental risk factors (P < 0.0001). Within this group of patients, the subgroups of patients with no positive, one positive, or two positive immunohistochemistry scores (emmprin and survivin) had estimated 5-year survival rates of 44.0%, 21.1%, and 0%, respectively. Response to chemotherapy could also be predicted with an odds ratio of 4.41 (95% confidence interval, 1.91–10.1) and 2.48 (95% confidence interval, 1.1–5.5) for emmprin and survivin, respectively. Consequently, emmprin and survivin proteins were identified as strong independent prognostic factors for response and survival after chemotherapy in patients with advanced bladder cancer.

Author information


PROOF Centre of Excellence, Vancouver, BC, Canada

Casey P. Shannon, Robert Balshaw, Virginia Chen, Zsuzsanna Hollander, Bruce M. McManus, Raymond T. Ng & Scott J. Tebbutt

BC Centre for Disease Control, Vancouver, BC, Canada

Division of Cardiology, University of British Columbia, Vancouver, BC, Canada

Department of Pathology and Laboratory Medicine, University of British Columbia, Vancouver, BC, Canada

Department of Computer Science, University of British Columbia, Vancouver, BC, Canada

Department of Medicine, Division of Respiratory Medicine, University of British Columbia, Vancouver, BC, Canada

J. Mark FitzGerald, Don D. Sin & Scott J. Tebbutt

Centre for Heart Lung Innovation, University of British Columbia, Vancouver, BC, Canada

Casey P. Shannon, Virginia Chen, Zsuzsanna Hollander, Bruce M. McManus, Don D. Sin, Raymond T. Ng & Scott J. Tebbutt

Institute for Heart and Lung Health, Vancouver, BC, Canada

Bruce M. McManus, J. Mark FitzGerald, Don D. Sin, Raymond T. Ng & Scott J. Tebbutt

Watch the video: Homemade vertical A-Frame hydroponic system Facebook (January 2022).