What value type would a chromosome position be in a database or form?

I wanted to create a tool for some fields like SIFT, Phenotype, etc… so for example I know Phenotype will have "Text" values or SIFT will have some determined values from a drop down list… but what about Chrom Positions? what are some valid sample values for that? so I can know what type I can use for it.

Since it sounds like you are the one designing the database, you can make this a number of ways. The simplest is probably to reduce it to two variables, likely two decimals.

See this hemoglobin example for a chromosomal locus example.

  • There are N chromosomes (23 for humans, if you like, sex chromosomes can be treated as a pair).
  • There are 2 chromatids per chromosome.
  • The part of the chromatid is either p or q (short or long arm).
  • Then there is the location on the portion of the chromatid (eg, 15.5).

Chromatid can be easily represented as a decimal, where the integer portion is the chromosome number, and the decimal portion corresponds to the chromatid and arm.

The chromosomal locus can be another decimal, such as 15.5 for the example above.

This is of course one way, and there are many other ways you could do this.

How to crossover the parents when using a value encoding method in genetic algorithm?

There is phase in genetic algorithm where we should choose to crossover the chromosomes from parents to offspring.

It is easy to do via binary form.

But what to do if we encodes the chromosomes using the value encoding?

Let's say one bit in my chromosomes is a DOUBLE type value, let's say 0.99, its range is (0-1) since it will represent a probability.

How to crossover this DOUBLE number?

Convert to binary to crossover then convert back.

This format is used to provide called peaks of signal enrichment
based on pooled, normalized (interpreted) data. It is a BED6+3 format.

field type description
chrom string Name of the chromosome
chromStart int The starting position of the feature in the chromosome. The first base in a chromosome is numbered 0.
chromEnd int The ending position
of the feature in the chromosome or scaffold. The chromEnd base is not
included in the display of the feature. For example, the first 100 bases
of a chromosome are defined as chromStart=0, chromEnd=100, and span the
bases numbered 0-99.
name string Name given to a region (preferably unique). Use ‘.’ if no name is assigned.
score int Indicates how
dark the peak will be displayed in the browser (1-1000). If 𔃰’, the
DCC will assign this based on signal value. Ideally average signalValue
per base spread between 100-1000.
strand char +/- to denote strand or orientation (whenever applicable). Use ‘.’ if no orientation is assigned.

This might help you. - from Narayana Vyas. It searches all columns of all tables in a given database. I have used it before and it works.

This is the Stored Proc from the above link - the only change I made was substituting the temp table for a table variable so you don't have to remember to drop it each time.

To execute the stored procedure :

If you need to run such search only once then you can probably go with any of the scripts already shown in other answers. But otherwise, I’d recommend using ApexSQL Search for this. It’s a free SSMS addin and it really saved me a lot of time.

Before running any of the scripts you should customize it based on the data type you want to search. If you know you are searching for datetime column then there is no need to search through nvarchar columns. This will speed up all of the queries above.

Based on bnkdev's answer I modified Narayana's Code to search all columns even numeric ones.

It'll run slower, but this version actually finds all matches not just those found in text columns.


The results of 822 PGT-A cycles were analyzed. Table 1 shows data relating to number of cases, medical indication for PGT-A and patient age, which ranged from 22 to 46 years (average: 38.8 + 3.2 years 95CI: 38.6–39.0). Table 2 shows the results of genetic testing for aneuploidies. Forty-six percent of blastocysts (1656 of 3565) were euploid, with incidence varying significantly according to PGT-A indication (Table 2). The remaining blastocysts were diagnosed as aneuploid (53.5% 1909 of 3565). In 45.2% (1610 of 3565) of diagnosed blastocysts one (29.2%) or more (16.0%) whole-chromosomes were implicated in the aneuploidy.

In the case of segmental aneuploidy (Table 2), 8.4% of diagnosed blastocysts (299 of 3565) exhibited one or more segmental chromosome aneuploidies, some of which were associated with whole-chromosome aneuploidy, while others were not. Two-hundred and seventy-four of the 3565 blastocysts (7.7%) had a single segmental aneuploidy (SSA), associated (n = 115) or not (n = 159) with a whole-chromosome aneuploidy, whereas 0.7% of the remaining blastocysts (25 of 3565) showed segmental aneuploidies in two different chromosomes. Only one blastocyst was diagnosed as carrying three segmental aneuploidies located in three different chromosomes (multiple segmental aneuploidy). No more than three segmental aneuploidies per embryo, or one segment per chromosome and embryo, were observed.

Single segmental aneuploidies in the absence of whole-chromosome aneuploidies (pure-SSA) were detected in 159 blastocysts (4.5% analyzed blastocysts), independently of the medical indication for the assisted reproductive technology (ART) cycle (P > 0.05 Table 2).

Frequency of pure-SSA (n = 159) was not related to day of blastocyst biopsy (day 5 vs day 6 P = 0.70) or blastocyst stage (P = 0.58), while it was related to quality of the ICM and TE (P < 0.01). Thus, as shown in Table 3, a significantly higher percentage of pure-SSA was observed among blastocysts qualified as grade “C” (referring to TE and ICM) than among those with better TE and ICM quality.

From a qualitative point of view, we described the SSA population according to location of gains or losses on the p- or q-chromosome arms. In general, both gains (44.0%) and losses (56.0%) were equally represented in the pure-SSA population however, they were more frequently located on the q- than on the p-chromosome arm (67.3% vs 32.7%, respectively). Moreover, SSA type, defined by combining both variables (gains/losses and arm location), was equally distributed in the blastocyst population (Table 4).

SSA type was not statistically affected by age (P = 0.51), clinical indication (P = 0.15), blastocyst stage (P = 0.54) or ICM and TE quality (P = 0.2 and P = 0.28, respectively), but it was significantly affected by day of biopsy (P = 0.007 Table 4). Thus, blastocysts biopsied on day 5 showed significantly higher percentages of gains on the q-chromosome arms (22.0%), whereas those biopsied on day 6 showed significantly higher percentages of SSA losses on the q-chromosome arm (22.0%). SSA affecting p-chromosome arms were equally distributed amongst blastocysts biopsied on day 5 or 6 of development (ranging from 5.0 to 11.3% Table 4).

Qualitative description of SSA was also defined by the chromosome involved. The Kolmogorov-Smirnov test revealed that the frequency of the chromosomes with a SSA did not follow a normal distribution (Fig. 2 P < 0.001). In fact, our SSA population displayed an asymmetrical distribution of chromosome frequency: SSAs were located on chromosomes 1 to 9 in nearly two thirds of blastocysts, whereas 29.6% of SSA were located on the remaining autosomes and sexual chromosomes. No SSA was observed on the Y chromosome or on autosomes 19, 21 or 22 (Fig. 2).

Percentage of blastocysts diagnosed as single segmental aneuploid (SSA) and types (losses on the small or large chromosome-arm: -p, −q, respectively or gains on the small or large chromosome-arm: +p, +q, respectively) according to chromosome carrier

Additionally, SSA affecting a particular chromosome was not statistically related with age (P = 0.92), medical indication (P = 0.24), day of biopsy (P = 0.25), blastocyst stage (P = 0.96) or ICM quality (which was constantly rated “b”). On the other hand, a significant relation was observed between TE quality and the affected chromosome (P = 0.04). No statistical analysis was performed to explore the relation of SSA size to the chromosome carrier due to the relatively low number of cases studied.

Interestingly, although current qualitative descriptions include both topographic location of gains/losses on the chromosome arm and the chromosome involved, our data showed that these two qualitative variables were not related to each other (P = 0.09 Fig. 2 Table 5).

Description of SSA from a quantitative point of view requires the study of DNA-sequence length (Additional file 1). The Kolmogorov-Smirnov test revealed that SSA size did not follow a normal frequency distribution (P < 0.001). Thus, this continuous variable (SSA size) was converted to a categorical one by re-grouping the sizes into quartiles in order to perform statistical comparisons with continuous variables such as patient age.

SSA size was not statistically related to age (P = 0.99), medical indication (P = 0.48), day of biopsy (P = 0.18), blastocyst stage (P = 0.40), or TE (P = 0.09) or ICM quality (constantly rated “b”). However, significant differences were observed according to SSA type (P = 0.003) and the chromosome involved (P = 0.007). Thus, gains and losses located on the p-arm had comparable average sizes (45.4 ± 30.6 Mb 95CI: 36.9–53.9 Mb P = 0.99) and were significantly shorter than gains on the q arm (average: 74.8 ± 33.2 Mb 95CI: 65.4–84.2 Mb P < 0.03), whereas losses on the q-arm were of an intermediate size (average: 65.1 ± 36.9 Mb 95CI: 55.3–74.9 Mb Fig. 3a).

Average SSA size (Mb open circle) according to SSA type (Fig. 3a) and chromosome carrier (Fig. 3b). Error bars represent 95% confidence interval in Mb. Footnote: Different superscripts represent statistically significant differences (P < 0.05) between SSA types or affected chromosomes

Otherwise, SSA size was related to the chromosome involved (P = 0.003 Fig. 3b). However, since the analysis of SSA size in relation to chromosome rendered a relatively low number of cases, no further analysis was performed.

The above mentioned relationship between TE quality and affected chromosome was also observed after grouping chromosomes according to the Denver classification. Thus, in fair TE quality blastocysts (rated “c”), significantly more pure-SSA were observed on acrocentric or small–sized chromosomes (Groups D-F) than in blastocysts with excellent or good TE quality (rated “a” or “b”), in which such chromosomes were rarely affected (P = 0.00006 Fig. 4). Pure SSA were more frequently located on large- or medium-sized sub-metacentric chromosomes (Group A-C), regardless trophectoderm quality (average: 88.0% P = 0.25 Fig. 4).

Percentage (mean, error bars: 95% confidence interval) of pure SSA located in each Denver chromosome classification group, according to excellent, good or fair trophectoderm quality scores (a, b and c, respectively Fig. 4a). Lower figure shows percentage (mean error bars: 95% confidence interval) of SSA located in chromosomes pertaining to groups A-C or D-F, according to trophectoderm quality scores (Fig. 4b)

Nevertheless, we re-grouped the SSA population according to the Denver Standard chromosome classification system [26]. As shown in Fig. 5a, pure SSAs were most frequently represented in group C (43.4%), followed by those in groups A (25.8%) and B (18.9%). The remaining blastocysts exhibited a SSA in chromosomes in groups D (5.0%) and E (6.3%), and only one blastocyst had SSA in chromosome 20 (group F: 0.6%). No SSA was detected in the small-sized metacentric chromosome 19 or acrocentric chromosomes 21, 22 and Y (Fig. 2 and Fig. 5a). Following the Denver Standard System, different SSA sizes were observed amongst the chromosome categories analyzed (P = 0.0001 Fig. 5b). Thus, although comparable sizes were observed amongst groups D and E (31.9 ± 11.3 Mb 95CI: 26.3–37.5 Mb), they were significantly shorter than those in groups A-C and larger than the single SSA observed in chromosome 20 (group F: 21.2 Mb). Moreover, SSA sizes in group A were significantly larger (average: 77.6 ± 38.1 Mb 95CI: 65.6–89.6 Mb) than those in medium- or small-sized chromosomes, regardless of the centromere emplacement (groups C-G). The SSAs identified in large-sized sub-metacentric autosomes (group B: 69.0 ± 41.7 Mb 95CI: 53.4–84.6 Mb) were of an intermediate size with respect to groups A (large-size) and C (medium-sized sub-metacentric chromosomes average: 57.5 ± 29.6 Mb 95CI: 50.4–64.6 Mb), but were significantly larger than those quantified in small-sized chromosomes (groups E-F) and medium-sized acrocentric chromosomes (group D).

Frequency of single segmental aneuploid (SSA) blastocysts, with detailed types of SSA (Fig. 5a) and average mean SSA sizes (open circles Mb, Fig. 5b) according to the Denver classification system. Averaged mean size (open circles) of each SSA type classified by the Denver Standard System (Fig. 5c). Error bars represent 95% confidence interval in Mb. Footnote: Different superscripts represent statistically significant differences (P < 0.05) in SSA size between chromosome categories or SSA types within C-chromosome category

Additionally, we assessed the size of all four SSA types within each chromosome group (Fig. 5c). The results showed comparable SSA sizes regardless of SSA type in all groups except for group C. In this way, sequences corresponding to gains or losses on the q-arm were significantly larger (average: 68.6 ± 27.1 Mb 95CI: 60.4–76.8 Mb) than losses on the p-arm (average: 39.2 ± 26.8 Mb 95CI: 26.2–52.3 Mb). Gains on the p-arm were of an intermediate size with respect to SSA losses, whatever the chromosome arm affected (average: 34.4 ± 13.3 Mb 95CI: 22.1–46.7 Mb).

Finally, we calculated the ratio of SSA size according to the length of the entire chromosome, including the centromere (SSA:chromosome ratio). The results showed that the SSA:chromosome ratio was pretty much constant for all the chromosome groups classified according to the Denver Standard System (P = 0.62), with an estimated average of 0.37 ± 0.19 (95CI: 0.37–0.40 Fig. 6a). However, the SSA:chromosome ratio was affected by SSA type (P < 0.001 Fig. 6b) SSAs on the p-arms had a significantly lower ratio (average: 0.27 ± 0.15 95CI: 0.23–0.31) than gains on the q-arms (average: 0.46 ± 0.19 95CI: 0.40–0.51). An intermediate ratio (0.37 ± 0.18 95CI: 0.32–0.42) was calculated for losses on the q-chromosome arm.

Average SSA:chromosome ratio (open circles ranged 0–1) according to the Denver classification system (Fig. 6a) and SSA type (Fig. 6b). Average SSA:arm ratio (open circles ranged 0–1) according to the Denver classification system (Fig. 6c and SSA type (Fig. 6d). Error bars represent 95% confidence interval in Mb. Footnote: Different superscripts represent statistically significant differences (P < 0.05) in the SSA:chromosome or SSA:arm ratio between chromosome categories or SSA type

The ratio of SSA size to arm length (SSA:arm ratio) was comparable in almost all the chromosome groups when the Denver Standard classification was employed an average ratio of 0.72 ± 0.37 was obtained (95CI: 0.66–0.78 P = 0.71 Fig. 6c), with the exception of group D, in which it was significantly lower (average: 0.37 ± 0.11 95CI: 0.27–0.46). SSA:arm ratios were also affected by SSA type (P = 0.005 Fig. 6d). Thus, losses on the q-arm showed significantly lower SSA:arm ratios (average: 0.27 ± 0.15 95CI: 0.22–0.33) than those located on the p-arm (average: 0.37 ± 0.18 95CI: 0.32–0.42), while gains displayed intermediate SSA:arm ratios, whatever the chromosome arm affected (averaged: 0.74 ± 0.32 95CIs: 0.66–0.82).


3.1 Genome assembly

The k-mer analysis of our sequenced genome of a female Salix dunnii plant indicated that the frequency of heterozygous sites in this diploid individual is low (0.79%) (Figures S2 and S3 Table S1). We generated 72 Gb (

180×) of ONT long reads, 60 Gb (

150×) Illumina reads and 55 Gb (

140×) of Hi-C reads (Tables S5 and S6). After applying several different assembly strategies, we selected the one with the “best” contiguity metrics ( smartdenovo with canu correction, Table S2). Polishing/correcting using Illumina short reads of the same individual yielded a 333-Mb genome assembly in 100 contigs (contig N50 = 10.1 Mb) (Table S2).

With the help of Hi-C scaffolding, we achieved a final chromosome-scale assembly of 328 Mb of 29 scaffolds (scaffold N50 = 17.28 Mb), about 325.35 Mb (99.17%) of which is anchored to 19 pseudochromosomes (Figure 1a, Table 2 Figure S4, Table S4), corresponding to the haploid chromosome number of the species. The mitochondrial and chloroplast genomes were assembled into circular DNA molecules of 711,422 and 155,620 bp, respectively (Figures S5 and S6). About 98.4% of our Illumina short reads were successfully mapped back to the genome assembly, and about 99.5% of the assembly was covered by at least 20× reads. Similarly, 98.9% of ONT reads mapped back to the genome assembly and 99.9% were covered by at least 20× reads. The assembly's LTR Assembly Index (LAI) score was 12.7, indicating that our assembly reached a high enough quality to achieve the rank of “reference” (Ou et al., 2018 ). busco (Simão et al., 2015 ) analysis identified 1392 (96.6%) of the 1440 highly conserved core proteins in the Embryophyta database, of which 1239 (86.0%) were single-copy genes and 153 (10.6%) were duplicate genes. A further 33 (2.3%) had fragmented matches to other conserved genes, and 37 (2.6%) were missing.

Total assembly size (Mb) 328
Total number of contigs 31
Total anchored size (Mb) 325.352
Maximum contig length (Mb) 35.892
Minimum contig length (kb) 68.49
Contig N50 length (Mb) 16.657
Contig L50 count 8
Contig N90 length (Mb) 12.795
Contig L90 count 17
Total number of scaffolds 29
Maximum scaffold length (Mb) 35.892
Minimum scaffold length (kb) 68.49
Scaffold N50 length (Mb) 17.281
Scaffold L50 count 8
Scaffold N90 length (Mb) 13.179
Scaffold L90 count 17
Gap number 2
GC content (%) 33.09
Gene number 31,501
Repeat content (%) 41.05

3.2 Annotation of genes and repeats

In total, 134.68 Mb (41.0%) of the assembled genome consisted of repetitive regions (Table 2), close to the 41.4% predicted by findgse (Sun et al., 2018 ). LTR-RTs were the most abundant annotations, forming up to 19.1% of the genome, with Gypsy and Copia class I retrotransposon (RT) transposable elements (TEs) accounting for 13% and 5.85% of the genome, respectively (Table S7). All genomes so far studied in Salix species have considerable proportions of TE sequences, but the higher proportions of Gypsy elements in S. dunnii (Table S7) (Chen et al., 2019 ) suggested considerable expansion in this species. Based on estimated divergence per site (see Methods), most full-length LTR-RTs appear to have inserted at different times within the last 30 million years rather than in a recent burst (Figures S7-S9 Table S8). Divergence values of all chromosomes are 0 to 0.2, mean 0.041 and median 0.027. The values for just chromosome 7 are similar, range from 0 to 0.18, but the mean 0.0461 and median 0.035 slightly higher than for the chromosomes other than 7, and this is mainly caused by a higher value/greater age in the X-linked region.

Using a comprehensive strategy combining evidence-based and ab initio gene prediction (see Methods), we then annotated the repeat-masked genome. We identified a total of 31,501 gene models, including 30,200 protein-coding genes, 650 transfer RNAs (tRNAs), 156 ribosomal RNAs (rRNA) and 495 unclassifiable noncoding RNAs (ncRNAs) (Table 2 Table S9). The average S. dunnii gene is 4095.84 bp long and contains 6.07 exons (Table S10). Most of the predicted protein-coding genes (94.68%) matched a predicted protein in a public database (Table S11). Among the protein-coding genes, 2053 transcription factor (TF) genes were predicted and classified into 58 gene families (Tables S12 and S13).

3.3 Comparative genomics and whole genome duplication events

We compared the S. dunnii genome sequence to four published willow genomes and Populus trichocarpa, as an outgroup, using 5950 single-copy genes to construct a phylogenetic tree of the species' relationships (Figure 1b). Consistent with published topologies (Wu et al., 2015 ), S. dunnii appears in our study as an early diverging taxon in sister position to the four Salix species of the Chamaetia-Vetrix clade.

To test for whole genome duplication (WGD) events, we examined the distribution of Ks values between paralogues within the S. dunnii genome, together with a dot plot to detect potentially syntenic regions. This revealed a Ks peak similar to that observed in Populus, confirming the previous conclusion that a WGD occurred before the two genera diverged (Ks around 0.3 in Figure S10) (Tuskan et al., 2006 ). A WGD is also supported by our synteny analysis within S. dunnii (Figure 1a Figure S11). Synteny and collinearity were nevertheless high between S. dunnii and S. purpurea on all 19 chromosomes, and between the two willow species and P. trichocarpa for 17 chromosomes (Figure 1c), with a previously known large interchromosomal rearrangement between chromosome 1 and chromosome 16 of Salix and Populus (Figure 1c).

3.4 Identification of the sex determination system

To infer the sex determination system in S. dunnii, we sequenced 20 females and 18 males from two wild populations by Illumina short-read sequencing (Table S1). After filtering, we obtained more than 10 Gb of clean reads per sample (Table S14) with average depths of 30× to 40× (Table S15), yielding 4,370,362 high-quality SNPs.

A GWAS revealed a small (1,067,232 bp) S. dunnii chromosome 7 region, between 6,686,577 and 7,753,809 bp, in which 101 SNPs were significantly associated with sex (Figure 2a,b Table S16, Figure S12). More than 99% of these candidate sex-linked SNPs are homozygous in all the females, and 63.74% are heterozygous in all the males in our sample (Table S17).

Consistent with our GWAS, the CQ method, with 18 individuals of each sex, detected the same region, and estimated a somewhat larger region, between 6.2 and 8.75 Mb, with CQ > 1.6 (which includes all the candidate sex-linked SNPs), whereas other regions of chromosome 7 and the other 18 chromosomes and contigs have CQ values close to 1 (Figure 2c Figure S13). These results suggest that S. dunnii has a male heterogametic system, with a small completely sex-linked region on chromosome 7. Because these positions are based on sequencing a female, and the species has male heterogamety, we refer to this as the X-linked region (X-LR). We predicted (see Methods) that the chromosome 7 centromere lies between roughly 5.2 and 7.9 Mb, implying that the sex-linked region may be in a low recombination region near this centromere (Figure S1). Moreover, the analysis of LD using 20 females shows that the X-LR is located within a region of the X chromosome with lower recombination than the rest of chromosome 7, consistent with a centromeric or pericentromeric location (Figure S14). Without genetic maps, it is not yet clear whether this species has low recombination near the centromeres of all its chromosomes.

Genetic differentiation (estimated as FST) between our samples of male and female individuals further confirmed a 3.205-Mb X-LR region in the region detected by the GWAS. Between 5.675 and 8.88 Mb (21% of chromosome 7), changepoint analysis (see Methods) detected FST values significantly higher than those in the flanking regions, as expected for a completely X-linked region (Figure 2 Figure S15). The other 79% of the chromosome forms two PARs (see Figure 2). LD was substantially greater in the putatively fully sex-linked region than in the whole genome (Figure S16).

3.5 Gene content of the fully sex-linked region

We found 124 apparently functional genes in the X-LR (based on intact coding sequences) vs. 516 in PAR1 (defined as the chromosome 7 region from position 0 to 5,674,999 bp), and 562 in PAR2 in chromosome 7 (from 8,880,001 to 15,272,728 bp) (Figure 2e Tables S9 and S18). The X-LR gene numbers are only 10.3% of the functional genes on chromosome 7, vs. 21% of its physical size, suggesting either a low gene density or loss of function of genes, either of which could occur in a pericentromeric genome region. We also identified 183 X-linked pseudogenes. Including pseudogenes, X-LR genes form 17% of this chromosome's gene content, and therefore overall gene density is not much lower than in the PARs. Instead, pseudogenes form a much higher proportion (59%) than in the autosomes (31%), or the PARs (148 and 269 in PAR1 and in PAR2, respectively, or 28% overall, see Tables S19 and S20). In total, 41 genes within the X-linked region had no blast hits on chromosome 7 of either P. trichocarpa or S. purpurea (Table S18).

Our searches of the S. dunnii genome for complete or partial copies of the Potri.019G133600 sequence (the ARR17-like gene described above, and discussed further below, that is involved in sex-determination on several other Salicaceae) found copies on chromosomes 1, 3, 8, 13 and 19 (Table S21). Importantly, we found none on chromosome 7, and specifically no copy or pseudogene copy in the X-LR.

3.6 Molecular evolution of S. dunnii X-linked genes

Gene density is lower in the X-LR than the PARs, probably because LTR-Gypsy element density is higher (Figure 3a). Repetitive elements make up 70.58% of the X-LR, vs. 40.36% for the PARs and 40.78% for the 18 autosomes (Table 3). More than half (53.31%) of the identified intact LTR-Gypsy element of chromosome 7 were from X-LR (Figure 3b Table S8).

We estimated Ka, Ks and Ka/Ks ratios for chromosome 7 genes that are present in both S. dunnii and S. purpurea (992 orthologue pairs) or S. dunnii and P. trichocarpa (1017 orthologue pairs). Both Ka and Ks values are roughly similar across the whole chromosome (Figures S17 and S18), and the Ka/Ks values did not differ significantly between the sex-linked region and the autosomes or PARs (Figure 3c,d Figure S19). However, the Ka and Ks estimates for PAR genes are both significantly higher than for autosomal genes, suggesting a higher mutation rate (Figure S17 shows the results for divergence from P. trichocarpa, and Figure S18 for S. purpurea).

Category X-LR PARs Autosomes
Genes 0.537 (16.77%) 4.679 (38.78%) 122.740 (39.58%)
Gypsy-LTR 1.429 (44.60%) 1.370 (11.36%) 39.321 (12.68%)
Copia-LTR 0.190 (5.94%) 0.844 (6.99%) 17.986 (5.80%)
Total repeats 2.262 (70.58%) 4.870 (40.36%) 126.465 (40.78%)

3.7 Sex-biased gene expression in reproductive and vegetative tissues

After quality control and trimming, more than 80% of our RNAseq reads mapped uniquely to the genome assembly across all samples (Table S22). In both the catkin and leaf data sets, there are significantly more male- than female-biased genes. In catkins, 3734 genes have sex differences in expression (2503 male- and 1231 female-biased genes). Only 43 differentially expressed genes were detected in leaf material (31 male- vs. 12 female-biased genes, mostly also differentially expressed in catkins Figure S20, Table S23). Chromosome 7, as a whole, showed a similar enrichment for genes with male-biased expression (117 male-biased genes, out of 1112 that yielded expression estimates, or 10.52%), but male-biased genes form significantly higher proportions only in the PARs, and not in the X-linked region (Figure 4), which included only six male- and five female-biased genes, while the other 94 X-LR genes that yielded expression estimates (90%) were unbiased.

We divided genes into three groups according to their sex differences in expression, based on the log2FoldChange values. All the male biased X-LR genes are in the higher expression category, but higher expression female-biased genes are all from the PARs (Figure 4).


Probe development:

Previous reports of single-locus FISH detection suggest that the minimum genomic target that can be detected using FISH in maize is ∼3000 bp (K ato et al. 2006 W ang et al. 2006 Y u et al. 2007). To develop probes that could be routinely used, several approaches identified genomic targets >6000 bp that would be free of repetitive elements and that would be expected to be readily detectable. Such targets included genes organized in clusters, large cDNAs, genes without repetitive elements in their introns, and pooled unique sequences from BACs.

Gene clusters:

Certain types of plant genes tend to be organized as large gene clusters. Because a single probe will hybridize to the entire group, such clusters make excellent FISH targets. The classic examples of tandem gene detection are the ribosomal genes that have been used in karyotyping for many different species, including maize (L i and A rumuganathan 2001 K ato et al. 2004). Additionally, genes that mediate maize disease resistance (W ebb et al. 2002 S mith et al. 2004) or that encode storage (W oo et al. 2001) and cell-wall proteins (W u et al. 2001) are found in large clusters and have been used as FISH probes in maize (B auer and B irchler 2006 K ato et al. 2006 L amb and B irchler 2006 V aldivia et al. 2007). The 19-kDa zein genes are present in clusters at several loci, including on chromosomes 4S and 7S (S ong and M essing 2002). Because the 19-kDa zein subfamily A gene cluster on chromosome 4 is readily detectable using FISH (K ato et al. 2006), the B subfamily gene cluster was a promising candidate for a chromosome 7 marker. A 19-kDa zein B subfamily sequence was PCR amplified and cloned using subfamily-specific primers that have been previously described (S ong and M essing 2002). This clone was used in subsequent PCR reactions to produce template for the FISH labeling reaction. The probe was called α-zeinB and produced a signal exclusively on 7S (Figure 1).

Small-target FISH on somatic chromosomes from maize inbred line B73. Somatic chromosomes from inbred line B73 were hybridized with small-target probes (red) and with repetitive element probes (CentC, TAG microsatellite, and the 180-bp knob repeat), which in combination with size and arm-length ratios, allow each chromosome to be identified. CentC and TAG microsatellite signals are green and the 180-bp knob repeat signals are blue. Chromosomes from individual preparations were electronically cut out and arranged in rows. In each row, the merged image is presented to show the chromosomal position of each small-target probe. Below the merged image, the gray values are displayed for the small-target signals as follows: (A) dek1, (B) serk2+rf2e1, (C) 19-kDa α-zeinB gene family, (D) BAC8L, (E) BAC9S, (F) acc1/acc2, and (G) myo1. Red arrows indicate the positions of the signals.

Single genes:

Most maize genes are >3 kb (H aberer et al. 2005), the minimum size that is routinely detectable by FISH. Therefore, individual genes are candidates as single-locus FISH probes although some will contain repetitive elements in their introns and be unsuitable. To develop probes for chromosome 5, PCR primer pairs were designed to amplify two genes located on 5L: the rf2e1 gene (4739 bp) and the serk2 gene (5484 bp) (Table 1). The rf2e1 gene was amplified in two fragments. The resulting PCR products produced FISH signals on 5L and showed no or low background. The PCR products from serk2 and rf2e1 genes were combined to produce a 5L probe, referred to as serk2+rf2e1 (Figure 1). On some chromosomes, two signals were produced, probably corresponding to the two genes (Figure 1 and supplemental Figure 1, A and B, at The serk2 gene is placed on the GRAMENE Z. mays finger printed contig map ( on chromosome 5L, contig 234, position 133.32 Mb. A rf2e1 homolog, found by BLASTn analysis (Z. mays PCO083188_ov mRNA, accession AY107915), is mapped on chromosome 5L, contig 240, position 145.42 Mb. The distance between the serk2 and the rf2e1 sequences is 12 Mb.

PCR probe production

Large cDNAs:

Approximately 11% of maize genes contain repetitive sequences in their introns (H aberer et al. 2005) and will not be suitable as FISH probes without removal of the repeats. Because fully processed mRNAs do not contain introns and are therefore likely to contain less repetitive DNA, a database search was conducted for large maize cDNA sequences to use as FISH probes. Thirty-six candidate mRNA sequences >4000 bp were identified, including both mapped and unmapped genes. Of these, several with cDNA sequences >6000 bp were selected for further analysis, including Z. mays B73 calpain-like protein (dek1) (7110 bp), the unconventional myosin heavy chain (myo1) (5375 bp), and the acetyl-coenzyme A carboxylase (acc1) (7324 bp) genes. The myo1 gene had not previously been localized and the maize genome contains two highly similar acc genes (A shton et al. 1994) present on chromosome arms 2L (acc2) and 10L (acc1) ( The resulting PCR products were used as FISH probes (Figure 1). Probe dek1 labeled the expected interstitial position on chromosome 1S and additional signal was seen at the NOR due to contaminating cDNA from the rDNA genes (supplemental Figure 2 at By using RNA that is enriched for poly(A)-containing mRNA as the RT–PCR template, the NOR hybridization signal was eliminated (Figure 1, supplemental Figure 1, E and F, at, and purified mRNA was used as the template for subsequent RT–PCR reactions. The myo1 probe produced a signal on the distal end of chromosome 3L. The acc1/acc2 probe hybridized near the centromere on chromosome 2L and at an interstitial position on chromosome 10L (Figure 1). Thus, the position of the myo1 gene has been determined and the positions of dek1, acc1, and acc2 genes have been detected using FISH.

Pooled PCR products from BACs:

Although extensive BAC libraries exist for maize, the abundance of dispersed repetitive elements prevents the direct use of maize BACs as FISH probes. Many BAC clones have been sequenced as part of the ongoing maize genome sequencing effort, allowing the identification of unique or genic regions using sequence analysis software. Pooling multiple low-copy sequences from a BAC sequence would allow FISH to a genomic target of sufficient size to be readily detectable and free of background signal.

To develop a FISH marker for chromosome 8, the 136.9-kb BAC clone sequence AC157487 was selected. Four unique regions with sizes of 7.5, 13.4, 7.0, and 8.4 kb were identified after RepeatMasker analysis (Figure 2A). Each region was analyzed using the BLASTn program and sequences with homology to plant cDNAs or mRNAs were selected for primer design. These regions were expected to be conserved among maize varieties. Seven PCR products were labeled as FISH probes and individually tested on chromosome spreads. Three PCR products showing no background were combined to produce a probe totaling 8.7 kb in length that readily detected a specific region on chromosome 8L (Figure 1, Figure 2B, Table 1). Four PCR products showed nonspecific hybridization due to elements being present, which can be missed during RepeatMasker analysis. As additional repetitive elements are added to the Repeat Masker library, selection of maize unique sequences for FISH probe development will be more effective. We anticipate that many additional probes will be produced in this fashion and propose the following naming system. Probes will be designated by the chromosome arm and the GenBank accession number used in their design. Thus, a probe produced by pooling the PCR products is named BAC8L–AC157487. Because only one probe on 8L is used in this study, an abbreviated form of the name, BAC8L, will be used.

Development of unique probes from BAC sequences. (A) Chromosome 8L anchored BAC AC157487 (136.9 kb) after RepeatMasker analysis: four long unique BAC regions were selected for primer design. (B) Seven PCR products were amplified and used as FISH probes separately. Three PCR products that showed low or no background were selected and combined as one probe.

A similar approach was applied to develop a probe for chromosome 9S using sequence AF448416 from a BAC clone containing the bz1 regions (106.2 kb). RepeatMasker and BLASTn analysis was performed in the same manner as noted above. Because the bz1 region has been well characterized (F u and D ooner 2002 B runner et al. 2005), it was possible to select sequences shared among three different inbreds—B73, Mo17, and McC—for FISH probe development. Seven PCR primer pairs were designed. Five PCR products corresponding to regions of genes stk1, stc1, znf, tac7077, and uce2 showed low background as FISH probes and were pooled to produce a readily detectable 12.3-kb probe for the chromosome 9 bz region (Table 2, Figure 1). The other two PCR products were not used because they produced high background. This probe, BAC9S–AF448416, will be referred to as BAC9S in this report.

Karyotyping cocktail components


By combining the new probes produced for chromosomes 1S, 5L, 7S, 8L, and 9S with others described previously, including the p1 gene on 1S (Y u et al. 2007), the 5S ribosomal gene cluster on 2L (K ato et al. 2004), the rp3 disease resistance gene cluster on 3L (K ato et al. 2006), the Cent4 repeat cluster near the centromere of chromosome 4 (K ato et al. 2004), the expansin B11 gene cluster expB11 on 5L (V aldivia et al. 2007), the 45S (NOR) ribosomal gene cluster on 6S (K ato et al. 2004), the expansin B9 gene cluster expB9 on 9L (V aldivia et al. 2007), and the rp1 disease resistance gene cluster on 10S (K ato et al. 2006), a collection of single-locus FISH probes that includes at least one on each chromosome has been assembled (Table 2). By combining selected members of this collection, labeled in different colors, it is possible to identify each chromosome in the maize karyotype. The probe cocktail was successfully applied to chromosomes of inbred lines B73, Oh43, and KYS (KYS is shown in Figure 3). The position of each probe on the respective chromosome of inbred B73 was measured and an idiogram was constructed (Figure 4 ).

Inbred KYS chromosomes identified using small-target probes. FISH with small-target probes including dek1 (1S, red), p1-wr (1S, white), rp3 (3L, red), serk2+rf2e1 (5L, green), expB11 (5L, green), α-zeinB (7S, red), BAC8L (8L, green), BAC9S (9S, green), expB9 (9L, green), and rp1 (10S, red) and with probes to the repetitive elements 5S rDNA (2L, green), 45S rDNA (NOR, 6S, green), and Cent4 (green). See Table 1 for details on each probe. Bar, 10 μm.

Idiogram of Z. mays B73 chromosomes showing average relative chromosome lengths (as percentages) and positions of small-target probes on somatic chromosomes from inbred line B73. The asterisk by chromosome 4 indicates the position of a second site of hybridization to a 19-kDa zein gene probe and the asterisk by chromosome 9 indicates a minor site of hybridization to expB10. The colors used to indicate the probe positions correspond to the colors in Figure 3 except probes not used for karyotyping, which are shown in black.

Probe extension to other maize lines and relatives:

Because single-locus FISH probes detected B73 genes or gene clusters whose function is likely conserved, it was expected that they would hybridize in other maize lines and related species. To confirm this supposition, the probes dek1, serk2+rf2e1, BAC8L, and BAC9S were applied to the inbreds KYS and Oh43 and produced signals in the expected locations (supplemental Figure 1 at Several of the probes in the single-locus collection, including α-zeinA, rp1, and rp3, were previously shown to hybridize to unique locations in Tripsacum and Z. diploperennis (L amb and B irchler 2006). The remaining probes from the single-locus collection—dek1, serk2+rf2e1, BAC8L, BAC9S, expB11, α-zeinB, acc1/acc2, and myo1—were applied to chromosome spreads from F1 hybrids between maize and wild relatives, including Z. luxurians, Z. diploperennis, and T. dactyloides, and to a “tri-species hybrid” containing chromosomes from Z. mays, Z. diploperennis, and T. dactyloides. The presence of the maize chromosomes in these hybrids provides a positive control for ensuring that the conditions were optimal for signal detection. For all the probes, signal could be detected on the chromosomes from the wild relatives. Several examples are included in Figure 5 and supplemental Figure 2 at The number of signals in the wild Zea species was the same as that in maize for each probe. In Tripsacum, rp1, rp3, α-zeinA (L amb and B irchler 2006), α-zeinB, myo1, and acc1/acc2 probes all produced the same number of signals per haploid genome as in maize (Figure 5, supplemental Figure 2 at The other probes produced more signals per haploid genome than Zea did: BAC8L (three signals), dek1 (two signals), and serk2+rf2e1 (three signals). Because BAC8L is a mixture of three detectable PCR products (3125, 2179, and 3353 bp long), the three sites could indicate three homologous regions or result from separation of the genomic locations relative to their positions in maize.

Extension of small-target FISH probes to wild relatives of Z. mays. Small-target probes are indicated by arrowheads. The gray-value images depict the small-target labeling alone. Bars, 10 μm. (A) Maize × Z. diploperennis F1 hybrid chromosomes labeled with BAC8L (red) and (B) ExpB11 (red). The Grande retrotransposon probe (green) hybridizes strongly to maize chromosomes and with intermediate intensity to Z. diploperennis chromosomes. (C) “Tri-species” hybrid containing a haploid set of chromosomes from maize (n = 9, as chromosome 2 is missing), Z. diploperennis (n = 10), and T. dactyloides (n = 18), labeled with myo1 (red) and a Tripsacum-specific retroelement probe, TC#25 (green). Three myo1 signals are observed, two on Zea chromosomes and one on a Tripsacum chromosome. (D) Maize × Z. luxurians F1 hybrid labeled with the serk2+rf2e1 probe (red) and the 180-bp knob probe (green). Knob signals in Z. luxurians are located at the ends of chromosomes whereas maize knob signals are interstitial.

General usage¶

As described on the UCSC Genome Browser website (see link below), the browser extensible data (BED) format is a concise and flexible way to represent genomic features and annotations. The BED format description supports up to 12 columns, but only the first 3 are required for the UCSC browser, the Galaxy browser and for bedtools. bedtools allows one to use the “BED12” format (that is, all 12 fields listed below). However, only intersectBed, coverageBed, genomeCoverageBed, and bamToBed will obey the BED12 “blocks” when computing overlaps, etc., via the “-split” option. For all other tools, the last six columns are not used for any comparisons by the bedtools. Instead, they will use the entire span (start to end) of the BED12 entry to perform any relevant feature comparisons. The last six columns will be reported in the output of all comparisons.

  • Any string can be used. For example, “chr1”, “III”, “myChrom”, “contig1112.23”.
  • This column is required.
  • The first base in a chromosome is numbered 0.
  • The start position in each BED feature is therefore interpreted to be 1 greater than the start position listed in the feature. For example, start=9, end=20 is interpreted to span bases 10 through 20,inclusive.
  • This column is required.
  • The end position in each BED feature is one-based. See example above.
  • This column is required.
  • Any string can be used. For example, “LINE”, “Exon3”, “HWIEAS_0001:3:1:0:266#0/1”, or “my_Feature”.
  • This column is optional.
  1. score - The UCSC definition requires that a BED score range from 0 to 1000, inclusive. However, bedtools allows any string to be stored in this field in order to allow greater flexibility in annotation features. For example, strings allow scientific notation for p-values, mean enrichment values, etc. It should be noted that this flexibility could prevent such annotations from being correctly displayed on the UCSC browser.
  • Any string can be used. For example, 7.31E-05 (p-value), 0.33456 (mean enrichment value), “up”, “down”, etc.
  • This column is optional.
  1. blockSizes - A comma-separated list of the block sizes.
  2. blockStarts - A comma-separated list of block starts.

bedtools requires that all BED input files (and input received from stdin) are tab-delimited. The following types of BED files are supported by bedtools:

  1. BED3: A BED file where each feature is described by chrom, start, and end.
  1. BED4: A BED file where each feature is described by chrom, start, end, and name.
  1. BED5: A BED file where each feature is described by chrom, start, end, name, and score.
  1. BED6: A BED file where each feature is described by chrom, start, end, name, score, and strand.

BED12: A BED file where each feature is described by all twelve columns listed above.

For example: chr1 11873 14409 uc001aaa.3 0 + 11873 11873 0 3 354,109,1189, 0,739,1347,

BEDPE format¶

We have defined a new file format, the browser extensible data paired-end (BEDPE) format, in order to concisely describe disjoint genome features, such as structural variations or paired-end sequence alignments. We chose to define a new format because the existing “blocked” BED format (a.k.a. BED12) does not allow inter-chromosomal feature definitions. In addition, BED12 only has one strand field, which is insufficient for paired-end sequence alignments, especially when studying structural variation.

The BEDPE format is described below. The description is modified from:

  1. chrom1 - The name of the chromosome on which the first end of the feature exists.
  • Any string can be used. For example, “chr1”, “III”, “myChrom”, “contig1112.23”.
  • This column is required.
  • Use “.” for unknown.
  1. start1 - The zero-based starting position of the first end of the feature on chrom1.
  • The first base in a chromosome is numbered 0.
  • As with BED format, the start position in each BEDPE feature is therefore interpreted to be 1 greater than the start position listed in the feature. This column is required.
  • Use -1 for unknown.
  1. end1 - The one-based ending position of the first end of the feature on chrom1.
  • The end position in each BEDPE feature is one-based.
  • This column is required.
  • Use -1 for unknown.
  1. chrom2 - The name of the chromosome on which the second end of the feature exists.
  • Any string can be used. For example, “chr1”, “III”, “myChrom”, “contig1112.23”.
  • This column is required.
  • Use “.” for unknown.
  1. start2 - The zero-based starting position of the second end of the feature on chrom2.
  • The first base in a chromosome is numbered 0.
  • As with BED format, the start position in each BEDPE feature is therefore interpreted to be 1 greater than the start position listed in the feature. This column is required.
  • Use -1 for unknown.
  1. end2 - The one-based ending position of the second end of the feature on chrom2.
  • The end position in each BEDPE feature is one-based.
  • This column is required.
  • Use -1 for unknown.
  • Any string can be used. For example, “LINE”, “Exon3”, “HWIEAS_0001:3:1:0:266#0/1”, or “my_Feature”.
  • This column is optional.
  1. score - The UCSC definition requires that a BED score range from 0 to 1000, inclusive. However, bedtools allows any string to be stored in this field in order to allow greater flexibility in annotation features. For example, strings allow scientific notation for p-values, mean enrichment values, etc. It should be noted that this flexibility could prevent such annotations from being correctly displayed on the UCSC browser.
  • Any string can be used. For example, 7.31E-05 (p-value), 0.33456 (mean enrichment value), “up”, “down”, etc.
  • This column is optional.
  1. Any number of additional, user-defined fields - bedtools allows one to add as many additional fields to the normal, 10-column BEDPE format as necessary. These columns are merely “passed through” pairToBed and pairToPair and are not part of any analysis. One would use these additional columns to add extra information (e.g., edit distance for each end of an alignment, or “deletion”, “inversion”, etc.) to each BEDPE feature.

Entries from an typical BEDPE file:

Entries from a BEDPE file with two custom fields added to each record:

GFF format¶

The GFF format is described on the Sanger Institute’s website ( The GFF description below is modified from the definition at this URL. All nine columns in the GFF format description are required by bedtools.

  • Any string can be used. For example, “chr1”, “III”, “myChrom”, “contig1112.23”.
  • This column is required.
  1. source - The source of this feature. This field will normally be used to indicate the program making the prediction, or if it comes from public database annotation, or is experimentally verified, etc.
  • This column is required.
  • bedtools accounts for the fact the GFF uses a one-based position and BED uses a zero-based start position.
  1. score - A score assigned to the GFF feature. Like BED format, bedtools allows any string to be stored in this field in order to allow greater flexibility in annotation features. We note that this differs from the GFF definition in the interest of flexibility.
  1. attribute - Taken from From version 2 onwards, the attribute field must have an tag value structure following the syntax used within objects in a .ace file, flattened onto one line by semicolon separators. Free text values must be quoted with double quotes. Note: all non-printing characters in such free text value strings (e.g. newlines, tabs, control characters, etc) must be explicitly represented by their C (UNIX) style backslash-escaped representation (e.g. newlines as ‘n’, tabs as ‘t’). As in ACEDB, multiple values can follow a specific tag. The aim is to establish consistent use of particular tags, corresponding to an underlying implied ACEDB model if you want to think that way (but acedb is not required).

An entry from an example GFF file :

Genome file format¶

Some of the bedtools (e.g., genomeCoverageBed, complementBed, slopBed) need to know the size of the chromosomes for the organism for which your BED files are based. When using the UCSC Genome Browser, Ensemble, or Galaxy, you typically indicate which which species/genome build you are working. The way you do this for bedtools is to create a “genome” file, which simply lists the names of the chromosomes (or scaffolds, etc.) and their size (in basepairs).

Genome files must be tab-delimited and are structured as follows (this is an example for C. elegans):

bedtools includes pre-defined genome files for human and mouse in the /genomes directory included in the bedtools distribution.


Upregulated DEGs were significantly enriched in cell cycle-related pathways

Many pipelines and strategies exist to aid in the interpretation of omics data. Firstly, we selected suitable datasets and performed canonical DEG screening to characterize ATC. Detailed sample information was listed in Table S1.

The data retrieval process for DEG screening was recorded in Fig. 1A. Using combined effect size method, we filtered out 661 DEGs, including 318 upregulated and 343 downregulated genes. Detailed information on DEGs was provided in Table S2.

After DEG filtering, we performed gene enrichment analysis to characterize the relevant KEGG pathways of these DEGs. As illustrated in Figs. 1B & 1C, upregulated DEGs were significantly enriched in cell cycle-related pathways. Meanwhile, downregulated DEGs were primarily enriched in thyroid hormone synthesis pathway.

The above results indicated that thyroid hormone synthesis pathway was significantly enriched in downregulated DEGs. We were not surprise to see that, as degenerative phenotypes are classic manifestations of ATC (Molinaro et al., 2017).

As indicated by previous literature (Evans et al., 2012 Pita et al., 2014), dyregulation of cell cycle-related pathways are important feature and potential driver of ATC. Hence, in the present work, we primarily focused on cell cycle-related key genes. We further validated the enrichment of KEGG pathway ‘Cell cycle’ using flexible GSVA method. As illustrated in Fig. 1D, pathway ‘Cell cycle’ was differentially enriched between ATC and normal thyroid tissue, with adjusted P value < 0.0001.

Detecting gene modules using WGCNA

Next, we decided to apply an unsupervised clustering algorithm WGCNA to explore the co-expression network and find if there was any gene cluster highly related to ATC. Using WGCNA (Langfelder & Horvath, 2008), we can identify the correlations among genes and cluster genes into ‘gene modules’. By quantifying the associations between these gene modules and ATC, we can filter out potential key gene modules for further analysis.

As an advanced data mining algorithm, WGCNA has high demands on sample size. To make the full use of data and produce more robust results, we re-screened and re-selected the data (Fig. 2A). Detailed sample information was listed in Table S1.

The top 5,000 genes with the highest variance were loaded for module detection. As shown in Fig. 2B, several gene modules were identified by WGCNA. Then, we calculated out the correlations between these modules and ATC using each module’s eigengene. A total of five gene modules were identified as positively correlated with ATC (P < 0.05). Among them, module turquoise had the highest correlation coefficient.

Identifying module turquoise as a potential key cycle-related module

After module detection, we can further uncover key gene modules by gene enrichment analysis focused on genes’ involvement in pathways. As the above analysis revealed that upregulated genes were enriched in cell cycle-related pathways, next we want to explore if any cell cycle-enriched gene module can be detected.

As illustrated in Fig. 3A, KEGG enrichment analysis revealed that cell cycle-related pathways were significantly enriched in genes of module turquoise. GSVA method confirmed the enrichment (Fig. 3B) with adjusted P value < 0.0001. No other gene module with relevant to ATC (P < 0.05, both positively and negatively correlated) showed the enrichment of cell cycle-related pathways (Table S3). Next, we will choose module turquoise as a cell cycle-related key gene module and perform further exploration.

Figure 3: Module turquoise was significantly enriched in cell cycle-related pathways.

Combining two pipelines to filter out potential cell cycle-related key genes

Genes interact with each other, forming a comprehensive network. For key genes occupying central positions in the regulatory network, even small changes may bring great impact. Hence, we tended to explore gene-gene interaction between these DEGs and tried to uncover key DEGs with potential key function. Based on protein-protein interaction (PPI) network, we identified the top 50 hub DEGs with the highest prediction scores. Interestingly, all the top 50 hub genes were clustered in module turquoise (Fig. 4).

Figure 4: Centrality of the top 50 PPI network-predicted hub DEGs in module turquoise.

The WGCNA algorithm can calculate the eigengene to feature each module. Module membership (MM) was defined as the absolute correlation coefficient between each gene’s expression and the corresponding module eigengene. Genes with high MM value indicate high centrality in the subnetwork. We defined that genes with MM > 0.85 shall be regarded as module’s hub genes. According to the above cut-off criteria, we identified 31 genes predicted as key genes by both PPI network-guided and WGCNA-guided prediction pipelines (Fig. 4). As both the upregulated DEGs and genes of module turquoise were significantly enriched in cell cycle-related pathways, these key genes can be regarded as potential cell cycle-related key genes.

Further filtering of cell cycle-related key genes with cancer/testis expression pattern

Expression of some genes are restricted to germ cells under normal conditions, but may be reactivated and upregulated in tumor. These ‘cancer/testis’ genes harbor potential of being therapeutic targets as they are both immunogenic and critical in tumorigenesis. Wang et al. recently systematically identified several testis-specific genes (Wang et al., 2016). Based on their publication, we filtered out 10 genes out of 31 predicted key genes as having cancer/testis expression pattern (Fig. 5A). Their expression levels across major organs under physiological conditions were illustrated in Fig. 5B. These genes were further regarded as putative key genes of ATC harboring therapeutic potential.

Figure 5: Identification of 10 genes with cancer/testis expression pattern as putative key genes of ATC harboring therapeutic potential.

We further validate their gene ontology (GO) ‘biological processes (BP)’ classification using ARCHS 4 database. Top 10 GO terms of each putative key gene with highest Z scores were recorded in the Table S4. These annotated GO terms again demonstrated that these putative key genes play key roles in cell cycle-related pathways. Notably, GO annotation revealed that these putative key genes were primarily associated with chromosome segregation, which will be discussed later.

Key genes’ impact on disease-free survival among patients with differentiated thyroid cancer

Next, we decided to further investigate the association between those key genes’ expression and clinical outcomes of thyroid cancer patients. Data from the THCA cohort, TCGA project was utilized. THCA cohort mainly includes differentiated thyroid cancers. Nevertheless, the tumorigenesis and progression of ATC have been widely acknowledged to be a multistep deterioration process that evolved from that of differentiated thyroid cancers (Molinaro et al., 2017). Hence, THCA cohort can still provide valuable information on the functional characterization of key genes in ATC from a pan-thyroid cancer perspective.

As illustrated in Figs. 6A–6E, expression levels of TRIP13, TPX2, DLGAP5, KIF2C and TTK were associated with shorter disease free survival (DFS) among differentiated thyroid cancer. As illustrated in Fig. 6F, patients with more key genes upregulated tended to have shorter DFS (logrank P = 0.0128) than patients with less key genes upregulated.

Figure 6: Putative key genes’ impact on disease free survival (DFS) among differentiated thyroid cancer patients.

What value type would a chromosome position be in a database or form? - Biology

In mammalian cells, the p-arm of many acrocentric chromosomes carry nucleolar organising regions (NORs) which contain genes coding for ribosomal RNA. This is true for all five pairs of acrocentrics in human cells.


A chromosome anomaly can be:

    e.g. 1: a constitutional anomaly having occurred in a parental gamete (e.g. + 21) will be found in each of the cells of the resulting child (homogeneous trisomy 21).

Note: In practice, when an acquired anomaly is said homogeneous, it only means that no normal cell was karyotyped within the scored sample.


    e.g. 1: A non-disjunction (e.g. + 21) having occurred in the zygote after a few cell divisions: Only some of the embryo cells (and later, of the child’s cells) will carry the anomaly (46, XY/47, XY, +21).

* A chromosome anomaly can be:


1 - Homogeneous due to meiotic non-disjunction (Figure)

    non disjunction in first meiotic division produces 4 unbalanced gametes.

Table: Zygotes produced for each type gamete: Empty boxes indicate a non-viable conceptus. Boxes XX and XY with ° are normal zygotes from normal gametes. Boxes with * are normal zygotes from unbalanced gametes.


2 - Homogenous due to a fertilisation anomaly

    digyny: non-expulsion of the 2nd polar body.

Note: Viability of the two daughter cells may differ. In the above-mentioned trisomy 21 example, the clone monosomic for 21 is non-viable and has disappeared.

Note: Mosaicism is frequent in malignancies, either because normal cells can still be karyotyped, or because the malignant clone produces sub-clones with additional anomalies (clonal evolution).

Visually, chromosomes can appear to break, and broken ends can rejoin in various ways:

retinoblastoma . Normal individuals carry 2 functional copies, but one of these can be inactivated by mutation or removal (loss of heterozygosity) and the cell continues normal function through the normal allele (which is now acting as a tumour suppressor gene). Loss of the second allele by removal (or mutation) leads to the formation of the tumour."

Note: Many of the structural aberrations formed are cell lethal, and are soon eliminated from the cell population. Of those that survive and are transmitted, the most frequent are translocations, small inversions and deletions.

Note: Rearranged chromosomes that are transmitted are called derivative chromosomes (der) and they are numbered according to the centromere they carry. Thus a reciprocal translocation between chromosome 7 and chromosome 14 will result in a der(7) and a der(14).

B - Main structural anomalies (Figure)

1 - Reciprocal translocation

Transmission to descendants (constitutional anomalies)

At meiosis, where there is pairing of homologous chromosome segments (normal chromosomes form a bivalent), followed by crossing-over, translocations may form a quadrivalent (tetravalent, in Greek) and this leads to segregation problems. At meiosis anaphase I, chromosomes separate without centromere separation this separation occurs at anaphase 2. Segregation of chromatids in the case of a quadrivalent (Figure) can be according the following:

Note There will be no mechanical transmission problems at mitosis.

Note: Reciprocal and Complex translocations can also occur in somatic cells at any time after birth they are particularly frequent in cancer processes.

Watch the video: : chromosomes and phenotype (January 2022).