Phenotype/disease specific gene ranking using curated, gene library and network based data structures

Item request has been placed!

Item request cannot be made.

Processing Request

Read Online Read More Add to Saved list

Publication Date:
October 20, 2020

Additional Information
- Patent Number:
  10810,213
- Appl. No:
  15/723055
- Application Filed:
  October 02, 2017
- Abstract:
  The present invention relates to methods, systems and apparatus for capturing, integrating, organizing, navigating and querying large-scale data from high-throughput biological and chemical assay platforms. It provides a highly efficient meta-analysis infrastructure for performing research queries across a large number of studies and experiments from different biological and chemical assays, data types and organisms, as well as systems to build and add to such an infrastructure. According to various embodiments, methods, systems and interfaces for identifying genes that are potentially associated with a biological, chemical or medical concept of interest.
- Inventors:
  Illumina, Inc. (San Diego, CA, US)
- Assignees:
  Illumina, Inc. (San Diego, CA, US)
- Claim:
  1. A computer system, comprising: one or more processors; system memory; one or more computer-readable storage media having stored thereon a database comprising a plurality of gene sets, wherein each gene set of the plurality of gene sets comprises a plurality of genes and a plurality of experimental values associated with the plurality of genes, and wherein the plurality of experimental values are correlated with a biological, chemical, or medical concept of interest in at least one experiment; and one or more computer-readable storage media storing program code that, when executed by the one or more processors, causes the computer system to implement a method for identifying genes associated with the biological, chemical, or medical concept of interest, said program code comprising: (a) code for selecting the plurality of gene sets from the database; (b) code for determining, for each gene set, one or more experimental gene scores for first one or more genes among the plurality of genes using one or more experimental values of the first one or more genes; (c) code for determining, for each gene set, one or more in silico gene scores for second one or more genes among the plurality of genes based at least in part on the first one or more genes' correlations with the second one or more genes, wherein: the first one or more genes' correlations with the second one or more genes are indicated in other gene sets in the database beside the plurality of gene sets, each of the other gene sets comprises correlation scores between one or more of the second one or more genes and one of the first one or more genes, an in silico gene score for each of the second one or more genes indicates a correlation between each of the second one or more genes and one or more of the first one or more genes, and the in silico gene score for each of the second one or more genes is obtained by aggregating the correlation scores for each of the second one or more genes across the other gene sets; (d) code for obtaining summary scores for the first and second one or more genes based at least in part on the one or more experimental gene scores for the first one or more genes determined in (b) and the one or more in silico gene scores for the second one or more genes determined in (c), wherein each summary score is aggregated across the plurality of gene sets; and (e) code for identifying the genes associated with the biological, chemical, or medical concept of interest using the summary scores of the first and second one or more genes.
- Claim:
  2. The computer system of claim 1 , wherein (c) comprises, for each gene set of the plurality of gene sets, (i) code for identifying a second plurality of gene sets from the database, each gene set of the second plurality of gene sets comprising a second plurality of genes and a second plurality of experimental values associated with the second plurality of genes, and wherein the second plurality of experimental values are correlated with a first gene among the first one or more genes; (ii) code for aggregating the experimental values across the second plurality of gene sets to obtain a vector of aggregated values for the first gene among the first one or more genes; (iii) code for applying (i) and (ii) to one or more other genes among the first one or more genes, thereby obtaining one or more vectors of aggregated values for the one or more other genes among the first one or more genes; and (iv) code for aggregating vectors of aggregated values for the first gene and the one or more other genes among the first one or more genes, thereby obtaining one compressed vector comprising the one or more in silico gene scores for the second one or more genes.
- Claim:
  3. The computer system of claim 1 , wherein said program code further comprising code for determining one or more gene-group scores for third one or more genes.
- Claim:
  4. The computer system of claim 1 , wherein said program code further comprising code for determining interactome scores respectively for fourth one or more genes.
- Claim:
  5. A method, implemented at a computer system that includes one or more processors and system memory, for identifying genes associated with a biological, chemical, or medical concept of interest, the method comprising: (a) selecting, by the one or more processors, a plurality of gene sets from a database, wherein each gene set of the plurality of gene sets comprises a plurality of genes and a plurality of experimental values associated with the plurality of genes, and wherein the plurality of experimental values are correlated with the biological, chemical, or medical concept of interest in at least one experiment; (b) determining, for each gene set and by the one or more processors, one or more experimental gene scores for first one or more genes among the plurality of genes using one or more experimental values of the first one or more genes; (c) determining, for each gene set and by the one or more processors, one or more in silico gene scores for second one or more genes among the plurality of genes based at least in part on the first one or more genes' correlations with the second one or more genes, wherein: the first one or more genes' correlations with the second one or more genes are indicated in other gene sets in the database beside the plurality of gene sets, each of the other gene sets comprises correlation scores between one or more of the second one or more genes and one of the first one or more genes, an in silico gene score for each of the second one or more genes indicates a correlation between each of the second one or more genes and one or more of the first one or more genes, and the in silico gene score for each of the second one or more genes is obtained by aggregating the correlation scores for each of the second one or more genes across the other gene sets; (d) obtaining, by the one or more processors, summary scores for the first and second one or more genes based at least in part on the one or more experimental gene scores for the first one or more genes determined in (b) and the one or more in silico gene scores for the second one or more genes determined in (c), wherein each summary score is aggregated across the plurality of gene sets; and (e) identifying, by the one or more processors, the genes associated with the biological, chemical, or medical concept of interest using the summary scores of the first and second one or more genes.
- Claim:
  6. The method of claim 5 , wherein (c) comprises, for each gene set of the plurality of gene sets, (i) identifying a second plurality of gene sets from the database, each gene set of the second plurality of gene sets comprising a second plurality of genes and a second plurality of experimental values associated with the second plurality of genes, and wherein the second plurality of experimental values are correlated with a first gene among the first one or more genes; (ii) aggregating the experimental values across the second plurality of gene sets to obtain a vector of aggregated values for the first gene among the first one or more genes; (iii) applying (i) and (ii) to one or more other genes among the first one or more genes, thereby obtaining one or more vectors of aggregated values for the one or more other genes among the first one or more genes; and (iv) aggregating vectors of aggregated values for the first gene and the one or more other genes among the first one or more genes, thereby obtaining one compressed vector comprising the one or more in silico gene scores for the second one or more genes.
- Claim:
  7. The method of claim 5 , further comprising, determining, before (d), one or more gene-group scores for third one or more genes.
- Claim:
  8. The method of claim 7 , wherein each gene-group score for a particular gene is determined using (i) gene memberships of one or more gene groups that each comprise a group of genes related to a group label, wherein the group of genes comprises the particular gene, and (ii) at least some of the one or more experimental values of the first one or more genes.
- Claim:
  9. The method of claim 8 , wherein (d) comprises obtaining the summary scores for the first and second one or more genes based at least in part on the gene-group scores for at least some of the third one or more genes, as well as the one or more experimental scores for the first one or more genes determined in (b) and the one or more in silico scores for the second one or more genes determined in (c).
- Claim:
  10. The method of claim 8 , wherein determining the one or more gene-group scores for the third one or more genes comprises: identifying, for a particular gene among the third one or more genes, the one or more gene groups that each comprise the particular gene; determining, for each gene group, a percentage of members of the gene group that are among the first one or more genes; aggregating, for each gene group, one or more experimental values of at least some of the first one or more genes that are members of the gene group, thereby obtaining a sum experimental value for the gene group; and determining, for the particular gene among the third one or more genes, a gene-group score using the percentage of members of the gene group that are among the first one or more genes and the sum experimental value for the gene group.
- Claim:
  11. The method of claim 10 , wherein determining the gene-group score using the percentage of members of the gene group that are among the first one or more genes and the sum experimental value for the gene group comprises: obtaining, for each gene group, a product of the percentage of members and the sum experimental value, thereby obtaining one or more products for the one or more gene groups; summing, across the one or more gene groups, the one or more products, thereby obtaining a summed product; and determining, for the particular gene among the third one or more genes, a gene-group score based on the summed product.
- Claim:
  12. The method of claim 5 , further comprising, before (d), determining interactome scores respectively for fourth one or more genes.
- Claim:
  13. The method of claim 12 , wherein each interactome score for a particular gene is determined using (i) connections between the particular gene and other genes connected to the particular gene in a network of genes and (ii) at least some of the one or more experimental values of the first one or more genes.
- Claim:
  14. The method of claim 13 , wherein (d) comprises obtaining the summary scores for at least the first one or more genes and the second one or more genes based at least in part on the interactome scores for at least some of the fourth one or more genes, as well as the one or more experimental gene scores for the first one or more genes determined in (b) and the one or more in silico gene scores for the second one or more genes determined in (c).
- Claim:
  15. The method of claim 13 , wherein the network of genes are based on interactions and relations among genes, proteins, and/or phospholipids.
- Claim:
  16. The method of claim 13 , wherein determining interactome scores respectively for the fourth one or more genes comprises: providing a network of genes, wherein each pair of genes in the network are connected by an edge, the genes of the network comprise the fourth one or more genes, which comprise at least some of the first one or more genes and/or the second one or more genes; defining, for each gene of the fourth one or more genes, a neighborhood of connected genes based on a connection distance from a particular gene as measured by the number of connection edges connecting two adjacent genes; and calculating, for each gene of the fourth one or more genes, an interactome score using (i) one or more connection distances between the particular gene and one or more other genes in the neighborhood of connected genes and (ii) summary scores of the one or more other genes in the neighborhood of connected genes, wherein the summary scores are based on experimental data.
- Claim:
  17. The method of claim 16 , wherein the interactome score is calculated as proportional to a sum of multiple fractions, each fraction being a summary score of another gene in the neighborhood of connected genes divided by a connection distance between the particular gene and the other gene in the neighborhood of connected genes.
- Claim:
  18. The method of claim 13 , wherein determining interactome scores respectively for the fourth one or more genes comprises: providing a network of genes, wherein the genes of the network have summary scores based on experimental data above a first threshold value, each pair of genes are connected by an edge, and the genes of the network comprise the fourth one or more genes, which comprise at least some of the first one or more genes and/or the second one or more genes; assigning, for each edge, a weight to the edge connecting two genes based on connection data for the two genes in at least one intereactome knowledge base; and calculating, for each gene in the network, an interactome score using (i) weights of edges between a particular gene and all genes connected to the particular gene, and (ii) summary scores of all genes connected to the particular gene.
- Claim:
  19. The method of claim 18 , wherein calculating the interactome score comprises calculating the interactome score as N i ′: N i ′=N i +Σ((N i +N n)*edge_weight n) wherein N i is the summary score of the particular gene i, N n is a summary score of gene n connected to the particular gene, and edge_weight n is the weight of the edge connecting the particular gene i and gene n.
- Claim:
  20. The method of claim 19 , wherein calculating the interactome score further comprises: saving N i ′ that are smaller than a second threshold in a first pass dictionary; and repeating the calculating of claim 19 for all genes in the first pass dictionary, thereby updating the interactome scores.
- Claim:
  21. The method of claim 20 , wherein calculating the interactome score further comprises repeating the operations of claim 20 for one or more passes.
- Claim:
  22. The method of claim 5 , wherein selecting the plurality of gene sets of (a) comprises selecting gene sets based on biotag scores assigned to biotags associated with the gene sets, wherein the biotag scores indicate levels of importance of the gene sets.
- Claim:
  23. The method of claim 22 , wherein the biotags are organized by categories selected from a group consisting of biosource, biodesign, tissue, disease, compound, gene, genemode, biogroup, and any combination thereof.
- Claim:
  24. The method of claim 5 , wherein the plurality of experimental values comprises a plurality of gene perturbation values.
- Claim:
  25. The method of claim 6 , wherein the plurality of experimental values indicates levels of RNA expression, protein expression, DNA methylation, transcription factor activity, and/or association in genome wide association study.
- Claim:
  26. The method of claim 5 , wherein the biological, chemical, or medical concept of interest comprises a phenotype.
- Claim:
  27. The method of claim 26 , wherein the phenotype comprises a disease-related phenotype.
- Claim:
  28. The method of claim 5 , wherein the summary scores of the first and second one or more genes are penalized based on how likely experimental values of the first and second one or more genes in one or more random gene sets are correlated with the biological, chemical, or medical concept of interest.
- Claim:
  29. A computer program product comprising a non-transitory machine readable medium storing program code that, when executed by one or more processors of a computer system, causes the computer system to implement a method for identifying genes associated with a biological, chemical, or medical concept of interest, said program code comprising: (a) code for selecting a plurality of gene sets from a database, wherein each gene set of the plurality of gene sets comprises a plurality of genes and a plurality of experimental values associated with the plurality of genes, and wherein the plurality of experimental values are correlated with the biological, chemical, or medical concept of interest in at least one experiment; (b) code for determining, for each gene set, one or more experimental gene scores for first one or more genes among the plurality of genes using one or more experimental values of the first one or more genes; (c) code for determining, for each gene set, one or more in silico gene scores for second one or more genes among the plurality of genes based at least in part on the first one or more genes' correlations with the second one or more genes, wherein: the first one or more genes' correlations with the second one or more genes are indicated in other gene sets in the database beside the plurality of gene sets, each of the other gene sets comprises correlation scores between one or more of the second one or more genes and one of the first one or more genes, a in silico gene score for each of the second one or more genes indicates a correlation between each of the second one or more genes and one or more of the first one or more genes, and the in silico gene score for each of the second one or more genes is obtained by aggregating the correlation scores for each of the second one or more genes across the other gene sets; (d) code for obtaining summary scores for the first and second one or more genes based at least in part on the one or more experimental gene scores for the first one or more genes determined in (b) and the one or more in silico gene scores for the second one or more genes determined in (c), wherein each summary score is aggregated across the plurality of gene sets; and (e) code for identifying the genes associated with the biological, chemical, or medical concept of interest using the summary scores of the first and second one or more genes.
- Patent References Cited:
  2007/0162411 July 2007 Kupershmidt et al.
  2009/0049019 February 2009 Su et al.
  2018/0080082 March 2018 Mougeot
  102855398 January 2013
- Other References:
  Komurov et al., “An integrated network platform for contextual prioritization of drugs and pathways”, 2015 (Year: 2015). cited by examiner
  Ghiassian, S. et al., “ADiseAse MOdule Detection (DIAMOnD) Algroithm Derived from a Systematic Analysis of Connectivity Patterns of Disease Proteins in the Human Interactome”, PLoS Comput Biol 11(4) E1004120 Doi:10.1371/journal.pcbi.1004120, 2015. cited by applicant
  Mukherjee, S. et al., “Gene Ranking Using Bootstrapped P-values”, Sigkdd Explorations vol. 5 (2), 2003, 14-20. cited by applicant
  Simoes, S. et al., “NERI: network-medicine based integrative approach for disease gene prioritization by relative importance”, BMC Bioinformatics 16(Suppl 9): S9, 2015. cited by applicant
  Bacon, C. et al., “Brain-specific Foxp1 deletion impairs neuronal development and causes autistic-like behaviour”, Molecular Psychiatry 20 2015, 623-639. cited by applicant
  Nava, C. et al., “Hypomorphic variants of cationic amino acid transporter 3 in males with autism spectrum disorders”, Amino Acids 47, 2015, 2647-2658. cited by applicant
  Shen, K. et al., “Meta-analysis for pathway enrichment analysis when combining mutiple genomic studies”, Gene expression vol. 26 No. 10, 2010, 1316-1326. cited by applicant
  Shi, L. et al., “Whole-genome sequencing in an autism multiplex family”, Molecular Autism 4:8, 2013. cited by applicant
  Xiao, Y. et al., “Differential expression pattern-based priortization of candidate genes through integrating disease-specific expression data”, Genomics 98, 2011, 64-71. cited by applicant
- Primary Examiner:
  Conyers, Dawaune A
- Attorney, Agent or Firm:
  Weaver Austin Villeneuve & Sampson LLP
- Accession Number:
  edspgr.10810213

Comments

No Comments.

Phenotype/disease specific gene ranking using curated, gene library and network based data structures

Contact

Follow us