Skip to content

Latest commit

 

History

History
88 lines (62 loc) · 6.52 KB

data.md

File metadata and controls

88 lines (62 loc) · 6.52 KB
title layout
Data archive
page

GeneMANIA interaction networks are available for download in plain text format at http://genemania.org/data/.

The data is organized into separate folders by release date under the “archive” folder, each named by the year, month, and day of release in the format “YYYY-MM-DD”. The most recent release is also available under the folder “current”.

Each individual release contains subfolders for every organism with the name “Genus_species”. Within the organism folders are files containing all the individual interaction networks, as well as additional files containing network metadata and identifier mapping tables.

All files are plain US-ASCII tab-delimited text files with a single header row containing field names. The formats of the individual data files are described in detail below. The data is available as compressed zip archives at the organism level.

Interaction Networks

The interaction network files are named “network_group.network_name.txt”. Network names and group names correspond to those used in the GeneMANIA website, with spaces being substituted with underscore characters. The files contain a row for each interaction in the network, with the three columns: Gene_B, Gene_A, and Weight. Ensembl Gene ID’s are preferred to identify genes in the first two columns of the interaction file, but other identifier types may also appear. The weight column contain a floating decimal value.

Each interacting pair of genes will be present exactly once in the file (symmetric interactions are not included) . Non-interacting genes are not present. No assumptions are made regarding the order of the records in the file or the order of genes in a record. The following example includes genes identified by Entrez Gene ID’s:

Gene_A 	Gene_B  Weight
814707  814741  0.26
814691  814846  0.14
...

Network Metadata

Network metadata is contained in the file “networks.txt”. The file contains a row for each network belonging to the organism, with five columns: File_Name, Network_Group_Game, Network_Name, Source (such as GEO or PathwayCommons), and Pubmed_ID.

Example, columns 1 and 2:

File_Name                                         Network_Group_Name
Shared_protein_domains.INTERPRO.txt               Shared protein domains
Pysical_interactions.Van_Leene-De_Jaeger-2007.txt Physical interactions
...

Example, columns 3, 4, and 5:

Network_Name               Source          Pubmed_ID
INTERPRO                   INTERPRO
Van Leene-De Jaeger-2007   PATHWAYCOMMONS  17426018
...

Identifier Mappings

A table of recognized identifiers is contained in “identifier_mappings.txt”. This file contains multiple rows for each gene recognized by GeneMANIA, with each row containing 3 columns: Preferred_Name, Name, and Source. The preferred name of a gene is the identifier that will appear in the interaction network files for this gene. The preferred name also groups all records in the identifier mappings file that represent the same gene. The second column is an alternate identifier that GeneMANIA recognizes for the same gene, for example the source data from which the network was constructed may have used one of these alternative names to identify the gene in question. The third column describes the type of the identifier in the second column, as for example Entrez ID or Uniprot ID. The preferred name will also appear in a row with itself in the second column so that its own source may be specified.

The file will contain all the identifier mappings recognized by GeneMANIA, not just those that appear in the interaction files. No assumptions are made on the order of the records in the file. The following example contains the records describing a pair of human genes: TSPAN-6 and TNMD.

Preferred_Name         Name             Source
ENSG00000000003        7105             Entrez Gene ID
ENSG00000000003        ENSG00000000003 	Ensembl Gene ID
ENSG00000000003        ENSP00000362111 	Ensembl Protein ID
ENSG00000000003        ENSP00000409517 	Ensembl Protein ID
ENSG00000000003        NM_003270       	RefSeq mRNA ID
ENSG00000000003        NP_003261       	RefSeq Protein ID
ENSG00000000003        O43657  	        Uniprot ID
ENSG00000000003        T245    	        Synonym
ENSG00000000003        TM4SF6           Synonym
ENSG00000000003        TSN6_HUMAN       Uniprot ID
ENSG00000000003        TSPAN-6          Synonym
ENSG00000000003        TSPAN6  	        Gene Name
ENSG00000000005        64102   	        Entrez Gene ID
ENSG00000000005        BRICD4  	        Synonym
ENSG00000000005        CHM1-LIKE       	Synonym
ENSG00000000005        CHM1L   	        Synonym
ENSG00000000005        ENSG00000000005 	Ensembl Gene ID
ENSG00000000005        ENSP00000362122 	Ensembl Protein ID
ENSG00000000005        NM_022144       	RefSeq mRNA ID
ENSG00000000005        NP_071427       	RefSeq Protein ID
ENSG00000000005        Q9H2S6  	        Uniprot ID
ENSG00000000005        TNMD    	        Name
ENSG00000000005        TNMD_HUMAN      	Uniprot ID
ENSG00000000005        myodulin        	Synonym
ENSG00000000005        tendin  	        Synonym
...

Combined Networks

A combined network integrates multiple individual GeneMANIA networks into a single large network. Currently the set of default networks, combined using the GO-based Biological Process method, are available for each organism. This network is used by GeneMANIA for finding genes similar to sets of query genes of size less than 6.

The combined networks are packaged for download separately from the organisms set of individual networks. As combined networks integrate the interactions between many individual networks they can be large in size. Combined networks are represented in a pair of files, one containing the network itself and the other containing the weights used to produce the combined network.

The integrated network is named “COMBINED.DEFAULT_NETWORKS.BP_COMBINING.txt”. The file is organized the same way as the individual networks described above, with each line containing a pair of interacting genes and their weights.

The set of combination weights used to produce the integrated network are available in a file named “COMBINATION_WEIGHTS.DEFAULT_NETWORKS.BP_COMBINING.txt”. This file contains 3 columns, the network group name, the network name, and the weight given to the network in the combined result. The individual networks themselves are available separately as described above.