EcoliWiki:Gene Association File
EcoliWiki and EcoCyc contribute to the Gene Ontology (GO) by monthly submitting our annotations.
Contents
Format
The gene_assocation file format has two versions. The Gene Ontology's documentation page describes these quite well. Below are some notes, some specific to the EcoCyc/EcoliWiki version of the gaf.
Column Number | Name | Notes | Cardinality |
---|---|---|---|
1 | Database | This is the database that the identifiers come from. We are using EcoCyc identifiers, so this column will always be "EcoCyc." | 1 |
2 | db_object_id |
Unique identifier from the database in column1. We are using the EcoCyc database identifiers for historical reasons. These accessions are pulled out of the 'Gene_accessions_table' on the Gene page. These EcoCyc accessions must be up to date or the entire set of annotations for a particular protein can be lost. |
1 |
3 | db_object_symbol |
A unique and valid symbol. This is the canonical name of the subject of the annotation (a gene or protein name, ORF, etc.) In this case it is the name of the gene product. We use the standard E. coli systematic nomenclature and capitolize the first letter of this symbol to make it clear this is a product. This is where a lot of collision between EcoliWiki and EcoCyc happens. After the two files of correct annotations from the two databases are combined, the names in this column usually conflict with each other -- meaning there is more than one db_object_symbol for a particular db_object_id. |
1 |
4 | Qualifier |
The qualifier. These three are nonstandard and must be removed:
|
0, 1, >1 |
5 | GO id | 1 | |
6 | Reference | A valid database prefix is required | 1 |
7 | Evidnece | Three letter abbreviation | 1 |
8 | With/From | column of variable content, depending on the situation. See the Gene Ontology gaf2.0 specification page for more info. | 0, 1 |
9 | Aspect | one of three: P, F, C | 1 |
10 | db_object_name | name of the gene or gene product. free text | 0, 1 |
11 | db_object_synonym | a pipe-delimited list of synonyms for this subject. Comes from the synonyms field in the 'Product_nomenclature_table'. | 0, 1, >1 |
12 | db_object_type | A description of the type of gene product being annotated.
|
1 |
13 | taxon | integer value of the taxon. It is not quite clear which taxon we are annotating to. See. EcoliWiki:Gene Association File#Taxon Controversy | 1 |
14 | date | Date on which the annotation was made. IEAs older than a year should be removed. | 1 |
15 | assigned by | the database which made the annotation. Not the person. One of the values in the GO list of valid database identifiers. | 1 |
16 | annotation extension | contains cross-references to other ontologies that can be used to qualify or enhance the annotation. prefaced by an appropriate GO relationship | 0, 1, >1 |
17 | isoform/product form | as the db_object_id (column 2) must be a canonical entry - something that has a 1:1 relationship to a gene - this field allows the annotation of "specific variants of that gene or gene product." EcoliWiki as of July 2010 has no way of annotating to a specific variant of a gene product. This column must contain a standard DB:ID identifier. | ? |
Generation
The two sites' annotations are prepared by the EcoliWiki team (mostly Daniel,) and submitted via CVS to the GO Consortium. The figure at right shows the four basic steps to this process:
- Obtain annotations from EcoCyc (quarterly)
- Obtain annotations from EcoliWiki (monthly)
- Merge them and resolve name-conflicts
- Submit to the GO
Taxon Identifier(s)
- EcoCyc consistenly uses taxon:511145 (Escherichia coli str. K-12 substr. MG1655)[1].
- EcoliWiki consistenly uses taxon:83333 (Escherichia coli K-12). We made this decision because we don't deal exclusively with MG1655. Now that we have other strains in the wiki (REL606, DH10B, and soon to be BL21) we'll have to decide what taxon to use for annotations to genes that are in differenent strains.
Peter Karp says:
- "It will cause great confusion to try to curate several strains together with no separation. You may have several strains in the wiki, but I believe they should be separated, and each strain should have its own set of GO annotations. Therefore I believe you should be using taxon:511145."
Jim Hu says:
- "The literature-based curation is based on a corpus of papers describing experiments that are done in a mix of E. coli lab strains, not just MG1655 and its direct descendants. Large fractions of the work on K-12 strains is based on W3110 derivatives. MG1655 and W3110 are both covered by taxid 83333. However, this excludes all of the work done with E. coli B derivatives, including, I believe, all of the early work from the CSHL phage group, all of the work from Engelsberg on arabinose, and most of the work from Fred Neidhardt's group.
- From the model organism perspective, the vast majority of the work we expect people to curate is inferred to apply to a more generic lab E. coli, not just the strain where the work was done. The exception would be studies comparing pathogens and nonpathogens, restriction-modification, or strains with or without lambda or F. For this reason, I think we should use 562 unless the annotation is based on a study of what makes K-12 different from other E. coli clades, or MG1655 different from other K-12 strains."
- Currently we leave the annotations alone, meaning there is a mix of annotations to Ecoli K-12 and subst. MG1655.
Detailed Description of File Generation
Step 0 - Setup the environment
The process requires some files and scripts from the GO Consortium. Instead of do this by hand each month, Daniel wrote a script to handle this part of the process. To use this bash script, copy this code into a file and make it executable. On our servers it is named gaf_setup.sh.
The first step is to run this script. Alternatively, you can follow the directions below to do this by hand.
<php>
- !/bin/bash
- -------------------------------------------------------------------------------
- gaf_setup.sh Daniel Renfro 01/2010
- This script sets up a working directory for the montly creation of the
- gene-associations file and the gp2protein file for EcoliWiki.
- -------------------------------------------------------------------------------
working_dir=$(date "+%Y-%m-%d") prefix=`dirname "$0"`
- make the working directory
mkdir $working_dir cd $working_dir
- make some necessary directory
echo "Creating directories..." mkdir ontology mkdir doc
- download the stuff from GO
echo "Downloading files from the Gene Ontology..." ncftp ftp://ftp.geneontology.org/pub/go/software/utilities/ <<END_SCRIPT get filter-gene-association.pl get check-abbr-ga-file.pl quit END_SCRIPT ncftp ftp://ftp.geneontology.org/pub/go/ontology <<END_SCRIPT get gene_ontology_edit.obo quit END_SCRIPT ncftp ftp://ftp.geneontology.org/pub/go/doc/ <<END_SCRIPT get GO.xrf_abbs quit END_SCRIPT
- put the files in the right places
mv gene_ontology_edit.obo ontology mv GO.xrf_abbs doc
- set the permissions on the executable files.
chmod +x check-abbr-ga-file.pl filter-gene-association.pl
echo "setup complete." </php>
Make a new directory to work in.
cd /usr/local/working/GAF/ mkdir YYYY-MM-DD chmod 775 YYYY-MM-DD cd YYYY-MM-DD
Look through the wiki and find things that are wrong and can be fixed and fix them.
php5 /usr/local/phpwikibots/trunk/find_GO_terms_with_changed_names.php -w /Library/WebServer/WebServer/Documents/ecoliwiki/colipedia php5 /usr/local/phpwikibots/trunk/find_obsolete_categories_and_annotations.php -w /Library/WebServer/WebServer/Documents/ecoliwiki/ -C -A
Make some directories for use later
mkdir {doc, ontology}
Download some scripts and some files. Here we are using the ncftp commandline tool, you can use whichever tool you are comfortable with.
ncftp ftp://ftp.geneontology.org/pub/go/software/utilities/ get filter-gene-association.pl get check-abbr-ga-file.pl quit ncftp ftp://ftp.geneontology.org/pub/go/ontology get gene_ontology_edit.obo quit ncftp ftp://ftp.geneontology.org/pub/go/doc/ get GO.xrf_abbs quit
Move some of those files into the directories.
mv gene_ontology_edit.obo ontology/ mv GO.xrf_abbs doc/
Step 1 - EcoCyc annotation
EcoCyc's quarterly releases contain their annotations in the gene_association file format. The file is typically named gene_associations.ecocyc.... (Notice the plural on associations). In order to prepare their annotations, we must first parse the annotations for validity. Using the script(s) we downloaded in the setup section, parse the most recent EcoCyc release's annotation file. For a more detailed description of the script, see http://www.geneontology.org/GO.format.annotation.shtml#script. These scripts can be run with the option to print the erroneous lines with short descriptions of the problem. Some of these errors will be easy to fix and will not have any biological significance. A common error is misspellings or disallowed entries in the Assigned_by column (column 15.) Other errors are non-trivial and should be kept track of for further curation.
Copy the downloaded EcoCyc file to our working directory
cp ./path/to/ecocyc/data/14.0/gene_associations.ecocyc-14.0-corrected ./ecocyc.annotations.all
Change the permissions to read-only.
chmod 444 ecocyc.annotations.all
Fix some very common errors with the EcoCyc file[
/path/to/code/gene_ontology/fix_ecocyc_gene_associations_file.pl < ecocyc.annotation.all > ecocyc.annotations.valid
Validate
./check-abbr-ga-file.pl -i ecocyc.annotations.valid -p nocheck -w > a; mv a ecocyc.annotations.valid; ./filter-gene-association.pl -i ecocyc.annotations.valid -p nocheck -w > a; mv a ecocyc.annotations.valid;
Step 2 - EcoliWiki annotations
We use a custom-written script to pull annotations from the wiki, named mine_annotations.php. This is then validated, and the errors sent to our curators.
Get all annotations using PHP
php /path/to/mine_annotations.php -w ecoliwiki > ecoliwiki.annotations.all
Validate
./check-abbr-ga-file.pl -i ecoliwiki.annotations.all -p nocheck -w > ecoliwiki.annotations.valid ./filter-gene-association.pl -i ecoliwiki.annotations.valid -p nocheck -w > a; mv a ecoliwiki.annotations.valid;
Step 3 - Merge
Problems arise when EcoliWiki and EcoCyc use different db_object_symbols/common-names for the same db_object_id. (Column 2 matches, but column 3 doesn't.) The GO filter script counts this as an error.
Combine the two sets of valid annotations.
cat *.valid > combined.annotations.all
The next step is to run the filter script, save the output and fix the file by hand.
./filter-gene-association.pl -i combined_good_lines.gaf -p nocheck -e > name_conflicts.txt
After fixing naming issues, validate
./check-abbr-ga-file.pl -i combined.annotations.all -p nocheck -w > combined.annotations.valid ./filter-gene-association.pl -i combined.annotations.valid -p nocheck -w > a; mv a combined.annotations.valid
Step 4 - Submit
header information
Occasionally the header information in the gene_association file needs to be updated. This information can be found in the gene_assocation.ecocyc.conf file in the go/gene-associations/submission/ directory of the GO CVS. To change this information, just edit the file from the CVS checkout and resubmit it. It will get prepended to the gene_assocation file automatically. As of March 2010 the file looks like this:
project_name=EcoCyc and EcoliWiki collaborative annotation for E. coli K-12 contact_email=biocyc-support@ai.sri.com contact_email=ecoliwiki@gmail.com project_url=http://ecoliwiki.net project_url=http://ecocyc.org funding_source=GM077678 funding_source=NIGMS U24GM088849 email_report=ecoliwiki@gmail.com ecocyc_version=14.0
gp2protein.ecocyc
Aside from the gene_association file, the gp2protein file needs to be submitted. This can be made on-the-fly each month from a custom script that pulls the necessary information from EcoliWiki.
submission via CVS
The files are sorted and uniqed using the appropriate Unix commands. They are then gzipped and moved into the CVS directory.
cd ~/CVS/ CVS submit
See Also
- How EcoliWiki incorporates EcoCyc's annotations
- GO Consortium's specification for the GAF - http://www.geneontology.org/GO.format.annotation.shtml
- gp2protein file notes - http://wiki.geneontology.org/index.php/Gp2protein