EcoliWiki:Gene Association File

From EcoliWiki
(Redirected from Help:Gene Association File)
Jump to: navigation, search

EcoliWiki and EcoCyc contribute to the Gene Ontology (GO) by monthly submitting our annotations.

Format

The gene_assocation file format has two versions. The Gene Ontology's documentation page describes these quite well. Below are some notes, some specific to the EcoCyc/EcoliWiki version of the gaf.

Column Number Name Notes Cardinality
1 Database This is the database that the identifiers come from. We are using EcoCyc identifiers, so this column will always be "EcoCyc." 1
2 db_object_id

Unique identifier from the database in column1.

We are using the EcoCyc database identifiers for historical reasons. These accessions are pulled out of the 'Gene_accessions_table' on the Gene page. These EcoCyc accessions must be up to date or the entire set of annotations for a particular protein can be lost.

1
3 db_object_symbol

A unique and valid symbol. This is the canonical name of the subject of the annotation (a gene or protein name, ORF, etc.) In this case it is the name of the gene product. We use the standard E. coli systematic nomenclature and capitolize the first letter of this symbol to make it clear this is a product.

This is where a lot of collision between EcoliWiki and EcoCyc happens. After the two files of correct annotations from the two databases are combined, the names in this column usually conflict with each other -- meaning there is more than one db_object_symbol for a particular db_object_id.

1
4 Qualifier

The qualifier. These three are nonstandard and must be removed:

  1. obsolote_go_term - flag
  2. under_review - we think this should be recurated
  3. deprecated - we've looked at it and it needs to be removed
  4. ...probably others...
0, 1, >1
5 GO id 1
6 Reference A valid database prefix is required 1
7 Evidnece Three letter abbreviation 1
8 With/From column of variable content, depending on the situation. See the Gene Ontology gaf2.0 specification page for more info. 0, 1
9 Aspect one of three: P, F, C 1
10 db_object_name name of the gene or gene product. free text 0, 1
11 db_object_synonym a pipe-delimited list of synonyms for this subject. Comes from the synonyms field in the 'Product_nomenclature_table'. 0, 1, >1
12 db_object_type A description of the type of gene product being annotated.
  • If column 17 is used, this column refers to that, otherwise, it refers to column 2
1
13 taxon integer value of the taxon. It is not quite clear which taxon we are annotating to. See. EcoliWiki:Gene Association File#Taxon Controversy 1
14 date Date on which the annotation was made. IEAs older than a year should be removed. 1
15 assigned by the database which made the annotation. Not the person. One of the values in the GO list of valid database identifiers. 1
16 annotation extension contains cross-references to other ontologies that can be used to qualify or enhance the annotation. prefaced by an appropriate GO relationship 0, 1, >1
17 isoform/product form as the db_object_id (column 2) must be a canonical entry - something that has a 1:1 relationship to a gene - this field allows the annotation of "specific variants of that gene or gene product." EcoliWiki as of July 2010 has no way of annotating to a specific variant of a gene product. This column must contain a standard DB:ID identifier. ?

Generation

See text for description.

The two sites' annotations are prepared by the EcoliWiki team (mostly Daniel,) and submitted via CVS to the GO Consortium. The figure at right shows the four basic steps to this process:

  1. Obtain annotations from EcoCyc (quarterly)
  2. Obtain annotations from EcoliWiki (monthly)
  3. Merge them and resolve name-conflicts
  4. Submit to the GO


Taxon Identifier(s)

  • EcoCyc consistenly uses taxon:511145 (Escherichia coli str. K-12 substr. MG1655)[1].
  • EcoliWiki consistenly uses taxon:83333 (Escherichia coli K-12). We made this decision because we don't deal exclusively with MG1655. Now that we have other strains in the wiki (REL606, DH10B, and soon to be BL21) we'll have to decide what taxon to use for annotations to genes that are in differenent strains.

Peter Karp says:

"It will cause great confusion to try to curate several strains together with no separation. You may have several strains in the wiki, but I believe they should be separated, and each strain should have its own set of GO annotations. Therefore I believe you should be using taxon:511145."

Jim Hu says:

"The literature-based curation is based on a corpus of papers describing experiments that are done in a mix of E. coli lab strains, not just MG1655 and its direct descendants. Large fractions of the work on K-12 strains is based on W3110 derivatives. MG1655 and W3110 are both covered by taxid 83333. However, this excludes all of the work done with E. coli B derivatives, including, I believe, all of the early work from the CSHL phage group, all of the work from Engelsberg on arabinose, and most of the work from Fred Neidhardt's group.
From the model organism perspective, the vast majority of the work we expect people to curate is inferred to apply to a more generic lab E. coli, not just the strain where the work was done. The exception would be studies comparing pathogens and nonpathogens, restriction-modification, or strains with or without lambda or F. For this reason, I think we should use 562 unless the annotation is based on a study of what makes K-12 different from other E. coli clades, or MG1655 different from other K-12 strains."
  • Currently we leave the annotations alone, meaning there is a mix of annotations to Ecoli K-12 and subst. MG1655.


Detailed Description of File Generation

Step 0 - Setup the environment

The process requires some files and scripts from the GO Consortium. Instead of do this by hand each month, Daniel wrote a script to handle this part of the process. To use this bash script, copy this code into a file and make it executable. On our servers it is named gaf_setup.sh.

The first step is to run this script. Alternatively, you can follow the directions below to do this by hand.

<php>

  1. !/bin/bash
  2. -------------------------------------------------------------------------------
  3. gaf_setup.sh Daniel Renfro 01/2010
  4. This script sets up a working directory for the montly creation of the
  5. gene-associations file and the gp2protein file for EcoliWiki.
  6. -------------------------------------------------------------------------------

working_dir=$(date "+%Y-%m-%d") prefix=`dirname "$0"`

  1. make the working directory

mkdir $working_dir cd $working_dir

  1. make some necessary directory

echo "Creating directories..." mkdir ontology mkdir doc

  1. download the stuff from GO

echo "Downloading files from the Gene Ontology..." ncftp ftp://ftp.geneontology.org/pub/go/software/utilities/ <<END_SCRIPT get filter-gene-association.pl get check-abbr-ga-file.pl quit END_SCRIPT ncftp ftp://ftp.geneontology.org/pub/go/ontology <<END_SCRIPT get gene_ontology_edit.obo quit END_SCRIPT ncftp ftp://ftp.geneontology.org/pub/go/doc/ <<END_SCRIPT get GO.xrf_abbs quit END_SCRIPT

  1. put the files in the right places

mv gene_ontology_edit.obo ontology mv GO.xrf_abbs doc

  1. set the permissions on the executable files.

chmod +x check-abbr-ga-file.pl filter-gene-association.pl

echo "setup complete." </php>

Make a new directory to work in.

cd /usr/local/working/GAF/
mkdir YYYY-MM-DD
chmod 775 YYYY-MM-DD
cd YYYY-MM-DD

Look through the wiki and find things that are wrong and can be fixed and fix them.

php5 /usr/local/phpwikibots/trunk/find_GO_terms_with_changed_names.php -w /Library/WebServer/WebServer/Documents/ecoliwiki/colipedia 
php5 /usr/local/phpwikibots/trunk/find_obsolete_categories_and_annotations.php -w /Library/WebServer/WebServer/Documents/ecoliwiki/ -C -A 

Make some directories for use later

mkdir {doc, ontology}

Download some scripts and some files. Here we are using the ncftp commandline tool, you can use whichever tool you are comfortable with.

ncftp ftp://ftp.geneontology.org/pub/go/software/utilities/
  get filter-gene-association.pl
  get check-abbr-ga-file.pl
  quit
ncftp ftp://ftp.geneontology.org/pub/go/ontology
  get gene_ontology_edit.obo
  quit
ncftp ftp://ftp.geneontology.org/pub/go/doc/
  get GO.xrf_abbs
  quit

Move some of those files into the directories.

mv gene_ontology_edit.obo ontology/
mv GO.xrf_abbs doc/

Step 1 - EcoCyc annotation

EcoCyc's quarterly releases contain their annotations in the gene_association file format. The file is typically named gene_associations.ecocyc.... (Notice the plural on associations). In order to prepare their annotations, we must first parse the annotations for validity. Using the script(s) we downloaded in the setup section, parse the most recent EcoCyc release's annotation file. For a more detailed description of the script, see http://www.geneontology.org/GO.format.annotation.shtml#script. These scripts can be run with the option to print the erroneous lines with short descriptions of the problem. Some of these errors will be easy to fix and will not have any biological significance. A common error is misspellings or disallowed entries in the Assigned_by column (column 15.) Other errors are non-trivial and should be kept track of for further curation.

Copy the downloaded EcoCyc file to our working directory

cp ./path/to/ecocyc/data/14.0/gene_associations.ecocyc-14.0-corrected ./ecocyc.annotations.all 

Change the permissions to read-only.

chmod 444 ecocyc.annotations.all

Fix some very common errors with the EcoCyc file[

/path/to/code/gene_ontology/fix_ecocyc_gene_associations_file.pl < ecocyc.annotation.all > ecocyc.annotations.valid

Validate

./check-abbr-ga-file.pl      -i ecocyc.annotations.valid -p nocheck -w > a; mv a ecocyc.annotations.valid;
./filter-gene-association.pl -i ecocyc.annotations.valid -p nocheck -w > a; mv a ecocyc.annotations.valid;

Step 2 - EcoliWiki annotations

We use a custom-written script to pull annotations from the wiki, named mine_annotations.php. This is then validated, and the errors sent to our curators.

Get all annotations using PHP

php /path/to/mine_annotations.php -w ecoliwiki > ecoliwiki.annotations.all

Validate

./check-abbr-ga-file.pl      -i ecoliwiki.annotations.all -p nocheck -w > ecoliwiki.annotations.valid
./filter-gene-association.pl -i ecoliwiki.annotations.valid -p nocheck -w > a; mv a ecoliwiki.annotations.valid;

Step 3 - Merge

Problems arise when EcoliWiki and EcoCyc use different db_object_symbols/common-names for the same db_object_id. (Column 2 matches, but column 3 doesn't.) The GO filter script counts this as an error.

Combine the two sets of valid annotations.

cat *.valid > combined.annotations.all 

The next step is to run the filter script, save the output and fix the file by hand.

./filter-gene-association.pl -i combined_good_lines.gaf -p nocheck -e > name_conflicts.txt

After fixing naming issues, validate

./check-abbr-ga-file.pl      -i combined.annotations.all    -p nocheck -w > combined.annotations.valid
./filter-gene-association.pl -i combined.annotations.valid  -p nocheck -w > a; mv a combined.annotations.valid

Step 4 - Submit

header information

Occasionally the header information in the gene_association file needs to be updated. This information can be found in the gene_assocation.ecocyc.conf file in the go/gene-associations/submission/ directory of the GO CVS. To change this information, just edit the file from the CVS checkout and resubmit it. It will get prepended to the gene_assocation file automatically. As of March 2010 the file looks like this:

project_name=EcoCyc and EcoliWiki collaborative annotation for E. coli K-12
contact_email=biocyc-support@ai.sri.com
contact_email=ecoliwiki@gmail.com
project_url=http://ecoliwiki.net
project_url=http://ecocyc.org
funding_source=GM077678
funding_source=NIGMS U24GM088849
email_report=ecoliwiki@gmail.com
ecocyc_version=14.0

gp2protein.ecocyc

Aside from the gene_association file, the gp2protein file needs to be submitted. This can be made on-the-fly each month from a custom script that pulls the necessary information from EcoliWiki.

submission via CVS

The files are sorted and uniqed using the appropriate Unix commands. They are then gzipped and moved into the CVS directory.

cd ~/CVS/
CVS submit

See Also

  • Personal correspondence. https://mail.google.com/mail/?hl=en&shva=1#label/gaf/1266d5c79b4c8bc7