Batch Creator Script

From EcoliWiki
Jump to: navigation, search

<perl>

  1. !/usr/bin/perl -w
  1. ----------------------------------------------------------------------------------
  2. 2010-06-11 Version 1.0
  3. INTRODUCTION:
  4. This code takes a template and merges it with a data file to create a batch
  5. files for loading microarray data into SMD [smd.stanford.edu]. The purpose
  6. is to automate the process of making annotations by having this script fill
  7. in fields that are always the same.
  8. >>Template File Construction:
  9. This file should include 2 rows of data with the first being the headers from
  10. "Result_Set_Name" to "PROBE_SET_ALGORITHM". The second row should data in any
  11. column that includes information that will never change for that template
  12. type. If you have multipule array types that you are loading you can make
  13. different templates.
  14. >>Data File Construction:
  15. This file should contain at least 2 rows (one for the headers and the other
  16. for the row of data that was annotated), but can be as many rows as you want.
  17. >>Note about files types:
  18. You can make the template and data file in excel, but they must be saved as
  19. a tab-delimited text file if you want them to work properly with this code.
  20. The resulting batch file will be named GSE####_batch.txt (where the #### will
  21. be replaced with the GSE number you give the command line).
  22. HOW THE CODE WORKS:
  23. This code will first look at your data file and see if the cells are filled.
  24. If they are not, it will check the template. If neither are present this code
  25. will throw an error telling you which column is missing data. Because the data
  26. file has priority over the template, you can overwrite sany template options
  27. without changing the template itself. However if you find yourself overwriting
  28. many of the template options it would be best to make an new template.
  29. >>Assumptions in this code
  30. This code assumes that Column C, O, and S are blank and do not read from the
  31. template or data files.
  32. >>Rules
  33. Listed at the bottom of this code is quick glance at what rules this code
  34. follows. As a rule of thumb fill in Columns D, K, N, P, T, and U for the template
  35. and Columns E, F, G, Q, and R for the data file.
  36. QUESTIONS
  37. Contact me at amandasupak@gmail.com
  38. ----------------------------------------------------------------------------------


  1. use strict;

use Getopt::Long; use Data::Dumper;

my ($template_content, $data_file, $gse_number, $result, @template_lines);

$result = GetOptions ( "t|template=s" => \$template_file, "d|data=s" => \$data_file, "g|gse_number=s" => \$gse_number ); die "needs a template file " unless $template_file; die "neeeds a data file " unless $data_file; die "needs a GSE number! " unless $gse_number;

  1. Read template into a scalar (to ignore line endings)

{ local $/; open (TEMPLATE, $template_file) or die "Couldn't open file: $!"; $template_content = <TEMPLATE>; close TEMPLATE; }

  1. Remove Excel style line endings (window's endings)

$template_content =~ s!\r! !g; @template_lines = split / /, $template_content;

  1. Print to the screen that the data is being processed

print " Processing... ";

  1. Strips "GSE" from "GSE#####" when typed in the promt "-g"

$gse_number=~ s/GSE(\d+)/$1/; print "$gse_number ";

  1. split the template into columns, only on the second line (the first is the headings)

my @template = split /\t/, $template_lines[1];

  1. open a file to print to

open (OUTPUT, '>GSE' . $gse_number . '_batch.txt') or die $!;

  1. Put the headers from the template into the output file

print OUTPUT $template_lines[0] . " ";

  1. Read the data file into a scalar

{ local $/; open (DATA, $data_file) or die "Couldn't open file: $!"; $data_content = ; close DATA; } $data_content =~ s!\r! !g; my @data_lines = split / /, $data_content;

my @data;

my $i = 0; foreach (@data_lines) {

if ( $i ==0 ) { $i++; next; }

next if /^\s*$/;

@data = split /\t/, $_;

my @fields_to_print;

# (A) result_set_name if ($gse_number =~ m/^GSE/i) { $gse_number =~ s/^GSE//; } $fields_to_print[0] = 'GSE' . $gse_number . '_ecolihub';

# (B) result_set_description $fields_to_print[1] = $fields_to_print[0];

# (C) add_to_exp (blank) $fields_to_print[2] = "";

# (D) print_name $fields_to_print[3] = getColumn(3);

# (E) experiment_category $fields_to_print[4] = getColumn(4);

# (F) experiment_subcategory $fields_to_print[5] = getColumn(5);

# (G) slide_name $fields_to_print[6] = getColumn(6);

# (H) exp_file_location $fields_to_print[7] = getColumn(7);

# (I) cel_file_location $fields_to_print[8] = $fields_to_print[6] . '.CEL';

# (J) gene_file_location $fields_to_print[9] = $fields_to_print[6] . '.CEL.chp.txt';

# (K) single_scan_file_location $fields_to_print[10] = getColumn(10);

# (L) single_channel_description my $s = getColumn(17); if ( $s =~ /^"(.*)"$/ ) { # strip off the leading and trailing quotes added by Excel $s = $1; } if ( $s =~ /^(.*)<c-anno>/ ) { # grab off the first part $fields_to_print[11] = $1; } else { print STDERR "Please check the experiment description (Column R). It must include a description of the experiment before the <c-anno> tag. "; }

# (M) experiment_name $fields_to_print[12] = $fields_to_print[11];

# (N) normalization $fields_to_print[13] = getColumn(13);

# (O) norm_value $fields_to_print[14] = ;

# (P) experimenter $fields_to_print[15] = getColumn(15);

# (Q) date my $date = getColumn(16); if ( $date !~ /^(?:19|20)\d\d-(?:0[1-9]|1[012])-(?:0[1-9]|[12][0-9]|3[01])$/ ) { # match a date like YYYY-MM-DD print STDERR "Date in the wrong format! Please use YYYY-MM-DD format. "; $date = ; } $fields_to_print[16] = $date;

# (R) experiment_description $fields_to_print[17] = $s; # from (L)

# (S) collaborative_group $fields_to_print[18] = "";

# (T) individual_user $fields_to_print[19] = getColumn(19);

# (U) probe_set_algorithm $fields_to_print[20] = getColumn(20);

# now we have @fields_to_print, make into scalar and print to the screen my $x = join("\t", @fields_to_print); print OUTPUT $x . " "; print "$fields_to_print[6] ";


#for (my $j=0, $c=scalar(@fields_to_print); $j<=$c; $j++) { # printf("%d,\t%s,\t%s,\t%s ",$j, $data[$j], $template[$j], $fields_to_print[$j] ); #} #die();

$i++; } close DATA;


  1. Determines if the data file or the template file has priority

sub getColumn { my $column_index = shift @_;

if ( !$data[$column_index] && !$template[$column_index] ) { print STDERR "Column " . $column_index . " doesn't have any values in data or template! "; return; }

return ($data[$column_index]) ? $data[$column_index] : $template[$column_index]; }

my $r = $i -1; print "Created $r row(s) ";

__END__

  1. Rules:

Column A is taken from the template and user input (GSE####) Column B is (column A) Column I is (column G + .CEL) Column J is (column G + .CEL.chp.txt) Column H is taken from template unless specified by user (.EXP files) Column L is taken from text of Column R before "<c-anno>" Column M is taken from Column L</perl>