In my home director on cluster…
WORKFLOW FOR ANALYSIS OF CHIP-SEQ DATA INCLUDING ANALYSIS OF ENRICHED TRANCRIPTION FACTOR BINDING SITES
-Gavin Schnitzler 02/11/2013
1) QUALITY CONTROL ON SEQUENCING RESULTS
Open the fastqc_report.html file.
Check per base sequence quality. This should be in the green range throughout. Somewhat lower quality is expected in the first ~8bp (due to technical aspects of the sequencing method), which is why these bases are often excluded from the analysis. If other 5’ or 3’ end bases fall out of the green range you may also want to exclude these with the caveats that you’ll need at least ~25 bp to map to a mammalian genome & you will want to perform the same trim on all samples.
Check per sequence quality scores. Ideally there will be a sharp peak in the 30’s & relatively few reads below this.
Other metrics often don’t link properly to the .html report, but you can access each of these by opening the images subfolder of the fastqc results folder:
duplication_levels.png shows the number of exact duplicate reads. If most of your reads have a duplication value greater than 1 this indicates that you are sequencing multiples of the same fragment that were generated by the PCR amplification step in library production. Avoiding this is the major reason why you need to start with ~3ng of ChIP fragments and why you want to limit amplification to ~18 cycles. If there is a single curve centered around 2 or 3 your data should still be fine – but will be about as accurate as having 1/3 as many unique sequences.
The kmer content values in the .html report & kmer_profiles.png image describe certain short sequences that are over represented compared to chance predicted from GC content. Usually TTTTT & AAAAA show up & a few others, which are just normal characteristics of mammalian DNA. Don’t worry about this unless certain sequences show up 100s of times greater than background and/or the kmers in your input DNA sample differ greatly from those in your ChIP sample… in which case these kmers could be an indicator of contamination by very high copies of the same sequence – potentially a PCR fragment from some other experiment that contaminated your ChIP sample.
Per base GC content & sequence content .png files should show roughly straight lines with the exception of the first ~8 or 9 bp, which is expected (as mentioned above).
The per sequence GC content .png file should have a single peak centered over the expected %GC for the genome in question (~45% for mammals). Sharp peaks of any given GC content probably represent highly repeated sequences (resulting perhaps from contamination of your ChIP DNA with a PCR product of some specific sequence that was hanging around the lab). A second broad peak of higher GC content probably indicates bacterial DNA contamination. An example of contamination with a common soil bacterium that was probably growing in a ChIP buffer is shown here:
Note that none of these issues should completely prevent analysis of your data. For instance, non-mammalian sequences won’t map to the genome & over-amplified artifact sequences are removed by the recommended procedure of removing duplicate reads before calling peaks. These problems will, however, decrease your total number of reads and thus decrease the accuracy of peak calls. Similarly a library of too low a concentration will result in fewer than optimal numbers of sequenced clusters. What will prevent analysis is low sequence quality across all bases of reads, or high N (unknown base) calls, which will prevent any reliable mapping to your target genome. This sort of effect is rare & is likely to indicate a problem with the Illumina run itself, that you will want to contact your core about.
[pic]
2) DOWNLOADING & UNPACKING ILLUMINA SEQUENCING RESULTS
#go to the web site provided by the core facility. Find the correct file ending in “fastq.gz”. Control click to get menu & select “copy link location”
#go to the Tufts computing cluster either using a shell program like TeraTerm or using “sftp your_account@cluster.uit.tufts.edu” from a Mac OSX shell.
#Navigate to an appropriate folder in your shared directory & do…
wget {pasted URL copied from website: e.g. }
#unpack using…
bsub gunzip FILE.fastq.gz
**Remember to create a backup of the QC and .fastq.gz files in someplace other than the cluster, such as an external hard drive.
3) CHIP-SEQ STEP 1: MAPPING READS TO TARGET GENOME WITH BOWTIE
#run bowtie using:
bsub -oo LiE_man2_bowtie.runinfo /cluster/shared/gschni01/bowtie-0.12.5/bowtie -n 1 -m 1 -5 8 -3 10 --best --strata mm9 FILE.fastq FILE.map
# “bsub –oo filename” submits the bowtie run as batch and records any output that would have been sent to the screen in the provided file. bsub is necessary for any job that will take more than a few seconds to run. “-n #” specifies the number of mismatches allowed in 1st 25 bp of read, “-m #” specifies the maximum number of different genomic locations a read can map to before being rejected, “-5 #” indicates the number of bp trimmed from 5’ end (8 generally advised), “-3 #” number of bp trimmed from 3’ end, “--best" & “--strata" are recommended parameters to find likely best fit in the genome.
#note, no commands to set path parameters are necessary to run bowtie, so long as the full path to the bowtie executable file is used. You can identify this path by going to that directory and typing “pwd”.
#Examine the bowtie.runinfo file & record the results, which will tell you what % of sequences mapped to the genome, what percent failed to map & what % were suppressed due to the --m parameter (e.g. mapping to more than one genomic location if --m is set to 1).
# To install bowtie go to: & follow instructions for installation. To install indexes for a genome of interest, right click on the link for the pre-assembled index you want (along right of page) & select copy link location. In UNIX go to the “indexes” folder in bowtie & do:
wget [pasted link location]
unzip [downloaded file]
4) CHIP-SEQ STEP 2: CONVERTING FROM MAP TO BED FORMAT
#MACs requires that the .map bowtie output be converted into a .bed file. Do:
awk 'OFS="\t" {print $4,$5,$5+length($6),$1,".",$3}' FILE.map > FILE.bed
# “\t” tells awk that entries within a line are separated by tabs, $4 means the 4th column, $5+length($6) means add the numeric value in the 5th column to the (text) length of the entry in column 6 & “.” means to insert a literal period in this tabbed position.
5) IDENTIFYING CHIP-SEQ PEAKS WITH MACS
#To prepare to run macs:
module load python/2.6.5
export PYTHONPATH=/cluster/shared/gschni01/lib/python2.6/site-packages:$PYTHONPATH
export PATH=/cluster/shared/gschni01/bin:$PATH
#note the paths used here depend on parameters specified when MACS was installed
#Running MACs using an input control (-c) and ChIP experimental (-t) files:
bsub -oo IPvINPUT.macsinfo macs14 --format=BED --bw=210 --tsize=33 --keep-dup=1 -B -S -c INPUT.bed -t IP.bed --name IPvINPUT
#--format=BED tells MACs that the input file is in .bed format, bw=210 tells MACs the expected size of sequenced fragments (before addition of linkers, which add an additional ~90 bp) from which value it attempts to build a model from sense and antisense sequence reads, --tsize=33 indicates that each read (after trimming in bowtie) is 33 bp long (not strictly necessary: MACS can figure this out on its own), --keep-dup=1 instructs MACS to consider only the first instance of a read starting at any given genomic base pair coordinate & pointing in the same direction – assuming that additional reads starting at the same base pair are due to amplified copies of the same ChIP fragment in the library (by default MACS estimates the number of duplicates that are likely to arise by linear amplification of all fragments from a limited starting sample, and sets the threshold to cut out replicate reads with a much higher number – likely artifacts), -B tells MACS to make a bedgraph file of read density at each base pair (which can be used to visualize the results on the UCSC browser) & -S tells MACS to make a single .bedgraph file instead of one for each chromosome, & finally --name gives the prefix name for all output files.
# Carefully scan through the MACS output file & examine the results for:
INFO @ Tue, 21 Feb 2012 18:30:07: #1 total tags in treatment: 54992015
INFO @ Tue, 21 Feb 2012 18:30:07: #1 tags after filtering in treatment: 38314984
#(reads left after keeping only one of each repeated sequence based on keep-dup=1)
INFO @ Tue, 21 Feb 2012 18:30:07: #1 Redundant rate of treatment: 0.30
# also look at these numbers for your control file. A high redundant rate reduces your read count & your ability to detect peaks accordingly.
# MACS identifies + strand and – strand peaks, assumes that some of these come from left & right reads of binding site peak fragments & builds a model that determines the optimum separation between these peaks…
INFO @ Tue, 21 Feb 2012 18:31:19: #2 Build Peak Model...
INFO @ Tue, 21 Feb 2012 18:32:01: #2 number of paired peaks: 9916
INFO @ Tue, 21 Feb 2012 18:32:04: #2 predicted fragment length is 133 bps
# You ideally want >1000 peak pairs in the model, but MACS will run (giving a warning) so long as you have ~100. Importantly, the predicted fragment size should be within +/-10 or 20 bp of your estimate of the ChIP fragment size in your library (equal to your average final library fragment length minus the ~90 bp length of the adapters). If it is not, then the model is probably based on spurious read peaks resulting from, for instance, sonication sensitive sequences that get cleaved at high frequency.
#You can visualize the quality of the model as follows: copy the “.r” file generated by MACS (e.g. LiE_man2.33_v_Li_Input_bw210_dup1_model.r to your PC. Open R. Set the working directory to equal the folder with the .r file using:
setwd(“[file path]”)
#Then generate a viewable .pdf from the .r file using:
source(“filename.r”)
#This should show two broad peaks for plus strand & minus strand reads and a 3rd for the read density center called by MACs based on its predicted fragment length. Sharp peaks separated by less than 50 bp are peak call errors due perhaps to sonication-sensitive sequences. If you also see broad peaks flanking these sharp peaks it means that it may be possible to identify the real peaks if you adjust the MACS parameters. In particular the --mfold=X,Y parameter tells MACS to use peaks that are between X and Y fold over background in calculating the model (default 10 & 30). Sharp false peaks are often very high over background, so reducing the max --mfold can often help (e.g. --mfold=10,20). If signal for real binding events is not strong, you can also reduce the lower parameter (I’ve had good results sometimes with --mfold=7,12) – with the caveat that this is more likely to call true noise as a peak.
# Finally you get to the number of peaks called:
INFO @ Tue, 21 Feb 2012 18:38:19: #3 Finally, 55920 peaks are called!
# Then, as a control, MACS swaps treatment and control data, considering treatment as background and asking for the peaks detected in the control data.
INFO @ Tue, 21 Feb 2012 18:38:19: #3 find negative peaks by swapping treat and control
INFO @ Tue, 21 Feb 2012 18:39:08: #3 Finally, 1119 peaks are called!
# This is a measure of how many peaks result from random noise. Ideally you want “negative peaks” to be 10-fold or more lower than “positive peaks”. This is how MACS calculates empirical false discovery rate, so the lower the ratio of negative to positive peaks, the better your FDRs will be. If these values are roughly equal, this indicates that non-specific fragments in your ChIP DNA are more numerous than the fragments specifically brought down with your antibody & may suggest you need to optimize your ChIP conditions or try other antibodies.
# Keep tweaking --mfold ranges until you get the correct estimated fragment size & a high positive/negative peak ratio, if possible. Beware that the smaller the mfold range you allow the fewer qualifying peaks there will be. If the model won’t resolve, you can tell MACS to forget the model & just use your estimate of fragment size by adding these parameters to your MACS command “--bw=fragment_length --shiftsize=fragment_length/2 --nomodel" & hope that this gives a good positive/negative peaks ratio.
#To install MACS 1st go to: liulab.dfci.harvard.edu/MACS/ & choose the download option & ctrl-click to copy the link location. If you need to login use default username=macs & password=chipseq
To download directly to the cluster, you will need to include login information in the wget command:
wget --http-user=macs --http-password=chipseq [paste url]
Follow the instructions for installation. To get it to work, you will probably need to change the default version of python with:
module add python/2.6.5
6) SUBSAMPLING BEFORE MACS, PLUSES and MINUSES
# I am told that the P-values and fold-enrichment values given by MACS are sensitive to the relative number of reads in the treatment vs control files, and if this number is very different (maybe > 1.5x different) MACs may not accurately report these numbers (even though the accuracy of peak calls should be mostly unaffected).
# One way to handle this is to run MACS with all of the data initially, and compare tags after filtering for treat and control. Then take a subsample of the dataset with the higher number of reads & re-run MACS. E.g. if control has 60M reads after filtering and treat has 20M, take a 33% subsample of the control & rerun MACS. The following small command-line perl script can be used to subsample reads, in this case 33% of lines in LiE_man2.bed are put into LiE_man2_33pct.bed.
perl -e 'open(F1,"FILE.bed"); open(FH2,">","FILE_33pct.bed"); while(){if(rand()>.666){print FH2 $_;} }; close F1; close F2'
#This will not give an exactly equal number of reads, especially if % redundancy is high, but may get you close – especially if you try a few iterations. To get a truly equal number requires deduplicating the reads _outside_ of MACS and feeding MACS an equal number of these reads. Unfortunately, this is trickier than it might seem, involving converting your bed into a .sam file, running two samtools programs and then converting back to .bed.
#Be aware that subsampling is a delicate trade-off which increases the effective noise & reduces the accuracy of the subsampled dataset in order to improve MACS fold enrichment and p.value output accuracy.
7) COMPARING PEAK CALLS TO UNDERLYING READ DENSITY DATA
#If you included the -B & -S parameters in MACS it will create subfolders containing treat and control bedgraph (.bdg) files which map the read density base-pair per base-pair across the whole genome.
#Unfortunately MACS often extends predicted density a few bases past the UCSC-browser recognized chromosome ends, causing errors. To fix this you can modify and run a short perl program as follows:
vi trim_bdg_chrom_ends.pl
i
#cut and paste this content after adjusting the chromosome lengths to what the UCSC genome browser reports for the species and build you are using (you can change the number of base pairs per chromosome, add or subtract numbered chromosomes without problem, but you’ll need to make further adjustments to the program if you add a non-numbered chromosome other than X, Y and M):
=head1 Simple file to remove bed or bedgraph regions that excede chromosome ends in mouse mm9
=head1 Usage
Input: At command line, type:
> perl trim_bdg_chrom_ends.pl input_bed_or_begraph_file.bdg output_filename.bdg
=head1 Version Information
Gavin Schnitzler 6/29/2012
=cut
my %chromhash=(
chr1=>197195432,
chr2=>181748087,
chr3=>159599783,
chr4=>155630120,
chr5=>152537259,
chr6=>149517037,
chr7=>152524553,
chr8=>131738871,
chr9=>124076172,
chr10=>129993255,
chr11=>121843856,
chr12=>121257530,
chr13=>120284312,
chr14=>125194864,
chr15=>103494974,
chr16=>98319150,
chr17=>95272651,
chr18=>90772031,
chr19=>61342430,
chrX=>166650296,
chrY=>15902555,
chrM=>16299
);
#print "Chr1: $chromhash{chr1}\n";
open (FH1, "", $ARGV[1]) or die ("Could not open output file $ARGV[1]\n");
while(){
# chomp;
@tmp_dat = split(/\t/, $_);
if($tmp_dat[0] =~/track/){print FH2 $_; next;}
if($tmp_dat[2]>=$chromhash{$tmp_dat[0]}){next;}
print FH2 $_;
}
close FH1;
close FH2;
#finish editing in vi by
[esc]
:wq
#then run the program with:
perl trim_bdg_chrom_ends.pl filename.bdg outputfilename.bdg
#Next compress your file for uploading to the browser with:
gzip outpufilename.bdg
#copy this file to your desktop using WinSCP (in windows) or sftp within a shell window (in mac OSX)
#open the genome browser, select your genome, click add custom tracks, browse for the file name & hit upload (it will take some time, but should eventually finish). Do the same with your input control .bdg file.
# click add custom tracks again & upload the MACS result file ending with peaks.bed
# if you are comparing ChIP results from more than one condition or with more than one antibody you can upload .bdg and .bed files for each one.
# Now scan along any chromosome & examine the peaks MACS called. Are they believably above background? Make note of the coordinates of believable and non-believable peaks, then open the peaks.xls file made by MACS into Excel on your desktop. Is there a p-value or fold-enrichment threshold that separates most good peaks from most bad peaks? If so, you may want to apply this(these) threshold(s) to create your final list of peaks.
# Note that the height of the browser graphs for each track will be proportional to the number of reads after filtering that MACS used for your treat and control samples. If these differ considerably, you can adjust for this easily enough by just mentally multiplying the axis values for the sample with more reads by the ratio of [reads for the smaller sample]/[reads for the larger sample] (e.g. in the example above, you’d multiply the control sample axis values by 1/3).
# A better approach would be to do the following to normalize each .bdg entry by dividing by the number of millions of reads after filtering (e.g. 20 for treat and 60 for control in the example above). This reads per million base pairs (RPM) normalization is easy enough to do in awk, e.g.:
awk 'OFS="\t" {print $1,$2,$3,$4/20}' treat.bdg > treat_normalized.bdg
#Note: the browser is fine with non-integral values in .bdg files
8a) EXAMINING OVERLAP BETWEEN PEAKS AND OTHER GENOMIC FEATURES, INCLUDING TRANSCRIPTION START SITES OR OTHER CHIP-SEQ PEAKS
#To determine the degree to which two sets of bed regions overlap with each other use:
bsub perl /cluster/home/g/s/gschni01/perl_programs/overlap_1.2.txt File1.bed File2.bed -outfile File1_v_File2.overlap
#the output file will summarize the number of ranges in each input file & the average length of each range & then provide details of all ranges from file1 that overlapped 1bp or more with those in file 2.
#the program will also provide an estimate of the number of overlaps that would be expected by random chance, from the formula: (avg_length_file1+avg_length_file2)*regions_in_file1*regions_in_file2/genomesize
#this provides a reasonable first pass estimate, which can be used in binomial tests (see **12 below) which would be structured as follows: hits=overlaps, tests=regions_in_file1, bkg_freq=overlaps_by_chance/regions_in_file1.
# this background estimate, however, fails to account for the fact that bed regions in each file are non-overlapping, and the fact that more than one bed region in file1 might overlap with a single region from file2. If you have relatively small numbers of not-too-long peaks in file1 & file2 (such that their combined total bp is less than 1% of the genome) this estimate is probably good enough. If not, it is better to establish a background overlap frequency empirically. This can be done by creating a randomly-distributed non-overlapping set of regions of equal length to those in file2 & repeating the overlap of file1 with this file2_background set. To do this use:
perl /cluster/home/g/s/gschni01/overlap_bkg_generator.pl file2.bed random_bkg_for_file2.bed
# Running overlaps.pl with file1 versus random_bkg_for_file2.bed gives an empirically-determined number of overlaps expected by regions in file1 with randomly distributed same-length regions in file 2. If there are more than 5000 regions in file 2 this should be pretty accurate. If the regions in file2 are less than this you may want to make multiple bkg set files, run overlaps.pl on each of them & take the average as your background hits.
#Note, if you want to generate any number of ranges of a given fixed size (e.g. for +/-50 kb from TSSes in the example below), create a raw feed file for overlap_bkg_generator.pl using the following template (where the chromosome number and exact bp positions are irrelevant, the only thing that matters is the distance between start & end).
perl -e 'for($x=1;$x=11 column using Insert/Cells
If desired, delete Name column & other extraneous ones.
Add adjusted_p column, multiply p.value by 585 (# of tests=# of motif matrices searched)*
* Alternatively, use the less conservative Benjamini-Hochberg correction. To do so, sort your raw p values from lowest to highest and give each a rank (with lowest=1, next =2, etc.). For each TFBS adjusted_p = raw_p*595/rank.
Next calculate enrichment ratios: foreground/background frequency (fg_counts)/(fg_bp) / (bkg_counts)/(bkg_bp), which will indicate how great the enrichment/anti-enrichment of each site is.
Once you know the enrichment ratio, sort first by p.value & then by this to set up 2 lists of significant enrichment differences, sites that are significantly overenriched (adjusted p.value 1) and sites that are underenriched relative to chance (fg/bkg95th_percentile_bkg_value_cell,)) [hit ctrl shift enter to activate, should be surrounded by {} brackets when you click on the cell again]
#this is the number of ChIP seq peaks with more sites than the 95th percentile value
=INT(number_from_countif-(1-percentrank_value_from_bkg_set)*total_number_of_peaks)
# This is the number of sites in excess of those that would have this characteristic by random chance. If this number is high, especially if it is greater than say ~1% of your total peaks, it suggests that choosing all peaks with sites greater than the 95th percentile value is likely to identify peaks with functional enrichment of that site.
If you then perform a descending sort on the chosen column (making sure to sort ALL columns as well) and take those greater than the 95th percentile value, you’ve got a set of putative target locations.
#This analysis will let you identify binding site peaks for your ChIP’ed factor that are highly enriched in consensus motifs for another factor. These can serve as candidate regions to check by ChIP for the presence of the other factor and/or on which to test for the loss of binding of your ChIP’ed factor after knock down or inhibition of the other factor.
15) ALL-BY-ALL OVERLAPS ANALYSIS TO IDENTIFY CO-ENRICHED OR MUTUALLY-EXCLUSIVE TFBS SITES
Create a new column that makes a single-word-identifier for these bed regions using:
=concatenate([chromosome_cell],”:”,[start_cell],”-“,[end_cell]) & copy & “paste special” as values these putative target IDs in columns to a separate sheet, copy this new table (capturing the longest column, and including empty cells for shorter columns). If you ran both f .85 and p.0005 storm analyses you can repeat the analysis below for each one, or (since many peaks will show site enrichment by both methods) simply combine both lists together, removing duplicates.
First, modify the R commands below as follows: replace “45” with the number of columns you have and replace “11975” with the number of original ChIP-seq peaks used for the storm analysis.
Then, in R do:
test ................
................
In order to avoid copyright disputes, this page is only a partial summary.
To fulfill the demand for quickly locating and searching documents.
It is intelligent file search solution for home and business.
Related searches
- my home credit
- my home credit sprint rewards
- xfinity comcast my home page
- what is my home worth
- make xfinity my home page windows 10
- how to make xfinity my home page
- make comcast my home page
- my home credit sprint card
- my home point financial
- find my home network password
- selling my home for cash
- selling my home myself