University of Colorado Colorado Springs



NotesCygwin: a free software that provides a Unix-like environment on WindowsBWA vs. Bowtie2: both used for indexing and alignment creation; BWA is better for aligning longer reads while Bowtie2 is faster/better for short reads. BWA is?a software package for mapping low-divergent sequences against a large reference genome, such as the human genome.?Bowtie is an ultrafast, memory-efficient short read aligner. It aligns short DNA sequences (reads) to the human genome at a rate of over 25 million 35-bp reads per hour. Bowtie indexes the genome with a Burrows-Wheeler index to keep its memory footprint small: typically about 2.2 GB for the human genome (2.9 GB for paired-end).Samtools: SAM (Sequence Alignment/Map) format is a generic format for storing large nucleotide sequence alignments. SAM aims to be a format that:Is flexible enough to store all the alignment information generated by various alignment programs;Is simple enough to be easily generated by alignment programs or converted from existing alignment formats;Is compact in file size;Allows most of operations on the alignment to work on a stream without loading the whole alignment into memory;Allows the file to be indexed by genomic position to efficiently retrieve all reads aligning to a locus.SAM Tools provide various utilities for manipulating alignments in the SAM format, including sorting, merging, indexing and generating alignments in a per-position format.FASTA: a DNA and protein sequence alignment software package; A major focus of the package is the calculation of accurate similarity statistics, so that biologists can judge whether an alignment is likely to have occurred by chance, or whether it can be used to infer homology (shared ancestry).FASTA takes a given nucleotide or amino acid sequence and searches a corresponding sequence database by using local sequence alignment to find matches of similar database sequences.The FASTA programs find regions of local or global similarity between Protein or DNA sequences, either by searching Protein or DNA databases, or by identifying local duplications within a sequence.Paired end sequencing refers to the fact that the fragment(s) sequenced were sequenced from both ends and not just the one (as was true for first generation sequencing). These fragments are prepared in 'libraries', which are fragmented DNA and which are size selected to about 600-900 bp. This is referred to as the library size.?The insert size is the?length?of the DNA (or RNA) that you want to sequence and that is "inserted" between the adapters (so adapters excluded).A?binary file?is a?file?stored in?binary?format. A?binary file is computer-readable but not human-readable. All executable programs are stored in?binary files, as are most numeric data?files. In general,?executable(ready-to-run) programs are often identified as binary files and given a file name extension of ".bin".Paired-end sequencing allows users to sequence both ends of a fragment and generate high-quality, alignable sequence data. Paired-end sequencing facilitates detection of genomic rearrangements and repetitive sequence elements, as well as gene fusions and novel transcripts.Since paired-end reads are more likely to align to a reference, the quality of the entire data set improves.Indexing the reference genome makes querying faster and compresses the size of the genome data.Indexing a genome can be explained similar to indexing a book. If you want to know on which page a certain word appears or a chapter begins, it is much more efficient/faster to look it up in a pre-built index than going through every page of the book until you found it. Same goes for alignments. Indices allow the aligner to narrow down the potential origin of a query sequence within the genome, saving both time and memory.Hg19 genome is the human reference genome; AKA GRCh37 assemblyFASTQ format is a text-based format for storing both a biological sequence (usually nucleotide sequence) and its corresponding quality scores.Mapping single and pair read FASTQ files used to perform alignment of the inde and a se of sequencing read files.A BAM file (*.bam) is the compressed binary version of a SAM file that is used to represent aligned sequences up to 128 Mb thread, in the context of Java, is the path followed when executing a program.A shell script is a computer program designed to be run by the Unix shell, a command-line interpreter. Shell scripting is writing a series of command for the shell to execute.Filtering a BAM file is used to remove low quality mapped and unmapped reads among others.Converting a BAM file to a Hi-C input file is done in a medium file format style: a text file describing mapped Hi-C reads that can be used as input to create a .hic file from raw fastq files derived from a Hi-C experiment. A hic format file is a binary file containing contact matrices at different resolutions (number of pixels) and normalized (mediated into a range) by different methodsSamtools FLAG (-F): removes unmapped reads; combination of bitwise FLAGs; 0x4 represents segment unmappedBit 0x4 is the only reliable place to tell whether the read is unmapped. If 0x4 is set, no assumptions can be made about MAPQ, 0x2 (each segment properly aligned according to the aligner), 0x100 (secondary alignment; marks the alignment not to be used in certain analyses when the tools in use are aware of this bit; it is typically used to flag alternative mappings when multiple mappings are presented in a SAM), and 0x800 (supplementary alignment; indicates that the corresponding alignment is part of a chimeric alignment)Samtools MAPQ (-q): removes low-quality reads; mapping quality; the reads map to multiple places on the genome and we are not sure where the reads originated; to improve the quality of our data, we can remove the low-quality reads; generally, select reads with MAPQ>1.In the SAM format, each alignment line typically represents the linear alignment of a segment. Each line has 11 mandatory fields. Two of the fields are Flag (Col 2) and MAPQ (Col 5)2DConverting mapped Hi-C reads to a hic format file is needed to create a binary hic format file containing matrices at different resolutions and normalized by different methods from a text file describing mapped Hi-C readsFive formats are acceptable for use in the conversion: short, short with score, medium, long, and 4DN DCIC; readname and strand not currently stored within .hic filesShort: a whitespace separated file that contains on each line str (strand; 0 for forward, anything else for reverse), chr (chromosome), pos (position), and frag (restriction site fragment) x2Short with score: is useful for reading in already processed files (those that have already been binned and/or normalized); can be used in conjunction with the -r flag to create a .hic file that contains a single resolution; each line contains str, chr, pos, and frag x2 plus score (the score imputed to this read)Medium: whitespace separated file containing str, chr, pos, frag x2 and mapq (mapping quality score) x2Long: used by Juicer (a platform for analyzing kilobase resolution Hi-C data) and takes in directly the merged_nodups.txt file4DN DCIC: contains a header (##pairs format v1.0) with the first seven columns reserved (#columns: readID chr pos x2 strand x2); pairs format is a standard text format for pairs of genomic loci given as 1bp point positionsA contact matrix is a N x N matrix in which the ith, jth element tells you whether or not the ith or jth molecules from a set of N molecules are adjacent or not; additionally, if there is a pair of molecules, they are adjacent if their centers of mass are within a certain cutoff and the two molecules have the same orientationData from Hi-C usually summarized by chromosomal contact maps; by binning the genome into equally sized bins, a contact map is a matrix whose elements store the population averaged colocation frequencies between pairs of lociMatrix format: number of interactions between sets of regionsHi-C is an unbiased assay of chromatin conformation, resulting in even read coverage across the entire genome.? This, coupled with the fact that most Hi-C reads describe interactions at close linear distance along the chromosomes, results in relatively sparse read coverage for interactions between individual restriction fragments separated by great distance.? This makes it difficult to find high-resolution interactions across long distances.? To help describe how distal and/or inter-chromosomal regions co-localize, two general approaches are utilized:Increase the size of the regions [i.e. decrease the resolution] to boost the number of Hi-C reads used for interaction analysis.? This is the most common and simplest technique used by most approaches. It's much easier to identify significant inter-chromosomal interactions between regions that are 100kb in size than regions that are only 5kb in size (essentially 20x more Hi-C reads to work with).Compare the profile of interactions regions participate in when comparing regions.? The best example of this is calculating the correlation coefficient of the interaction profiles between two regions.? The idea is that if two regions share interactions with several other regions, they are probably similar themselves (even if they do not have direct interaction evidence between them).? This uses the transitive property to link regions.? If A interacts with C and D, and B interactions with C and D, perhaps A and B interact as well.Extracting contact matrices from a hic format gives a sparse matrix format in a text file (each line represents a contact between three numbers separated by whitespaces: <position1> <position2> <interaction_frequency>)Normalize contact matrices in sparse matrix format: matrix normalized by the Iterative Correction and Eigenvector decomposition (ICE) method (A computational pipeline that integrates a strategy to map sequencing reads with a data-driven method for iterative/repetitive correction of biases, yielding genome-wide maps of relative contact probabilities. Iterative correction leverages the unique pairwise and genome-wide structure of Hi-C data to decompose contact maps into a set of biases and a map of relative contact probabilities between any two genomic loci, achieving equal visibility across all genomic regions.)Visualizing dataset in 2D format is used to create 2D graphical representations (heatmaps) of a contact matrix from an input file (sparse matrix format)A heatmap is a graphical representation of contact data where numeric values in the input contact matrix are represented as colors based according to a selected color gradientTANH: the type of data displayed on the heatmap; calculates the hyperbolic tangent of the pixels and has a defined rangeTAD: topologically associating domains; a self-interacting genomic region, meaning that DNA sequences within a TAD physically interact with each other more frequently than with sequences outside the TADThe Display Multiple TADs function allows TADs from different methods to be overlapped on the same display window; useful for comparing TADs of different methods for a datasetSquare matrix: a full matrix representing all the contact regionsIdentify TAD gives a TAD with the best quality in a bed formatBED (Browser Extensible Data) format provides a flexible way to define the data lines that are displayed in an annotation track. BED lines have three required fields: chrom, chromStart, and chromEndClusterTAD: the default algorithm used for TAD identification from the input contact matrixCheck TAD consistency gives a report of 4 cases: 1. The number of exact TADs found in both method 1 and method 2; 2. The number of sub-TADs that exist between method1 and method 2; 3. The number of conflicting TADs; 4. The number of TADs in method 1 but not found in method 23DLorDG: A restraint-based method that is capable of reconstructing 3D genome structures utilizing both intra-and inter-chromosomal contact data. LorDG method is robust to noise and performed well in comparison with a panel of existing methods on a controlled simulated data set. On a real Hi-C data set of the human genome, this method produced chromosome and genome structures that are consistent with 3D FISH data and known knowledge about the human chromosome and genome, such as, chromosome territories and the cluster of small chromosomes in the nucleus center with the exception of the chromosome 18.Conversion factor is α in the formula dij = 1/IFαijDuring reconstruction, the conversion factor is being used to build the model and the current value of the objective function (higher is better); after reconstruction is finished, the score is displayed (the lower the value, the better the model is)3DMax: Reconstructs the 3D structure of chromosomes from Hi-C data. 3DMax combines the maximum likelihood inference approach with a gradient ascent method. It constructs optimized structures for chromosomes. This tool transforms contact data of a chromosome, or genome into an ensemble of probable 3D conformations. It permits the approximation of the dynamic 3D genome structures of a population of cells.Genome scale system (gss) file format: ?allows for the visualization of a genome at multiple scales instead of having to view each PDB structure file individuallyProtein data bank (pdb) file format: a textual file format describing the three-dimensional structures of molecules. .pdb format is a standard for files containing atomic coordinatesChromatin loops in 3D models: A chromatin loop occurs when stretches of genomic sequence that lie on the same chromosome (configured in cis) are in closer physical proximity to each other than to intervening sequences; deals with gene expression and regulatory elementsModel annotation is used to annotate 3D models with genomic elementsGene expression data visualization is a special case of model annotation that displays gene expression level along a 3D modelThe GCT file format is a tab delimited file format that describes an expression dataset.?It allows missing expression pare 2 3D models in gss format to be superimposed, scaled, and visualized; Spearman’s correlation and RMSE (root-mean-square error; a frequently used measure of the differences between values (sample or population values) predicted by a model or an estimator and the values observed) between pairwise distance of the 2 models are calculatedQuestions?Why is UNIX/LINUX or MacOS needed instead of Windows? Because things for indexing programmed for UNIX; gives more flexibility than Windows; Windows processor is lower (8 vs 20) so slowerWhich resolution is better? The lower the resolution in value, the higher it is 1 mb vs 40kbWhat normalization methods are used? ICE, KR normalization, square root… ................
................

In order to avoid copyright disputes, this page is only a partial summary.

Google Online Preview   Download