VCFv4.3 and BCFv2.2 27 Jul 2021

The Variant Call Format Specification

VCFv4.3 and BCFv2.2 22 Aug 2022

The master version of this document can be found at . This printing is version 8073fda from that repository, last modified on the date shown above.

1

Contents

1 The VCF specification

4

1.1 An example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

1.2 Character encoding, non-printable characters and characters with special meaning . . . . . . . . . . . 4

1.3 Data types . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

1.4 Meta-information lines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

1.4.1 File format . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

1.4.2 Information field format . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

1.4.3 Filter field format . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

1.4.4 Individual format field format . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

1.4.5 Alternative allele field format . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

1.4.6 Assembly field format . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

1.4.7 Contig field format . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

1.4.8 Sample field format . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

1.4.9 Pedigree field format . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

1.5 Header line syntax . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

1.6 Data lines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

1.6.1 Fixed fields . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

1.6.2 Genotype fields . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

2 Understanding the VCF format and the haplotype representation

12

2.1 VCF tag naming conventions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

3 INFO keys used for structural variants

12

4 FORMAT keys used for structural variants

13

5 Representing variation in VCF records

14

5.1 Creating VCF entries for SNPs and small indels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

5.1.1 Example 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

5.1.2 Example 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

5.1.3 Example 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

5.2 Decoding VCF entries for SNPs and small indels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

5.2.1 SNP VCF record . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

5.2.2 Insertion VCF record . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

5.2.3 Deletion VCF record . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

5.2.4 Mixed VCF record for a microsatellite . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

5.3 Encoding Structural Variants . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

5.4 Specifying complex rearrangements with breakends . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

5.4.1 Inserted Sequence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

5.4.2 Large Insertions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

5.4.3 Multiple mates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

5.4.4 Explicit partners . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

5.4.5 Telomeres . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

5.4.6 Event modifiers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

5.4.7 Inversions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

5.4.8 Uncertainty around breakend location . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

5.4.9 Single breakends . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

5.4.10 Sample mixtures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

5.4.11 Clonal derivation relationships . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

5.4.12 Phasing adjacencies in an aneuploid context . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

5.5 Representing unspecified alleles and REF-only blocks (gVCF) . . . . . . . . . . . . . . . . . . . . . . . 27

2

6 BCF specification

28

6.1 Overall file organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

6.2 Header . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

6.2.1 Dictionary of strings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

6.2.2 Dictionary of contigs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

6.3 BCF2 records . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

6.3.1 Site encoding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

6.3.2 Genotype encoding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

6.3.3 Type encoding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

6.4 Encoding a VCF record example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

6.4.1 Encoding CHROM and POS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

6.4.2 Encoding QUAL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

6.4.3 Encoding ID . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

6.4.4 Encoding REF/ALT fields . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

6.4.5 Encoding FILTER . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

6.4.6 Encoding the INFO fields . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

6.4.7 Encoding Genotypes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

6.5 BCF2 block gzip and indexing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

7 List of changes

36

7.1 Changes to VCFv4.3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

7.2 Changes between VCFv4.2 and VCFv4.3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

7.3 Changes between BCFv2.1 and BCFv2.2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

7.4 Changes between VCFv4.1 and VCFv4.2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

3

1 The VCF specification

VCF is a text file format (most likely stored in a compressed manner). It contains meta-information lines (prefixed with "##"), a header line (prefixed with "#"), and data lines each containing information about a position in the genome and genotype information on samples for each position (text fields separated by tabs). Zero length fields are not allowed, a dot (".") must be used instead. In order to ensure interoperability across platforms, VCF compliant implementations must support both LF ("\n") and CR+LF ("\r\n") newline conventions.

1.1 An example

##fileformat=VCFv4.3

##fileDate=20090805

##source=myImputationProgramV3.1

##reference=

##contig=

##phasing=partial

##INFO=

##INFO=

##INFO=

##INFO=

##INFO=

##INFO=

##FILTER=

##FILTER=

##FORMAT=

##FORMAT=

##FORMAT=

##FORMAT=

#CHROM POS

ID

REF ALT

QUAL FILTER INFO

FORMAT

NA00001

NA00002

NA00003

20

14370 rs6054257 G

A

29 PASS NS=3;DP=14;AF=0.5;DB;H2

GT:GQ:DP:HQ 0|0:48:1:51,51 1|0:48:8:51,51 1/1:43:5:.,.

20

17330 .

T

A

3 q10 NS=3;DP=11;AF=0.017

GT:GQ:DP:HQ 0|0:49:3:58,50 0|1:3:5:65,3 0/0:41:3

20

1110696 rs6040355 A

G,T

67 PASS NS=2;DP=10;AF=0.333,0.667;AA=T;DB GT:GQ:DP:HQ 1|2:21:6:23,27 2|1:2:0:18,2 2/2:35:4

20

1230237 .

T

.

47 PASS NS=3;DP=13;AA=T

GT:GQ:DP:HQ 0|0:54:7:56,60 0|0:48:4:51,51 0/0:61:2

20

1234567 microsat1 GTC G,GTCT 50 PASS NS=3;DP=9;AA=G

GT:GQ:DP 0/1:35:4

0/2:17:2

1/1:40:3

This example shows (in order): a good simple SNP, a possible SNP that has been filtered out because its quality is below 10, a site at which two alternate alleles are called, with one of them (T) being ancestral (possibly a reference sequencing error), a site that is called monomorphic reference (i.e. with no alternate alleles), and a microsatellite with two alternative alleles, one a deletion of 2 bases (TC), and the other an insertion of one base (T). Genotype data are given for three samples, two of which are phased and the third unphased, with per sample genotype quality, depth and haplotype qualities (the latter only for the phased samples) given as well as the genotypes. The microsatellite calls are unphased.

1.2 Character encoding, non-printable characters and characters with special meaning

The character encoding of VCF files is UTF-8. UTF-8 is a multi-byte character encoding that is a strict superset

of 7-bit ASCII and has the property that none of the bytes in any multi-byte characters are 7-bit ASCII bytes. As

a result, most software that processes VCF files does not have to be aware of the possible presence of multi-byte

UTF-8 characters. VCF files must not contain a byte order mark. Note that non-printable characters U+0000?

U+0008, U+000B?U+000C, U+000E?U+001F are disallowed. Line separators must be CR+LF or LF and they are

allowed only as line separators at end of line. Some characters have a special meaning when they appear (such as

field delimiters `;' in INFO or `:' FORMAT fields), and for any other meaning they must be represented with the

capitalized percent encoding:

%3A :

(colon)

%3B ;

(semicolon)

%3D =

(equal sign)

%25 %

(percent sign)

%2C ,

(comma)

%0D CR

%0A LF

%09 TAB

1.3 Data types

Data types supported by VCF are: Integer (32-bit, signed), Float (32-bit IEEE-754, formatted to match one of the regular expressions ^[-+]?[0-9]*\.?[0-9]+([eE][-+]?[0-9]+)?$ or ^[-+]?(INF|INFINITY|NAN)$ case insensi-

4

tively), Flag, Character, and String. For the Integer type, the values from -231 to -231 + 7 cannot be stored in the binary version and therefore are disallowed in both VCF and BCF, see 6.3.3.

1.4 Meta-information lines

File meta-information lines start with "##" and must appear first in the VCF file, before the header line (section 1.5) and data record lines (section 1.6). They may be either unstructured or structured.

An unstructured meta-information line consists of a key (denoting the type of meta-information recorded) and a value (which may not be empty and must not start with a ` ................
................

In order to avoid copyright disputes, this page is only a partial summary.

Google Online Preview   Download