ISO TC 46/SC 4 N



C:\Documents and Settings\CAH\My Documents\_NISO\TC 46\WARC\ISO_NP WARC .docWORKING DRAFT© ISO 2006 — All rights reservedISO/WD XXXXX 31 Information and documentation — The WARC File FormatÉlément introductif — Élément central — Élément complémentaireInformation and documentation — The WARC File FormatE2006-02-6(20) PreparatoryISOISO  International Standard2006 ISO XXXXXISO XXXXXISO/WD XXXXX ANSITechnical interoperabilityInformation and documentation 446 2Heading 2Heading 1 STD Version 2.110 4 ISO TC 46/SC 4 N 595

Date:   2006-02-62006-02-62006-02-62006-02-6

ISO/WD XXXXX

ISO ISO ISO ISO TC 46464646/SC 4444/WG 

Secretariat:   ANSIANSIANSIANSI

Information and documentation — The WARC File FormatInformation and documentation — The WARC File FormatInformation and documentation — The WARC File FormatInformation and documentation — The WARC File Format

Élément introductif — Élément central — Élément complémentaireÉlément introductif — Élément central — Élément complémentaireÉlément introductif — Élément central — Élément complémentaireÉlément introductif — Élément central — Élément complémentaire

Warning

This document is not an ISO International Standard. It is distributed for review and comment. It is subject to change without notice and may not be referred to as an International Standard.

Recipients of this draft are invited to submit, with their comments, notification of any relevant patent rights of which they are aware and to provide supporting documentation.

Copyright notice

This ISO document is a working draft or committee draft and is copyright-protected by ISO. While the reproduction of working drafts or committee drafts in any form for use by participants in the ISO standards development process is permitted without prior permission from ISO, neither this document nor any extract from it may be reproduced, stored or transmitted in any form for any other purpose without prior written permission from ISO.

Requests for permission to reproduce this document for the purpose of selling it should be addressed as shown below or to ISO's member body in the country of the requester:

[Indicate the full address, telephone number, fax number, telex number, and electronic mail address, as appropriate, of the Copyright Manger of the ISO member body responsible for the secretariat of the TC or SC within the framework of which the working document has been prepared.]

Reproduction for sales purposes may be subject to royalty payments or a licensing agreement.

Violators may be prosecuted.

Contents Page

1 Scope 1

2 Normative references 1

3 Terms, definitions and acronyms 2

3.1 Terms and definitions 2

3.1.1 WARC record 2

3.1.2 WARC record content block 2

3.1.3 WARC record payload 2

3.1.4 WARC record header 3

3.1.5 WARC named fields 3

3.1.6 WARC logical record 3

3.2 Acronyms 3

4 File and record model 3

5 Named fields 6

5.1 General 6

5.2 WARC-Record-ID (mandatory) 6

5.3 Content-Length (mandatory) 6

5.4 WARC-Date (mandatory) 6

5.5 WARC-Type (mandatory) 6

5.6 Content-Type 7

5.7 WARC-Concurrent-To 7

5.8 WARC-Block-Digest 7

5.9 WARC-Payload-Digest 8

5.10 WARC-IP-Address 8

5.11 WARC-Refers-To 8

5.12 WARC-Target-URI 9

5.13 WARC-Truncated 9

5.14 WARC-Warcinfo-ID 9

5.15 WARC-Filename 9

5.16 WARC-Profile 10

5.17 WARC-Identified-Payload-Type 10

5.18 WARC-Segment-Number 10

5.19 WARC-Segment-Origin-ID 10

5.20 WARC-Segment-Total-Length 11

6 WARC Record Types 11

6.1 General 11

6.2 'warcinfo' 11

6.3 'response' 12

6.3.1 General 12

6.3.2 for 'http' and 'https' schemes 12

6.3.3 for other URI schemes 12

6.4 'resource' 13

6.4.1 General 13

6.4.2 for 'http' and 'https' schemes 13

6.4.3 for 'ftp' scheme 13

6.4.4 for 'dns' scheme 13

6.4.5 for other URI schemes 13

6.5 'request' 13

6.5.1 General 13

6.5.2 for 'http' and 'https' schemes 13

6.5.3 for other URI schemes 14

6.6 'metadata' 14

6.7 'revisit' 14

6.7.1 General 14

6.7.2 Profile: Identical Payload Digest 15

6.7.3 Profile: Server Not Modified 15

6.7.4 Other profiles 16

6.8 'conversion' 16

6.9 'continuation' 16

7 Record segmentation 16

8 Registration of MIME media types application/warc and application/warc-fields 17

8.1 General 17

8.2 application/warc 17

8.3 application/warc-fields 18

9 IANA considerations 18

Annex A (informative) Compression recommendations 19

A.1 General 19

A.2 Record-at-time compression 19

A.3 GZIP WARC file name suffix 19

Annex B (informative) WARC file size and name recommendations 20

Annex C (informative) Examples of WARC records 21

C.1 Example of 'warcinfo' record 21

C.2 Example of 'request' record 21

C.3 Example of 'response' record 22

C.4 Example of 'resource' record 22

C.5 Example of 'metadata' record 22

C.6 Example of 'revisit' record 23

C.7 Example of 'conversion' record 23

C.8 Example of segmentation ('continuation' record) 23

Annex D (informative) Use cases for writing WARC records 25

Foreword

Introduction

1 Scope

2 Normative references

3 Terms, definitions and acronyms

3.1 Terms and definitions

3.1.1 WARC record

3.1.2 WARC record content block

3.1.3 WARC record header

3.1.4 WARC named fields 2

3.1.5 WARC logical record

3.2 Acronyms

4 File and record model

5 Named fields 5

5.1 General 5

5.2 WARC-Record-ID (mandatory) 5

5.3 Content-Length (mandatory)

5.4 WARC-Date (mandatory)

5.5 WARC-Type (mandatory)

5.6 Content-Type 6

5.7 WARC-Concurrent-To

5.8 WARC-Block-Digest 7

5.9 WARC-Payload-Digest 7

5.10 WARC-IP-Address

5.11 WARC-Refers-To

5.12 WARC-Target-URI 8

5.13 WARC-Truncated 98

5.14 WARC-Warcinfo-ID

5.15 WARC-Filename

5.16 WARC-Profile 9

5.17 WARC-Identified-Payload-Type 109

5.18 WARC-Segment-Number

5.19 WARC-Segment-Origin-ID

5.20 WARC-Segment-Total-Length 10

6 WARC Record Types 1110

6.1 General 1110

6.2 'warcinfo'

6.3 'response' 1211

6.3.1 General 1211

6.3.2 for 'http' and 'https' schemes

6.3.3 for other URI schemes

6.4 'resource' 12

6.4.1 General 12

6.4.2 for 'http' and 'https' schemes 1312

6.4.3 for 'ftp' scheme 1312

6.4.4 for 'dns' scheme

6.4.5 for other URI schemes

6.5 'request'

6.5.1 General

6.5.2 for 'http' and 'https' schemes

6.5.3 for other URI schemes 1413

6.6 'metadata' 1413

6.7 'revisit'

6.7.1 General

6.7.2 Profile: Identical Payload Digest 1514

6.7.3 Profile: Server Not Modified

6.7.4 Other profiles 15

6.8 'conversion' 15

6.9 'continuation' 1615

7 Record segmentation

8 Registration of MIME media types application/warc and application/warc-fields 1716

8.1 General 1716

8.2 application/warc 1716

8.3 application/warc-fields 17

9 IANA considerations

10 Acknowledgments Erreur ! Signet non défini.18

Annex A (informative) Compression recommendations

A.1 General

A.2 Record-at-time compression

A.3 GZIP WARC file name suffix

Annex B (informative) WARC file size and name recommendations

Annex C (informative) Examples of WARC records

C.1 Example of 'warcinfo' record

C.2 Example of 'request' record

C.3 Example of 'response' record

C.4 Example of 'resource' record

C.5 Example of 'metadata' record

C.6 Example of 'revisit' record

C.7 Example of 'conversion' record

C.8 Example of segmentation ('continuation' record)

1 Scope 1

2 Normative references 1

3 Terms, definitions and acronyms 2

3.1 Terms and definitions 2

3.1.1 WARC record 2

3.1.2 WARC record content block 2

3.1.3 WARC record header 2

3.1.4 WARC named fields 2

3.1.5 WARC logical record 3

3.2 Acronyms 3

4 File and record model 3

5 Named fields 5

5.1 General 5

5.2 WARC-Record-ID (mandatory) 5

5.3 Content-Length (mandatory) 5

5.4 WARC-Date (mandatory) 6

5.5 WARC-Type (mandatory) 6

5.6 Content-Type 6

5.7 WARC-Concurrent-To 7

5.8 WARC-Block-Digest 7

5.9 WARC-Payload-Digest 7

5.10 WARC-IP-Address 8

5.11 WARC-Refers-To 8

5.12 WARC-Target-URI 8

5.13 WARC-Truncated 8

5.14 WARC-Warcinfo-ID 9

5.15 WARC-Filename 9

5.16 WARC-Profile 9

5.17 WARC-Identified-Payload-Type 9

5.18 WARC-Segment-Number 10

5.19 WARC-Segment-Origin-ID 10

5.20 WARC-Segment-Total-Length 10

6 WARC Record Types 10

6.1 General 10

6.2 'warcinfo' 10

6.3 'response' 11

6.3.1 General 11

6.3.2 for 'http' and 'https' schemes 11

6.3.3 for other URI schemes 12

6.4 'resource' 12

6.4.1 General 12

6.4.2 for 'http' and 'https' schemes 12

6.4.3 for 'ftp' scheme 12

6.4.4 for 'dns' scheme 12

6.4.5 for other URI schemes 12

6.5 'request' 13

6.5.1 General 13

6.5.2 for 'http' and 'https' schemes 13

6.5.3 for other URI schemes 13

6.6 'metadata' 13

6.7 'revisit' 14

6.7.1 General 14

6.7.2 Profile: Identical Payload Digest 14

6.7.3 Profile: Server Not Modified 14

6.7.4 Other profiles 15

6.8 'conversion' 15

6.9 'continuation' 15

7 Record segmentation 15

8 Registration of MIME media types application/warc and application/warc-fields 16

8.1 General 16

8.2 application/warc 16

8.3 application/warc-fields 17

9 IANA considerations 17

10 Acknowledgments 18

Annex A (informative) Compression recommendations 19

A.1 General 19

A.2 Record-at-time compression 19

A.3 GZIP WARC file name suffix 19

Annex B (informative) WARC File Size and Name Recommendations 20

Annex C (informative) Examples of WARC records 21

C.1 Example of 'warcinfo' record 21

C.2 Example of 'request' record 21

C.3 Example of 'response' record 22

C.4 Example of 'resource' record 22

C.5 Example of 'metadata' record 22

C.6 Example of 'revisit' record 23

C.7 Example of 'conversion' record 23

C.8 Example of segmentation ('continuation' record) 23

1 Scope [Goals] 1

2 Normative references 1

3 Terms, definitions and acronyms 2

3.1 Terms and definitions 2

3.2 Acronyms 3

4 File and record Model 3

5 Named Fields 5

5.1 General 5

5.2 WARC-Record-ID (mandatory) 5

5.3 Content-Length (mandatory) 5

5.4 WARC-Date (mandatory) 6

5.5 WARC-Type (mandatory) 6

5.6 Content-Type 6

5.7 WARC-Concurrent-To 7

5.8 WARC-Block-Digest 7

5.9 WARC-Payload-Digest 7

5.10 WARC-IP-Address 8

5.11 WARC-Refers-To 8

5.12 WARC-Target-URI 8

5.13 WARC-Truncated 8

5.14 WARC-Warcinfo-ID 9

5.15 WARC-Filename 9

5.16 WARC-Profile 9

5.17 WARC-Identified-Payload-Type 9

5.18 WARC-Segment-Number 10

5.19 WARC-Segment-Origin-ID 10

5.20 WARC-Segment-Total-Length 10

6 WARC Record Types 10

6.1 'warcinfo' 10

6.2 'response' 11

6.2.1 for 'http' and 'https' schemes 11

6.2.2 for other URI schemes 12

6.3 'resource' 12

6.3.1 for 'http' and 'https' schemes 12

6.3.2 for 'ftp' scheme 12

6.3.3 for 'dns' scheme 12

6.3.4 for other URI schemes 12

6.4 'request' 12

6.4.1 for 'http' and 'https' schemes 13

6.4.2 for other URI schemes 13

6.5 'metadata' 13

6.6 'revisit' 13

6.6.1 Profile: Identical Payload Digest 14

6.6.2 Profile: Server Not Modified 14

6.6.3 Other profiles 15

6.7 'conversion' 15

6.8 'continuation' 15

7 Record segmentation 15

8 Registration of MIME Media Types application/warc and application/warc-fields 16

8.1 application/warc 16

8.2 application/warc-fields 16

9 IANA Considerations 17

10 Acknowledgments 17

Annex A (informative) Compression Recommandations 18

A.1 Record-at-time Compression 18

A.2 GZIP WARC File Name Suffix 18

Annex B (informative) WARC File Size and Name Recommendations 19

Annex C (informative) Examples of WARC Records 20

C.1 Example of 'warcinfo' Record 20

C.2 Example of 'request' Record 20

C.3 Example of 'response' Record 21

C.4 Example of 'resource' Record 21

C.5 Example of 'metadata' Record 21

C.6 Example of 'revisit' Record 22

C.7 Example of 'conversion' Record 22

C.8 Example of Segmentation ('continuation' record) 22

Annex D (informative) Author’s Adresses 24

Foreword

ISO (the International Organization for Standardization) is a worldwide federation of national standards bodies (ISO member bodies). The work of preparing International Standards is normally carried out through ISO technical committees. Each member body interested in a subject for which a technical committee has been established has the right to be represented on that committee. International organizations, governmental and non-governmental, in liaison with ISO, also take part in the work. ISO collaborates closely with the International Electrotechnical Commission (IEC) on all matters of electrotechnical standardization.

International Standards are drafted in accordance with the rules given in the ISO/IEC Directives, Part 2.

The main task of technical committees is to prepare International Standards. Draft International Standards adopted by the technical committees are circulated to the member bodies for voting. Publication as an International Standard requires approval by at least 75 % of the member bodies casting a vote.

Attention is drawn to the possibility that some of the elements of this document may be the subject of patent rights. ISO shall not be held responsible for identifying any or all such patent rights.

ISO/WD XXXXX was prepared by Technical Committee ISO/TC 46, Information and documentation, Subcommittee SC 4, Technical interoperability. It is derived from a working specification created in the context of an open-source software project and previously published in a series of drafts to prepare for publication as an Internet RFC.

Introduction

Web sites and web pages emerge and disappear from the world wide web every day. For the past ten years, memory organizations have tried to find the most appropriate ways to collect and keep track of this vast quantity of important material using web-scale tools such as web crawlers. A web crawler is a program that browses the web in an automated manner according to a set of policies; starting with a list of URLs, it saves each page identified by a URL, finds all the hyperlinks in the page (e. g. links to other pages, images, videos, scripting or style instructions, etc.), and adds them to the list of URLs to visit recursively. Storing and managing the billions of saved web page objects itself presents a challenge.

At the same time, those same organizations have a rising need to archive large numbers of digital files not necessarily captured from the web (e.g., entire series of electronic journals, or data generated by environmental sensing equipment). A general requirement that appears to be emerging is for a container format that permits one file simply and safely to carry a very large number of constituent data objects for the purpose of storage, management, and exchange. Those data objects (or resources) must be of unrestricted type (including many binary types for audio, CAD, compressed files, etc.), but fortunately the container needs only minimal knowledge of the nature of the objects.

The WARC (Web ARChive) file format offers a convention for concatenating multiple resource records (data objects), each consisting of a set of simple text headers and an arbitrary data block into one long file. The WARC format is an extension of the ARC File Format [ARC] that has traditionally been used to store "web crawls" as sequences of content blocks harvested from the World Wide Web. Each capture in an ARC file is preceded by a one-line header that very briefly describes the harvested content and its length. This is directly followed by the retrieval protocol response messages and content. The original ARC format file is used by the Internet Archive (IA) since 1996 for managing billions of objects, and by several national libraries.

The motivation to extend the ARC format arose from the discussion and experiences of the International Internet Preservation Consortium (IIPC) [IIPC], whose members include the national libraries of Australia, Canada, Denmark, Finland, France, Iceland, Italy, Norway, Sweden, The British Library (UK), The Library of Congress (USA), and the Internet Archive (IA). The California Digital Library and the Los Alamos National Laboratory also provided input on extending and generalizing the format.

The WARC format is expected to be a standard way to structure, manage and store billions of resources collected from the web and elsewhere. It will be used to build applications for harvesting (such as the opensource Heritrix [HERITRIX] web crawler), managing, accessing, and exchanging content. The way WARC files will be created and resources will be stored and rendered will depend on software and applications implementations.

The files constituting websites, harvested on the Internet, are contained as payload of WARC records in the WARC files. However, the different pieces of a same Website may not be contained in the same WARC file or WARC files.

To render the archive of a Website for future users, an access software should request files from different WARC files. It is recommended to use external indexes for a quicker access to the archives.

Besides the primary content recorded in ARCs, the extended WARC format accommodates related secondary content, such as assigned metadata, abbreviated duplicate detection events, later-date transformations, and segmentation of large resources. The extension may also be useful for more general applications than web archiving. To aid the development of tools that are backwards compatible, WARC content is clearly distinguishable from pre-revision ARC content.

The WARC file format is made sufficiently different from the legacy ARC format files so that software tools can unambiguously detect and correctly process both WARC and ARC records; given the large amount of existing archival data in the previous ARC format, it is important that access and use of this legacy not be interrupted when transitioning to the WARC format.

BACKGROUND INFORMATION ON WEB ARCHIVING (PROPOSAL) Web sites and web pages emerge and disappear from the world wide web every day. For the past ten years, memory institutions organizations have tried to find the most appropriate ways to collect and keep track of this vast quantity of important material using web-scale tools such as web crawlers. A web crawler are is a software program which that browses the web in an automated manner according to a set of policies; sIt starts with a list of URI to visit. Atarting with a list of URLs, it visitssaves these each page identified by a URI, it makes copies of the elements identified by these URL, finds all the hyperlinks in the page (e. g. links to other pages, images, videos, scripting or style instructions, etc.), and adds them to the list of URLs to visit recursively. Storing and managing the billions of saved web page objects itself presents a challenge.

At the same time, those same organizations have a rising need to archive large numbers of digital files not necessarily captured from the web (e.g., entire series of electronic journals, or data generated by environmental sensing equipment). A general requirement that appears to be emerging is for a container format that permits one file simply and safely to carry a very large number of constituent data objects for the purpose of storage, management, and exchange. Those data objects (or resources) must be of unrestricted type (including many binary types for audio, CAD, compressed files, etc.), but fortunately the container needs only minimal knowledge of the nature of the objects.

EXPLAIN IN MORE DETAILS Needs for a format to physically store, manage and preserve billions of objects harvested.

The Web ARChive (WARC) file format offers a convention for concatenating multiple resource records (data objects), each consisting of a set of simple text headers and an arbitrary data block into one long file. The WARC format is an extension of the ARC File Format [ARC] format that has traditionally been used to store "web crawls" as sequences of content blocks harvested from the World Wide Web. Each capture in an ARC file is preceded by a one-line header that very briefly describes the harvested content and its length. This is directly followed by the retrieval protocol response messages and content. The original ARC format file is used by the Internet Archive (IA) since 1996 for managing a 600To and 50 billions objects archive billions of objects, and by several national libraries.

The motivation to extend the format arose from the discussion and experiences of the International Internet Preservation Consortium (IIPC) [IIPC], whose members included the national libraries of Australia, Canada, Denmark, Finland, France, Iceland, Italy, Norway, Sweden, The British Library (UK), The Library of Congress (USA), and the Internet Archive IA, and tThe California Digital Library and the Los Alamos National Laboratory, which have set up large repositories also provided input on extending and generalizing the format.

The WARC format is expected to be a standard way to structure, manage and store billions of collected web resources collected from the web and elsewhere. It will be used to build applications for harvesting, (such as the open-source Heritrix [HERITRIX] web crawler), DO WE MENTION IT HAS BEEN TESTED ? managing, accessing, and or exchanging purposescontent.

Besides the primary content currently recorded, the extension of the WARC format accommodates related secondary content, such as assigned metadata, abbreviated duplicate detection events, later-date transformations and segmentation of large resources. The extension may also be useful for more general applications than web archiving. To aid the development of tools that are backwards compatible, WARC content is clearly distinguishable from pre-revision ARC content.

Information and documentation — The WARC File FormatInformation and documentation — The WARC File FormatInformation and documentation — The WARC File FormatInformation and documentation — The WARC File Format

Scope [Goals]

This international standard specifies the Goals of the WARC file format include the following.:

← Ability to store both the payload content and control information from mainstream Internet application layer protocols, such as HTTP, DNS, and FTP;.

← Ability to store arbitrary metadata linked to other stored data (e.g., subject classifier, discovered language, encoding);

← to support data compression and maintain data record integrityto s;.

← Ability to store all control information from the harvesting protocol (e.g., request headers), not just response information;.

← Ability to store the results of data transformations linked to other stored data.;

← Ability to store a duplicate detection event linked to other stored data (to reduce storage in the presence of identical or substantially similar resources);.

← Ability to be extended without disruption to existing functionality;

← Sto support handling of overly long records by truncation or segmentation where desired.

The WARC file format is made sufficiently different from the legacy ARC format files so that software tools can unambiguously detect and correctly process both WARC and ARC records; given the large amount of existing archival data in the previous ARC format, it is important that access and use of this legacy not be interrupted when transitioning to the WARC format.This international standards specifies the Web ARChive file format, which provides the following:

← ability to store both the payload content and control information from mainstream Internet application layer protocols, including such as HTTP, FTP, NNTP, and SMTP ;

← ability to store arbitrary metadata linked to other stored data (e.g. subject classifier, discovered language, encoding) ;

← support for data compression and maintenance of data record integrity ;

← ability to store all control information from the harvesting protocol (e.g. request headers), not just response information ;

← ability to store the results of data transformations linked to other stored data ;

← ability to store a duplicate detection event linked to other stored data (to reduce storage in the presence of identical or substantially similar resources) ;

← ability to provide efficient processing for amendmentsbe extended without disruption to existing functionality ;

← ability to store globally unique record identifiers ;

← support for deterministic handling of long records (e.g. truncation, segmentation).

The WARC file format is sufficiently different from the legacy ARC format files that software tools can unambiguously detect and correctly process both WARC and ARC records; given the large amount of existing archival data in the previous ARC format, it is important that access and use of this legacy not be interrupted when transitioning to the WARC format.

Normative references

The following referenced documents are indispensable for the application of this document. For dated references, only the edition cited applies. For undated references, the latest edition of the referenced document (including any amendments) applies.

[ARC] Burner, M. and B. Kahle, “The ARC File Format,” September 1996 (HTML).

[HERITRIX] “Heritrix Open Source Archival Web Crawler” (HTML).

[IIPC] “International Internet Preservation Consortium (IIPC)” (HTML).

[W3CDTF] “Date and Time Formats (W3C profile of ISO8601)” (HTML).

[DCMI] “DCMI Metadata Terms” (HTML).

[RFC1035] Mockapetris, P., “Domain names - implementation and specification,” STD 13, RFC 1035, November 1987.

[RFC1884] Hinden, R. and S. Deering, “IP Version 6 Addressing Architecture,” RFC 1884, December 1995.

[RFC1950] Deutsch, L. and J-L. Gailly, “ZLIB Compressed Data Format Specification version 3.3,” RFC 1950, May 1996 (TXT, PS, PDF).

[RFC1951] Deutsch, P., “DEFLATE Compressed Data Format Specification version 1.3,” RFC 1951, May 1996 (TXT, PS, PDF).

[RFC1952] Deutsch, P., Gailly, J-L., Adler, M., Deutsch, L., and G. Randers-Pehrson, “GZIP file format specification version 4.3,” RFC 1952, May 1996 (TXT, PS, PDF).

[RFC2045] Freed, N. and N. Borenstein, “Multipurpose Internet Mail Extensions (MIME) Part One: Format of Internet Message Bodies,” RFC 2045, November 1996.

[RFC2047] Moore, K., “MIME (Multipurpose Internet Mail Extensions) Part Three: Message Header Extensions for Non-ASCII Text,” RFC 2047, November 1996 (TXT, HTML, XML).

[RFC2048] Freed, N., Klensin, J., and J. Postel, “Multipurpose Internet Mail Extensions (MIME) Part Four: Registration Procedures,” BCP 13, RFC 2048, November 1996 (TXT, HTML, XML).

[RFC2119] Bradner, S., “Key words for use in RFCs to Indicate Requirement Levels,” BCP 14, RFC 2119, March 1997 (TXT, HTML, XML).

[RFC2540] Eastlake, D., “Detached Domain Name System (DNS) Information,” RFC 2540, March 1999.

[RFC2616] Fielding, R., Gettys, J., Mogul, J., Frystyk, H., Masinter, L., Leach, P., and T. Berners-Lee, “Hypertext Transfer Protocol -- HTTP/1.1,” RFC 2616, June 1999 (TXT, PS, PDF, HTML, XML).

[RFC2822] Resnick, P., “Internet Message Format,” RFC 2822, April 2001.

[RFC3548] Josefsson, S., “The Base16, Base32, and Base64 Data Encodings,” RFC 3548, July 2003.

[RFC3629] Yergeau, F., “UTF-8, a transformation format of ISO 10646”, STD 63, RFC 3629, November 2003.

[RFC3986] Berners-Lee, T., Fielding, R., and L. Masinter, “Uniform Resource Identifier (URI): Generic Syntax,” STD 66, RFC 3986, January 2005 (TXT, HTML, XML).

[RFC4027] Josefsson, S., “Domain Name System Media Types,” RFC 4027, April 2005.

[RFC4501] Josefsson, S., “Domain Name System Uniform Resource Identifiers,” RFC 4501, May 2006.

[RDF] “Resource Description Framework (RDF)” (HTML).

[RFC0822] Crocker, D., “Standard for the format of ARPA Internet text messages,” STD 11, RFC 822, August 1982.

[RFC1035] Mockapetris, P., “Domain names - implementation and specification,” STD 13, RFC 1035, November 1987.

[RFC1884] Hinden, R. and S. Deering, “IP Version 6 Addressing Architecture,” RFC 1884, December 1995.

[RFC1950] Deutsch, L. and J-L. Gailly, “ZLIB Compressed Data Format Specification version 3.3,” RFC 1950, May 1996. (TXT) ; (PS) ; (PDF) 

[RFC1951] Deutsch, P., “DEFLATE Compressed Data Format Specification version 1.3,” RFC 1951, May 1996. (TXT) ; (PS) ; (PDF) 

[RFC1952] Deutsch, P., Gailly, J-L., Adler, M., Deutsch, L., and G. Randers-Pehrson, “GZIP file format specification version 4.3,” RFC 1952, May 1996. (TXT) ; (PS) ; (PDF)

[RFC2045] Freed, N. and N. Borenstein, “Multipurpose Internet Mail Extensions (MIME) Part One: Format of Internet Message Bodies,” RFC 2045, November 1996.

[RFC2048] Freed, N., Klensin, J., and J. Postel, “Multipurpose Internet Mail Extensions (MIME) Part Four: Registration Procedures,” BCP 13, RFC 2048, November 1996. (TXT) (HTML) ; (XML) 

[RFC2141] Moats, R., “URN Syntax,” RFC 2141, May 1997. (TXT) ; (HTML) ; (XML) 

[RFC2234] Crocker, D., Ed. and P. Overell, “Augmented BNF for Syntax Specifications: ABNF,” RFC 2234, November 1997. (TXT) ; (HTML) ; (XML) 

[RFC2396] Berners-Lee, T., Fielding, R., and L. Masinter, “Uniform Resource Identifiers (URI): Generic Syntax,” RFC 2396, August 1998. (TXT) ; (HTML) ; (XML) 

[RFC2540] Eastlake, D., “Detached Domain Name System (DNS) Information,” RFC 2540, March 1999.

[RFC2616] Fielding, R., Gettys, J., Mogul, J., Frystyk, H., Masinter, L., Leach, P., and T. Berners-Lee, “Hypertext Transfer Protocol -- HTTP/1.1,” RFC 2616, June 1999. (TXT) ; (HTML) ; (XML)

[RFC4027] Josefsson, S., “Domain Name System Media Types,” RFC 4027, April 2005.

[RFC4501] Josefsson, S., “Domain Name System Uniform Resource Identifiers,” RFC 4501, May 2006.

Terms, definitions and acronyms

1 Terms and definitions

For the purposes of this document, the following terms and definitions given apply. TERMS TO BE DEFINED ?

1 WARC record – The

bBasic constituent of a WARC file, consisting of a sequence of WARC records.

2 WARC record content block – T

he pPart (zero or more octets) of a WARC record that follows the header and that forms the main body of a WARC record.

3 WARC record payload

4 Data object referred to, or contained by a WARC record as a meaningful subset of the content block.

5

6 WARC record header – The b

Beginning of a WARC record, consisting of one first line declaring the record to be in the WARC format with a given version number, a header-line followed by lines of named fields up to a blank line.

WARC record header-line – A line of whitespace-separated text tokens that begins each WARC record.

7 WARC named fields

– A sSet of elements consisting of a name, a colon, and a value, with long values continued on indented lines.

WARC positional fieldsarameters – A set of elements consisting of text tokens, identified by their relative position, that appear on a header-line or in the value of a named field.

8 WARC logical record –

In the context of segmentation, a logical record may be composed of multiple segments, each represented by a WARC record.

amed Parameters

2 Acronyms

ABNF Augmented Backus-Naur Form

ARC ARChive

CRLF Carriage Return Line Feed

HTTP HyperText Transport Protocol

IANA Internet Assigned Numbers Authority

IESG Internet Engineering Steering Group

IETF Internet Engineering Task Force

IIPC International Internet Preservation Consortium

IA Internet ArchiveRFC Request For Comments

UR(I/L/N) Uniform Resource (Identifier/Locator/Name)

WARC Web ARChive

The WARC record modelFile and record mModel

A WARC format file is the simple concatenation of one or more WARC records. The first record usually describes the records to follow. In general, record content is either the direct result of a retrieval attempt — web pages, inline images, URL redirection information, DNS hostname lookup results, standalone files, etc. — or is synthesized material (e.g., metadata, transformed content) that provides additional information about archived content.

A WARC record consists of a record header followed by a record content block and two newlines. The WARC record header consists of one first line declaring the record to be in the WARC format with a given version number, then a variable number of line-oriented named fields terminated by a blank line. With one major exception, allowing UTF-8 [RFC3629], the WARC record header format largely follows the tradition of HTTP/1.1 [RFC2616] and [RFC2822] headers.

The top-level view of a WARC file can be expressed in an augmented Backus-Naur Form (BNF) grammar, reusing the augmented constructs defined in section 2.1 of HTTP/1.1 [RFC2616]. (In particular, note that to avoid the risk of confusion, where any WARC rule has the same name as an RFC2616 rule, the definition here has been made the same, EXCEPTexcept in the case of the CHAR rule, which in WARC includes multibyte UTF-8 characters.)

warc-file = 1*warc-record

warc-record = header CRLF

block CRLF CRLF

header = version warc-fields

version = "WARC/0.17" CRLF

warc-fields = *named-field CRLF

block = *OCTET

The record version appears first in every record and hence also begins the WARC file itself.

The WARC record relies heavily on named fields. Each named field consists of a name followed by a colon (":") and the field value. Field names are case-insensitive. The field value MAYmay be preceded by any amount of linear whitespace (LWS), though a single space is preferred. Header fields can be extended over multiple lines by preceding each extra line with at least one space or tab character.

Named fields may appear in any order and field values may contain any UTF-8 character. Both defined-fields and extension-fields follow the generic named-field format. Extension fields may be used in extensions of the core format.

named-field = field-name ":" [ field-value ]

field-name = token

field-value = *( field-content | LWS ) ; further qualified

; by field definitions

field-content =

OCTET =

token = 1*

separators = "(" | ")" | "" | "@"

| "," | ";" | ":" | "\" | *(qdtext | quoted-pair ) >

quoted-pair = "\" CHAR ; single-character quoting

uri = ""

Although UTF-8 characters are allowed, the 'encoded-word' mechanism of [RFC2047] MAYmay also be used when writing WARC fields and shall also be understood by WARC reading software.

The rest of the WARC record grammar concerns defined-field parameters such as record identifier, record type, creation time, content length, and content type.

defined-field = WARC-Type

| WARC-Record-ID

| WARC-Date

| Content-Length

| Content-Type

| WARC-Concurrent-To

| WARC-Block-Digest

| WARC-Payload-Digest

| WARC-IP-Address

| WARC-Refers-To

| WARC-Target-URI

| WARC-Truncated

| WARC-Warcinfo-ID

| WARC-Filename ; warcinfo only

| WARC-Profile ; revisit only

| WARC-Identified-Payload-Type

| WARC-Segment-Origin-ID ; continuation only

| WARC-Segment-Number

| WARC-Segment-Total-Length ; continuation only

Every WARC record has a type, reported in the WARC-Type field. There are eight WARC record types: 'warcinfo', 'response', 'resource', 'request', 'metadata', 'revisit', 'conversion', and 'continuation'. The relevant fields for each record type are further described in WARC Record Types. Each field's meaning and legal value format are described in nNNamed fFields.

The record block contains OCTET octet content interpreted based on the record type and other header values. All records shall include a Content-Length field to specify the length of the block.

Some record types (and possibly future record types) also define a payload, such as a meaningful subset of the block or content from a predecessor record. Some headers pertain to the payload of a record rather than the block directly.

For example, in a ‘response’ record with a content block consisting of httpHTTP headers and a data object, the payload would be the data object. All ‘response’, ‘resource’, ‘conversion’ and ‘continuation’ records may have a payload. All ‘warcinfo’, ‘request’, ‘metadata’ and ‘revisit’ records shall not have a payload.

Content matching the warc-file rule has the MIME content-type "application/warc", registered below in Ssection 7.1.

Content matching only the warc-fields rule is useful as a simple descriptive format, and has MIME content-type "application/warc-fields", registered below in sSection 7.2.

1 General

A WARC format file is the simple concatenation of one or more WARC records. A record consists of a record header followed by a record content block and two new lines. The WARC record header declares baseline identifying information about the current record, and allows additional per-record information. It consists of one first line of required positional fields, then a variable number of lines of named fields. Each record's content block contains zero or more bytes of data, interpreted according to the record type and any preceding headers.

Newlines are represented by CRLF. as per other internet standards PRECISE WHICH ONE. ThisThe WARC reccord can be expressed in the following IETF Augmented Backus-Naur Form (ABNF) grammar specified in [RFC2234]. (All-caps "core" elements are as defined in RFC2234.)

warc-file = 1*warc-record

warc-record = header block CRLF CRLF

header = header-line CRLF *anvlnamed-field CRLF

block = *OCTET

Elements of the ABNF grammar are further specified and explained in clauses 4 and 6.

2 Header-line

The record header-line is a newline-terminated sequence of whitespace-delimited text tokens representing parameters such as record length, time of creation, and subject URIcommon to every record (whether or not captured from the web). Token order is significant.

header-line = warc-id vwsp data-length vwsp creation-date vwsp

record-id vwsp segment-status vwsp

vwsp = 1*SPACE

header-line = warc-id tsp data-length tsp record-type tsp subject-uri tsp creation-date tsp timestamp tps content-type tsprecord-id

tsp = 1*WSP

The amount of whitespace between header-line tokens is variable. This gives archive builders the flexibility to add padding and later adjust pre-written header parameters when final values are only completely known after the record content block has been written.

warc-id

A fixed pattern, "warc/JJ.NN0.9", that appears first in every record and hence begins the WARC file itself. The pattern specifies the major (JJ) and minor (NN) version numbers of the WARC specification to which the WARC record conforms; JJ and NN are fixed two-digit strings, zero-padded on the left. This document specifies version 00.13, meaning a fixed pattern of,

warc/00.13

starts each record. The major and minor version numbers, both given as exactly two digits. It They serves to identify the file format and version to outside inspection, and to assist error recovery when a process reading a WARC file fails to find the next record boundary where expected.

Occurrences of this string are not definitively the same as record boundaries, since the string may by chance occur inside a record. However, it may still beits fixed length and form may make it useful to locate such strings when attempting to recover from file corruption which that has rendereds one or more data-length parameters unreliable.

FUTURE The warc-id string may change in future versions, but should always begin "warc/", continue with version numbers, and end at whitespace.

data-length

The combined length of the header and block sections of this record, in octets, starting with the first letter ("w") of the first token, through to the end of the content block — but not including the two record-ending CRLF newlines. After proceeding this many octets from that first character of the record header, there should be two CRLF newlines and either the beginning of a new record or the end of the file. (WARC reading implementations may choose to tolerate more or fewer CRLF newlines at the end of a record.) The data-length is the most important header parameter for efficient bulk scanning processing, which permits, for example, which may need to skip entire records to be skipped without parsing their contents.

If the first next token does not match the first token of a WARC record (“warc/JJ.NN”), then the previous data-length should be considered in error; corrective action might include searching for a nearby occurrence of "warc/JJ.NN 0.9" and other character patterns indicative of a legal record beginning.

record-type

The type of WARC record. All record types are optional, though starting all WARC files with a 'warcinfo' record is recommended. Record types are defined in clause 5.

FIND ANOTHER NAME THAN SUBJECT (WHICH IS TOO MEANINGFUL FOR LIBRARIES) ? subject-uri

The original URI whose collection gave rise to the information content in this record. In the context of web harvesting, this is the URI that was the target of a crawler's retrieval request. Indirectly, such as for a 'revisit', 'metadata', or 'conversion' record, it is a copy of the subject-uri appearing in the original record to which the newer record pertains. For a 'warcinfo' record, this parameter is given a synthesized value for the creation name of the WARC file, as a URI.

The URI in this value should be properly escaped according to [RFC2396] and written with no internal whitespace.

creation-date

A 14-digit timestamp in the format YYYYMMDDhhmmss representing the GMT time when record creation began. Multiple records written as part of a single collection action may share the same creation-date, even though the times of their writing will not be exactly synchronized.

content-type

The MIME type [RFC2045] of the information contained in the record's content block. (Type and subtype only.) For content in HTTP request and response records, this should be "message/http"; in particular, it is not the content-type of any HTTP content body.

The content-type in this value should be written with no internal whitespace.

record-id

An identifier assigned to the record that is globally unique for its period of intended use. No identifier scheme is mandated by this specification, but each record-id should be a legal URI and clearly indicate a documented and registered scheme to which it conforms (e.g., via a URI scheme prefix such as "http:"). The record-id is a strong feature of the WARC in that it allows unique record reference (e.g., from other WARC records and from search indexes); on those occasions when unique reference is not important, the record-id may be specified as a hyphen (“-“). The record-id in this value should always be written with no internal whitespace.

segment-status

A token of the form CM, where C is a letter representing a segment code and M is a positive integer representing a segment number. Segment numbering starts with 1, and every logical record is considered to have at least one segment. A record is considered to end when its final segment is encountered with a segment code C different from “p”. Defined values for C are:

p partial – this segment (numbered M) is part of a still incomplete record

w whole -- this segment (M) ends the record, which is complete

t truncated – this segment (M) ends the record, truncated due to a time constraint

z truncated – this segment (M) ends the record, truncated due to a size constraint

x truncated – this segment (M) ends the record, truncated for an unspecified reason

The most common segment-status is “w1”, meaning the logical record is wholly contained in its first and only segment (“whole in one”); those applications that construct a record as one long string may wish to write “w1” in the header as an optimistic default, and later change the “w” to a “p” in the unusual case that the record will not fit in one segment. A complete logical record spanning 7 parts will have segments with this series of segment-status codes:

p1, p2, p3, p4, p5, p6, w7

To keep segments grouped with the appropriate logical record, it is a requirement that every non-initial segment contain the named field Segment-Origin-ID. To indicate truncation, the last segment’s number should be preceded by the code “t”, “z”, or “x”; for example, a 3 part record whose last segment was the result of a web capture that had insufficient time to finish will have the series “p1, p2, t3”.

3 Named fields following the header-line

Zero or more named fields, all of them optional, expressed in A Name-Value Language [ANVL] follow the header-line. These fields have in a line-oriented syntax very similar to that of email headers [RFC0822] but with unrestricted "text" values (none of its 13 reserved special characters). Essentially, an element consists of a name, a colon, and a value, where long values may be continued on indented lines after the name. The ANVL syntax is as followsHere, this format is called the “named field syntax” and is precisely specified by:

anvlnamed-field = field-name ":" [field-body] CRLF

field-name = 1*

field-body = text [CRLF 1*LWSP-char field-body]

text = 1*

; (Octal, Decimal.)

CHAR = ; (0-177, 0.-127.)

CR = ; ( 15, 13.)

LF = ; ( 12, 10.)

SPACE = ; ( 40, 32.)

HTAB = ; ( 11, 9.)

CRLF = CR LF

LWSP-char = SPACE / HTAB ; semantics = SPACE

This international standards defines a number of named fieldsparameters that may appear as an anvlnamed-fields. Note that the smallest possible anvl-fields is a single CRLF, indicating no named fields If there are no named fields present, the entire WARC record header is the line of positional parameters followed by one blank line (two consecutive CRLF newlines).

No named-field is required except for Segment-Origin-ID, which must occur in every non-initial segment of a multi-segment logical record, The ‘type’ and ‘content-type’ fields are strongly recommended at the beginning of every record (meaning the initial segment of a multi-segment logical record).

In principle, more than one instance of a named-field (bearing the same name) may occur in one record, but in practice this really only makes sense for the ‘Related-Resource’ field.

The rest of this section describes the currently defined named-fields.

record-type: type-specific-parameters

The type of WARC record. If this parameter is absent, the record type defaults to ‘data’. All record types are optional, though starting allAlthough starting a WARC files with a 'warcinfo' record is recommended, any combination of record types may appear inside a WARC file. If no type is specified, it defaults to ‘data’. Record types are defined in clause 5.

FIND ANOTHER NAME THAN SUBJECT (WHICH IS TOO MEANINGFUL FOR LIBRARIES) ? subject-uri

The original URI whose collection gave rise to the information content in this record. In the context of web harvesting, this is the URI that was the target of a crawler's retrieval request. Indirectly, such as for a 'revisit', 'metadata', or 'conversion' record, it is a copy of the subject-uri appearing in the original record to which the newer record pertains. For a 'warcinfo' record, this parameter is given a synthesized value for the creation name of the WARC file, as a URI.

The URI in this value should be properly escaped according to [RFC2396] and written with no internal whitespace.

content-type: string

The MIME type [RFC2045] of the information contained in the record's content block. (Type and subtype only.) This may be the fully structured MIME type with embedded spaces. For content in an HTTP request and or response records, this the content-type should be "messageapplication/http" as per Section 19.1 of [RFC2616] (or 'application/http; msgtype=request' or 'application/http; msgtype=response' respectively). I; in particular, it is not the value of the HTTP Content-Type header in an HTTP response but a MIME type to describe the content body (hence 'application/http' if the content body contains response headers and the response itself). content-type of any HTTP content body. If no content-type is specified it defaults to “application/octet-stream”. An example of this field is:

content-type: application/http; msgtype=request

revisit: ref-uri comparison

An indication that the content block holds an empty or partial representation of the resource referenced by ref-uri, which usually identifies another WARC record (which may itself hold a partial representation) but which may be specified as “-“ if a resource identifier is unavailable. Typically, this field is used when the content visited was either a complete or substantial duplicate of material previously archived. An example of this field is:

revisit: urn:uuid:d7ae5c10-e6b3-4d27-967d-34780c58ba39 same

The comparison parameter may be one of “same” (resource was identical), “different” (resource was different and the content block, if non-empty, describes the differences in free text), or “patch” (resource was different and the content block represents the differences). The content block for this record is empty or contains a patch (if not specified, a content-type of “text/patch” is assumed), which is a set of machine-readable instructions that can be used (on Unix systems) automatically to construct a complete representation when applied to the referenced resource

The purpose of this field is to permit reduction in redundant storage when repeatedly retrieving identical or little-changed content, while still recording that a revisit occurred, plus details about the current state of the visited content relative to the archived version. A 'revisit' field may not make sense with some record types and should only be used when interpreting the record requires consulting a previous resource. It is not required that any revisit of a previously-visited URI use 'revisit'.

note: text

This field can be used to enter any free text comment or observation about the WARC record.

IP-Address: IP-addressip-address

If the content block was obtained directly from an Internet host, this field can be used to hold the numeric Internet address contacted to retrieve any included contenof that host. An IPv4 address should be written as a "dotted quad"; an IPv6 address as per [RFC1884]. For example, in the case of an HTTP retrieval, this field would hold will be the IP address used at retrieval time corresponding to the hostname in the record's subject-uritarget URI. An example of this field is:

IP-Address: 137.227.232.150

Checksum: algorithm:value

An optional parameter indicatingThis field can be used to indicate the name of a digest algorithm and the string representing the resulting value of the computation of the algorithm on the content block. An example is:

Checksum: sha1:AB2CD3EF4GH5IJ6KL7MN8OPQ

No particular algorithm is recommended, FUTURE though a future recommendation is possible.

Related-Record-IDResource: relationship record-id uri

The identifier of the record for which the present record holds related contentA specified relationship to a resource referenced by uri, such as another WARC record. This parameterfield is required of the record types 'revisit' and 'conversion'. It is also required to associate records of types 'request', 'response', 'resource', and 'metadata' with one another, when depermits WARC records to relate to resources in ways other than specified elsewhere in this document (eg, see the ‘http-request’ and ‘conversion’ record types)sired. However, none of these record types necessarily takes precedence over the others to become the referred-to (primary) record. (Any of them may appear first or alone.)The relationship string should be taken from a vocabulary such as Dublin Core terms [DCMI] or given as “-“ (undefined). An example of this field is:

Related-Resource: isRequiredBy

CLARIFY POSSIBLE RELATIONS ? A potential strategy, after choosing one record to be primary, is to extend its record-id as described in Annex A about record-id considerations. This creates satellite record-ids for related records that contain the primary record-id as an initial substring, which greatly optimizes the detection (and in some cases derivation) of related records.

Segment-Origin-ID: record-id

MENTION IT IS MANDATORY IN CASE OF SEGMENTATION. In a 'continuation' record, this identifiesThe identifier of the first record segment of thea multi-segment first segment of the setlogical record. This field is required of every non-initial segment of a multi-segment logical record. An example of this field is:

Segment-Origin-ID:

Segment-Number: integer

MENTION IT IS MANDATORY IN CASE OF SEGMENTATION. In the first segment of a record that is completed in one or more later 'continuation' WARC records, this parameter is "1". In a 'continuation' record, this parameter is the sequence number of the current segment in the logical whole record, increasing by 1 in each next segment.

MISSING End-Length PARAMETER ?

Truncated: reason-token

When present, indicates that the current record ends before the apparent end of the source material, but no 'continuation' records are forthcoming. Possible values indicate the reason for the truncation:

- 'length' for exceeding a desired length limit;

- 'time' for exceeding a desired time limit during collection.

Warcinfo-ID: record-id

When present, indicates the record-id of the associated 'warcinfo' record for this record. Typically, the Warcinfo-ID parameter is used when the context of the applicable 'warcinfo' record is unavailable, such as after distributing single records into separate WARC files. WARC writing applications (such web crawlers) may choose to record this parameter routinely (e.g., before computing checksums). The Warcinfo-ID parameter overrides any association with a previously occurring (in the WARC) 'warcinfo' record, thus providing a way to protect the true association when records are combined from different WARCs. FUTURE Use of this parameter in a record of type 'warcinfo' is undefined and reserved for possible future extension.

The content-type in this value should be written with no internal whitespace.

4 Content block following the header

The A content block of zero or more octets follows the header. The block s, if any, which may contain arbitrary binary data, up through the remaining number of octets as specified in the previously-given data-length parameter. Finally After the content block, two CRLF newlines are should be given, but although they are nevernot counted in the declared record data-length.

Record typesNamed fFields

1 GGeneral

Named fields within a WARC record provide information about the current record, and allow additional per-record information. WARC both reuses appropriate headers from other standards and defines new headers, all beginning "WARC-", for WARC-specific purposes.

Because new fields may be defined in extensions to the core WARC format, WARC processing software shall ignore fields with unrecognized names.

2 General

3 WARC-Record-ID (mandatory)

An identifier assigned to the current record that is globally unique for its period of intended use. No identifier scheme is mandated by this specification, but each record-id shall be a legal URI and clearly indicate a documented and registered scheme to which it conforms (e.g., via a URI scheme prefix such as "http:" or "urn:"). Care should be taken to ensure that this value is written with no internal whitespace.

WARC-Record-ID = "WARC-Record-ID" ":" uri

All records shall have a WARC-Record-ID field.

4 Content-Length (mandatory)

The number of octets in the block, similar to [RFC2616]. If no block is present, a value of '0' (zero) shall be used.

Content-Length = "Content-Length" ":" 1*DIGIT

All records shall have a Content-Length field.

5 WARC-Date (mandatory)

A 14-digit UTC timestamp formatted according to YYYY-MM-DDThh:mm:ssZ, described in the W3C profile of ISO8601 [W3CDTF]. The timestamp shall represent the instant that data capture for record creation began. Multiple records written as part of a single capture event (see section 5.7) shall use the same WARC-Datecaptureshall, even though the times of their writing will not be exactly synchronized.

WARC-Date = "WARC-Date" ":" w3c-iso8601

w3c-iso8601 =

All records shall have a WARC-Date field.

6 WARC-Type (mandatory)

The type of WARC record: one of 'warcinfo', 'response', 'resource', 'request', 'metadata', 'revisit', 'conversion', or 'continuation'. Other types of WARC records may be defined in extensions of the core format. Types are further described in WARC Record Typesin.

A WARC file needs not contain any particular record types, though starting all WARC files with a "warcinfo" record is RECOMMENDEDrecommended.

WARC-Type = "WARC-Type" ":" record-type

record-type = "warcinfo" | "response" | "resource"

| "request" | "metadata" | "revisit"

| "conversion" | "contination" | future-type

future-type = token

All records shall have a WARC-Type field.

WARC processing software shall ignore records of unrecognized type.

7 Content-Type

The MIME type [RFC2045] of the information contained in the record's block. For example, in HTTP request and response records, this would be 'application/http' as per sSection 19.1 of [RFC2616] (or 'application/http; msgtype=request' and 'application/http; msgtype=response' respectively). In particular, the content-type is not the value of the HTTP Content-Type header in an HTTP response but a MIME type to describe the full archived HTTP message (hence 'application/http' if the block contains request or response headers).

Content-Type = "Content-Type" ":" media-type

media-type = type "/" subtype *( ";" parameter )

type = token

subtype = token

parameter = attribute "=" value

attribute = token

value = token | quoted-string

All records with a non-empty block (non-zero Content-Length), except 'continuation' records, SHOULDshould have a Content-Type field. Only if the media type is not given by a Content-Type field, a reader MAYmay attempt to guess the media type via inspection of its content and/or the name extension(s) of the URI used to identify the resource. If the media type remains unknown, the reader SHOULDshould treat it as type "application/octet-stream".

8 WARC-Concurrent-To

The WARC-Record-IDs of any records created as part of the same capture event as the current record. A capture event comprises the information automatically gathered by a retrieval against a single target-URI; for example, it might be represented by a 'response' or 'revisit' record plus its associated 'request' record.

WARC-Concurrent-To = "WARC-Concurrent-To" ":" 1*uri

This field may be used to associate records of types 'request', 'response', 'resource', 'metadata', and 'revisit' with one another when they arise from a single capture eventmayone another when they arise from a single capture action. (When so used, any WARC-Concurrent-To association shall be considered bidirectional even if the header only appears on one record.) The WARC Concurrent-to field shall not be used in 'warcinfo', 'conversion', and 'continuation' records.

9 WARC-Block-Digest

An optional parameter indicating the algorithm name and calculated value of a digest applied to the full block of the record.

WARC-Block-Digest = "WARC-Block-Digest" ":" labelled-digest

labelled-digest = algorithm ":" digest-value

algorithm = token

digest-value = token

An example is a SHA-1 labeled Base32 ([RFC3548]) value:

WARC-Block-Digest: sha1:AB2CD3EF4GH5IJ6KL7MN8OPQ

This document recommends no particular algorithm.

Any record MAYmay have a WARC-Block-Digest field.

10 WARC-Payload-Digest

An optional parameter indicating the algorithm name and calculated value of a digest applied to the payload referred to or contained by the record --- which is not necessarily equivalent to the record block.

WARC-Payload-Digest = "WARC-Payload-Digest" ":" labelled-digest

An example is a SHA-1 labeled Base32 ([RFC3548]) value:

WARC-Payload-Digest: sha1:3EF4GH5IJ6KL7MN8OPQAB2CD

This document recommends no particular algorithm.

The payload of an application/http block is its 'entity-body' (per [RFC2616]). In contrast to WARC-Block-Digest, the WARC-Payload-Digest field MAYmay also be used for data not actually present in the current record block, for example when a block is left off in accordance with a 'revisit' profile (see 'revisit').

The WARC-Payload-Digest field MAYmay be used on WARC records with a well-defined payload and shall not be used on records without a well-defined payload.

11 WARC-IP-Address

The numeric Internet address contacted to retrieve any included content. An IPv4 address shall be written as a "dotted quad"; an IPv6 address shall be written as per [RFC1884]. For an HTTP retrieval, this will be the IP address used at retrieval time corresponding to the hostname in the record's tTarget-Uri.

WARC-IP-Address = "WARC-IP-Address" ":" (ipv4 | ipv6)

ipv4 =

ipv6 =

The WARC-IP-Address field may be used on 'response', 'resource', 'request', 'metadata', and 'revisit' records, but shall not be used on ‘warcinfo’, 'conversion' or 'continuation' recordsmayshall not.

12 WARC-Refers-To

The WARC-Record-ID of a single record for which the present record holds additional content.

WARC-Refers-To = "WARC-Refers-To" ":" uri

The WARC-Refers-To field MAYmay be used to associate a 'metadata' record to another record it describes. The WARC-Refers-To field MAYmay also be used to associate a record of type 'revisit' or 'conversion' with the preceding record which helped determine the present record content. The WARC-Refers-To field shall not be used in 'warcinfo', 'response', ‘resource’, 'request', and 'continuation' recordsThe WARC Refers-to field shall not be used in 'warcinfo', 'response', 'request', and 'continuation' records.

13 WARC-Target-URI

The original URI whose capture gave rise to the information content in this record. In the context of web harvesting, this is the URI that was the target of a crawler's retrieval request. For a 'revisit' record, it is the URI that was the target of a retrieval request. Indirectly, such as for a 'metadata', or 'conversion' record, it is a copy of the WARC-Target-URI appearing in the original record to which the newer record pertains. The URI in this value shall be properly escaped according to [RFC3986] and written with no internal whitespace.

WARC-Target-URI = "WARC-Target-URI" ":" uri

All 'response', 'resource', 'request', 'revisit', ‘conversion’ and 'continuation' records shall have a WARC-Target-URI field. A 'metadata' record MAYmay have a WARC-Target-URI field. A 'warcinfo' record shall not have a WARC-Target-URI field.

14 WARC-Truncated

For practical reasons, writers of the WARC format MAYmay place limits on the time or storage allocated to archiving a single resource. As a result, only a truncated portion of the original resource may be available for saving into a WARC record.

Any record MAYmay indicate that truncation of its content block has occurred and give the reason with a 'WARC-Truncated' field.

WARC-Truncated = "WARC-Truncated" ":" reason-token

reason-token = "length" ; exceeds configured max length

| "time" ; exceeds configured max time

| "disconnect" ; network disconnect

| "unspecified" ; other/unknown reason

| future-reason

future-reason = token

For example, if the capture of what appeared to be a multi-gigabyte resource was cut short after a transfer time limit was reached, the partial resource could be saved to a WARC record with this field.

The WARC-Truncated field MAYmay be used on any WARC record. The WARC field Content-Length shall still report the actual truncated size of the record block.

15 WARC-Warcinfo-ID

When present, indicates the WARC-Record-ID of the associated 'warcinfo' record for this record. Typically, the Warcinfo-ID parameter is used when the context of the applicable 'warcinfo' record is unavailable, such as after distributing single records into separate WARC files. WARC writing applications (such web crawlers) MAYmay choose to always record this parameter.

WARC-Warcinfo-ID = "WARC-Warcinfo-ID" ":" uri

The WARC-Warcinfo-ID field value overrides any association with a previously occurring (in the WARC) 'warcinfo' record, thus providing a way to protect the true association when records are combined from different WARCs.

The WARC-Warcinfo-ID field MAYmay be used in any record type except 'warcinfo'.

16 WARC-Filename

The filename containing the current 'warcinfo' record.

WARC-Filename = "WARC-Filename" ":" ( TEXT | quoted-string )

The WARC-Filename field MAYmay be used in 'warcinfo' type records and shall not be used for other record types.

17 WARC-Profile

A URI signifying the kind of analysis and handling applied in a 'revisit' record. (Like an XML namespace, the URI may, but need not, return human-readable or machine-readable documentation.) If reading software does not recognize the given URI as a supported kind of handling, it shall not attempt to interpret the associated record block.

WARC-Profile = "WARC-Profile" ":" uri

The section 'revisit' defines two initial profile options for the WARC-Profile header for 'revisit' records.

The WARC-Profile field is mandatory on 'revisit' type records and undefined for other record types.

18 WARC-Identified-Payload-Type

The content-type of the record's payload as determined by an independent check. This string shall not be arrived at by blindly promoting an HTTP Content-Type value up from a record block into the WARC header without direct analysis of the payload, as such values may often be unreliable.

WARC-Identified-Payload-Type = "WARC-Identified-Payload-Type" ":"

media-type

The WARC-Identified-Payload-Type field MAYmay be used on WARC records with a well-defined payload and shall not be used on records without a well-defined payload.

19 WARC-Segment-Number

Reports the current record's relative ordering in a sequence of segmented records.

WARC-Segment-Number = "WARC-Segment-Number" ":" 1*DIGIT

In the first segment of any record that is completed in one or more later 'continuation' WARC records, this parameter is mandatory. Its value there is "1". In a 'continuation' record, this parameter is also MANDATORYmandatory. Its value is the sequence number of the current segment in the logical whole record, increasing by 1 in each next segment.

See the section below, Record Segmentation, for full details on the use of WARC record segmentation.

20 WARC-Segment-Origin-ID

Identifies the starting record in a series of segmented records whose content blocks are reassembled to obtain a logically complete content block.

WARC-Segment-Origin-ID = "WARC-Segment-Origin-ID" ":" uri

This field is MANDATORYmandatory on all 'continuation' records, and shall not be used in other records. See the section below, Record sSegmentation, for full details on the use of WARC record segmentation.

21 WARC-Segment-Total-Length

inIn the final record of a segmented series, reports the total length of all segment content blocks when concatenated together.

WARC-Segment-Total-Length = "WARC-Segment-Total-Length" ":" 1*DIGIT

This field is MANDATORYmandatory on the last 'continuation' record of a series, and shall not be used elsewhere.

See the section below, Record sSegmentation, for full details on the use of WARC record segmentation.

WARC Record Types

1 General

The purpose and use of each defined record type is described below.

Because new record types that extend the WARC format may be defined in future standards, WARC processing software shall skip records of unknown type.

2 'warcinfo'

A 'warcinfo' record describes the records that follow it, up through end of file, end of input, or until next 'warcinfo' record. Typically, this appears once and at the beginning of a WARC file. For a web archive, it often contains information about the web crawl which generated the following records.

The format of this descriptive record block may vary, though the use of the "application/warc-fields" content-type is RECOMMENDEDrecommended. Allowable fields include, but are not limited to, all [DCMI] plus the following field definitions. All fields are OPTIONALoptional.

'operator'

Contact information for the operator who created this WARC resource. A name or name and email address is RECOMMENDEDrecommended.

'software'

The software and software version used creating this WARC resource. For example, "heritrix/1.12.0".

'robots'

The robots policy followed by the harvester creating this WARC resource. The string 'classic' indicates the 1994 web robots exclusion standard rules are being obeyed.

'hostname'

The hostname of the machine that created this WARC resource, such as "crawling17.".

'ip'

The IP address of the machine that created this WARC resource, such as "123.2.3.4".

'http-header-user-agent'

The HTTP 'user-agent' header usually sent by the harvester along with each request. Note that if 'request' records are used to save verbatim requests, this information is redundant. (If a 'request' or 'metadata' record reports a different 'user-agent' for a specific request, the more specific information SHOULDshould be considered more reliable.)

'http-header-from'

The HTTP 'From' header usually sent by the harvester along with each request. (The same considerations as for 'user-agent' apply.)

So that multiple record excerpts from inside WARC files are also valid WARC files, it is OPTIONALoptional that the first record of a legal WARC be a 'warcinfo' description. Also, to allow the concatenation of WARC files into a larger valid WARC file, it is allowable for 'warcinfo' records to appear in the middle of a WARC file.

See the appendixannex C.1 below for an example of a ‘warcinfo’ record.

3 'response'

1 General

A 'response' record contains a complete scheme-specific response, including network protocol information where possible. The exact contents of a 'response' record are determined not just by the record type but also by the URI scheme of the record's target-URI, as described below.

See annex the appendix C.2 below for an example of a ‘response’ record.

2 for 'http' and 'https' schemes

For a target-URI of the 'http' or 'https' schemes, a 'response' record block should contain the full HTTP response received over the network, including headers. That is, it contains the 'Response' message defined by section 6 of HTTP/1.1 (RFC2616), or by any previous or subsequent version of HTTP compatible with the section 6 of HTTP/1.1 (RFC2616)should.

The WARC record's Content-Type field SHOULDshould contain the value defined by HTTP/1.1, "application/http;msgtype=response". When software bugs, network issues, or implementation limits cause response-like material to be collected that is not perfectly compliant with HTTP specifications, WARC writing software MAYmay record the problematic content using its best effort determination of the interesting material boundaries. That is, neither the use of the 'response' record with an 'http' target-URI nor the 'application/http' content-type serves as an absolute guarantee that the contained material is a legal HTTP response.

A WARC-IP-Address field SHOULDshould be used to record the network IP address from which the response material was received.

When a 'response' is known to have been truncated, this shall be noted using the WARC-Truncated field.

A WARC-Concurrent-To field (or fields) MAYmay be used to associate the 'response' to a matching 'request' record or concurrently-created 'metadata' record.

The payload of a 'response' record with a target-URI of scheme 'http' or 'https' is defined as its 'entity-body' (per [RFC2616]), with any transfer-encoding removed. If a truncated 'response' record block contains less than the full entity-body, the payload is considered truncated at the same position.

This document does not specify conventions for recording information about the 'https' secure socket transaction, such as certificates exchanged, consulted, or verified.

3 for other URI schemes

This document does not specify the contents of the 'response' record for other URI schemes.

4 'resource'

1 General

A 'resource' record contains a resource, without full protocol response information. For example: a file directly retrieved from a locally accessible repository, or the result of a networked retrieval where the protocol information has been discarded. The exact contents of a 'resource' record are determined not just by the record type but also by the URI scheme of the record's target-URI, as described below.

For all 'resource' records, the payload is defined as the record block.

A 'resource' record, with a synthesized target-URI, MAYmay also be used to archive other artifacts of a harvesting process inside WARC files.

See annex the appendix C.3 below for an example of a ‘resource’ record.

2 for 'http' and 'https' schemes

For a target-URI of the 'http' or 'https' schemes, a 'resource' record block shall contain the returned 'entity-body' (per [RFC2616], with any transfer-encodings removed), possibly truncated.

3 for 'ftp' scheme

For a target-URI of the 'ftp' scheme, a 'resource' record block shall contain the complete file returned by an FTP operation, possibly truncated.

4 for 'dns' scheme

For a target-URI of the 'dns' scheme ([RFC4501]), a 'resource' record shall contain material of content-type 'text/dns' (registered by [RFC4027] and defined by [RFC2540] and [RFC1035]) representing the results of a single DNS lookup as described by the target-URI.

5 for other URI schemes

This document does not specify the contents of the 'resource' record for other URI schemes.

5 'request'

1 General

A 'request' record holds the details of a complete scheme-specific request, including network protocol information where possible. The exact contents of a 'request' record are determined not just by the record type but also by the URI scheme of the record's target-URI, as described below.

See annex the appendix C.4 below for an example of a ‘request’ record.

2 for 'http' and 'https' schemes

For a target-URI of the 'http' or 'https' schemes, a 'request' record block should contain the full HTTP request sent over the network, including headers. That is, it contains the 'Request' message defined by section 5 of HTTP/1.1 (RFC2616), or by any previous or subsequent version of HTTP compatible with the section 5 of HTTP/1.1 (RFC2616)should.

The WARC record's Content-Type field SHOULDshould contain the value defined by HTTP/1.1, "application/http;msgtype=request".

A WARC-IP-Address field SHOULDshould be used to record the network IP address to which the request material was directed.

A WARC-Concurrent-To field (or fields) MAYmay be used to associate the 'request' to a matching 'response' record or concurrently-created 'metadata' record.

The payload of a 'request' record with a target-URI of scheme 'http' or 'https' is defined as its 'entity-body' (per [RFC2616]), with any transfer-encoding removed. If a truncated 'request' record block contains less than the full entity-body, the payload is considered truncated at the same position.

This document does not specify conventions for recording information about the 'https' secure socket transaction, such as certificates exchanged, consulted, or verified.

3 for other URI schemes

This document does not specify the contents of the 'request' record for other URI schemes.

6 'metadata'

A 'metadata' record contains content created in order to further describe, explain, or accompany a harvested resource, in ways not covered by other record types. A 'metadata' record will almost always refer to another record of another type, with that other record holding original harvested or transformed content. (However, it is allowable for a 'metadata' record to refer to any record type, including other 'metadata' records.) Any number of metadata records MAYmay reference one specific other record.

The format of the metadata record block may vary. The "application/warc-fields" format, defined earlier, MAYmay be used. Allowable fields include all [DCMI] plus the following field definitions. All fields are OPTIONALoptional.

'via'

The referring URI from which the archived URI was discovered.

'hopsFromSeed'

A symbolic string describing the type of each hop from a starting 'seed' URI to the current URI.

'fetchTimeMs'

Time in milliseconds that it took to collect the archived URI, starting from the initiation of network traffic.

A 'metadata' record MAYmay be associated with other records derived from the same capture event using the WARC-Concurrent-To header. A 'metadata' record MAYmay be associated to another record which it describes using the WARC-Refers-To header.

See annex the appendix C.5 below for an example of a ‘metadata’ record.

7 'revisit'

1 General

A 'revisit' record describes the revisitation of content already archived, and might include only an abbreviated content body which has to be interpreted relative to a previous record. Most typically, a 'revisit' record is used instead of a 'response' or 'resource' record to indicate that the content visited was either a complete or substantial duplicate of material previously archived.

Using a 'revisit' record instead of another type is OPTIONALoptional, for when benefits of reduced storage size or improved cross-referencing of material are desired.

A 'revisit' record REQUIRESshall contain a WARC-Profile field which determines the interpretation of the record's fields and record block. Two initial values and their interpretation are described in the following sections. A reader which does not recognize the profile URI shall not attempt to interpret the enclosing record or associated content body.

The purpose of this record type is to reduce storage redundancy when repeatedly retrieving identical or little-changed content, while still recording that a revisit occurred, plus details about the current state of the visited content relative to the archived version.

See annex the appendix C.6 below for an example of a ‘revisit’ record.

2 Profile: Identical Payload Digest

This 'revisit' profile MAYmay be used whenever a subsequent consideration of a URI provides payload content which a strong digest function, such as SHA-1, indicates is identical to a previously recorded version.

To indicate this profile, use the URI:



To report the payload digest used for comparison, a 'revisit' record using this profile shall include a WARC-Payload-Digest field, with a value of the digest that was calculated on the payload.

A 'revisit' record using this profile MAYmay have no record block, in which case a Content-Length of zero must be written. If a record block is present, it shall be interpreted the same as a 'response' record type for the same URI, but truncated to avoid storing the duplicate content. A WARC-Truncated header with reason 'length' shall be used for any identical-digest truncation.

For records using this profile, the payload is defined as the original payload content whose digest value was unchanged.

Using a WARC-Refers-To header to identify a specific prior record from which the matching content can be retrieved is RECOMMENDEDrecommended, to minimize the risk of misinterpreting the 'revisit' record.

3 Profile: Server Not Modified

This 'revisit' profile MAYmay be used whenever a subsequent consideration of a URI encounters an assertion from the providing server that the content has not changed, such as an HTTP "304 Not Modified" response.

To indicate this profile, use the URI:



A 'revisit' record using this profile MAYmay have no content body, in which case a Content-Length of zero MOSTshall be written. If a content body is present, it should be interpreted the same as a 'response' record type for the same URI, truncated if desired.

shallFor records using this profile, the payload is defined as the original payload content from which a 'Last-Modified' and/or 'ETag' value was taken.

Using a WARC-Refers-To header to identify a specific prior record from which the unmodified content can be retrieved is RECOMMENDEDrecommended, to minimize the risk of misinterpreting the 'revisit' record.

4 Other profiles

Other documents may define additional profiles to accomplish other goals, such as recording the apparent magnitude of difference from the previous visit, or to encode the visited content as a "diff" -- where "diff" is the file comparison utility that outputs the differences between two files -- of the content previously stored.

8 'conversion'

A 'conversion' record contains an alternative version of another record's content that was created as the result of an archival process. Typically, this is used to hold content transformations that maintain viability of content after widely available rendering tools for the originally stored format disappear. As needed, the original content may be migrated (transformed) to a more viable format in order to keep the information usable with current tools while minimizing loss of information (intellectual content, look and feel, etc). Any number of 'conversion' records MAYmay be created that reference a specific source record, which may itself contain transformed content. Each transformation SHOULDshould result in a freestanding, complete record, with no dependency on survival of the original record.

Metadata records MAYmay be used to further describe transformation records. Wherever practical, a 'conversion' record SHOULDshould contain a 'WARC-Refers-To' field to identify the prior material converted.

For 'conversion' records, the payload is defined as the record block.

See annex the appendix C.7 below for an example of a ‘conversion’ record.

9 'continuation'

Record blocks from 'continuation' records must be appended to corresponding prior record block(s) (e.g., from other WARC files) to create the logically complete full-sized original record. That is, 'continuation' records are used when a record that would otherwise cause a WARC file size to exceed a desired limit is broken into segments. A continuation record shall contain the named fields 'WARC-Segment-Origin-ID' and 'WARC-Segment-Number', and the last 'continuation' record of a series shall contain a 'WARC-Segment-Total-Length' field. The full details of WARC record segmentation are described in the below section Record Segmentation. See also annex the appendix C.8 below for an example of a ‘continuation’ record.

Record segmentation

A record that will not fit into a single WARC file of desired maximum size MAYmay be broken into a number of separate records, called segments.

The first segment of a segmented series shall carry the original record-type (not 'continuation'), and a 'WARC-Segment-Number' field with a value of "1".

All subsequent segments shall have a record type of 'continuation', with an incremented 'WARC-Segment-Number' field. They shall also include a 'WARC-Segment-Origin-ID' field with a value of the WARC-Record-ID of the record containing the first segment of the set. All segments of a set shall have identical target-URI values. Segments MAYmay have individual WARC-Block-Digest fields.

The last segment shall contain a "WARC-Segment-Total-Length" field specifying the total length, in bytes, of all segment content blocks if reassembled. The last segment MAYmay also contain a 'WARC-Truncated' field, if appropriate.

The WARC-Payload-Digest recorded in the first segment of a segmented record is the digest of the payload of the logical record.

Segments other than the first SHOULD NOTshould not contain other optional fields, as segments merely serve to continue the record data block of the first record.

To reassemble all segments into the intended complete logical record, the content blocks of all records with the same 'WARC-Segment-Origin-ID' value are collected and appended, in 'WARC-Segment-Number' order, to the origin record's content block. The resulting assembled record adopts as its 'Content-Length' the 'WARC-Segment-Total-Length' value. It also adopts any 'WARC-Truncated' reason of the final segment.

Segmentation shall not be used if there is another way to store the record within the desired WARC file target size. Specifically, if a record could be stored without segmentation by starting a new WARC file, segmentation shall not be used. Further, when segmentation is used, the size of the first segment shall be maximized. Specifically, the origin segment shall be placed in a new WARC file, preceded only by a 'warcinfo' record (if any).

Segmentation may be applied to any original record type other than 'continuation', but its use on 'warcinfo', 'request', 'metadata' and ‘revisit’ records is not recommendedmaynotrecommended.

There are 8 currently defined WARC record types: 'warcinfo', 'response', 'resource', 'request', 'metadata', 'revisit', 'conversion', and 'continuation'. The purpose and use of each type is described below.

FUTURE New record types that extend the WARC format may be defined in the future. New record types that extend WARC may be defined in the future. WARC processing software should is encouraged to tolerate (e.g., skip over) skip records of unknown type. If no record type is given, a default type of ‘data’ is assumed. Some record types take positional parameters, for example, a “uri” is required with an HTTP response:

type: http-response uri

Several record types have a content block that is expected to be structured. If no content-type parameter accompanies such a record, the content block is assumed to be structured according to the named field syntax following the WARC header-line.

type: warcinfo

A 'warcinfo' record describes the records that follow it up throughuntil the end of file (end of input), or another 'warcinfo' record is reached. Typically, this appears once in a WARC file, and usually at the beginning of a WARC file. For a web archive, it often contains a description of a web crawl (e.g., seeds, depth, timeout, purpose, maximum file size). If no content-type parameter accompanies this record, the content block is assumed to contain a set of named fields in the same format as those following the WARC header-line. An example record is:

warc/00.13 390 20070214235805 urn:uuid:d7ae5c10-e6b3-4d27-967d-34780c58ba39 w1

type: warcinfo

software: Heritrix 1.4.0

hostname: crawler017.

ip: 207.241.227.234

isPartOf: testcrawl-20050708

title: testcrawl with WARC output

creator: IA_Admin

http-header-user-agent: Mozilla/5.0

(compatible; heritrix/1.4.0 +)

OUTSIDE THE SCOPE The format of the description is outside the scope of this document, but may include such things as:

a subject-uri, a URI name, synthesized as necessary, which references the WARC file itself,

approximate maximum archive file size (e.g., 500MB),

rate of crawling,

site entry point URIs for a targeted crawl.

So that multiple record excerpts from inside WARC files may also be valid WARC files, it is not strictly required that the first record of a legal WARC be a 'warcinfo' description. Also, to allow the concatenation of WARC files into a larger valid WARC file, it is allowable for 'warcinfo' records to appear in the middle of a WARThe precise specification of the content block for this record type is outside the scope of this document.C file.

FIND ANOTHER NAME THAN SUBJECT (WHICH IS TOO MEANINGFUL FOR LIBRARIES) ? subject-uri

The original URI whose collection gave rise to the information content in this record. In the context of web harvesting, this is the URI that was the target of a crawler's retrieval request. Indirectly, such as for a 'revisit', 'metadata', or 'conversion' record, it is a copy of the subject-uri appearing in the original record to which the newer record pertains. For a 'warcinfo' record, this parameter is given a synthesized value for the creation name of the WARC file, as a URI.

The URI in this value should be properly escaped according to [RFC2396] and written with no internal whitespace.

type: http-response uri

An 'http-response' record contains an entire HTTP protocol response, such as a full HTTP response including headers and content-body, from an Internet retrieval. The uri parameter, which names the target of the HTTP request, is required; it is the original URI whose capture gave rise to the information content in this record. The URI in this value should be properly escaped according to [RFC2396] and written with no internal whitespace.

Often the payload of such a response reflects the main collection objective of the archiving service, whose responsibility it is to distinguish payload from protocol headers during subsequent processing. A response record may often include come with the named parameters 'IP-Address' and 'Related-Record-IDRelated-Resource'. An example record is:

warc/00.13 7425 20050708010101 w1

IP-Address: 207.241.224.241

Checksum: sha1:2ZWC6JAT6KNXKD37F7MOEKXQMRY75YY4

HTTP/1.x 200 OK

Date: Fri, 08 Jul 2005 01:01:01 GMT

Server: Apache/1.3.33 (Debian GNU/Linux) PHP/5.0.4-0.3

Last-Modified: Sun, 12 Jun 2005 00:31:01 GMT

Etag: "914480-1b2e-42ab8245"

Accept-Ranges: bytes

Content-Length: 6958

Keep-Alive: timeout=15, max=100

Connection: Keep-Alive

Content-Type: image/jpeg

[6958 bytes of binary data here ]

'resource'

A 'resource' record contains a resource, without full protocol response information. For example: a file directly retrieved from a locally accessible repository, or the result of a networked retrieval where the protocol information has been discarded. A 'resource' record may often include the named parameter 'Related-Record-ID'.

type: http-request response-uri

An 'http-request' record holds the manner in which a primary record'sprotocol request headers (e.g., GET or POST) associated with a particular HTTP request. content was requested. The response-uri parameter, which references the HTTP response, is required, but may be given as “-“ to leave it undefined. The URI in this value should be properly escaped according to [RFC2396] and written with no internal whitespace. (In a web crawling context, this would hold the HTTP request.) A request record may often include come with the named parameter 'Related-Record-IDRelated-Resource'. An example record is:

warc/00.13 289 20050708010101 uuid:f569983a-ef8c-4e62-b347-295b227c3e51 w1

type: http-request

IP-Address: 207.241.224.241

GET /images/logo.jpg HTTP/1.0

Host:

User-Agent: Mozilla/5.0 (compatible; crawler/1.4 +)

type: dns-response dns-uri

A ‘dns-response’ record is used to hold the results of an Internet Domain Name System “A” record lookup on a given dns-uri. Records of this type are often used in a web archiving context to hold the IP address of a hostname that may be the target of repeated HTTP requests. A request for DNS information can be summarized in a URI in accordance with an IETF Network Working Group draft proposal [RFC4501]. DNS information as retrieved can be represented in the formats specified by [RFC1035], [RFC2540], and [RFC4027]. An example record is:

warc/00.13 252 20060909004930 – w1

type: dns-response dns:ca.water.

content-type: text/dns

20060909004930

ca.water.. 60 IN A 137.227.232.150

ca.water.. 60 IN A 137.227.232.151

ca.water.. 60 IN A 137.227.232.152

If present, the IP-Address named field should be the address of the DNS server that provided the DNS record.

type: metadata ref-uri

A 'metadata' record contains content created in order to further describe, explain, or accompany a harvestedanother resource in ways not covered by other record types. The other resource is referenced by the ref-uri parameter, often identifying another WARC record or an Internet-accessible resource (however, it may be given as “-“). CLARIFY POSSIBLE RELATIONS ? A 'metadata' record will almost always refer to another record of another type, with that other record holding original harvested or transformed content. (However, it is allowable for a 'metadata' record to refer to any record type, including other 'metadata' records, or to refer to no other individual record at all.) Any number of metadata records may be created that reference one specific other recordthe same resource (e.g., another WARC record). A metadata record may also come with the named parameter 'Related-Resource' to specify other related records.

OUTSIDE THE SCOPE The format of the metadata is outside the scope of this document, but potential formats include are [ANVL]the named field syntax used earlier and [RDF] or other XML-based formats. If no content-type parameter accompanies this record, the content block is assumed to contain a set of named fields in the same format as those following the WARC header-line. An example record is:

warc/00.13 282 20070214235805 w1

type: metadata

erc:

who: Lederberg, Joshua

what: Studies of Human Families for Genetic Linkage

when: 1974

where:

A metadata record may often include the named parameter 'Related-Record-ID'.

revisit

A 'revisit' record describes the revisitation of content already archived, and includes only an abbreviated content block which shall be interpreted relative to a previous record. Most typically, a 'revisit' record should be used instead of 'response' or 'resource' record to indicate that the content visited was either a complete or substantial duplicate of material previously archived.

A 'revisit' record should only be used when interpreting the record requires consulting a previous record; other record types should be preferred if the current record is understandable standing alone. It is not required that any revisit of a previously-visited URI use 'revisit', only those which refer back to other records.

OUTSIDE THE SCOPE The format of a 'revisit' record's content block is outside the scope of this document, and may vary to accomplish different goals, such as recording the apparent magnitude of difference from the previous visit, or to encode the visited content as a "diff" of the content previously stored.The purpose of this record type is to reduce storage redundancy when repeatedly retrieving identical or little-changed content, while still recording that a revisit occurred, plus details about the current state of the visited content relative to the archived version. A 'revisit' record requires the named parameter 'Related-Record-ID'.

type: conversion ref-uri flag

A 'conversion' record contains an alternative version of another record's resource’s content that was created as the result of an archival process. The other resource is referenced by the ref-uri parameter, often identifying another WARC record or an Internet-accessible resource. The flag parameter may be given as either “noenvelope” (to indicate that any protocol control information stored in the original resource was stripped off before conversion) or “-“ (no flag specified). The flag is a place to signal the common expected case in which an http-response record was not actually converted, but only the content inside the protocol envelope (the “payload”) was converted.

This record may also come with the named parameter 'Related-Resource' to specify other related records. An example record is:

warc/00.13 15153 20060909004930 w1

type: conversion noenvelope

content-type: image/jp2k

[ 14,984 bytes of binary image data here ]

Typically, this a ‘conversion’ record is used to hold content transformations that maintain viability of content after widely available rendering tools for the originally stored format disappear. As needed, the original content may be migrated (transformed) to a more viable format in order to keep the information usable with current tools while minimizing loss of information (intellectual content, look and feel, etc.). Any number of transformation records may be created that reference a specific source record, which may itself contain transformed content. Each transformation should result in a freestanding, complete record, with no dependency on survival of the original record. Metadata records may be used to further describe transformation records. A conversion record requires the named parameter 'Related-Record-ID'.

OUTSIDE THE SCOPE Specification of the fields and metadata formats used to describe a 'conversion'

record is outside the scope of this document.type: data

A 'data' record contains digital content of unspecified type. This type is the default if no record type is given. Examples include a file directly retrieved from a locally accessible repository, or the result of a networked retrieval where the protocol information has been discarded.

'continuation'

A 'continuation' record needs to be logically appended to a prior record (e.g., from another WARC file) to create the logically complete full-sized record. This is used when a record that would otherwise cause the WARC file size to exceed a desired limit is broken into segments. See clause 8 on Truncated and Segmented Records for more information. A 'continuation' record requires the named parameters 'Segment-Origin-ID' and 'Segment-Number', and may often include the named parameter 'Related-Record-ID'.

Record header

General

The header-line parameters are:

warc-id = "warc/0.9"

data-length = 1*DIGIT

record-type = "warcinfo" / "response" / "request"

/ "metadata" / "revisit" / "conversion"

/ "continuation" / future-type

future-type = 1*VCHAR

subject-uri = uri

uri =

creation-date = timestamp

timestamp =

content-type = type "/" subtype

type =

subtype =

record-id = uri

Named parameters (also known as ANVL [ANVL]), if any, follow the positional parameters after the header-line. Normally, PRECISE WHEN NAMED PARAMETERS ARE MANDATORY OR OPTIONAL ? named parameters are optional and their order is insignificant, however, specific record types require that certain named parameters be present FUTURE (and future extensions may have ordering requirements).

Positional parameters

This clause describes each of the individual positional parameters of the WARC header-line.

Named parameters

Named parameters, also referred to as named fields, are optional except as noted otherwise. FUTURE Additional named parameters may be proposed by WARC users, who are urged to publicly document and discuss with the WARC community new named parameters before use. CLARIFY WHICH ARE OPTIONAL OR MANDATORY AND IN WHICH CASE ?.

Record content block

Each record's content block contains zero or more bytes of data, interpreted according to the record type and any preceding headers.

Registration of MIME mMedia tTypes application/warc and application/warc-fields

1 General

This section describes, as per [RFC2048], the MIME types associated with the WARC format.Truncated and segmented records

2 application/warcGeneral

MIME media type name: application

MIME subtype names: warc

Required parameters: None

Optional parameters: None

Encoding considerations:

Content of this type is in 'binary' format.

Security considerations:

The WARC record syntax poses no direct risk to computers and networks. Implementors need to be aware of source authority and trustworthiness of information structured in WARC. Readers and writers subject themselves to all the risks that accompany normal operation of data processing services (e.g., message length errors, buffer overflow attacks).

Interoperability considerations: None

Published specification: TBD

Applications which use this media type: Large- and small-scale archiving

Additional information: None

Person and email address to contact for further information:

Gordon Mohr gojomo@, John Kunze jak@ucop.edu

Intended usage: COMMON

Author/Change controller: IESG

3 application/warc-fields

MIME media type name: application

MIME subtype names: warc-fields

Required parameters: None

Optional parameters: None

Encoding considerations:

Content of this type is in 'binary' format.

Security considerations:

The WARC field syntax poses no direct risk to computers and networks. Implementors need to be aware of source authority and trustworthiness of information structured in WARC. Readers and writers subject themselves to all the risks that accompany normal operation of data processing services (e.g., message length errors, buffer overflow attacks).

Interoperability considerations: None

Published specification: TBD

Applications which use this media type: Large- and small-scale archiving

Additional information: None

Person and email address to contact for further information:

Gordon Mohr gojomo@, John Kunze jak@ucop.edu

Intended usage: COMMON

Author/Change controller: IESG

IANA Cconsiderations

After IESG approval, IANA is expected to register the WARC type "application/warc" using the application provided in this document.

For practical reasons, users of the WARC format may place limits on the time or storage allocated to archiving a single resource. As a result, only a truncated portion of the original resource may be available for saving into a WARC record.

Additionally, users will often want to keep individual WARC files near or below some target size, such as 5100MB or 1G500MB. If some records would be too large to be contained by a single WARC file of desired maximum size, those records will have to be split between multiple WARC files.

This clause defines mechanisms for indicating that a WARC record has been truncated or split into multiple records, called segments, across WARC files. This is based on the concept of a logical record that may span multiple WARC records, perhaps in held in different WARC files. Each segment is represented by a separate WARC record.

FUTURE These mechanisms are provisional and subject to change. A superior method of indicating truncation and segmentation may be developed, which better allows the writing of records to begin without foreknowledge of their final length.

1 Record truncation

Any record may indicate that truncation has occurred and give the reason by the addition of a named 'Truncated' field in the record header. Acceptable values for this field include 'time' for truncation due to exceeding a time limit, and 'length' for truncation due to exceeding a length limit.by using the segment-status code on the WARC header-line. A code of “t” means that truncation occurred because of a time constraint and a code of “z” means that truncation occurred because of a size constraint. The reason may be left unspecified by using a code of “x”.

2 Record segmentation

A record that will not fit into a single WARC file of desired maximum size may be broken into any number of separate records, called segments. Together these segments comprise the logical record. As much as possible, segmentation should be avoided. and where necessaryWhen segmentation is needed, segments other than the first shall be of record-type 'continuation'shall come with the named field, Segment-Origin-ID, to tie the logical record segments together.

The first segment shall carry the record-type (not 'continuation') that the record would have had were it not broken into segments, and a 'Segment-Number' named field with a valuesegment number in the header-line segment-status of "1".

All subsequent segments shall have a record type of 'continuation'come with a Segment-Origin-ID field, with and an incremented 'Segment-Number'segment number. field. They shall also include a 'Segment-Origin-ID' field with a value of the Record-ID of the record containing the first segment of the set. Segments other than the first should contain no other named fields, as they merely serve to continue the record data block of the first record. All segments of a set shall have identical subject-uri parameters.

The last segment may also contain a 'Truncated' fieldan indication of truncation, if appropriate. For example, a 4-segment logical record that is truncated due to excessive size, and whose first segment’s record-id is 54321, would have the series of segment-status and Segment-Origin-ID fields according to this table:

|Segment-Status |Segment-Origin-ID |

| p1 |undefined |

|p2 |54321 |

|p3 |54321 |

|z4 |54321 |

Example of series of parameters in a 4-segment logical record.

To reassemble all segments into the intended complete logical record, all records with the same 'Segment-Origin-ID' value shall be collected and appended, in 'Segment-Number'segment number order, to the origin record.

(informative)

Considerations in choice of record IDCompression Recommandationsrecommendations

1. General

The WARC format defines no internal compression. Whether and how WARC files should be compressed is an external decision.

However, experience with the precursor ARC format at the Internet Archive has demonstrated that applying simple standard compression can result in significant storage savings, while preserving random access to individual records.

For this purpose, the GZIP format with customary "deflate" compression is RECOMMENDEDrecommended, as defined in [RFC1950], [RFC1951], and [RFC1952]. Freely available source code implementing this format is available, and the technique is free of patent encumberances. The GZIP format is also widely used and supported across many free and commercial software packages and operating systems.

This section documents recommended, but optional, practices for compressing WARC files with GZIP.The WARC format differs significantly from the ARC format in requiring the record-id parameter. The record-id should be globally unique for its period of intended use. If that period is indefinite, the record-id should be maintained to a level appropriate for any persistent identifier, in which case identifier opaqueness is usually desirable.

2. Record-at-time cCompression

Per section 2.2 of the GZIP specification, a valid GZIP file consists of any number of gzip "members", each independently compressed.

Where possible, this property SHOULDshould be exploited to compress each record of a WARC file independently. This results in a valid GZIP file whose per-record subranges also stand alone as valid GZIP files.

External indexes of WARC file content may then be used to record each record's starting position in the GZIP file, allowing for random access of individual records without requiring decompression of all preceding records.

Note that the application of this convention causes no change to the uncompressed contents of an individual WARC record.

3. GZIP WARC fFile nName sSuffix

A gzip compressed WARC file SHOULDshould have the customary ".gz" appended to it, making the complete suffix, ".warc.gz".

There is no reason why the archiving institution may not choose record-ids that are also "actionable" (submittable as retrieval requests to widely available tools such as web browsers) as long as there are providers to service them. This specification does not dictate what identifier scheme to use; suitable schemes include URN [RFC2141], [ARK], [GUID], etc.

Also worth considering is the establishment of lexical conventions for record-ids that reveal or suggest relationships among content blocks. Although some record types are already required to reference certain related resource and the 'Related-Record-IDRelated-Resource' parameter may also be used, required of 'metadata', 'revisit', and 'conversion' records is sufficient to convey relatedness in the context of a single WARC file, great optimization can be had when relatedness can be inferred by third parties through identifier comparison rather than by lookup in a database or examination of the relevant WARC files.

These conventions are suggested by [RFC2396], formalized by the [ARK] scheme, and are applicable to such things as the summarizing of large search results from Internet-wide indexing engines. As an example of a convention that could be adopted by users of any identifier scheme, the "/" character could be reserved as a separator used to introduce an extension string that is appended to a primary record-id. If the record-id of a primary block of captured content were,



The convention could also reserve the extension strings "_s", "_d", and "_t" to indicate record- ids for secondary, duplicate, and transform blocks, respectively. Over time this might result in the assignment of record-ids such as,











(informative)

WARC fFile sSize and nName rRecommendationsWARC application to specific protocols

1GB (10^9 bytes) is RECOMMENDEDrecommended as a practical target size for WARC files, when record sizes allow. Oversized records may be truncated, segmented, or placed in oversized WARC files, at a project's discretion.

It is helpful to use practices within an institution that make it unlikely or impossible to duplicate aggregate WARC file names. The convention used inside the Internet Archive with ARC files is to name files according to the following pattern:

Prefix-Timestamp-Serial-Crawlhost.warc.gz

Prefix is an abbreviation usually reflective of the project or crawl that created this file. Timestamp is a 14-digit GMT timestamp indicating the time the file was initially begun. Serial is an increasing serial-number within the process creating the files, often (but not necessarily) unique with regard to the Prefix. Crawlhost is the domain name or IP address of the machine creating the file.

IIPC member institutions have expressed an interest in adopting a common naming strategy, with per-institution unique identifiers to assist in marking WARC files with their institution of origin. It is proposed that all such WARC file names adhering to this future convention begin "iipc".

This specification does not require any particular WARC file naming practice, but conventions similar to the above are RECOMMENDEDrecommended within WARC-creating institutions. The file name prefix "iipc" SHOULD NOTshould not be used unless participating in a future IIPC naming registry.

1. HTTP and HTTPS

A full HTTP or HTTPS response, with protocol information and content-body (if any), can be saved verbatim into a WARC file as an 'http-response' type record, with a MIME content-type of "messageapplication/http" (or "messageapplication/http; msgtype=response").

A full HTTP or HTTPS request, including all request headers and content-body (if any), can similarly be saved verbatim into a WARC file as an "http-request" type record, with a MIME content-type of "messageapplication/http" (or "messageapplication/http; msgtype=request").

For either a request or response, an 'IP-Address' field should OR MAYmay be used to record the network IP address to which the request was directed, using the best available DNS information at the time.

Additional metadata about the HTTP or HTTPS transaction may be stored in a 'metadata' type record, OUTSIDE THE SCOPE in a format to be specified elsewhere. In particular, information about the secure session in which an HTTPS transaction occurs, such as certificates presented or consulted and authentication information exchanged, may be stored in one or more 'metadata' type records.

The multiple records which pertain to a single HTTP or HTTPS logical group of records will all have unique record-id values. In order to associate the records, all but one shall use 'Related-Record-ID' fields to refer to another record in the set.

As any mixture of record types may appear for a single collection event, and in any order, there is no specific record type which is automatically considered primary. Generally, all may refer back to the one record which appeared first, but this is not required. (A request record may refer to a response record or vice-versa; either could refer to a 'metadata' record or a 'metadata' record could refer to either.) Multiple and bidirectional 'Related-Record-IDRelated-Resource' fields may appear.

In the case where resources from a website have been harvested or otherwise received without performing normal HTTP operations, or where HTTP protocol information has been lost, it may be appropriate to store the plain content in WARC 'resource' 'data' type records, under their original subject-uri, but using the content MIME type in place of the "messageapplication/http" type.

2. DNS

A request for DNS information can be summarized in a URI in accordance with an IETF Network Working Group draft proposal [DNS-URI]. DNS information as retrieved can be represented in the formats specified by [RFC1035], [RFC2540], and [RFC4027].

The results of a DNS lookup can thus be straightforwardly archived in a WARC 'response' record under the appropriate DNS URI and MIME type.

3. Other resources with URIs, and other protocols

Any resource that can be identified with a URI, even if it is not retrieved via an Internet operation, may be archived in a WARC file under a 'resource' 'data' type record. This includes files that have meaningful URIs retrieved from a locally-accessible file system or other repository.

OUTSIDE THE SCOPE Specific conventions for other protocols and media types are expected to be defined as necessary. In general, the WARC format should be capable of archiving any digital resource which has a URI, a specific time of collection and a discrete length.

The 'http-request' and 'http-response' record types should be used for verbatim or lossless transcripts of collection activity, including protocol information. The 'resource' 'data' record type should be used for content without any protocol-specific enveloping. Additional information about a resource or transaction can be supplied in a protocol- or media-appropriate manner with 'metadata' type records.

(informative)

Examples of WARC records

General

Examples of each of record-type are provided in this annex. In some cases, illustrative data is shown where conventions have not yet been specified. Each record header-line is split over multiple lines for readability; continuations of the single line are indented, and a newline should only be considered to appear at the end of the last indented line. Declared record lengths are approximate, and unique IDs and checksums shown are plausible random filler.

Example of 'warcinfo' record

The following 'warcinfo' example includes an XML description of the enclosing WARC file that is loosely modelled after the descriptions currently used in Internet Archive ARC files. However, this is an abbreviated and speculative illustration; the referenced WARC-specific namespace "" has not been formally defined anywhere, and may not reflect eventual practice with WARC files.

warc/0.9 1012 warcinfo

filedesc:test-20050708010101-00001-crawl017..warc.gz

20050708010101 text/xml uuid:cbad35b7-e591-4b43-8a67-9d1d8f9ef4cd

Heritrix 1.4.0

crawling017.

207.241.227.234

testcrawl-20050708

testcrawl with WARC output

IA_Admin

Mozilla/5.0 (compatible; heritrix/1.4.0 +)

WARC file version 0.9



The first line (spread over three lines for readability) shows the required line of positional parameters. This record has no named fields, as evidenced by the single blank line following the header-line. The content block is "text/xml", as declared in the header-line. Two newlines follow the content block.

Example of 'request' record

A 'request' record captures the protocol request used to collect a resource. For example, to collect the resource "", the following 'request' record might be generated:

warc/0.9 298 request

20050708010101 message/http

uuid:f569983a-ef8c-4e62-b347-295b227c3e51

IP-Address: 207.241.224.241

GET /images/logo.jpg HTTP/1.0

Host:

User-Agent: Mozilla/5.0 (compatible; crawler/1.4 +)

Example of 'response' record

The archived response to the above request might look like the following.

warc/0.9 7583 response

20050708010101 message/http

uuid:a4b26b6b-f918-4136-af04-f859d75aebe5

IP-Address: 207.241.224.241

Related-Record-ID: uuid:f569983a-ef8c-4e62-b347-295b227c3e51

Checksum: sha1:2ZWC6JAT6KNXKD37F7MOEKXQMRY75YY4

HTTP/1.x 200 OK

Date: Fri, 08 Jul 2005 01:01:01 GMT

Server: Apache/1.3.33 (Debian GNU/Linux) PHP/5.0.4-0.3

Last-Modified: Sun, 12 Jun 2005 00:31:01 GMT

Etag: "914480-1b2e-42ab8245"

Accept-Ranges: bytes

Content-Length: 6958

Keep-Alive: timeout=15, max=100

Connection: Keep-Alive

Content-Type: image/jpeg

[6958 bytes of binary data here]

Note the 'Related-Record-ID' named field referring back to the generating 'request' record, and the creation-date identical to the previous record.

Example of 'resource' record

This same file, "logo.jpg", might be archived internally to an organization under its local filesystem name. This could result in a 'resource' record:

warc/0.9 7141 resource

20050710010101 image/jpeg

uuid:a6c3132b-49b8-4fd5-8072-45ce66d48a4b

Checksum: sha1:37F7MOEKXQMRY75YY42ZWC6JAT6KNXKD

[6958 bytes of binary data here]

Example of 'metadata' record

If some crawl-time metadata should be archived near the above response, a 'metadata' record could be used like the following (with a purely speculative XML format):

warc/0.9 395 metadata

20050708010101 text/xml

uuid:a4acff63-c213-4f35-9652-41a0e2dfc492

Related-Record-ID: uuid:a4b26b6b-f918-4136-af04-f859d75aebe5



565

Note again the same creation-date as the preceding related records. A relationship is declared to the preceding 'response' record, but declaring a relationship to the 'request' would also be legal.

Example of 'revisit' record

If the same URI is later revisited and the content is unchanged, a 'revisit' record like the following (again with a speculative content-type) could be generated:

warc/0.9 395 revisit

20050808010101 text/xml

uuid:ad522b3b-d68c-464a-b5e2-38149cfb511d

Related-Record-ID: uuid:a4b26b6b-f918-4136-af04-f859d75aebe5

HTTP/1.x 304 Not Modified

Date: Mon, 08 Aug 2005 01:01:01 GMT

Etag: "914480-1b2e-42ab8245"

Again, reference is made back to the original 'response' record. A new creation-date reflects the time of revisit. This content block hypothesizes including header excerpts from a server response to explain the results of the revisit. (In this case, the remote server indicated the resource was unchanged from the previous 'Etag' value.) The actual formats for describing the result of a revisit remain to be defined.

Example of 'conversion' record

At some future date, the "image/jpeg" format may no longer be considered viable, prompting a conversion of the original archive content into a hypothetical new format, "image/neoimg", which generates a 3098 byte version of the same image. This could be accommodated with a 'conversion' record:

warc/0.9 4111 conversion image/neoimguuid:c631da8a-e8db-44a8-84c5-9cc848dff35a

Related-Record-ID: uuid:a4b26b6b-f918-4136-af04-f859d75aebe5

Checksum: sha1:XQMRY75YY42ZWC6JAT6KNXKD37F7MOEK

[3098 bytes of binary data here]

An accompanying 'metadata' record, referring to this 'conversion' record, could contain additional details about the transformation. (Alternatively, new named-fields in this record could serve this role.)

Example of 'continuation' record

If the 'response' above had been so large that it would not fit into a single WARC file of desired maximum size, it would have to be segmented into separate smaller records. The first record would be as before, except with one additional named field, 'Segment-Number', with a value of "1", indicating that the record was the beginning of a segmented record set.

The subsequent segment for that record would then look like this:

warc/0.9 39514322 continuation message/httpuuid:c0d36ada-af8c-4608-8409-e60818b1d9e9

Segment-Number: 2

Segment-Origin-ID: uuid:a4b26b6b-f918-4136-af04-f859d75aebe5

[39514114 bytes of binary data here]

Note that the 'Segment-Origin-ID' refers to the first segment of the set, the one with the "Segment-Number: 1" named field.

(informative)

Compression recommendations

4. General

The WARC format defines no internal compression. Whether and how WARC files should be compressed is an external decision.

However, experience with the precursor ARC format at the Internet Archive has demonstrated that applying simple standard compression can result in significant storage savings, while preserving random access to individual records.

For this purpose, the GZIP format with customary "deflate" compression is recommended, as defined in [RFC1952], [RFC1950], and [RFC1951]. Freely available source code implementing this format is available, and the technique is free of patent encumbrances. The GZIP format is also widely used and supported across many free and commercial software packages and operating systems.

This clause documents recommended, but optional, practices for compressing WARC files with GZIP.

5. Record-at-a-time compression

Per section 2.2 of the GZIP specification, a valid GZIP file consists of any number of gzip "members", each independently compressed.

Where possible, this property should be exploited to compress each record of a WARC file independently. This results in a valid GZIP file whose per-record subranges also stand alone as valid GZIP files.

External indexes of WARC file content may then be used to record each record's starting position in the GZIP file, allowing for random access of individual records without requiring decompression of all preceding records.

Note that the application of this convention causes no change to the uncompressed contents of an individual WARC record. In particular, the declared record length remains the length of the uncompressed record.

6. GZIP extra field: skip-lengths ('sl')

Customarily, GZIP members do not declare their compressed length. This presents a problem for WARC processing which, after reading a small portion of a record, would like to skip to the next full record. In the absence of an external, precalculated index, using only the WARC record's uncompressed length would require the complete current record to be decompressed to find the start of the next record.

Section 2.3.1.1 of the GZIP format specification makes an allowance for arbitrary extension fields, called "extra-fields". We define here a new GZIP extra-field, "skip-lengths", identified by the two byte id "sl" (0x73, 0x6C).

This field, when present, shall contain two 4-byte unsigned integer values, with least significant byte first (as per other multi-byte values in the GZIP format). The first integer, compressed-skip-length, is a number of compressed bytes that may be skipped, from the beginning of the current GZIP member, to reach a distinct following member. (This value may be the exact length of the current member, but may also indicate a length of several related concatenated members.) The second integer, uncompressed-skip-length, is the number of uncompressed bytes that will be passed over when skipping the compressed-skip-length bytes forward.

With the help of these values, a decompressor can often skip forward past large ranges of the compressed input that are not of interest, restarting decompression at the targeted next member, while retaining knowledge of exactly how many bytes of uncompressed data have been skipped.

If the skip-length value is zero, the field should be ignored as if it were not present. (Compressors writing this field may use a zero value to reserve space for an as-yet-unknown skip-length, filling in the value if possible later.)

This extra-field will be registered with the GZIP authors as appropriate.

7. GZIP WARC file name suffix

A WARC file compressed with the extra GZIP field conventions described in this document is a legal GZIP file. To ensure that it is properly recognized by GZIP tools, its nameThe name of a gzip-compressed WARC file should have the customary ".gz" appended to it, making the complete suffix, ".warc.gz". GZIP software that does not recognize the extra GZIP fields will simply pass over them without benefit or harm.

(informative)

Examples of WARC rRecordsCollected ABNF for WARC

1. Example of 'warcinfo' rRecord

WARC/0.17

WARC-Type: warcinfo

WARC-Date: 2006-09-19T17:20:14Z

WARC-Record-ID:

Content-Type: application/warc-fields

Content-Length: 381

software: Heritrix 1.12.0

hostname: crawling017.

ip: 207.241.227.234

isPartOf: testcrawl-20050708

description: testcrawl with WARC output

operator: IA_Admin

http-header-user-agent:

Mozilla/5.0 (compatible; heritrix/1.4.0 +)

format: WARC file version 0.17

conformsTo:



2. Example of 'request' rRecord

WARC/0.17

WARC-Type: request

WARC-Target-URI:

WARC-Warcinfo-ID:

WARC-Date: 2006-09-19T17:20:24Z

Content-Length: 236

WARC-Record-ID:

Content-Type: application/http;msgtype=request

WARC-Concurrent-To:

GET /images/logoc.jpg HTTP/1.0

User-Agent: Mozilla/5.0 (compatible; heritrix/1.10.0)

From: stack@

Connection: close

Referer:

Host:

Cookie: PHPSESSID=009d7bb11022f80605aa87e18224d824

3. Example of 'response' rRecord

WARC/0.17

WARC-Type: response

WARC-Target-URI:

WARC-Warcinfo-ID:

WARC-Date: 2006-09-19T17:20:24Z

WARC-Block-Digest: sha1:UZY6ND6CCHXETFVJD2MSS7ZENMWF7KQ2

WARC-Payload-Digest: sha1:CCHXETFVJD2MUZY6ND6SS7ZENMWF7KQ2

WARC-IP-Address: 207.241.233.58

WARC-Record-ID:

Content-Type: application/http;msgtype=response

WARC-Identified-Payload-Type: image/jpeg

Content-Length: 1902

HTTP/1.1 200 OK

Date: Tue, 19 Sep 2006 17:18:40 GMT

Server: Apache/2.0.54 (Ubuntu)

Last-Modified: Mon, 16 Jun 2003 22:28:51 GMT

ETag: "3e45-67e-2ed02ec0"

Accept-Ranges: bytes

Content-Length: 1662

Connection: close

Content-Type: image/jpeg

[image/jpeg binary data here]

4. Example of 'resource' rRecord

WARC/0.17

WARC-Type: resource

WARC-Target-URI:

WARC-Date: 2006-09-19T17:20:24Z

WARC-Record-ID:

Content-Type: image/jpeg

WARC-Payload-Digest: sha1:CCHXETFVJD2MUZY6ND6SS7ZENMWF7KQ2

WARC-Block-Digest: sha1:CCHXETFVJD2MUZY6ND6SS7ZENMWF7KQ2

Content-Length: 1662

[image/jpeg binary data here]

5. Example of 'metadata' rRecord

WARC/0.17

WARC-Type: metadata

WARC-Target-URI:

WARC-Date: 2006-09-19T17:20:24Z

WARC-Record-ID:

WARC-Concurrent-To:

Content-Type: application/warc-fields

WARC-Block-Digest: sha1:UZY6ND6CCHXETFVJD2MSS7ZENMWF7KQ2

Content-Length: 59

via:

hopsFromSeed: E

fetchTimeMs: 565

6. Example of 'revisit' rRecord

WARC/0.17

WARC-Type: revisit

WARC-Target-URI:

WARC-Date: 2007-03-06T00:43:35Z

WARC-Profile:

WARC-Record-ID:

WARC-Refers-To:

Content-Type: message/http

Content-Length: 226

HTTP/1.x 304 Not Modified

Date: Tue, 06 Mar 2007 00:43:35 GMT

Server: Apache/2.0.54 (Ubuntu) PHP/5.0.5-2ubuntu1.4

Connection: Keep-Alive

Keep-Alive: timeout=15, max=100

Etag: "3e45-67e-2ed02ec0"

7. Example of 'conversion' rRecord

WARC/0.17

WARC-Type: conversion

WARC-Target-URI:

WARC-Date: 2016-09-19T19:00:40Z

WARC-Record-ID:

WARC-Refers-To:

WARC-Block-Digest: sha1:XQMRY75YY42ZWC6JAT6KNXKD37F7MOEK

Content-Type: image/neoimg

Content-Length: 934

[image/neoimg binary data here]

8. Example of sSegmentation ('continuation' record)

Let us take the example of the 'response' record given earlier, and segment it to fit the within a WARC file no larger than 2K. The first WARC file would contain the first segment, a record of type 'response' with a WARC-Segment-Number of 1. Note that the block-digest has changed -- as the block is no longer the same as the standalone 'response' record -- but the payload-digest has not changed, as the reassembled record will have the same internal payload.

WARC/0.17

WARC-Type: response

WARC-Target-URI:

WARC-Date: 2006-09-19T17:20:24Z

WARC-Block-Digest: sha1:2ASS7ZUZY6ND6CCHXETFVJDENAWF7KQ2

WARC-Payload-Digest: sha1:CCHXETFVJD2MUZY6ND6SS7ZENMWF7KQ2

WARC-IP-Address: 207.241.233.58

WARC-Record-ID:

WARC-Segment-Number: 1

Content-Type: application/http;msgtype=response

Content-Length: 1600

HTTP/1.1 200 OK

Date: Tue, 19 Sep 2006 17:18:40 GMT

Server: Apache/2.0.54 (Ubuntu)

Last-Modified: Mon, 16 Jun 2003 22:28:51 GMT

ETag: "3e45-67e-2ed02ec0"

Accept-Ranges: bytes

Content-Length: 1662

Connection: close

Content-Type: image/jpeg

[first 1360 bytes of image/jpeg binary data here]

The next file would contain the 'continuation' record, with fields to identify the start of the segmentation series (WARC-Segment-Origin-ID), to indicate this record's place in the series (WARC-Segment-Number), and to report that this the last record and what the total size is (WARC-Segment-Total-Length).

WARC/0.17

WARC-Type: continuation

WARC-Target-URI:

WARC-Date: 2006-09-19T17:20:24Z

WARC-Block-Digest: sha1:T7HXETFVA92MSS7ZENMFZY6ND6WF7KB7

WARC-Record-ID:

WARC-Segment-Origin-ID:

WARC-Segment-Number: 2

WARC-Segment-Total-Length: 1902

WARC-Identified-Payload-Type: image/jpeg

Content-Length: 302

[last 302 bytes of image/jpeg binary data here]

warc-file = 1*warc-record

warc-record = header block CRLF CRLF

header = header-line CRLF *anvlnamed-field CRLF

block = *OCTET

header-line = warc-id tsp vwsp data-length tsp vwsp record-type tspsubject-uri tsp creation-date tspcontent-type tsp vwsp record-id vwsp segment-status vwsp

vwtsp = 1*WSPACE

warc-id = "warc/" DIGIT DIGIT "." DIGIT DIGIT

data-length = 1*DIGIT

record-type = "warcinfo" / "response" / "request" / "metadata" /"revisit" / "conversion" / "continuation" /

future-type

future-type = 1*VCHAR

subject-urirecord-id = uri

uri =

creation-date = timestamp

timestamp = ; Greenwich Mean Time

content-type = type "/" subtype

segment-status = SegCode SegNum

SegCode = "p" / "w" / "t" / "z"

SegNum = 1*DIGIT

type =

subtype =

record-id = uri

anvlnamed-field = defined-fields / field-name ":" [ field-body ] CRLF

defined-fields = "type: warcinfo" CRLF

/ "type:" vwsp "http-response" vwsp uri CRLF

/ "type:" vwsp "http-request" vwsp response-uri CRLF

/ "type:" vwsp "dns-request" vwsp dns-uri CRLF

/ "type:" vwsp "metadata" vwsp ( ref-uri / "-" ) CRLF

/ "type:" vwsp "conversion" vwsp ref-uri vwsp flag CRLF

/ "type:" vwsp "http-request" vwsp response-uri CRLF

/ "type:" vwsp "data" CRLF

/ "content-type:" vwsp CRLF ; per RFC 2045

/ "revisit:" vwsp ref-uri vwsp

("same" / "different" / "patch" ) CRLF

/ "note:" vwsp field-body CRLF

/ "IP-Address:" vwsp CRLF ; per RFC 1884

/ "Checksum:" vwsp "sha1:" field-body CRLF

/ "Related-Resource:" vwsp relationship vwsp uri CRLF

/ "Segment-Origin-ID:" vwsp warc-record-id CRLF

/ "Warcinfo-ID:" vwsp warc-record-id CRLF

response-uri = uri

dns-uri = uri

ref-uri = uri

warc-record-id = uri

flag = "noenvelope" / "-"

relationship = "-" /

field-name = 1*

field-body = text [CRLF 1*LWSP-char field-body]

text = 1*

; (Octal, Decimal.)

CHAR = ; ( 0-177, 0.-127.)

CR = ; ( 15, 13.)

LF = ; ( 12, 10.)

SPACE = ; ( 40, 32.)

HTAB = ; ( 11, 9.)

CRLF = CR LF

LWSP-char = SPACE / HTAB ; semantics = SPACE

(informative)

Author’s AdressesWARC file name and size recommendations

John A. Kunze (editor)

California Digital Library

415 20th St, 4th Floor

Oakland, CA 94612-3550

US

Fax: +1 510-893-5212

Email: jak@ucop.edu

Allan Arvidson

Kungliga biblioteket (National Library of Sweden)

Box 5039

Stockholm 10241

SE

Fax: +46 (0)8 463 4004

Email: allan.arvidson@kb.se

Gordon Mohr

Internet Archive

4 Funston Ave, Presidio

San Francisco, CA 94117

US

Email: gojomo@

Michael Stack

Internet Archive

4 Funston Ave, Presidio

San Francisco, CA 94117

US

Email: stack@

It is helpful to use practices within an institution that make it unlikely or impossible to duplicate aggregate WARC file names. The convention used inside the Internet Archive with ARC files is to name files according to the following pattern:

Prefix-Timestamp-Serial-Crawlhost.warc.gz

Prefix is an abbreviation usually reflective of the project or crawl that created this file. Timestamp is a 14-digit GMT timestamp indicating the time the file was initially begun. Serial is an increasing serial-number within the process creating the files, often (but not necessarily) unique with regard to the Prefix. Crawlhost is the domain name or IP address of the machine creating the file.

IIPC member institutions have expressed an interest in adopting a common naming strategy, with unique identifiers attributed to institutions to assist in marking WARC files with their institution of origin. It is proposed that all such WARC file names adhering to this future convention begin "iipc".

The WARC File Format specification does not require any particular WARC file naming practice, but recommends conventions similar to the above be adopted within WARC-creating institutions. The file name prefix "iipc" should be avoided unless participating in the IIPC naming registry.

1G500MB (5x109^8 bytes) is recommended as a practical target size of WARC files, when record sizes allow. Oversized records may be truncated, segmented, or simply placed in oversized WARC files, at a project's discretion.

(informative)

Registration of MIME media type application/warc

This Annex describes, as defined in [RFC2048], the MIME types associated with the WARC format.

MIME media type name: application

MIME subtype names: warc

Required parameters: None

Optional parameters: None

Encoding considerations:

Content of this type is in 'binary' format. UTF-8 is the default character encoding for the textual information defined by the WARC format. However, any binary data may be included as blocks within the WARC format, and so only "8bit" and "binary" encoding is allowable.

Security considerations:

The WARC record syntax poses no direct risk to computers and networks. Implementors need to be aware of source authority and trustworthiness of information structured in WARC. Readers and writers subject themselves to all the risks that accompany normal operation of data processing services (e.g., message length errors, buffer overflow attacks).

Interoperability considerations: None

Published specification: TBD

Applications which use this media type: Large- and small-scale archiving

Additional information: None

Person and email address to contact for further information:

Gordon Mohr gojomo@, John Kunze jak@ucop.edu

Intended usage: COMMON Author/Change controller: IESG

After IESG approval, IANA is expected to register the WARC type "application/warc" using the application provided in this document.

Bibliography

[ANVL] Kunze, J., Kahle, B., Masanes, J., and G. Mohr, “A Name-Value Language” (PDF).

[ARC] Burner, M. and B. Kahle, “The ARC File Format,” September 1996 (HTML).

[ARK] Kunze, J. and R. Rodgers, “The ARK Persistent Identifier Scheme,” August 2005 (PDF).

[DCMI] “DCMI Metadata Terms,” December 2006 (HTML).

[GUID] “Wikipedia: Globally Unique Identifiers” (HTML).

[HERITRIX] “Heritrix Open Source Archival Web Crawler” (HTML).

[IIPC] “International Internet Preservation Consortium (IIPC)” (HTML).< >

[DNS-URI] Josefsson, S., “Domain Name System Uniform Resource Identifiers,” May 2005 (TXT).

(informative)

Use cases for writing WARC records

Below are listed different use cases developing some situations where WARC files and WARC records may be generated. Solutions adopted for each use case are not the only solutions that may be used. These are presented as examples.

The first column describes the use case and is different steps (some of them are hypothetical).

The second column indicates what kind of record is generated. The content of four named field is specified: WARC-Type (mandatory field), WARC-Date (mandatory field), WARC-Concurrent-To (optional field), WARC-Refers-To (optional field), to clarify the use of these fields.

Note: we suppose these WARC records are written in already opened WARC files, with a ‘warcinfo’ record describing the process during which they are created.

|Use case one: A crawler archives a file from the World Wide Web |

|Date: D |

|Request sent by the crawler to the server |WARC-Type: ‘request’ |

| |WARC-Date: D |

| |WARC-Concurrent-To: WARC-Record ID of the following ‘response’ |

| |record |

|Response received by the crawler from the server |WARC-Type: ‘response’ |

| |WARC-Date: D |

|Generation of metadata further describing the harvesting process / |WARC-Type: ‘metadata’ |

|the harvested record (if necessary) |WARC-Date: D |

| |WARC-Concurrent-To: WARC-Record ID of the previous ‘response’ record|

|Segmentation of the WARC record (if the file harvested on the web is|WARC-Type: ‘continuation’ |

|too big to be contained in a single WARC file) |WARC-Date: D |

| |

|Use case two: A file is archived in a WARC file through another process than a Web harvest |

|Date: D’ |

|Resource archived |WARC-Type: ‘resource’ |

| |WARC-Date: D’ |

|Generation of metadata further describing the archiving process / |WARC-Type: ‘metadata’ |

|the archived record (if necessary) |WARC-Date: D’ |

| |WARC-Concurrent-To: WARC-Record ID of the previous ‘resource’ record|

| |

|Use case three: A crawler archives a file from the World Wide Web that has not changed since the latest harvest |

|Date: D+1 |

|Request sent by the crawler to the server |WARC-Type: ‘request’ |

| |WARC-Date: D+1 |

| |WARC-Concurrent-To: WARC-Record ID of the following ‘revisit’ record|

|The crawler recognizes the file has not changed. The file is not |WARC-Type: ‘revisit’ |

|recorded to reduce storage redundancy |WARC-Date: D+1 |

| |WARC-Refers-To: WARC-Record ID of the already recorded file |

| |

|Use case four: Further metadata are seemed necessary to describe one or more WARC records |

|Date: D+2 |

|Generation of metadata describing each WARC record |WARC-Type: ‘metadata’ |

| |Date: D+2 |

| |WARC-Refers-To: WARC-Record ID of the described WARC record |

| |

|Use case five: A file format has become obsolete as it can’t be read anymore by the existing rendering tools. It appears necessary to |

|migrate a file in the obsolete format to a new format |

|Date: D+3 |

|Generation of a file in the new format |WARC-Type: ‘conversion’ |

| |WARC-Date: D+3 |

| |WARC-Refers-To: WARC-Record ID of the WARC record whose payload has |

| |been migrated |

|Generation of metadata describing the migration process (if |WARC-Type: ‘metadata’ |

|necessary) |WARC-Date: D+3 |

| |WARC-Refers-To: WARC-Record ID of the previous conversion record |

Below are listed different use cases developing some situations where WARC files and WARC records may be generated. These use cases correspond to the needs of the web archiving community.

N.B.: In a web harvesting context, the files constituting the websites are stored as WARC records in WARC files. Depending on the web harvesting process configuration, the different pieces of a website may not be contained in a single WARC file or in a set of WARC files but may be spread out and stored along pieces of other harvested websites. Thus, to render the archive of a website to users, access software may have to extract files contained in WARC records from different WARC files. External indexes may be used for a quicker access.

Other users may imagine other use cases to answer their own needs. Moreover, solutions adopted for each use case are not the only solutions that may be used. These are presented as examples.

The first column describes the use case and its different steps.

The second column indicates what type of record is generated. Only the most complex named field are specified in order to clarify the use of these fields: WARC-Type (mandatory field), WARC-Date (mandatory field), WARC-Concurrent-To (optional field), WARC-Refers-To (optional field). The other mandatory or useful named fields are not presented in the document.

Note: we suppose these WARC records are written in an already opened WARC file, containing a ‘warcinfo’ record.

|Use case one: An archiving crawler fetches from the World Wide Web and writes it in a WARC |

|file. |

|Date: 2007-10-24 at 10:14:22 GMT |

|A request is sent by the crawler to the server hosting |WARC record created: |

| |WARC-Type: ‘request’ |

| |WARC-Date: 2007-10-24T10:14:22Z |

| |WARC-Concurrent-To: WARC-Record ID of the following ‘response’ record |

|A response is received by the crawler from the server |WARC record created: |

| |WARC-Type: ‘response’ |

| |WARC-Date: 2007-10-24T10:14:22Z |

|Metadata further describing the harvesting process / the harvested |WARC record created: |

|record are added (e.g. information coming from the log files) |WARC-Type: ‘metadata’ |

| |WARC-Date: 2007-10-24T10:14:22Z |

| |WARC-Concurrent-To: WARC-Record ID of the previous ‘response’ record |

|If the file harvested on the web is too big to be contained in a single |Second WARC record created: |

|WARC file (e.g. 1,5 GB), the WARC record is segmented and a second |WARC-Type: ‘continuation’ |

|record is created |WARC-Date: 2007-10-24T10:14:22Z |

| |

|Use case two: the XML version of the French Gazette of 2007-11-01 has been transferred to the National Library of France (via FTP or email). This |

|file is archived in a WARC file. |

|Date: 2007-11-02 at 15:20:44 GMT |

|The resource is archived |WARC record created: |

| |WARC-Type: ‘resource’ |

| |WARC-Date: 2007-11-02T15:20:44Z |

|Metadata further describing the archiving process / the archived record |WARC record created: |

|are added (e.g. information about the transfer) |WARC-Type: ‘metadata’ |

| |WARC-Date: 2007-11-02T15:20:44Z |

| |WARC-Concurrent-To: WARC-Record ID of the previous ‘resource’ record |

|Use case three: An archiving crawler fetches from the World Wide Web that has not changed |

|since the latest harvest |

|Date: 2007-11-24 at 18:28:24 GMT |

|A request is sent by the crawler to the server hosting |WARC record created: |

| |WARC-Type: ‘request’ |

| |WARC-Date: 2007-11-24T18:28:24Z |

| |WARC-Concurrent-To: WARC-Record ID of the following ‘revisit’ record |

|The crawler detects that the file is the same as previously archived and|WARC record created: |

|that it has not changed. The entire file is not recorded to avoid |WARC-Type: ‘revisit’ |

|duplicates and reduce storage redundancy |WARC-Date: 2007-11-24T18:28:24Z |

| |WARC-Refers-To: WARC-Record ID of the already written record |

|Use case four: After the end of the harvest, Jhove is used to validate the format of . It |

|produces validation results that have to be stored in a WARC file and linked to the corresponding record. |

|Date: 2007-11-01 at 20:54:02 GMT |

|Results of the validation process are added in another WARC file |WARC record created: |

| |WARC-Type: ‘metadata’ |

| |Date: 2007-11-01T20:54:02Z |

| |WARC-Refers-To: WARC-Record ID of the described WARC record |

|Use case five: file format has become obsolete as it cannot be read anymore by the existing |

|rendering tools. It is necessary to migrate this file from the obsolete format to a new format. |

|Date: 2020-01-23 at 16:14:32 GMT |

|A file in the new format is generated |WARC record created: |

| |WARC-Type: ‘conversion’ |

| |WARC-Date: 2020-01-23T16:14:32Z |

| |WARC-Refers-To: WARC-Record ID of the WARC record whose payload has been |

| |migrated |

|Metadata describing the migration process are added (e.g. tool used) |WARC record created: |

| |WARC-Type: ‘metadata’ |

| |WARC-Date: 2020-01-23T16:14:32Z |

| |WARC-Refers-To: WARC-Record ID of the previous conversion record |

................
................

In order to avoid copyright disputes, this page is only a partial summary.

Google Online Preview   Download