WIPO DELTA Dataset Specification



WORLD INTELLECTUAL PROPERTY ORGANIZATIONSPECIAL UNION FOR THE INTERNATIONAL PATENT CLASSIFICATION(IPC UNION) TITLE \* MERGEFORMAT WIPO DELTA Dataset SpecificationDateByVersionStatusModificationMarch 21, 2018Collioud0.1DRAFTCreatedApril 3, 2018Fievet0.2RevisedRef to Standards and ConventionsApril 3, 2018Fievet0.3RevisedLimit number of files per directory or zip fileApril 5, 2018Approved by Fievet1.0ApprovedFH / Promoted to version 1.0Contact: WIPO: Patrick FI?VET (patrick.fievet@wipo.int)Table of Contents TOC \o "1-4" \h \z \u 1.Introduction PAGEREF _Toc510715454 \h 22.General conventions PAGEREF _Toc510715455 \h 22.1.Language codes PAGEREF _Toc510715456 \h 22.2.Country codes PAGEREF _Toc510715457 \h 22.1.Conventions used in this document PAGEREF _Toc510715458 \h 22.2.Dates PAGEREF _Toc510715459 \h 33.Files management and naming PAGEREF _Toc510715460 \h 34.File structure PAGEREF _Toc510715461 \h 34.1.document PAGEREF _Toc510715462 \h 34.1.1.country PAGEREF _Toc510715463 \h 44.1.2.date-publ PAGEREF _Toc510715464 \h 44.1.3.doc-id PAGEREF _Toc510715465 \h 44.1.4.family-id PAGEREF _Toc510715466 \h 44.1.5.kind PAGEREF _Toc510715467 \h 44.1.6.doc-number PAGEREF _Toc510715468 \h 44.2.ipc-postreform PAGEREF _Toc510715469 \h 44.3.ipc-from-cpc PAGEREF _Toc510715470 \h 44.4.classification-ipc PAGEREF _Toc510715471 \h 44.5.language-of-filing PAGEREF _Toc510715472 \h 54.6.language-of-publication PAGEREF _Toc510715473 \h 54.7.parties PAGEREF _Toc510715474 \h 54.8.applicants PAGEREF _Toc510715475 \h 54.9.inventors PAGEREF _Toc510715476 \h 54.10.invention-title PAGEREF _Toc510715477 \h 54.10.1.lang PAGEREF _Toc510715478 \h 54.11.abstract PAGEREF _Toc510715479 \h 54.11.1.abstract-source PAGEREF _Toc510715480 \h 54.11.2.lang PAGEREF _Toc510715481 \h 65.Links to EPO Espacenet PAGEREF _Toc510715482 \h 66.Examples PAGEREF _Toc510715483 \h 66.1.Examples of EPO Espacenet URL PAGEREF _Toc510715484 \h 66.2.Example of individual document XML file in the WIPO DELTA English collection PAGEREF _Toc510715485 \h 6IntroductionThe WIPO DELTA collection is primarily intended for research institutes working on artificial intelligence applied to automatic text categorization in the International Patent Classification (IPC). It can be used as reference dataset for potential comparison of the performance of various IT solutions in this domain.This document describes the structure and content of the WIPO DELTA dataset collection.The WIPO DELTA collection is made of language dependent files in XML format.The English and French ones are used as source data for the training of IPCCAT, the WIPO IT solutions for neural text categorization in the IPC. It also documents information found in the corresponding XML schema.The WIPO DELTA dataset collection is a complement of the WIPO alpha collection created and published by WIPO in 2004 for the same purpose see .It uses, wherever possible, reference to International standards, in particular WIPO ST.36 and ST.8.The format of these Files is XML i.e. primarily an exchange format aiming at easy interface between IT systems of different types.This product contains data sourced from EPO databases, ? European Patent Organisation. WIPO DELTA files are produced through an extraction process taking the EPO product DOCDB XML as input. This extraction process which is WIPO’s property targets the creation of the largest collection of patent documents with the best possible IPC symbols, sometimes through the use of CPC symbols and their related concordance to the IPC.The associated XSD is attached as annex 1.General conventionsLanguage codesTwo-letter code according to ISO 639-1.Country codes Two-letter code according to WIPO ST.3 is used to represent country codesFor IPC symbols, the “ZZ” provisional country code is used in position 41-42 of when the “generating office” is not known e.g. in some CPC symbols.Conventions used in this documentSpecific meanings e.g. tags and attributes names, are indicated in italics.DatesYYYYMMDD with:YYYY: year, 4 digit numerical value.MM: month, 2 digit numerical value.DD: day, 2 digit numerical value.Files management and namingThe atomic element of the WIPO DELTA collection is one XML file including information of a given patent document in a given language.These WIPO DELTA XML files are grouped by language in a file system structure such as IPC_training_dataset_YYYYMMDD / LL / NNNN/ IPC-CC-LL-doc_id-family_id- YYYYMMDD.xmlwhere:YYYYMMDD: is the extraction dateLL: uppercase language code, see paragraph REF _Ref509308136 \r \h 2 (matching the value of elements invention-title/abstract attribute lang) NNNN: sequential number to limit the number of files per directoryCC: country code, see paragraph REF _Ref509308136 \r \h 2 (matching the value of element document attribute country), see doc_id: the document unique identifier (matching the value of element document attribute doc-id)family_id: the DOCDB family identifier (matching the value of element document attribute family-id)For each LL language, the IPC_training_dataset_YYYYMMDD / LL /NNNN directory is compressed into as many LL_WIPO_DELTA_training_dataset_YYYYMMDD_NNNN.zip files as needed.An aggregated file LL_WIPO_DELTA_training_dataset_YYYYMMDD.zip may also be created to ease download of WIPO DELTA in a given language.File structureA document in the WIPO DELTA dataset collection is essentially made of the following patent document information:Bibliographic data to identify this document and its allotted IPC symbol(s),Its title,Its abstract. documentUnique root element containing one occurrence of:optional elements: ipc-postreform, ipc-from-cpc, language-of-filing, languageof-publication and parties,required elements: invention-title and abstract.The following attributes must also be specified:countryPublishing country.Value: a country code, see paragraph REF _Ref509308136 \r \h 2.date-publPublication date.Value: a date, see paragraph REF _Ref509308136 \r \h 2.doc-idUnique and stable identifier to the publication-reference. Value: a variable length 10 digit numerical value.family-idUnique identifier of the EPO simple patent family that the publication features.Value: a variable length numerical value.kindKind of document.Value: 2 digit alphanumerical code according to WIPO ST.16.doc-numberNumber of the document.Value: a variable length alphanumerical value.ipc-postreformThis optional element must contain a list of one or more classification-ipc elements.ipc-from-cpcThis optional element must contain a list of one or more classification-ipc elements (resulting from CPC to IPC concordance).classification-ipcThis element must contain a character string: an IPC symbol on 50 positions according to WIPO ST.8.language-of-filingThis element must contain a lowercase character string: the document filing language, see paragraph REF _Ref509308136 \r \h 2.language-of-publicationThis element must contain a lowercase character string: the document publication language, see paragraph REF _Ref509308136 \r \h 2.partiesThis optional element must contain one applicants and/or one inventors elements.applicantsThis element must contain a UTF-8 character string: a dot+space separated list of applicant name(s).inventorsThis element must contain a UTF-8 character string: a dot+space separated list of inventor name(s).invention-titleThis element must contain a UTF-8 character string. The following attribute must also be specified:langLanguage in which the title is written.Value: a language code, see paragraph REF _Ref509308136 \r \h 2.abstractThis element must contain a UTF-8 character string.The following attributes must also be specified:abstract-sourceSource of the document abstract.Value: one of these strings:“EPO”: provided by EPO“PAJ”: Patent Abstract of Japan in English “national office”: as filed by the national office“translation”: translated to English“transcript”: transliterated from non-Latin character stringlangLanguage in which the abstract is written.Value: a language code, see paragraph REF _Ref509308136 \r \h 2.Links to EPO EspacenetUsing values of the document attributes described in paragraph REF _Ref509306034 \r \h 4.1 above, URL similar to the following can be built to access corresponding document in some of the EPO Espacenet databases. For English, French and German, the following syntax can be used:? DB=<language>.worldwide.CC=<country>&NR=<doc-number><kind>&KC=<kind>&FT=D&locale=<language>_EPWhere <language> is the language chosen to display Espacenet user’s interface, i.e. “en” for English, “fr” for French or “de” for German.See examples below.Examples of EPO Espacenet URL for English: of EPO Espacenet URL for French: examplesExample of individual document XML file in the WIPO DELTA English collection <document doc-number="1051" doc-id="356867" date-publ="19870416" country="AP" family-id="34477" kind="A1"> <ipc-postreform> <classification-ipc>B65D 5/06 20060101AFI20160315BHEP </classification-ipc> <classification-ipc>A41D 27/00 20060101AFI20160922BHIL </classification-ipc> </ipc-postreform> <ipc-from-cpc> <classification-ipc>A01N 1/01 20130101ALI20160307BHEP </classification-ipc> </ipc-from-cpc> <language-of-filing>sv</language-of-filing> <language-of-publication>en</language-of-publication> <parties> <applicants>TETRA PAK AB</applicants> <inventors>IGNELL ROLF</inventors> </parties> <invention-title lang="EN">A packaging container and a blank for the manufacture of the same.</invention-title> <abstract lang="EN">Packing container</abstract> </document>End of document ................
................

In order to avoid copyright disputes, this page is only a partial summary.

Google Online Preview   Download