BABEL Final Report



20001549402000200660BABEL Final ReportJune 30, 2020Program for Cooperative Cataloging (PCC), Standing Committee on Standards (SCS), BIBFRAME And MARC Bibliographic Encoding for Languages (BABEL) Task Group6900096000BABEL Final ReportJune 30, 2020Program for Cooperative Cataloging (PCC), Standing Committee on Standards (SCS), BIBFRAME And MARC Bibliographic Encoding for Languages (BABEL) Task Group730005673725center2420096000Contents TOC \o "1-3" \h \z \u Introduction PAGEREF _Toc44426549 \h 2Vocabulary Recommendation PAGEREF _Toc44426550 \h 2Types of Language Information in Cataloging Records PAGEREF _Toc44426551 \h 3Current and Reviewed Standards PAGEREF _Toc44426552 \h 5Standards & Linked Data Framework PAGEREF _Toc44426553 \h 5MARC language codes PAGEREF _Toc44426554 \h 5ISO 639-2 PAGEREF _Toc44426555 \h 6ISO 639-3 PAGEREF _Toc44426556 \h 6ISO 639-5 PAGEREF _Toc44426557 \h 7IETF BCP-47 PAGEREF _Toc44426558 \h 7Glottolog PAGEREF _Toc44426559 \h 7VIVO language ontology PAGEREF _Toc44426560 \h 7Wikidata PAGEREF _Toc44426561 \h 8Advantages of ISO 639-3 PAGEREF _Toc44426562 \h 9Implementing ISO 639-3 PAGEREF _Toc44426563 \h 9Mapping MARC Language Codes to ISO 639-3 PAGEREF _Toc44426564 \h 9Matching codes PAGEREF _Toc44426565 \h 10Complexities with matching codes PAGEREF _Toc44426566 \h 10Unmatched codes PAGEREF _Toc44426567 \h 12Codes only in ISO 639-3 PAGEREF _Toc44426568 \h 12General Issues for Implementing ISO 639-3 PAGEREF _Toc44426569 \h 12Lack of cross-references PAGEREF _Toc44426570 \h 12Number of language codes PAGEREF _Toc44426571 \h 13Implementing ISO 639-3 for Languages Associated with the Resource in MARC Records PAGEREF _Toc44426572 \h 13Implementing ISO 639-3 in BIBFRAME Records PAGEREF _Toc44426573 \h 13Implementing BCP47 to tag literals in BIBFRAME PAGEREF _Toc44426574 \h 14Steps toward Implementation PAGEREF _Toc44426575 \h 16Appendix PAGEREF _Toc44426576 \h 18MARC Documentation Related to Non-MARC Language Schemes PAGEREF _Toc44426577 \h 18Field 041 examples PAGEREF _Toc44426578 \h 18Language Code and Term Source Codes PAGEREF _Toc44426579 \h 19PCC policies and documentation PAGEREF _Toc44426580 \h 20IntroductionThe BIBFRAME and MARC Bibliographic Encoding for Languages (BABEL) Task Group members include Charlene Morrison (OCLC, chair), Kelley McGrath (SCS), TJ Kao (SCA), Elaine Kim (LC), and Robert Rendall. The BABEL Task Group was charged on March 1, 2020 to make recommendations on language vocabularies for use in both BIBFRAME and MARC standards while taking into consideration the various language communities and their needs.Per the charge laid out, this final report will make a recommendation on a vocabulary to represent languages, scripts, and transliteration schemes in PCC cataloging, whether performed in the BIBFRAME or MARC environments. To accomplish its charge, the Task Group identified the following as stakeholders in the MARC, language, and linked data communities:Charles Riley, African language expertLarisa Walsh, Cyrillic language expertALA ALCTS CaMMS Committee on Cataloging: Asian and African Materials LD4P2 Non-Latin Script Materials Affinity Group CORMOSEA (Committee on Research Materials on Southeast Asia), Southeast Asian language experts CEAL (Council of East Asian Libraries), East Asian language expertMELA Committee on Cataloging - Middle Eastern languages VIVO Ontology Group / Violeta Ilik PCC Linked Data Advisory Committee (LDAC)ILS VendorsThe Task Group would like to acknowledge the assistance of the following stakeholders in writing the report: Charles Riley, African language expertAndrew Cunningham, language coding systems expertLarisa Walsh, Cyrillic language expertPCC Linked Data Advisory Committee (LDAC)Vocabulary RecommendationAfter identifying and reviewing the various language coding standards, the Task Group recommends exploring and testing the use of ISO 639-3 codes in certain environments in MARC cataloging to support the cataloging community as it transitions from MARC to BIBFRAME and a linked data environment. If it is determined that ISO 639-3 can be successfully implemented, we recommend that PCC require the use of 639-3 codes in PCC cataloging; however, we do not think it will be reasonable to forbid the use of other language codes (and specifically MARC language codes) in catalog records. It will also be challenging to convert all MARC language codes in existing records accurately to the correct and sometimes more specific ISO 639-3 codes (see discussion below). We will need to ensure that any language codes are clearly identified by the vocabulary to which they belong, and that local and shared systems will be able to display information derived from codes in different vocabularies to the extent desired and appropriate in a given context.For BIBFRAME cataloging, the Task Group will also recommend exploring the following possibilities:expanding the use of ISO 639-3 in replace MARC language codes in other areas (for example for language of cataloging)use of ISO 15924 for script of the resourceuse of BCP 47 to tag literals with language and script, and potentially romanization schemeTypes of Language Information in Cataloging RecordsThe Task Group discussed the following types of language information in cataloging records:Language associated with the resource (e.g., language of the resource or of some component of the resource or the original language of the resource)Language of description (i.e., the language of the catalog record) Language of particular strings (e.g., title, summary) Language associated with a resourceLanguage of cataloging descriptionLanguage of statements (literals) in description Types of information likely to be of interestLanguage, scriptLanguage, scriptLanguage, script, transliterationMARC 21 Bibliographic008/35-37 Fixed-Length Data Elements / Language041 Language Code377?a Language code775?e Other Edition Entry / Language code040?b Cataloging Source / Language of cataloging 242?y Translation of Title by Cataloging Agency / Language code of translated title336?2 Content Type / Source337?2 Media Type / Source338?2 Carrier Type / SourceMARC 21 Authority377?a Associated Language / Language Code040?b Cataloging Source / Language of cataloging MARC 21 Holdings008/22-24 Fixed-Length Data Elements / Language040?b Record Source / Language of cataloging337?2 Media Type / Source338?2 Carrier Type / SourceMARC 21 Classification084?e Classification Scheme and Edition / Language code040?b Cataloging Source / Language of cataloging MARC 21 Community Information008/12-14 Fixed-Length Data Elements / Language041 Language Code040?b Cataloging Source / Language of cataloging BIBFRAMElanguagescriptdescriptionLanguageAny literal with language textMARC 21 vocabulary recommendationISO 639-3 in 041 and 377MARC language codes for other fields listedNo implementation of script codesMARC language codesMARC language codes where currently applicableBIBFRAME vocabulary recommendationISO 639-3 (languages)ISO 15924 (scripts)ISO 639-3Subset of BCP 47In MARC 21 records, the most practical and promising path to improving language-related metadata is by moving from using MARC language codes to ISO 639-3 codes for describing the language of the resource. This would greatly expand our ability to identify specific languages, to make distinctions important for aural language and to deal with certain types of language groupings. This recommendation is discussed in greater detail below. The Task Group recommends limiting this change to fields 041 and 377, which already include subfield $2 where a non-MARC language scheme can be identified as the source of the code. In BIBFRAME, we also recommend using ISO 639-3 for languages associated with the resource.In MARC 21 records, the script of the resource is currently only recorded as free text in field 546 subfield $b. Although it would be beneficial to record the script of the resource using a standard vocabulary, it is not clear that there is a practical way to incorporate this information into MARC. Field 546 is defined as a note field. Field 041 has few subfields left and it is not clear how one would effectively link the language and the script. It could possibly be incorporated into field 377, but use of field 377 in bibliographic records seems to be designed for records describing expressions without manifestation information and lacks the granularity of field 041. In BIBFRAME, we recommend using ISO 15924 for scripts associated with the resource. In MARC 21 records, the Task Group does not recommend making any changes to coding the language of description (language of cataloging). The field 040 does not currently contain subfield $2, which would be necessary to identify language codes from a non-MARC scheme. It would be complicated to add subfield $2 to field 040 for this purpose. Since PCC libraries are unlikely to catalog in languages that would benefit from an expanded list of languages, there is not a compelling use case for making any change. In BIBFRAME, however, we recommend that ISO 639-3 be used when recording the language of the description as BIBFRAME does not suffer from these constraints. It is not clear to us whether there is a strong use case for recording the script of the description in BIBFRAME, but if it is wanted, there does not currently seem to be a place for it.MARC 21 currently does not support identifying information about strings, such as language or script, with one exception. In bibliographic records field 242 subfield $y can contain the MARC code for the language of a title that has been translated by the cataloging agency. This field is not widely used in PCC cataloging and is not likely to be used to provide translations into less common languages. It is difficult to envision a practical, cost effective way to retrofit this capacity into MARC. In BIBFRAME, the Task group recommends that PCC explore options allowing us to more fully code this information using BCP 47 as BIBFRAME supports identifying information about strings, such as language or script. Because we will expect to continue to work at least partly in MARC for some time to come, the Task Group feels that implementation of ISO 639-3 should be explored in our current MARC environment, to the extent possible. Although postponing all change in practice related to language codes until we have fully transitioned to a BIBFRAME environment might seem easier, the need for more granular and precise encoding of languages has been felt in some communities for a long time, and the benefits that would result from a successful implementation in MARC would be immediate and greatly appreciated.Current and Reviewed StandardsThe Task Group reviewed the standards specified in the charge as well as several other additions. Within the charge, three vocabularies were identified, ISO 639-2, ISO 639-3, and IETF. Along with these three, the Task Group also looked into ISO 639-5 and Glottolog as well as possible linked data options aside from LC Linked Data Service for supporting ISO 639-3, VIVO language ontology, Wikidata.Standards & Linked Data FrameworkMARC language codesMARC language codes are currently used by libraries in the MARC and BIBFRAME cataloging environments. There are 516 codes, which include 31 discontinued codes. The codes correspond to ISO 639-2. Where there are both bibliographic and terminology codes for a single language in ISO 639-2, the MARC language codes use the bibliographic codes. With labels only in English (which in some cases differ slightly from ISO 639-2’s English labels), the focus of the MARC code set is on languages most commonly found in library collections, and the vocabulary groups languages without their own code under collective codes. One example is that many Baltic languages are grouped under the code “bat” instead of having their own code. This set of codes does not include a "macrolanguage" concept, and the treatment of languages considered to be macrolanguages in ISO 639-3 varies; sometimes MARC has only a code for the macrolanguage (Chinese), sometimes only for the individual languages (Serbian, Croatian, etc) and sometimes both (Norwegian and its variants). This vocabulary provides coding for what it considers the major languages of the world and groups other languages together to cover everything else.ISO 639-2ISO 639-2 includes 547 codes, some of which are duplicate codes that allow the use of different codes for the same language for bibliographic and terminology purposes. This vocabulary was based on the MARC code list and published in 1998. With labels in English, French, and German, this set also focuses on languages most commonly found in library collections, with collective codes used to cover less common languages. As with MARC language codes, one example of a collective code is “bat”, the code representing Baltic languages, although only the German label clarifies that this code is for other Baltic languages not covered by individual codes. Like the MARC codes, this set of codes does not include a "macrolanguage" concept, and the treatment of languages considered to be macrolanguages in ISO 639-3 varies; sometimes ISO 639-2 has only a code for the macrolanguage (Chinese), sometimes only for the individual languages (Serbian, Croatian, etc.) and sometimes both (Norwegian and its variants). Also lLike the MARC language codes, this vocabulary provides coding for what is considered major world languages and groups other languages together to cover everything else.ISO 639-3ISO 639-3 contains 7,868 codes including all of the individual language codes already accounted for in ISO 639-2. It also provides codes for macrolanguages (Serbo-Croatian) or individual languages within macrolanguages (Mandarin Chinese) not included in ISO 639-2. In addition to this, it also includes codes derived from Ethnologue and LinguistList covering thousands of other living languages as well as extinct, ancient, historic, and constructed languages. With labels for these codes in English, it maps macrolanguage codes to individual language codes. However, it does not include any codes for collections of related languages, so any new language not already included would need to be added. While related to the MARC language codes and the ISO 639-2 codes, this vocabulary does not have a 1 to 1 mapping with some languages that changed over time, for example Old Spanish and Old Tamil which have separate 639-3 codes but are included within Spanish and Tamil in the smaller vocabularies. Unlike the other two vocabularies, this one has a greater breadth and coverage of past and present languages.ISO 639-5ISO 639-5 contains 115 codes and supplements the coding of language groups and families in ISO 639-2. Out of the 115 codes, 65 match ISO 639-2/MARC language codes and are either language group codes or remainder group codes. The former groups two or more individual languages as a unit, while the latter groups languages with the exclusion of specific languages that have separate identifiers. The intent is to support the current ISO 639 standards instead of providing a scientific classification of the languages of the world. With labels for these codes in English, the codes are hierarchical in nature and are intended to identify membership in language families.IETF BCP-47IETF/BCP-47 language tags are made up of a combination of language codes separated by hyphens. The structure includes primary subtags, extended language subtags, script subtags, region subtags, variant subtags, extension subtags, and private use subtags. Subtags are based on various other standards such as ISO 639 (i.e. 639-1, 639-2, 639-3, and 639-5), ISO 15924, ISO 3166-1, and UN_M.49). Language, extended language, script, region, and variant subtags are listed in the IANA Language Subtag Registry. While capitalization is not significant, capitalization conventions are used. Language, extended language, and variant codes are lower case, while script codes capitalize the first letter followed by lower case letters, and two-letter country codes are in upper case. The intent is to utilize current language codes to create more meaningful tags that can be either simple or complex.GlottologGlottolog provides codes for all languoids. This would include language families, languages, and dialects. While it does match up the ISO-639-3 code to the Glottocode, there is not always a 1 to 1 correlation. Glottocodes are composed of a combination of 4 letters and numbers followed by 4 more numbers. Its focus seems to be on lesser known languages, including modern languages that are endangered and assumed languages. It does not provide coding for historical languages. VIVO language ontologyVIVO language ontology is an open source ontology that is based on ISO 639 language codes as its sources. While it currently incorporates ISO 639-1 and ISO 639-2, it doesn’t appear that ISO 639-3 has been fully integrated into the ontology at this time. The intent of this ontology is to allow for identification of both written and spoken languages. Labels are already in English, French, and German, and the ontology supports the addition of labels in native languages and provides a model to convert ISO 639 language codes into a linked data framework. ExampleThere are several issues with using the VIVO Language Ontology. At this time the ontology is still in beta so may not provide the stability needed to support ISO 639-3, the URIs are not dereferenceable, and the ontology only contains a list of codes. WikidataWikidata, a “free and open knowledge base,” provides another option for creating IRIs to support use of ISO 639-3 in a linked data environment. A P number, P220, already exists for ISO 639-3 and Wikidata does make use of this property in its data. One example is the entry for Tunisian Arabic, i.e. Q56240, which is linked to the ISO 639-3 code, aeb.Running a SPARQL query to pull all of the codes used in Wikidata using the Wikidata Query Service brings up 8243 results, which is more than the 7868 codes defined in ISO 639-3. Much of the discrepancy is due to the inclusion in Wikidata of codes that have been deprecated in ISO 639-3. For example, Aramanik language and Asa language are two separate Wikidata entities, even though in ISO 639-3, Aramanik has been deprecated in favor of Asa language. Some ISO 639-3 codes are used by more than one Wikidata entity and some Wikidata entities include more than one ISO 639-3 code. One explanation might be that there are duplicate entries for the same language in some cases. For example, Q7452602 and Q30732002 both represent the language, Sera. These duplicate entries presumably should be merged in Wikidata. Another explanation might be that languages are broken down into more specific subsets for which no more specific code exists in ISO 639-3. For example, the Egyptian and Late Egyptian entities have the same ISO 639-3 language code, egy. In addition, at the time of this writing, 28 ISO 639-3 codes do not occur in Wikidata at all.Advantages of ISO 639-3Because ISO 639-3 contains only three-letter codes, it will be easier for some existing MARC-based systems to integrate. Prior to the introduction of a method for encoding non-MARC language codes in MARC in 2001, only three-letter codes from the MARC list of languages were permitted and multiple languages were coded in a single subfield.041 0# $d engfregerrus $e engfregerrus $h engfregerrus $g engfreger $h engSystems parsing these codes for display relied on the standard length of the codes to split out multiple codes in a single subfield. Many library databases may contain instances of this older method of language coding. Some systems still incorporate an expectation of three-letter codes into their parsing algorithm.Because ISO 639-3 has more comprehensive coverage of spoken as well as written languages, it will allow specific coding of many languages that currently can only be recorded with group codes in the MARC vocabulary. With the availability of codes for languages within macrolanguages, it will also allow catalogers to make important distinctions (such as between Mandarin and Cantonese Chinese in film soundtracks) which cannot be encoded in a machine-readable way in MARC records that only use MARC language codes.Implementing ISO 639-3Mapping MARC Language Codes to ISO 639-3There are 485 currently valid MARC language codes and 7867 currently valid ISO 639-3 codes. This section describes the process of mapping MARC language codes to ISO 639-3 and the potential alignment issues that we have identified. Matching codesExact matches419 of the 485 MARC language codes (86%) directly or indirectly match ISO 639-3 codes. In 306 cases, both the code and the label match. In 93 cases, the code matches, but the label does not. Most of these variations involve order or parenthetical qualifications and are obviously not substantive (e.g., "Ainu (Japan)" vs. "Ainu" or "Old English (ca. 450-1100)" vs. "English, Old (ca. 450-1100)"). The rest are intended to have equivalent meanings, although this possibly should be reviewed.ISO 639-2B vs. ISO 639-2T termsISO-639-2 has two flavors: ISO 639-2B (bibliographic) and ISO 639-2T (terminology). The MARC language code list is the same as ISO 639-2B. The two ISO 639-2 lists are intended to be equivalent, but there are twenty cases where they use different codes for the same language. For example, the ISO 639-2B/MARC code for Chinese is “chi” while ISO 639-2T uses “zho.” In the cases where the language codes vary, ISO 639-3 uses the ISO 639-2T value. Therefore, although there is a 1:1 relationship between the MARC language code and the ISO 639-3 code, the values are not the same. This does not present a problem for display, but may be a challenge for search and indexing in some systems if they have to deal with databases that include values from both schemes. Systems would need a mechanism to collapse the equivalent ISO 639-2B/MARC and ISO 639-3 codes into a single value. There are a couple of machine-readable sources that could help with these mappings. SIL International, the organization that maintains ISO 639-3, provides tab-delimited, UTF-8 text files that include mappings of ISO 639-3 to the other ISO 639 standards for download. Wikidata also appears to provide links between the two vocabularies (e.g., ).Complexities with matching codesThere are a couple of situations where naive matching of equivalent ISO 639-3 and MARC language codes is potentially sub-optimal.MacrolanguagesISO 639-3 includes 62 codes for “macrolanguages.” 58 of these have an equivalent MARC language code. Macrolanguage codes are used to identify “clusters of closely-related language varieties that ... can be considered distinct individual languages, yet in certain usage contexts a single language identity for all is needed.” Examples of macrolanguage codes include ara (Arabic), zho (Chinese), nor (Norwegian) and hbs (Serbo-Croatian).These are three possible types of matches between MARC language codes and ISO 639-3 macrolanguages: MARC includes only the equivalent for the macrolanguage (e.g., Arabic, Chinese, Latvian)MARC includes both the equivalent for the macrolanguage and some or all of the individual languages contained in the macrolanguage (e.g., Norwegian)MARC includes only the equivalent for the individual languages contained in the macrolanguage (e.g., Serbo-Croatian)There are three issues related to macrolanguages that need to be considered.Impact on systemsIndividual languages that are part of macrolanguages need to search and facet both as individual languages and as part of the macrolanguage for optimal usability. Users should be able to find things with the ISO 639-3 code “cmn” both when limiting to Mandarin (e.g., if they are searching for videos with Mandarin soundtracks) and when limiting to Chinese (e.g., if they want to see all the Chinese language resources recently acquired by the library or that address a certain topic). This collocation problem is similar to that discussed in the previous section on ISO 639-2B vs. ISO 639-2T except that it requires a single value to be mapped to multiple values rather than multiple values mapping to a single value.Policies for cataloging practicePolicies would need to be developed for when to use macrolanguage codes and when to use individual language codes and, possibly, when to use both. For example, the ISO 639-3 macrolanguage code lav (Latvian) corresponds to the MARC language code lav (Latvian), but ISO 639-3 also includes the more specific code lvs (Latvian, Standard) which is considered to be a language within that macrolanguage. Either ISO 639-3 code could reasonably be applied to a resource in Latvian. Catalogers would need guidance on how to apply these codes and others like them going forward.ConversionBecause there are no MARC language code equivalents for most of the individual languages that are part of ISO 639-3 macrolanguages, any attempts to automate incorporating the more specific codes into existing MARC records will be more complex.Historic languagesISO 639-3 includes 83 languages that it classifies as “historic.” These languages are “considered to be distinct from any modern languages that are descended from [them]: for instance, Old English and Middle English.” Only sixteen of these codes are also present in the MARC language list. Many of the remaining 67 ISO 639-3 historic languages are currently coded in MARC using the code that ISO 639-3 defines only for the modern, living language. For example, some things that are coded “spa” for Spanish in MARC should be coded “osp” (Old Spanish) in ISO 639-3. This is another area where conversion is not straightforward. However, this is not a new problem. Many current MARC bibliographic records, especially those retroconned from cards, fail to use the existing MARC codes for historic languages when they should. It would also be useful to have a machine-actionable way to group the temporal variants for retrieval (e.g., Korean, Middle Korean, Old Korean). A potential issue for consistently applying these codes is that only the ones that match the MARC codes include date ranges in their labels.Unmatched codesThere are 66 MARC codes that do not match ISO 639-3 codes. 65 of these match codes in ISO 639-5, which contains language families and groups. However, despite the fact that the same codes are used in both lists, some of them are defined differently and thus cannot be used interchangeably. Each of the ISO 639-5 codes describes a language grouping as a whole, including all of its members. Some of the matching MARC codes are also defined this way and thus, presumably semantically equivalent (e.g., “aus” for Australian languages). However, some of the MARC language codes are defined to include only those languages in a group that don’t have their own individual code. The MARC language code “fiu” (Finno-Ugrian (Other)) is a collective code for Ingrian, Khanty, Livonian, Ludic, Mansi, Mordvin and Veps. The ISO 639-5 code “fiu” is labeled “Finno-Ugrian languages” and covers all the languages in this family, including not just the obscure ones, but also more well-known examples like Finnish and Hungarian.Ideally, the MARC codes for language groupings would be converted to ISO 639-3 codes for the appropriate individual languages. However, this is challenging to do in an automated, scalable fashion. In some cases, specific languages could be identified in notes or subject headings, but some records will require manual review. Databases that include mixed practices where some records are coded for the specific language and some only for language groups will be frustrating and confusing for users. This could be mitigated by the method suggested above for macrolanguages where a search for the group code is designed to include the component languages.One MARC language code (“him” for Western Pahari languages) lacks a corresponding code in 639-3 or 639-5, but this seems likely to be an oversight and it should have been mapped to 639-5.Codes only in ISO 639-3There are 7448 codes in ISO 639-3 that do not have an exact match in the MARC language code list. We have not identified any potential problems with adding these to MARC records other than the issues discussed above.General Issues for Implementing ISO 639-3Lack of cross-referencesLike most language code vocabularies other than MARC, ISO 639-3 lacks a set of cross-references designed to help catalogers identify the appropriate code(s) for the resource they are describing. For languages that they're not familiar with, catalogers would probably need to do research directly in Ethnologue, which is not free, or in Wikipedia or other sources that reproduce some of its content. This would include determining what the correct 639-3 code would be for some of the language names that are listed as references in MARC but are apparently not in 639-3 (for example, "?akavian" is listed in MARC as a reference under the collective code for Slavic (Other), but turns out to be listed in 639-3 under a different spelling as a separate language "Chakavian"; "Surzhyk" (mixed Russian-Ukrainian) is also listed in MARC under Slavic (Other), but is not considered to be a valid language in ISO 639-3 and is not included).Number of language codesLanguage codes are generally mapped to labels for use in discovery interfaces or dropdown lists in staff interfaces. A switch to ISO 639-3 or RFC 5646 would increase the size of this mapping table significantly. There are 485 currently valid MARC language codes, but over 7800 in ISO 639-3. This may be an issue for maintenance or for processing in some systems (e.g., Ex Libris’ Primo has a limit on the number of unique values that can be defined for a “static” facet). The Task Group believes that linked data-based systems should be better equipped to handle maintenance of these mappings, but the number of codes may still present a challenge for retrieval in some situations.Implementing ISO 639-3 for Languages Associated with the Resource in MARC RecordsThe language of the resource is recorded in 008/35-37, field 041, and field 377 in the MARC bibliographic record.Non-MARC language codes can be used in addition to or instead of MARC language codes in the 041 field in the bibliographic format and the 377 field in the bibliographic and authority formats. The source of the language term is identified in subfield $2 and ISO 639-3 already has a code defined. Since the infrastructure is in place, the remaining difficulties are related to the ways that existing systems handle language data in MARC records.The one implementation issue that is unique to the MARC environment is the inability to use codes from non-MARC language schemes in the language fixed field (008/35-37). Non-MARC language codes can only be used in field 041. If only non-MARC language codes are used in a record, the MARC format says to use fill characters in field 008. Although most systems likely use a combination of language codes from fields 008 and 041, it is possible that there are systems that only make use of the language code in field 008.Implementing ISO 639-3 in BIBFRAME RecordsThe Task Group’s preliminary work looked at the ISO codes converted to URIs by both the LC Linked Data Service, VIVO Language Ontology, and Wikidata.LC Linked Data Service includes URIs for MARC Language codes, ISO 639-1, ISO 639-2, and ISO 639-5. While it doesn’t include the ISO 639-3, a URI was created to represent the ISO 639-3 code source. Using only this service for BIBFRAME and other linked data applications would currently exclude the ISO 639-3 standard. Because SILS, the organization that maintains ISO 639-3, has not published that vocabulary as linked data, many providers of linked data have used the URIs published by Lexvo. However, this website no longer seems to be functional.In the case of replacing MARC language codes, the alternative should have RDF IRIs ready for consumption or easily generated. One option, VIVO Language Ontology, was created to represent the languages in the ISO standards 639-1, 639-2, and 639-3. This ontology models language under two major classes, “continuants” and “occurants”. Continuants are “recorded text or media,” while occurants are performed works. The ontology acknowledges that an occurant could result in a continuant. Another option would be using Wikidata to create IRIs for use. The use of IRIs from either of these options might work to incorporate ISO 639-3 codes into bibliographic data, but further research is needed. The LC Linked Data Service, ID., provides both interactive and machine access to commonly used ontologies, controlled vocabularies, and other lists for bibliographic description. It may be desirable for ID. to host a version of ISO 639-3 as linked data for use by the library community. This would avoid the challenge of trying to maintain a clean mapping with Wikidata or another source. It would also allow the incorporation of cross-references, and possibly usage guidelines, for the library community.Implementing BCP47 to tag literals in BIBFRAMEIt is not currently possible to identify in a machine-readable way the language, script, etc. of individual elements of a MARC record, and we do not recommend any changes to practice in MARC. BIBFRAME shows more promise in this respect, however, and we do recommend that as part of BIBFRAME implementation the possibility of encoding this sort of information, to the extent desired in different contexts, should be explored.Unlike in MARC 21, which allows the identification of the language in only a small number of fields and subfields listed in the previous section, BIBFRAME makes it possible to identify the language, script and transliteration of data in nearly all elements. The RDF standard specifies that the language of literals can be identified with a language tag as defined by BCP 47. As of June 15, 2020, the Library of Congress’ internal version of BIBFRAME Editor provides two separate drop down menus for identification of language and script of numerous elements, including title information, statement of responsibility, edition statement, transcribed provider statement, series statement, and various notes. While both sets of values are from the IANA Language Subtag Registry, the first value consists of two letters representing the language and the second value consists of four letters representing the script. Following BCP 47’s convention, the combination of the two values connected by a hyphen expresses explicitly the language and script of the information, e.g. zh-Hant for Chinese written in traditional Chinese script, ja-Hani for Japanese written in Kanji, ko-Hang for Korean written in Hangul.Below are examples of how BCP 47 is used to express languages and scripts of a literal that might look identical but actually represents different languages and scripts:日本人LanguageIANA code for scriptRomanizationIANA code for romanizationChinesezh-HantRiben renzh-LatnJapaneseja-HaniNihonjinja-LatnKoreanko-HaniIlboninko-LatnIf desired, use of BCP 47 could also be expanded to indicate also the romanization scheme used in strings containing romanized text, for example distinguishing between Korean text romanized according to the ALA-LC table and text romanized according to the Korean government's romanization system.Below is an example of how BCP 47 language subtags is contained in the RDF of BIBFRAME:_:b0_b1 a bf:Title; rdfs:label "???"; bf:mainTitle "???"@ko-kore._:b0_b2 a bf:VariantTitle; bf:mainTitle "韓國史"@ko-hani.In the BCP 47 registry, some languages normally written in just one script are listed with a 'Suppress-Script' field indicating that a script subtag will not add distinguishing information for that language and should not be used. For example the registry indicates that the subtag 'Latn' should not be used with the primary language 'en' because nearly all English documents are written in the Latin script. The implications of this restriction (if it needs to be followed) for the coding and retrieval of library data will require further investigation. BIBFRAME will need to be able to support mixed practices including both non-Latin script and/or romanization to meet the needs of different library communities, and to accommodate both legacy and newly created data. However, current BIBFRAME cataloging practice is geared toward using less transliteration. Access points are romanized, but other parts of the bibliographic description are described in the script of the resource for current LC BIBFRAME phase 2 pilot program.While it should be possible to tag language, script and transliteration for any BIBFRAME element, to ensure maximizing resource discovery while being economic, we would like to recommend the following BIBFRAME elements for tagging:For text in its original script: Title Information Statement of Responsibility Edition Statement Transcribed provider Statement Series Statement Notes (contents, summary etc.)For transliterated text: Creator Subject Form/Genre ContributorWe recognize that BCP 47 is a very complex standard. It will be challenging for cataloging interfaces to integrate and validate. However, we see considerable promise in continuing and expanding the use of this type of coding to tag literals in library data.Steps toward ImplementationThe Task Group recommends the following steps be taken for technical implementation of ISO 639-3, BCP 47, and ISO 15924. Bidirectional text. Further investigation into coding bidirectional text strings is recommended for BIBFRAME and other linked data services. The Task Group has concern with how the language subtags are put together. While MARC 21 supports bidirectional text through the use of markers (left to right, right to left, and Arabic), RDF does not currently have a way to indicate bi-directional text. More research and exploration of the best way to encode bidirectional text in BIBFRAME is needed. Canonical URIs. We recommend exploring and testing use of ID. to support the creation of URIs to represent the individual language codes within ISO 639-3 and possibly BCP 47 and ISO 15924 to support their use in BIBFRAME. Support and expertise could be provided by the PCC LDAC, LD4 community, and PCC Wikidata munity feedback. We recommend soliciting community feedback including vendor and ILS provider feedback. As mentioned above, this change includes using two language codes in tandem in MARC during the transition to BIBFRAME along with supporting a larger set of language codes. In particular, we should ascertain the extent to which vendors and ILS providers will be able to effectively support the incorporation of ISO 639-3 codes into discovery interfaces.Conversion of codes. The BIBFRAME system will need to be able to convert ISO 639-3 codes into the appropriate MARC language codes when MARC records are created from BIBFRAME records. We also recommend investigating strategies for converting MARC language codes into ISO 639-3 codes in MARC records, either as a large-scale project in a database like WorldCat, or on a record-by-record basis, as with the OCLC Music Toolkit.Policies and documentation. PCC policies and documentation listed in the Appendix should be updated as needed to account for the use of the ISO 639-3 and BCP 47 codes. We also recommend the creation of new policies and training documentation, especially in applying BCP 47 in BIBFRAME.Scripts. Further investigation is needed into how ISO 15924 could be implemented in a linked data environment, since canonical URIs do not seem to be currently available for this vocabulary.Appendix MARC Documentation Related to Non-MARC Language SchemesField 041 examples $2 - Source of code: Source of the language code scheme used in the field. Code from: Language Code and Term Source Codes.If a non-MARC code is used to express the predominant language in an item, field 008/35-37 is coded with three fill characters (| | |).If more than one code scheme is used in a record, repeat the field.008/35-37|||04107 $a en $a fr $a it $2 iso639-1008/35-37eng0410# $a eng $a fre04107 $a en $a fr $2 iso639-1[Two language code schemes are used and field 041 is repeated.]$r - Language code of accessible visual language (non-textual)Language codes for visual language (non-textual) used to provide alternative access to the audio content of a resource. For example, signed languages.0410# $a eng $r sgn04107 $r ase $2 iso639-3[An English language resource where audio is the primary mode of access, but alternate access is provided with picture-in-picture American Sign Language.]For resources where signed language is the primary mode of access, subfield $a should be used to record the language code for signed language.0410# $a sgn04107 $a ase $2 iso639-3[A resource where American Sign Language is the primary mode of access.]Language Code and Term Source CodesLanguage codeLanguage TermaustlangAUSTLANG (Australian Institute of Aboriginal and Torres Strait Islander Studies (AIATSIS)) din2335Sprachenzeichen: DIN 2335 (Berlin: Beuth)glottoGlottolog iso639-1Codes for the representation of names of languages--Part 1: Alpha-2 code (ISO 639-1:2002) (Geneva: International Organization for Standardization) iso639-2bCodes for the representation of names of languages--Part 2: Alpha-3 code (ISO 639-2B:2002) (Geneva: International Organization for Standardization). [The bibliographic language codes are identical to both NISO Z39.53 and the MARC Code List for Languages.] iso639-3Codes for the representation of names of languages--Part 3: Alpha-3 code for comprehensive coverage of languages (Geneva: International Organization for Standardization)kniaKody naimenovanii íàzykov: GOST 7.75-97 (Minsk: Mezhgosudarstvennyi sovet po standartizatsii, metrologii i sertifikatsii)rfc3066Tags for the identification of languages (January 2001) (The Internet Society) [replaced by RFC 4646 and RFC 4647]rfc4646Tags for identifying languages (September 2006) (The Internet Society) [In combination with RFC 4647, replaces RFC 3066. A language identifier as specified by the Internet Best Current Practice specification RFC4646 . This document gives guidance on the use of ISO 639-1, ISO 639-2, and ISO 639-3 language identifiers with optional secondary subtags and extensions. Replaced by RFC 5646.]rfc5646Tags for Identifying Languages (September 2009) (The Internet Society) [In combination with RFC 5645, replaces RFC 4646. A language identifier as specified by the Internet Best Current Practice specification RFC5646 . This document gives guidance on the use of ISO 639-1, ISO 639-2, ISO 639-3, and ISO-639-5 language identifiers.]walsoThe World atlas of language structures online PCC policies and documentationThe Task Group identified the following PCC policies and documentation that would need to be revised:CONSER Editing Guide (CEG) CONSER Cataloging Manual (CCM) CONSER Standard Record (CSR) RDA Metadata Application Profile Descriptive Cataloging Manual (DCM), Section Z1 Integrating Resources Cataloging Manual LC-PCC PS 6.11.1.3 LC-PCC PS 7.13.2.3 MARC 21 Encoding to Accommodate RDA Elements in 046, 3XX, 672, 673, and 678 Fields in NARs and SARs PCC Guidelines for Creating Bibliographic Records in Multiple Character Sets PCC Provider-Neutral E-Resource MARC Record Guidelines PCC RDA BIBCO Standard Record (BSR) Metadata Application Profile Training Materials for the Basic Serials Cataloging Workshop Training Materials for Serials Holdings Workshop Training Materials for the Integrating Resources Cataloging Workshop ................
................

In order to avoid copyright disputes, this page is only a partial summary.

Google Online Preview   Download