Course Notes



Prepared by:Nicole E. StanbridgeID #19700912T000 / S10491916 June 2012Prepared for:MADLV11 Digitizing Cultural Heritage MaterialProfessor M. Dahlstr?mUniversity of Bor?s, Sweden TOC \o "1-3" Introduction And Project Overview PAGEREF _Toc202085068 \h 2Pre-Digitization PAGEREF _Toc202085069 \h 2Training and Research PAGEREF _Toc202085070 \h 2Context Analysis PAGEREF _Toc202085071 \h 3Preparation and Needs Assessment PAGEREF _Toc202085072 \h 4Selection of Source Material PAGEREF _Toc202085073 \h 4Observations PAGEREF _Toc202085074 \h 4Copyright PAGEREF _Toc202085075 \h 5Equipment, Hardware, and Software PAGEREF _Toc202085076 \h 5Material Analysis and Handling PAGEREF _Toc202085077 \h 6Memory Requirements PAGEREF _Toc202085078 \h 7Digitization Costs PAGEREF _Toc202085079 \h 8Image Capture PAGEREF _Toc202085080 \h 8Calibration PAGEREF _Toc202085081 \h 8Scanner PAGEREF _Toc202085082 \h 8Monitor PAGEREF _Toc202085083 \h 9Test Scan PAGEREF _Toc202085084 \h 9Image-Quality and Benchmarking PAGEREF _Toc202085085 \h 10Scanning PAGEREF _Toc202085086 \h 11Observations PAGEREF _Toc202085087 \h 11Comparative Analysis PAGEREF _Toc202085088 \h 12Quality Control PAGEREF _Toc202085089 \h 13Observations PAGEREF _Toc202085090 \h 13Post-Digitization PAGEREF _Toc202085091 \h 13References PAGEREF _Toc202085092 \h 16Resources PAGEREF _Toc202085093 \h 18Introduction and Project OverviewThis collection and subsequent documentation is the result of a small-scale digitization project for the course Digitizing Cultural Heritage Material at Boras University. To the extent possible, a critical digitization approach was adopted in which best practices were followed. This is in contrast to the mass digitization mode adopted by projects like Google Books, Million Books Project, and Open Content Alliance where accessibility is the primary outcome measure not “preservation quality scans” ADDIN ZOTERO_ITEM CSL_CITATION {"citationID":"BZe6JsCv","properties":{"formattedCitation":"(Deegan & Sutherland, 2009, p.174)","plainCitation":"(Deegan & Sutherland, 2009, p.174)"},"citationItems":[{"id":749,"uris":[""],"uri":[""],"itemData":{"id":749,"type":"book","title":"Transferred illusions: digital technology and the forms of print","publisher":"Ashgate Publishing, Ltd.","number-of-pages":"226","abstract":"This is a study of the forms and institutions of print - newspapers, books, scholarly editions, publishing, libraries - as they relate to and are changed by emergent digital forms and institutions. In the early 1990s hypertext was briefly hailed as a liberating writing tool for non-linear creation. Fast forward no more than a decade, and we are reading old books from screens. It is, however, the newspaper, for around two hundred years print's most powerful mass vehicle, whose economy persuasively shapes its electronic remediation through huge digitization initiatives, dominated by a handful of centralizing service providers, funded and wrapped round by online advertising. The error is to assume a culture of total replacement. The Internet is just another information space, sharing characteristics that have always defined such spaces - wonderfully effective and unstable, loaded with valuable resources and misinformation; that is, both good and bad. This is why it is important that writers, critics, publishers and librarians - in modern parlance, the knowledge providers - be critically engaged in shaping and regulating cyberspace, and not merely the passive instruments or unreflecting users of the digital tools in our hands.","ISBN":"9780754670162","shortTitle":"Transferred illusions","language":"en","author":[{"family":"Deegan","given":"Marilyn"},{"family":"Sutherland","given":"Kathryn"}],"issued":{"year":2009,"month":4,"day":1}},"locator":"174","label":"page"}],"schema":""} (Deegan & Sutherland, 2009, p.174). Critical digitization aims to meet the demands set forth by scholarly research, preservation, and authentication needs using high-end capture methods ADDIN ZOTERO_ITEM CSL_CITATION {"citationID":"armpLGzo","properties":{"formattedCitation":"{\\rtf (Dahlstr\\uc0\\u246{}m et al., 2012)}","plainCitation":"(Dahlstr?m et al., 2012)"},"citationItems":[{"id":824,"uris":[""],"uri":[""],"itemData":{"id":824,"type":"article-journal","title":"‘As We May Digitize’—Institutions and Documents Reconfigured","container-title":"LIBER Quarterly","page":"455-474","volume":"21","issue":"3/4","URL":"","note":"This article frames digitization as a knowledge organization practice in libraries and museums. The primarily discriminatory practices of museums are compared with the non-discriminatory practices of libraries when managing their respective cultural heritage collections. Digitization of cultural heritage brings new practices, tools and arenas that reconfigure and reinterpret not only the collections, but the memory institutions themselves as well as the roles they respectively play on a societal level. The development of digitization promises to bridge some gaps between libraries and museums, either by redefining their respective identity, or by forming new ground where the interests of the respective institutions naturally meet or even converge, or by neglecting particular tasks and roles that do not seem to find a natural home in the new territory. Two poles along a digitization strategy scale, mass digitization and critical digitization, are distinguished in the article. As memory institutions are redefined in their development of digitized document collections, e.g., by increasingly emphasizing a common trans-national rather than national cultural heritage, mass digitization and critical digitization represent alternative avenues. Museums, libraries and archives (MLA) endeavour aiming for joint tools and practices in digitizing cultural heritage collections need a thorough understanding of such mechanisms. The article re-contextualizes current digitization discourse: a) historically, by suggesting that digitization brings ancient practices back to life rather than invents entirely new ones from scratch; b) conceptually, by presenting a new label (critical digitization) for a digitization strategy that has hitherto been downplayed in digitization discourse; and c) theoretically, by exploring the relations between the values of different digitization strategies, the reconfiguration of collections as they are digitized, and the redefinition of MLA institutions through those processes. The arguments in the article are drawn from examples of digitization in different library contexts on both a national (Swedish) level and a European level.","author":[{"family":"Dahlstr?m","given":"Mats"},{"family":"Hansson","given":"Joacim"},{"family":"Kjellman","given":"Ulrika"}],"issued":{"year":2012,"month":11,"day":4},"accessed":{"year":2012,"month":5,"day":22},"page-first":"455"}}],"schema":""} (Dahlstr?m et al., 2012). Unfortunately, without a professional or academic setting, the capture quantity and quality was limited. To compensate, points of comparative analysis against high-end standards and policy benchmarks, as well as analytical observations are made throughout the report. center0Statement of PurposeAdminister a small-scale critical digitization project resulting in multiple formats (PDF, JPG, TIFF, XML encoded to meet TEI P4 standards) presented via a dedicated website. 00Statement of PurposeAdminister a small-scale critical digitization project resulting in multiple formats (PDF, JPG, TIFF, XML encoded to meet TEI P4 standards) presented via a dedicated website. Pre-DigitizationThe pre-digitization phase is paramount to a successful image capture considering the many procedural variables, all of which create distinct schemas. Pre-digitization activities provide the infrastructure necessary to execute a sustainable, cost-effective capture. Solidifying the purpose and goals for digitization are perhaps its most valuable outcomes. The impetus for most digitizations fall into three categories: 1) research; 2) preservation; and 3) access ADDIN ZOTERO_ITEM CSL_CITATION {"citationID":"hqgQ1EUe","properties":{"formattedCitation":"(Deegan & Sutherland, 2009; Minerva Project, 2004; Rieger, 2008)","plainCitation":"(Deegan & Sutherland, 2009; Minerva Project, 2004; Rieger, 2008)"},"citationItems":[{"id":749,"uris":[""],"uri":[""],"itemData":{"id":749,"type":"book","title":"Transferred illusions: digital technology and the forms of print","publisher":"Ashgate Publishing, Ltd.","number-of-pages":"226","abstract":"This is a study of the forms and institutions of print - newspapers, books, scholarly editions, publishing, libraries - as they relate to and are changed by emergent digital forms and institutions. In the early 1990s hypertext was briefly hailed as a liberating writing tool for non-linear creation. Fast forward no more than a decade, and we are reading old books from screens. It is, however, the newspaper, for around two hundred years print's most powerful mass vehicle, whose economy persuasively shapes its electronic remediation through huge digitization initiatives, dominated by a handful of centralizing service providers, funded and wrapped round by online advertising. The error is to assume a culture of total replacement. The Internet is just another information space, sharing characteristics that have always defined such spaces - wonderfully effective and unstable, loaded with valuable resources and misinformation; that is, both good and bad. This is why it is important that writers, critics, publishers and librarians - in modern parlance, the knowledge providers - be critically engaged in shaping and regulating cyberspace, and not merely the passive instruments or unreflecting users of the digital tools in our hands.","ISBN":"9780754670162","shortTitle":"Transferred illusions","language":"en","author":[{"family":"Deegan","given":"Marilyn"},{"family":"Sutherland","given":"Kathryn"}],"issued":{"year":2009,"month":4,"day":1}}},{"id":796,"uris":[""],"uri":[""],"itemData":{"id":796,"type":"report","title":"Good Practices Handbook","publisher":"Minerva Working Group 6","URL":"","number":"Version 1.3","author":[{"family":"Minerva Project","given":""}],"issued":{"year":2004,"month":3},"accessed":{"year":2012,"month":1,"day":15}}},{"id":798,"uris":[""],"uri":[""],"itemData":{"id":798,"type":"report","title":"Preservation in the Age of Large-Scale Digitization","collection-title":"CLIR Reports","publisher":"Council on Library and Information Resources","publisher-place":"Washington, DC","page":"52","genre":"White Paper","event-place":"Washington, DC","call-number":"Z701.3.D54R54 2008","number":"CLIR Pub No. 141","language":"English","author":[{"family":"Rieger","given":""}],"issued":{"year":2008},"page-first":"52"}}],"schema":""} (Deegan & Sutherland, 2009; Minerva Project, 2004; Rieger, 2008). If the reason for digital capture is combinatorial, pre-planning is nearly a prerequisite. The dividends from this seemingly benign administrative task will become apparent over the project’s life-cycle. Digital project planning can encompass issues including, but not limited to: workflow construction, copyright, staffing, training, research, technology needs, test-bed design, benchmarking and risk assessment ADDIN ZOTERO_ITEM CSL_CITATION {"citationID":"l5dNV3yw","properties":{"formattedCitation":"(Minerva Project, 2004, sec.3.2)","plainCitation":"(Minerva Project, 2004, sec.3.2)"},"citationItems":[{"id":796,"uris":[""],"uri":[""],"itemData":{"id":796,"type":"report","title":"Good Practices Handbook","publisher":"Minerva Working Group 6","URL":"","number":"Version 1.3","author":[{"family":"Minerva Project","given":""}],"issued":{"year":2004,"month":3},"accessed":{"year":2012,"month":1,"day":15}},"locator":"3.2","label":"section"}],"schema":""} (Minerva Project, 2004, sec.3.2). Finally, properly documenting and publishing the pre-digitization stage demonstrates ‘digital validity’—i.e., reliability and authenticity of the records produced therein ADDIN ZOTERO_ITEM CSL_CITATION {"citationID":"MzPdcp6X","properties":{"formattedCitation":"(Duranti, 1995)","plainCitation":"(Duranti, 1995)"},"citationItems":[{"id":800,"uris":[""],"uri":[""],"itemData":{"id":800,"type":"article-journal","title":"Reliability and authenticity: the concepts and their implications","container-title":"Archivaria","volume":"39","issue":"10","shortTitle":"Reliability and authenticity","author":[{"family":"Duranti","given":""}],"issued":{"year":1995}}}],"schema":""} (Duranti, 1995). Without project provenance, procedural records, and adherence to standards, the researcher may deem the resultant records unprofessional and unreliable. Time allocations were the most impressionable observation in this capture. Guidelines suggest on average, “one-third of the effort will be project planning; one-third archival description/indexing; and one-third actual digitization” ADDIN ZOTERO_ITEM CSL_CITATION {"citationID":"MpsT0RwH","properties":{"formattedCitation":"(FADGI, 2009)","plainCitation":"(FADGI, 2009)"},"citationItems":[{"id":789,"uris":[""],"uri":[""],"itemData":{"id":789,"type":"report","title":"Guidelines for TIFF Metadata, Recommended Elements and Format","collection-title":"Version 1.0","publisher":"Federal Agencies Digitization Initiative (FADGI)","genre":"Guidelines","URL":"guidelines/TIFF_Metadata_Final.pdf","language":"English","author":[{"family":"FADGI","given":""}],"issued":{"year":2009,"month":2,"day":10},"accessed":{"year":2012,"month":3,"day":14}}}],"schema":""} (FADGI, 2009). This was fairly accurate, though unanticipated. Training and ResearchAn increasing number of staff and scholars are assuming key roles in digitizing materials of high value, demand, and complexity. Many are unqualified to conduct, often irreversible, tasks ADDIN ZOTERO_ITEM CSL_CITATION {"citationID":"EZP1aV8g","properties":{"formattedCitation":"(Conway, 2011)","plainCitation":"(Conway, 2011)"},"citationItems":[{"id":786,"uris":[""],"uri":[""],"itemData":{"id":786,"type":"speech","title":"SI 675 Digitization for Preservation, Week 2 Lecture- Image Digitization and Guidelines","publisher-place":"University of Michigan, Ann Arbor, MI","genre":"Slides","event-place":"University of Michigan, Ann Arbor, MI","URL":"","author":[{"family":"Conway","given":"Paul"}],"issued":{"year":2011},"accessed":{"year":2012,"month":4,"day":1}}}],"schema":""} (Conway, 2011). Furthermore, the field is dynamic and constantly evolving where ongoing training is increasingly important. Fortunately, current quality control standards and best-practice guidelines for digital conversion are plentiful (e.g., a bibliographic listing). The following guidelines were reviewed and utilized for this project: ECHO Project, 2007FADGI Guidelines for TIFF Metadata, 2009FADGI Technical Guidelines for Digitizing Cultural Heritage Materials, 2010Guidelines: Memory of Netherlands Project, 2003Best Practice Guidelines for Digital Collections, 2007In addition to staff training, researching similar projects provides insight into alternative methodologies, fosters creativity and ideas, avoids ‘recreating the wheel’, and can even preempt mistakes ADDIN ZOTERO_ITEM CSL_CITATION {"citationID":"ELWSLTuD","properties":{"formattedCitation":"(Minerva Project, 2004, sec.3.2)","plainCitation":"(Minerva Project, 2004, sec.3.2)"},"citationItems":[{"id":796,"uris":[""],"uri":[""],"itemData":{"id":796,"type":"report","title":"Good Practices Handbook","publisher":"Minerva Working Group 6","URL":"","number":"Version 1.3","author":[{"family":"Minerva Project","given":""}],"issued":{"year":2004,"month":3},"accessed":{"year":2012,"month":1,"day":15}},"locator":"3.2","label":"section"}],"schema":""} (Minerva Project, 2004, sec.3.2). A considerable amount of time was spent researching both industry guidelines and other digitizations, which proved to be invaluable. Context AnalysisA comprehensive context analysis is, essentially, a survey of the collection’s current landscape with a purpose to inform and steer future project decisions. Although it was not executable in the current project, a brief analysis is included. This stage may include a preservation assessment of the current physical storage environment, selection policies, accessibility levels, as well as relevant operations and governance. Context analysis also entails a preliminary assessment of the actual need for retention. Responsible digital stewardship is imperative considering the shear mass of today’s global corpus, which creates a proportionally limited opportunity for digital capture. In turn, selection criteria should consider constraints imposed by this ‘digital window’. Random or permanent retention must give way to retention ”for a period of continuing value” ADDIN ZOTERO_ITEM CSL_CITATION {"citationID":"cfPHSmMf","properties":{"formattedCitation":"(Bearman, 1989, chap.1)","plainCitation":"(Bearman, 1989, chap.1)"},"citationItems":[{"id":199,"uris":[""],"uri":[""],"itemData":{"id":199,"type":"webpage","title":"Archival Methods: Archives & Museum Informatics Technical Report #9","URL":"","author":[{"family":"Bearman","given":"David"}],"issued":{"year":1989},"accessed":{"year":2010,"month":11,"day":2}},"locator":"1","label":"chapter"}],"schema":""} (Bearman, 1989, chap.1). Bearman’s impressive critique of the archival schema, in the face of the digital age, abandons the notion of retention based on continuing, evidential, and informational value. The guiding factor is not the benefit of retaining a record but instead, “the probability of incurring unacceptable risks as a consequence of disposal” ADDIN ZOTERO_ITEM CSL_CITATION {"citationID":"4kffNXLT","properties":{"formattedCitation":"(Bearman, 1989, chap.1)","plainCitation":"(Bearman, 1989, chap.1)"},"citationItems":[{"id":199,"uris":[""],"uri":[""],"itemData":{"id":199,"type":"webpage","title":"Archival Methods: Archives & Museum Informatics Technical Report #9","URL":"","author":[{"family":"Bearman","given":"David"}],"issued":{"year":1989},"accessed":{"year":2010,"month":11,"day":2}},"locator":"1","label":"chapter"}],"schema":""} (Bearman, 1989, chap.1). That is, what is the risk of losing the record? This is driven by the information shelf-life, and “realistic assumptions about the length of time that [records] will be of value” ADDIN ZOTERO_ITEM CSL_CITATION {"citationID":"vSKo4nuC","properties":{"formattedCitation":"(Bearman, 1989, chap.2)","plainCitation":"(Bearman, 1989, chap.2)"},"citationItems":[{"id":199,"uris":[""],"uri":[""],"itemData":{"id":199,"type":"webpage","title":"Archival Methods: Archives & Museum Informatics Technical Report #9","URL":"","author":[{"family":"Bearman","given":"David"}],"issued":{"year":1989},"accessed":{"year":2010,"month":11,"day":2}},"locator":"2","label":"chapter"}],"schema":""} (Bearman, 1989, chap.2). The continuing value of many collections is perhaps low where selection was driven by subjective, short-term, indiscriminate factors ( ADDIN ZOTERO_ITEM CSL_CITATION {"citationID":"rpGkdtLe","properties":{"formattedCitation":"(Earnshaw et al., 2008, p.225)","plainCitation":"(Earnshaw et al., 2008, p.225)"},"citationItems":[{"id":812,"uris":[""],"uri":[""],"itemData":{"id":812,"type":"book","title":"Digital convergence : libraries of the future","publisher":"Springer","publisher-place":"London","event-place":"London","ISBN":"9781846289026 1846289025 1846289033 9781846289033","shortTitle":"Digital convergence","language":"English","author":[{"family":"Earnshaw","given":"Rae A"},{"family":"Vince","given":"John"},{"family":"Carr","given":"Reg"}],"issued":{"year":2008}},"locator":"225","label":"page"}],"schema":""} Earnshaw et al., 2008, p.225). The funnel for cultural continuity has narrowed and the selection ‘lens’ might benefit from this risk-based approach, as this capture will later demonstrate.After retention is justified, demand and usage are addressed. Tools for gauging longitudinal usage and adherence might entail interviews with staff, researchers, and patrons or examining retrieval rates and current availability. The physical risks imposed on the collection from the digitization process itself should be assessed. In addition, an unforeseen drop in post-digitization usage may occur, or “digital obsolescence” where information artifacts are merged into today’s vast electronic corpus and seemingly lost ADDIN ZOTERO_ITEM CSL_CITATION {"citationID":"ooVZwesT","properties":{"formattedCitation":"(Weel, 2011)","plainCitation":"(Weel, 2011)"},"citationItems":[{"id":794,"uris":[""],"uri":[""],"itemData":{"id":794,"type":"book","title":"Changing Our Textual Minds: Towards a Digital Order of Knowledge","publisher":"Manchester University Press","publisher-place":"Manchester, England","number-of-pages":"240","event-place":"Manchester, England","URL":"","ISBN":"978 0 7190 8555 0","shortTitle":"Changing Our Textual Minds","author":[{"family":"Weel","given":"Adriaan van der"}],"issued":{"year":2011,"month":2,"day":15},"accessed":{"year":2012,"month":3,"day":13}}}],"schema":""} (Weel, 2011) or retrieved only through premeditated searches. Preparation and Needs AssessmentSelection of Source MaterialPROFILE: Original Source MaterialMedium:Hardcover bookAuthor:Sir Williams Chandler Roberts-AustenSeries:Griffin’s Metallurgical SeriesEdition:5th, Revised and EnlargedPlace:LondonPublisher:Charles Griffin and Co LtdDate:1902Language:EnglishPhysical Description:xv, 516 p., [9] leaves of plates : ill. ; 22 cmProject goals and user demand usually determine selection, but this project’s criteria were dictated by broader circumstances (i.e., coursework, non-institutional effect). Typical criteria such as internal contracts, costs, equipment, staffing, legalities, preservation risks, and physical impediments were not applicable ADDIN ZOTERO_ITEM CSL_CITATION {"citationID":"CP4MTLxf","properties":{"formattedCitation":"(Minerva Project, 2004, sec.3.3)","plainCitation":"(Minerva Project, 2004, sec.3.3)"},"citationItems":[{"id":796,"uris":[""],"uri":[""],"itemData":{"id":796,"type":"report","title":"Good Practices Handbook","publisher":"Minerva Working Group 6","URL":"","number":"Version 1.3","author":[{"family":"Minerva Project","given":""}],"issued":{"year":2004,"month":3},"accessed":{"year":2012,"month":1,"day":15}},"locator":"3.3","label":"section"}],"schema":""} (Minerva Project, 2004, sec.3.3). In such settings, the “just in case” versus “just in time” approach ADDIN ZOTERO_ITEM CSL_CITATION {"citationID":"EKbUFKc9","properties":{"formattedCitation":"(Rydberg-Cox, 2009; Schreibman et al., 2004)","plainCitation":"(Rydberg-Cox, 2009; Schreibman et al., 2004)"},"citationItems":[{"id":822,"uris":[""],"uri":[""],"itemData":{"id":822,"type":"article-journal","title":"Digitizing Latin Incunabula: Challenges, Methods, and Possibilities","container-title":"Digital Humanities Quarterly","collection-title":"Changing the Center of Gravity: Transforming Classical Studies Through Cyberinfrastructure","volume":"3","issue":"1","URL":"","shortTitle":"Digitizing Latin Incunabula","journalAbbreviation":"DHQ","author":[{"family":"Rydberg-Cox","given":"Jeffrey A."}],"issued":{"year":2009},"accessed":{"year":2012,"month":2,"day":5}}},{"id":694,"uris":[""],"uri":[""],"itemData":{"id":694,"type":"book","title":"A companion to digital humanities","publisher":"Wiley-Blackwell","publisher-place":"Oxford","event-place":"Oxford","author":[{"family":"Schreibman","given":"S."},{"family":"Siemens","given":"R.G."},{"family":"Unsworth","given":"J.M."}],"issued":{"year":2004}}}],"schema":""} (Rydberg-Cox, 2009; Schreibman et al., 2004) or Bearman’s “retention for continuing value” framework enter the selection decision-making process ADDIN ZOTERO_ITEM CSL_CITATION {"citationID":"iAzpeT9q","properties":{"formattedCitation":"(Schreibman et al., 2004; Bearman, 1989)","plainCitation":"(Schreibman et al., 2004; Bearman, 1989)"},"citationItems":[{"id":694,"uris":[""],"uri":[""],"itemData":{"id":694,"type":"book","title":"A companion to digital humanities","publisher":"Wiley-Blackwell","publisher-place":"Oxford","event-place":"Oxford","author":[{"family":"Schreibman","given":"S."},{"family":"Siemens","given":"R.G."},{"family":"Unsworth","given":"J.M."}],"issued":{"year":2004}}},{"id":199,"uris":[""],"uri":[""],"itemData":{"id":199,"type":"webpage","title":"Archival Methods: Archives & Museum Informatics Technical Report #9","URL":"","author":[{"family":"Bearman","given":"David"}],"issued":{"year":1989},"accessed":{"year":2010,"month":11,"day":2}}}],"schema":""} (Bearman, 1989). Nonetheless, a meaningful value-added methodology was desired, versus baseless randomization. In a 2010 survey, staff from the HathiTrust Digital Library identified root causes of the most critical digitization errors from 117,555 volumes ingested over four years ADDIN ZOTERO_ITEM CSL_CITATION {"citationID":"fWXp4atm","properties":{"formattedCitation":"(Conway, 2010)","plainCitation":"(Conway, 2010)"},"citationItems":[{"id":758,"uris":[""],"uri":[""],"itemData":{"id":758,"type":"paper-conference","title":"Measuring Content Quality in a Preservation Repository: HathiTrust and Large-Scale Book Digitization","container-title":"Proceedings of 7th International Conference on Preservation of Digital Objects","collection-title":"iPres 2010","publisher":"Austrian Computer Society","publisher-place":"Vienna, Austria","page":"95-102","event":"International Conference on Preservation of Digital Objects","event-place":"Vienna, Austria","URL":"","note":"Andrew W. Mellon Foundation","shortTitle":"Measuring Content Quality in a Preservation Repository","language":"en_US","author":[{"family":"Conway","given":"Paul"}],"issued":{"year":2010},"accessed":{"year":2012,"month":3,"day":22},"page-first":"95"}}],"schema":""} (Conway, 2010). Pages exhibiting these characteristics were targeted for this digitization. These included:Thick text [character fill, bolding, indistinguishable characters]Broken text [character breakup, unresolved fonts]Blurred text [movement]Obscured text [portions not visible]Warp [text alignment, skew]Colorization [text bleed, low text-to-carrier contrast]Illustrations with color imbalance, gradient shiftsAtypical font or glyphs [gothic, mathematical]Utilizing this criteria yielded the greatest range of material and depth of experience.Observations SELECTION: The original selection method was unsuccessful. It involved researching the methods used by mass digitization projects when selecting pages for test scans or pilots. Published material, on this apparently finite aspect of the process, was not available. DIGITAL STEWARDSHIP: The owner of the source material conveyed no interested in its digitization because the content was outdated, particularly in the scientific domain where current data is imperative. He stated that, “the content is rather worthless” and might “only be accessed once a decade.” Bearman’s litmus test—retention for continuing value—was echoed by the end-user and offered a first-hand account of digital stewardship.CopyrightCultural institutions are primarily interested in two copyright issues: how to legally digitize material they may not hold the copyright on and ensuring the material is used appropriately (tacit or otherwise) ADDIN ZOTERO_ITEM CSL_CITATION {"citationID":"zPcxa3LB","properties":{"formattedCitation":"(NC ECHO Project, 2007)","plainCitation":"(NC ECHO Project, 2007)"},"citationItems":[{"id":765,"uris":[""],"uri":[""],"itemData":{"id":765,"type":"report","title":"Guidelines for Digitization 2007 Revised Edition","publisher":"Institute of Museum and Library Services","publisher-place":"North Carolina","genre":"Guidelines","event-place":"North Carolina","URL":"","author":[{"family":"NC ECHO Project","given":""}],"issued":{"year":2007},"accessed":{"year":2012,"month":3,"day":23}}}],"schema":""} (NC ECHO Project, 2007). In a traditional digitization, the memory institute would perform copyright due diligence on every object. Unfortunately, rights issues are not always clear and become a matter of risk management versus right or wrong ADDIN ZOTERO_ITEM CSL_CITATION {"citationID":"bH92f7cK","properties":{"formattedCitation":"(NISO, 2007)","plainCitation":"(NISO, 2007)"},"citationItems":[{"id":827,"uris":[""],"uri":[""],"itemData":{"id":827,"type":"book","title":"A Framework of Guidance for Building Good Digital Collections.","publisher":"National Information Standards Organization","publisher-place":"Baltimore, MD, USA","edition":"3rd","event-place":"Baltimore, MD, USA","URL":"","ISBN":"1-880124-74-2","author":[{"family":"NISO","given":""}],"issued":{"year":2007},"accessed":{"year":2012,"month":4,"day":17}}}],"schema":""} (NISO, 2007). Factors influencing copyright include international complications, multiple copyrights, transfer of ownership, format variations, and publication status. Additional issues include rights of privacy, rights of publicity, and laws governing patents, trademarks, and trade secrets. The consequences of use without permission and the impact infringement would have on the project or governing institution make risk mitigation efforts an unquestionable necessity. This project’s source material was published in 1902. Under US copyright law, content published before 1923 is in the public domain due to copyright expiration ADDIN ZOTERO_ITEM CSL_CITATION {"citationID":"zfrpeJ63","properties":{"formattedCitation":"(Cornell University, 2012)","plainCitation":"(Cornell University, 2012)"},"citationItems":[{"id":763,"uris":[""],"uri":[""],"itemData":{"id":763,"type":"webpage","title":"Copyright Term and the Public Domain in the United States","URL":"","author":[{"family":"Cornell University","given":""}],"issued":{"year":2012},"accessed":{"year":2012,"month":3,"day":23}}}],"schema":""} (Cornell University, 2012). A plethora of copyright guidelines are available like the one referenced in REF _Ref201772592 \h Table 1.Table 1. Public domain determination. ADDIN ZOTERO_ITEM CSL_CITATION {"citationID":"F1Hsle07","properties":{"formattedCitation":"(Carignan et al., 2007, p.48)","plainCitation":"(Carignan et al., 2007, p.48)"},"citationItems":[{"id":769,"uris":[""],"uri":[""],"itemData":{"id":769,"type":"report","title":"Best practice guidelines for digital collections at University of Maryland Libraries","publisher":"University of Maryland Libraries","URL":"","number":"2nd. ed.","author":[{"family":"Carignan","given":"Y."},{"family":"Evander","given":"J."},{"family":"Gueguen","given":"G."},{"family":"Hanlon","given":"A."},{"family":"Murray","given":"K."},{"family":"Roper","given":"J."}],"issued":{"year":2007},"accessed":{"year":2012,"month":3,"day":5}},"locator":"48","label":"page"}],"schema":""} (Carignan et al., 2007, p.48)Equipment, Hardware, and SoftwareImage capture was done on a home-office scanner and an Apple MacBook Pro. The project’s website was first developed using WordPress on a local server environment (Apache, PHP, MySQL) through MAMP, then loaded to . A full equipment profile is provided later in section “Image Capture”.Assessing and selecting the required software was challenging. Several options were researched to align costs, learnability, and feasibility given the project’s minimal infrastructure. On the institutional level, many systems are available for digital libraries ranging from technically advanced, feature-rich frameworks to more basic off-the-rack layperson varieties, with equally variant pricing from open-source to commercial grade, as well as limited to expansive usage parameters. An investigation of the current landscape revealed that the distinction between institutional repository software (e.g., Fedora, Digital Commons, DSpace, DigiTools), digital collections systems (e.g., CONTENTdm, Greenstone), and web content management systems (e.g., Islandora, Omeka, WordPress, Drupal/DrupalGardens, Joomla) are quickly blurring and ‘hybrid’ software is the trend. Omeka could not display PDF, TIFF, and XML files and the document viewer was an inadequate Google plugin. Omeka had greater metadata support but the framework was developed around museum archives and exhibiting images. DrupalGardens and Islandora required more advanced technical knowledge. WordPress was repeatedly recommended in presentations, blogs, and discussion forums. The talk Intro to CMSes designated it best out-of-the-box system ADDIN ZOTERO_ITEM CSL_CITATION {"citationID":"MwG6HfnT","properties":{"formattedCitation":"(Murray-John, 2011)","plainCitation":"(Murray-John, 2011)"},"citationItems":[{"id":802,"uris":[""],"uri":[""],"itemData":{"id":802,"type":"speech","title":"Introduction to CMSes. WordPress, Omeka, and Drupal","publisher-place":"Fairfax, VA USA","event":"THATCamp 2011","event-place":"Fairfax, VA USA","URL":"","author":[{"family":"Murray-John","given":"Patrick"}],"issued":{"year":2011,"month":6,"day":4},"accessed":{"year":2012,"month":2,"day":12}}}],"schema":""} (Murray-John, 2011). With ever increasing modules, support for a multitude of media objects, as well as technical and semantic metadata, WordPress was deemed appropriate for the present project.Material Analysis and Handling This project demonstrated precisely the importance of material analysis and pre-processing. Printed in 1902, the book simply fell apart during capture. Repeatedly opening it on the flatbed scanner compromised the brittle binding and the pages detached completely. Although it lacks any inherent value, a digital camera and cradle would have been preferable and greatly increased the throughput. As such, the material analysis report for the book ( REF _Ref201772842 \h Table 2) would have likely prompted such a conservational approach in a professional setting. Table 2. Material Analysis Report.StorageSealed plastic bag on owner’s bookshelfType of objectBook (1)FormatPrinted text, line drawings, photographic images (continuous tone)Number of objects 8 pages digitized from 626 page bookPhysical condition of object-Brittle and fragile binding-Pages: besides foxing and page brown, good condition with minimal markings or stains, minimal warping-Torn hardcoverPOST DIGITIZATION-Backstripe split and fully separated from hardcover-Exposed hinges but sewn binding intact-Hardcover entirely separated from text block Size of object: l x w x d in cm22 cm x 13.8 cm x 4.9 cmTotal number of pages Front material: 20 pages, 1 tissue sheetMain text: 516 pagesBack material: 90 pages Size of pages: l x w in cmSmallest: 21.2 cm x 13.5 cmBiggest: 33.2 cm x 36.8 cm (fold-out table)Size of marginsTop: 2.6 cmBottom: 2.2 cmLeft: 2.2 cmRight: 2.1 cmHeight of the smallest letter "e" in mm 0.9 mmHeight of the smallest letter "e" in footnotes in mm1.2 mmHeight of the biggest letter in mm6 mmDoes the material contain page numbers????Yes ??NoLanguages EnglishIs the material bound?? Yes ??NoIf bound, does the binding cause shadows on the pages???Yes ???NoDoes the material contain any fold-out pages?? Yes (2 pages) ??NoDoes the material contain pages printed across the inner margin???Yes ???NoDoes the material contain images?? Yes ??NoWhen more than 1% of the material contains images, describe the images and the percentage of the total number of pages that contains images.??Less than 1% ???More than 1%???line images (25%) ???halftone b/w (5%) ??halftone color ??color??otherIs the color of the paper constant throughout single items????Yes ??NoIs the color of the paper constant throughout all items????Yes ??NoWhat is the condition of the text?Faded text ???Yes (15%) ??NoText showing through ??Yes, ???NoStained/damaged text ???Yes (2%) ??NoThick text ???Yes (1%) ??No Broken text ???Yes (5%) ??No Illustrations with color imbalance, gradient shifts ??Yes ?? NoAtypical font or glyphs ???Yes (10% scientific) ??NoAdapted from the Memory of the Netherlands Project Guidelines, 2003.Memory RequirementsSeveral factors influence memory requirements, such as permanent or temporary storage needs, adding derivative files, and wasted space. REF _Ref201812007 \h Table 3 compares estimated to actual storage calculations for the current project (8 images), as well as projected memory needs for digitizing the entire book (626 images). Though inconsequential here, it shows how memory demands are pivotal in professional captures. Table 3. Estimated vs. actual memory requirements.Average file size per imageMemory (MB)Estimated storage needs - 8 images**44 MBTOTAL440 MBActual storage needs – 8 imagesOCR/TXTna.02 JPGna4.3 PDFna11.4 TIFFna354.3 TOTAL370 MBEstimated storage needs - 626 imagesOCR/TXT3 KB1.8 JPG536 KB328PDF1000 KB611 TIFF44 MB27544TOTAL27 GB**Calculation: number of image files x average file size x 1.25. (8 x 44MB x 1.25 = 440MB)Digitization CostsThe final stage in pre-digitization is evaluating costs ( REF _Ref201813283 \h Table 4). Variation in scan resolution, mediums, color depth and dimensions should be calculated as they greatly impact costs per item. For example, doubling resolution can increase memory costs by four times ADDIN ZOTERO_ITEM CSL_CITATION {"citationID":"hwlBmMe5","properties":{"formattedCitation":"(Fulton, 2012)","plainCitation":"(Fulton, 2012)"},"citationItems":[{"id":782,"uris":[""],"uri":[""],"itemData":{"id":782,"type":"webpage","title":"Scanning Basics 101","container-title":"A few scanning tips","URL":"","author":[{"family":"Fulton","given":"Wayne"}],"issued":{"year":2012},"accessed":{"year":2012,"month":1,"day":13}}}],"schema":""} (Fulton, 2012). Table 4. Estimated costs for partial vs. entire source.Estimated costs - 8 imagesCalculationAverage file size41.1 MB5in x 8in x 600dpi x 600dpi x 24-bit/8= 43,200,000 bytesTotal memory needed (GB)0.4 GB8 images x 41.1MB = 328.8 MB= 0.32 GB x 1.25TOTAL COSTincalculable0.4 GB x $2/GBBased on 8 images, 5 x 8 inches, 600dpi, 24-bit color without compression at $2/GB plus a 25% allowance for derivatives.Estimated costs - 626 imagesCalculationAverage file size41.1 MB5in x 8in x 600dpi x 600dpi x 24-bit/8 = 43,200,000 bytesTotal memory needed (GB)31.25 GB626 images x 41.1MB = 25,728 MB = 25 GB x 1.25TOTAL COST$62.5031.25 GB x $2/GBBased on 626 images, 5 x 8, 600dpi, 24-bit color without compression at $2/GB plus a 25% allowance for derivatives ADDIN ZOTERO_ITEM CSL_CITATION {"citationID":"UpKiAqB9","properties":{"formattedCitation":"(Cornell University, 2003)","plainCitation":"(Cornell University, 2003)"},"citationItems":[{"id":803,"uris":[""],"uri":[""],"itemData":{"id":803,"type":"webpage","title":"Moving Theory into Practice. Digital Imaging Tutorial","URL":"","author":[{"family":"Cornell University","given":""}],"issued":{"year":2003},"accessed":{"year":2012,"month":1,"day":15}}}],"schema":""} (Cornell University, 2003).Image CaptureImage capture is the core of the digitization process. This section will detail the workflow, adaptations, observations and challenges encountered throughout.PROFILE: Scanner, Software, and HardwareScannerHewlett-Packard PSC1610 Color Inkjet All-in-One1200 x 4800 dpi optical resolution48-bit color depth outputUp to 8-bit grayscale (256-levels of gray)Twain 7.1 scanning compliantScanner SoftwareVueScanComputerApple Macbook Pro (OS X 10.6.8)OCR SoftwareABBYY Fine Reader, Adobe Acrobat X ProfessionalGraphic Editing SoftwareAdobe Photoshop Elements 10Ground Truth SoftwareJuxta, AletheiaXML EditoroXygen 13.2Content Mgmt SoftwareWordPress and MAMP for OS XURL / Web Hosting via DreamHostCalibrationScanner and monitor calibration help to ensure that the whole range of analogue data is visible. ScannerThe highest resolution and bit depth do not necessarily produce the best capture. These and other parameters are dependent on the feature profile of the original source document. Adjusting even the basic scanning parameters is debatable. The default parameters may produce the most neutral capture, but it may also produce bad results by an otherwise good scanner. To be thoroughly objective, test-scans using both default and fine-tuned settings provide an additional layer of testing and calibration. The scanner used for this project was a low-end Hewlett-Packard PSC1610. The embedded software locked most settings, but the driver was TWAIN compliant. VueScan scanning software was used instead. Professional-grade scanners would have provided lower noise levels, advanced software, and higher optical lens quality ADDIN ZOTERO_ITEM CSL_CITATION {"citationID":"LuSXkG8z","properties":{"formattedCitation":"(Fulton, 2012)","plainCitation":"(Fulton, 2012)"},"citationItems":[{"id":782,"uris":[""],"uri":[""],"itemData":{"id":782,"type":"webpage","title":"Scanning Basics 101","container-title":"A few scanning tips","URL":"","author":[{"family":"Fulton","given":"Wayne"}],"issued":{"year":2012},"accessed":{"year":2012,"month":1,"day":13}}}],"schema":""} (Fulton, 2012). Another differentiating factor is their ability to deliver higher dynamic range (tonal difference between lightest light and darkest dark) ADDIN ZOTERO_ITEM CSL_CITATION {"citationID":"2PHZOAwF","properties":{"formattedCitation":"(Terras, 2008, chap.3)","plainCitation":"(Terras, 2008, chap.3)"},"citationItems":[{"id":696,"uris":[""],"uri":[""],"itemData":{"id":696,"type":"book","title":"Digital images for the information professional","publisher":"Ashgate Publishing, Ltd.","publisher-place":"Burlington, VT","number-of-pages":"276","event-place":"Burlington, VT","ISBN":"9780754648604","language":"en","author":[{"family":"Terras","given":"Melissa M."}],"issued":{"year":2008,"month":11,"day":30}},"locator":"3","label":"chapter"}],"schema":""} (Terras, 2008, chap.3).MonitorThe monitor was calibrated using OS X utility, ColorSync, and custom profiles. ColorSync also provides a graphic representation of color profiles. REF _Ref201814406 \h Figure 1 shows the color palettes for ProPhotoRGB (a), the monitor’s default AppleRGB profile (b), and the scanner’s default sRGB profile (c). The ProPhotoRGB space is noticeably bigger. (a) (b) (c) Figure 1. Color spaces for (a) ProPhotoRGB, (b) default for monitor, and (c) HP scanner.ProPhotoRGB is a 16-bit color space with 4.5 billion brightness levels and trillions of colors. Though a higher profile is typically unnecessary, ProPhoto was set for the scanner, scanner software, and computer monitor. This ensured that colors would not be ‘clipped’ or compressed down to the low-end scanner space.Test Scan Two pages with content prone to scanning errors were identified from the book for testing. The first contained graphics and its exhibited character distortion, joined glyphs, fading, and variable-pitch shown in REF _Ref201815342 \h Figure 2. Figure 2. Test page distortions.This test page was primarily used to become familiar with the process. Next, four test scans were done with different varying parameters ( REF _Ref201817001 \h Table 5). Table 5. Test scans.Test Scan #1Color Depth: 1-bit B/WResolution: 150dpiMedia: Line artColor Balance: NATest Scan #2Color Depth: 1-bit B/WResolution: 300dpiMedia: Line artColor Balance: NATest Scan #3Color Depth: 24-bit RGBResolution: 150dpiMedia: ColorColor Balance: NoneTest Scan #4Color Depth: 24-bit RGBResolution: 600dpiMedia: PhotoColor Balance: NoneThe comparison of corresponding OCR error rates in the four test scans are below in REF _Ref201817089 \h Figure 3.Figure 3. [Click image to expand in browser]Image-Quality and BenchmarkingTo verify calibration for optimal scanning, image-quality targets are often used to establish benchmarks for resolution, tonality, dynamic range, noise, and color. Benchmarking verifies the appropriateness of technical decisions for an optimal ‘capture workflow’. Documenting the steps and creating a subsequent checklist for staff helps establish consistency across the whole volume ADDIN ZOTERO_ITEM CSL_CITATION {"citationID":"QIHfe8rJ","properties":{"formattedCitation":"(Carignan et al., 2007)","plainCitation":"(Carignan et al., 2007)"},"citationItems":[{"id":769,"uris":[""],"uri":[""],"itemData":{"id":769,"type":"report","title":"Best practice guidelines for digital collections at University of Maryland Libraries","publisher":"University of Maryland Libraries","URL":"","number":"2nd. ed.","author":[{"family":"Carignan","given":"Y."},{"family":"Evander","given":"J."},{"family":"Gueguen","given":"G."},{"family":"Hanlon","given":"A."},{"family":"Murray","given":"K."},{"family":"Roper","given":"J."}],"issued":{"year":2007},"accessed":{"year":2012,"month":3,"day":5}}}],"schema":""} (Carignan et al., 2007). Large digitizations may rely on a representative selection from the entire corpus for assessing image quality ADDIN ZOTERO_ITEM CSL_CITATION {"citationID":"OALTlU9q","properties":{"formattedCitation":"(Rieger, 2008)","plainCitation":"(Rieger, 2008)"},"citationItems":[{"id":798,"uris":[""],"uri":[""],"itemData":{"id":798,"type":"report","title":"Preservation in the Age of Large-Scale Digitization","collection-title":"CLIR Reports","publisher":"Council on Library and Information Resources","publisher-place":"Washington, DC","page":"52","genre":"White Paper","event-place":"Washington, DC","call-number":"Z701.3.D54R54 2008","number":"CLIR Pub No. 141","language":"English","author":[{"family":"Rieger","given":""}],"issued":{"year":2008},"page-first":"52"}}],"schema":""} (Rieger, 2008). Guidelines recommend the sample include at least 10% of the total collection, distributed evenly ADDIN ZOTERO_ITEM CSL_CITATION {"citationID":"LynpUOSM","properties":{"formattedCitation":"(Koninklijke Bibliotheek, 2003)","plainCitation":"(Koninklijke Bibliotheek, 2003)"},"citationItems":[{"id":793,"uris":[""],"uri":[""],"itemData":{"id":793,"type":"report","title":"Guidelines and procedures for the execution of projects for Memory of the Netherlands","publisher":"Memory of the Netherlands Project Office","URL":"geheugenvannederland.nl/hgvn/webroot/files/File/PDF/richtlijnen/Microsoft%2520Word%2520-%2520Richtlijnen%2520en%2520procedures%2520%25204.0%2520engels.pdf","language":"English","author":[{"family":"Koninklijke Bibliotheek","given":""}],"issued":{"year":2003,"month":12},"accessed":{"year":2012,"month":3,"day":14}}}],"schema":""} (Koninklijke Bibliotheek, 2003). When formats vary (e.g., books, photos, and maps), an example is selected from each category. A quality manager or conservator may also be consulted.Test targets can be executed either manually (e.g., Macbeth ColorChecker) or automated. Mass digitization often employs automated applications. Among them, the DICE tool, in development at the Library of Congress, will result in an assessment target and associated software for automated analysis. GoldenThread is a high-end commercial inspection software created to assess scanning output. It allows users to set quality metrics and provides automated analysis of images and reports. GoldenThread includes two targets for two distinct situations: 1) fully qualifying an imaging device such as a camera or scanner and; (2) capturing information during the digitization. Objects are analyzed for performance against NARA and UCB for tone scale, L*a*b* for color accuracy, ISO 12233 for resolution, as well as other measures like noise, color encoding accuracy, white balance, and light uniformity ADDIN ZOTERO_ITEM CSL_CITATION {"citationID":"9pO9oie3","properties":{"formattedCitation":"(Conway & Williams, 2011)","plainCitation":"(Conway & Williams, 2011)"},"citationItems":[{"id":805,"uris":[""],"uri":[""],"itemData":{"id":805,"type":"article-journal","title":"Enhanced Education for Better Imaging Practices: A Case Study at the University of Michigan","URL":"","shortTitle":"Enhanced Education for Better Imaging Practices","author":[{"family":"Conway","given":"Paul"},{"family":"Williams","given":"D."}],"issued":{"year":2011},"accessed":{"year":2012,"month":3,"day":14}}}],"schema":""} (Conway & Williams, 2011).ScanningThe capture profile for the master TIFF files is presented below. Specifications for derivative files are discussed later under “Post-Digitization”.PROFILE: Image capture specificationsResolutionBit DepthColor Space/ModelColor ProfileImage FormatMaster600 dpi8-bitRGBProPhoto RGBTIFF (IBM byte order, uncompressed)ObservationsCOLOR: Capturing the tonal values of the analogue original was a problem. The pages were tan with darker brewing around the edges. Optimizing the settings took several passes; and disasters one, two, and three (e.g., block shifting mid-page, a 3D effect, extreme blurring). PROCESSING TIME: Even more apparent was the slow painstaking process this capture method entailed. Every page required cropping, the slow scanner engine took a considerable length of time per page, and small particles and fragments fell on the glass.BANDING: Bands of shadows appeared repeatedly in the scans, presumably from warping. A bitonal high-threshold scan confirmed this conjecture. The black regions in REF _Ref201822114 \h Figure 4a are warping. These correspond directly to shadow and blurred regions in the scans (Figure 4b at 150%). (a) (b) [Click image to expand in browser]Figure 4. Page warp (a); corresponding shadow and blurring (b). Along with other settings, color balance was adjusted to “white balance” and “neutral” with little success. A clean sheet of printer paper revealed the same banding patterns of shadows. Therefore, the scanner’s low quality was the likely cause, given its persistence on different source documents. REF _Ref201822671 \h Table 6 is a list of technical considerations, assembled from various sources, that were utilized to check the workflow ADDIN ZOTERO_ITEM CSL_CITATION {"citationID":"5KC2bTIP","properties":{"formattedCitation":"(HathiTrust, 2007; Conway, 2011; Pletschacher, 2011)","plainCitation":"(HathiTrust, 2007; Conway, 2011; Pletschacher, 2011)"},"citationItems":[{"id":815,"uris":[""],"uri":[""],"itemData":{"id":815,"type":"report","title":"Image Quality Review Manual","publisher":"University of Michigan Library","publisher-place":"Ann Arbor, MI, USA","page":"40","genre":"Manual","event-place":"Ann Arbor, MI, USA","URL":"documents/UM-QR-Manual.pdf","author":[{"family":"HathiTrust","given":""}],"issued":{"year":2007,"month":9,"day":14},"accessed":{"year":2012,"month":3,"day":3},"page-first":"40"}},{"id":786,"uris":[""],"uri":[""],"itemData":{"id":786,"type":"speech","title":"SI 675 Digitization for Preservation, Week 2 Lecture- Image Digitization and Guidelines","publisher-place":"University of Michigan, Ann Arbor, MI","genre":"Slides","event-place":"University of Michigan, Ann Arbor, MI","URL":"","author":[{"family":"Conway","given":"Paul"}],"issued":{"year":2011},"accessed":{"year":2012,"month":4,"day":1}}},{"id":784,"uris":[""],"uri":[""],"itemData":{"id":784,"type":"speech","title":"IMPACT Evaluation Tools, ground truth and datasets","publisher-place":"Den Haag, Netherlands","genre":"Slides","event":"IMPACT Final Conference","event-place":"Den Haag, Netherlands","URL":"","author":[{"family":"Pletschacher","given":"Stefan"}],"issued":{"year":2011,"month":10,"day":24},"accessed":{"year":2012,"month":4,"day":1}}}],"schema":""} (HathiTrust, 2007; Conway, 2011; Pletschacher, 2011). Table 6. Workflow Technical Decisions Image Enhancement FactorsAssign color profile (screen or print)Page splittingAdjust/correct color Border removalAdjust/correct tone (histogram) Dewarping (page curl, arbitrary warping)Cropping /deskew Noise removalReverse polarity (negative to positive) BinarisationApply sharpen mask Layout Analysis Remove scanner and film effects Segmentation of regions, lines, words, charactersResize for screen display (store master)Region classificationLogical layout analysisComparative AnalysisA recent case study showed that the optimal capture level for OCR output was 400dpi 8-bit greyscale—300dpi was sufficient, 400dpi was best, and 500dpi lowered accuracy ADDIN ZOTERO_ITEM CSL_CITATION {"citationID":"8AKm14bF","properties":{"formattedCitation":"(Antonacopoulos, 2011)","plainCitation":"(Antonacopoulos, 2011)"},"citationItems":[{"id":778,"uris":[""],"uri":[""],"itemData":{"id":778,"type":"speech","title":"The Effect of Scanning Parameters on OCR Results","publisher-place":"British Library, London","genre":"Lecture","event":"IMPACT Final Conference","event-place":"British Library, London","URL":"","shortTitle":"Conference Lecture","author":[{"family":"Antonacopoulos","given":"A"}],"issued":{"year":2011,"month":10,"day":25},"accessed":{"year":2012,"month":3,"day":6}}}],"schema":""} (Antonacopoulos, 2011). These are optimal for large digitizations were access is the primary goal. A 600dpi capture was more consistent with the research goals and size of this digitization. Nonetheless, every project is unique and the prevailing rule seems that there is no rule. REF _Ref201822813 \h Table 7 shows varying capture specifications for four mass digitization projects with the common goal of “expanding access to scholarly resources” ADDIN ZOTERO_ITEM CSL_CITATION {"citationID":"wrLwhRRC","properties":{"formattedCitation":"(Rieger, 2008)","plainCitation":"(Rieger, 2008)"},"citationItems":[{"id":798,"uris":[""],"uri":[""],"itemData":{"id":798,"type":"report","title":"Preservation in the Age of Large-Scale Digitization","collection-title":"CLIR Reports","publisher":"Council on Library and Information Resources","publisher-place":"Washington, DC","page":"52","genre":"White Paper","event-place":"Washington, DC","call-number":"Z701.3.D54R54 2008","number":"CLIR Pub No. 141","language":"English","author":[{"family":"Rieger","given":""}],"issued":{"year":2008},"page-first":"52"}}],"schema":""} (Rieger, 2008).Table 7. Capture specifications for mass digitization projects.Resolution/Bit Depth/Image FormatCapture DeviceGoogle Book Search, University of Michigan facilityText: 600 dpi, 1 bit TIFF ITU G4Illustration: 300 dpi JPEG2000aCustom-built digital scanning system by Kirtas. (see below)Microsoft Live Books, Cornell University facilityJPEG300-400 dpi, 8-24 bitKirtas APT 2400 or Scribe digital camera stationOpen Content AllianceJPEG 2000400 dpi, 12 bitScribe digital camera stationMillion Book ProjectTIFF UTI G4600 dpi, 1 bitMinolta Scanner ADDIN ZOTERO_ITEM CSL_CITATION {"citationID":"CTTyhAVi","properties":{"formattedCitation":"(Karle-Zenith, 2006)","plainCitation":"(Karle-Zenith, 2006)"},"citationItems":[{"id":813,"uris":[""],"uri":[""],"itemData":{"id":813,"type":"paper-conference","title":"Google Book Search and the University of Michigan","publisher":"Libraries Unlimited","publisher-place":"Charleston, SC USA","event":"26th Annual Charleston Conference","event-place":"Charleston, SC USA","URL":"","note":"Published","language":"en","author":[{"family":"Karle-Zenith","given":"Anne"}],"issued":{"year":2006},"accessed":{"year":2012,"month":3,"day":22}}}],"schema":""} (Karle-Zenith, 2006)As the pioneer in mass digitization, Google’s technical framework was researched in depth for comparing best practices. The analogue book is placed flat under two infrared cameras and infrared projector ( REF _Ref201997032 \h Figure 5a) that displays a special pattern ( REF _Ref201997032 \h Figure 5b) on the surface of the book. The images captured by the cameras, positioned at opposing angles above, are then stereoscopically combined to create a three-dimensional map of the pattern, which represents the surface of the book. The scanning software adjusts for the page distortions (bulge), spatially calculates the location of the book’s spine, and de-warps the image to create a flat plain ready for OCR ( REF _Ref201997032 \h Figure 5c). A video of the scanning device purportedly used by Google Books is online here. (a) (b) (c) Figure 5. Google scanning technology [Click images to expand in browser] ADDIN ZOTERO_ITEM CSL_CITATION {"citationID":"3X9fetTN","properties":{"formattedCitation":"(Lefevere & Saric, 2009)","plainCitation":"(Lefevere & Saric, 2009)"},"citationItems":[{"id":767,"uris":[""],"uri":[""],"itemData":{"id":767,"type":"patent","title":"Detection of grooves in scanned images","publisher-place":"Mountain View, CA","event-place":"Mountain View, CA","URL":"","number":"7508978","author":[{"family":"Lefevere","given":"F"},{"family":"Saric","given":"M"}],"issued":{"year":2009,"month":3,"day":24},"accessed":{"year":2012,"month":3,"day":24}}}],"schema":""} (Lefevere & Saric, 2009)Replicating the size of Google digitization project is impractical for most memory institutions. Nonetheless, as the ‘Alexandria of digital libraries’, Google is making important milestones for future widespread application.Quality Control Quality checks for images include sharpness, brightness, shadows, alignment, color, detail and granularity. Checks for textual content include letter sharpness and clarity, legibility, and gaps due to cropping (Kenney & Rieger, 2000). The guidelines in Table 8 were used to visually inspect the images, specified for 600dpi text/line art images without magnification.Table 8. Quality control visual assessmentFull reproduction of the page, with skew under 2% from the originalSufficient contrast between text and background; uniform density across the image-consonant w/ original Text legibility, included the smallest significant charactersAbsences of darkened borders at page edgesCharacters reproduced same size as original; line widths (thick, medium, and thin) rendered faithfullyAbsence of wavy or distorted textIndividual letters are clear and distinctAdjacent letter are separatedCapture of the ran of tones contained in the originalConsistent rendering of detail in the light and dark portions of the imageEven graduations across the imageAbsence of “noise” such as moiré patterns and other distorting elementsPresence of significant fine detail contained in the original ADDIN ZOTERO_ITEM CSL_CITATION {"citationID":"UTMWQdwT","properties":{"formattedCitation":"(Cornell University, 1997)","plainCitation":"(Cornell University, 1997)"},"citationItems":[{"id":787,"uris":[""],"uri":[""],"itemData":{"id":787,"type":"report","title":"Model Request for Proposal (RFP) for Digital Imaging Services prepared by Cornell University Library","publisher":"Research Libraries Group, Inc.","URL":"","author":[{"family":"Cornell University","given":""}],"issued":{"year":1997},"accessed":{"year":2012,"month":3,"day":10}}}],"schema":""} (Cornell University, 1997)ObservationsOCR: Odd OCR errors occurred in areas where text was without abnormalities (e.g., an underscore added at “corrections_to”). Spacing was missing between words in bitonal scans but not color, for no apparent reason. Other spacing was closer but scanned correctly, so variable-pitch could be ruled out. Lastly, the OCR engine returned the text “FIG” as lowercase, taking interpretive liberties that could distort semantic meaning, especially in scientific content ADDIN ZOTERO_ITEM CSL_CITATION {"citationID":"m5JbDltU","properties":{"formattedCitation":"(Duguid, 2007)","plainCitation":"(Duguid, 2007)"},"citationItems":[{"id":817,"uris":[""],"uri":[""],"itemData":{"id":817,"type":"article-journal","title":"Inheritance and loss? A brief survey of Google Books","container-title":"First Monday","volume":"12","issue":"8","URL":"","shortTitle":"Inheritance and loss?","author":[{"family":"Duguid","given":"P."}],"issued":{"year":2007},"accessed":{"year":2012,"month":1,"day":15}}}],"schema":""} (Duguid, 2007).PAGINATION: Verifying page number against the file-naming schema is a vital QC step. Scanning software typically assigns sequential file names while scanning. This feature caused misalignment with the actual page number, and prompted further investigation into solutions undertaken by large-scale projects. One practice uses the file name to check for missing pages. For example, if a difference of six separates the file number from the page number, then all subsequent pages should follow the same differentiation. Performing this check at intermittent points in the scan can prevent time-consuming errors.Post-DigitizationCommon post-digitization steps may include creation of derivative files and an archival information package, data collection, ingestion to repository system, provision of access for end users and online publication ADDIN ZOTERO_ITEM CSL_CITATION {"citationID":"EoasHRpm","properties":{"formattedCitation":"(FADGI, 2010)","plainCitation":"(FADGI, 2010)"},"citationItems":[{"id":788,"uris":[""],"uri":[""],"itemData":{"id":788,"type":"report","title":"Technical guidelines for digitizing cultural heritage materials: Creation of raster image master files.","publisher":"Federal Agencies Digitization Initiative (FADGI)","page":"101","genre":"Guidelines","abstract":"Update to Objective Performance Guidelines and minor editorial changes. Specific changes to Objective Performance Guidelines include a change to the mid-frequency SFR and the addition of Sharpening as a performance measure.","URL":" 24.pdf","shortTitle":"Still Image Working Group. Technical guidelines for digitizing cultural heritage materials","language":"English","author":[{"family":"FADGI","given":""}],"issued":{"year":2010,"month":8},"accessed":{"year":2012,"month":3,"day":14},"page-first":"101"}}],"schema":""} (FADGI, 2010). REF _Ref201824070 \h Table 9 outlines the post-processing workflow that was implemented. Optimization posed problems and required several revisions. Creating derivatives became a trial and error process. The full version of ABBYY may have streamlined steps, though the exercise was a valuable experience. Table 9. Post-processing workflow for derivative files and publicationStep 1. TIFF to JPG (300 dpi, 8-bit, lossy) PhotoShop Elements 10Step 2. TIFF to PDF (searchable)Adobe Acrobat Pro XStep 3. PDF to TXT (OCR)ABBYY FineReader (demo)Step 4. TXT to DOCXMSWord 14.1.4Step 5. DOCX to XML/TEI**OxGarageStep 6. XML/TEI encodingOxYgen 13.2MASTER: All original TIFF files were saved separately as master files sans alterations. 1. TIFF to JPG: PhotoShop was used to create JPG files from TIFF, with minimal cropping and color adjustments.2. TIFF to PDF: Searchable PDF files were created using Adobe Acrobat with ClearScan option. A problem was encountered on the first page. Adobe reported nearly 80% of the words as “suspects”. That is, the confidence level of the software was low matching to the lexicon. Upon manual inspection, the words were actually identified correctly and non-uniform spacing triggered the error. Manual corrections were made, though this would be unacceptable in mass digitization work. It did not occur on subsequent pages. For evaluation purposes, searchable PDFs were converted using ABBYY FineReader on two pages. ABBYY’s ability to designate content by type (text, picture, table) was advantageous, as was the ClearScan feature. This vectorizes the image and renders sharper glyphs in nearly ? file size. A custom font is also created based on pixel formations from a new character set. The result is considerably better character resolution and legibility. Although better for presentation, the page background was converted to low resolution and the color altered, thus diminishing authenticity.3. PDF to TXT: OCR was performed from the master TIFF file using ABBYY FineReader. An informal analysis was made on the vectorized PDF created by Adobe’s ClearScan. Theoretically, ClearScan’s sharpening effect should have increased OCR performance. The test page showed that the OCR performed essentially the same (4 errors) as the ClearScan (3 errors). Using ABBYY for the entire process had similar results (5 errors). With OCR accuracy being equal and Adobe creating the best PDFs for website rendering, using both Adobe and ABBYY presented the best workflow. REVISION: After closer inspection, Adobe’s standard setting was used. The ClearScan PDFs varied too much from the original and had artifact effects. 4. TXT to DOCX: MSWord was used to correct OCR errors and format text to match original source. 5. DOCX to XML/TEI: This step resulted from a problem encountered in the workflow. Time required for TEI encoding quickly graduated, especially for scientific documents laden with chemical formulae (e.g., p 364). Coding anything larger seemed highly inefficient. OxGarage is a semi-automated application that converts MSWord docs to XML while preserving simple, though highly repetitive, markup (e.g., italics, underline, subscript). Only the <body> from the OxGarage generated xml file was used. Some cleanup was still required, but most formatting was converted correctly and greater efficiency restored. Although semantic elements still required encoding, this method could potentially translate to larger projects where students or staff familiar with MSWord are employed to undertake this menial component. Such was the case in the EBBA archive and, in turn, prompted interest in alternative tools for the present project ADDIN ZOTERO_ITEM CSL_CITATION {"citationID":"OPLgkTct","properties":{"formattedCitation":"(NIH, 2007)","plainCitation":"(NIH, 2007)"},"citationItems":[{"id":790,"uris":[""],"uri":[""],"itemData":{"id":790,"type":"report","title":"Roxburghe Ballad Archive: Preservation and Access","publisher":"National Institute for Humanities","publisher-place":"Washington, DC","page":"18","genre":"Grant","event-place":"Washington, DC","URL":"","number":"NIH No. PW-50005-08","author":[{"family":"NIH","given":""}],"issued":{"year":2007},"accessed":{"year":2012,"month":2,"day":5},"page-first":"18"}}],"schema":""} (NIH, 2007).6. XML/TEI:Markup in XML followed TEI P4 encoding and guidelines from the Thomas MacGreevy Archive. Several questions or decisions were encountered during the encoding process. For example, should graphs be encoded or simply designated “Table X” for proper representation? Does <extent> in the <fileDesc> apply to the xml file or the TIFF file? A final observation was in the importance of a proactive encoding and editorial plan, otherwise cumbersome tracking and retroactive edits are required. For example, most of <editorialDecl> had to be initially marked “TBD” or <!--comment--> as placeholders for areas that required verification or completion because the body wasn’t encoded yet. Completing the <TEIHeader> after <text> seemed a preferred strategy. References ADDIN ZOTERO_BIBL {"custom":[]} CSL_BIBLIOGRAPHY Antonacopoulos, A. (2011) The Effect of Scanning Parameters on OCR Results. Available from: <; [Accessed 6 March 2012].Bearman, D. (1989) Archival Methods: Archives & Museum Informatics Technical Report #9 [Internet]. Available from: <; [Accessed 2 November 2010].Carignan, Y., Evander, J., Gueguen, G., Hanlon, A., Murray, K. & Roper, J. (2007) Best practice guidelines for digital collections at University of Maryland Libraries. University of Maryland Libraries. Available from: <; [Accessed 5 March 2012].Conway, P. (2010) Measuring Content Quality in a Preservation Repository: HathiTrust and Large-Scale Book Digitization. In: Proceedings of 7th International Conference on Preservation of Digital Objects. iPres 2010. Vienna, Austria, Austrian Computer Society, pp.95–102. Available from: <; [Accessed 22 March 2012].Conway, P. (2011) SI 675 Digitization for Preservation, Week 2 Lecture- Image Digitization and Guidelines. Available from: <; [Accessed 1 April 2012].Conway, P. & Williams, D. (2011) Enhanced Education for Better Imaging Practices: A Case Study at the University of Michigan. Available from: <; [Accessed 14 March 2012].Cornell University (2012) Copyright Term and the Public Domain in the United States [Internet]. Available from: <; [Accessed 23 March 2012].Cornell University (1997) Model Request for Proposal (RFP) for Digital Imaging Services prepared by Cornell University Library. Research Libraries Group, Inc. Available from: <; [Accessed 10 March 2012].Cornell University (2003) Moving Theory into Practice. Digital Imaging Tutorial [Internet]. Available from: <; [Accessed 15 January 2012].Dahlstr?m, M., Hansson, J. & Kjellman, U. (2012) “As We May Digitize”—Institutions and Documents Reconfigured. LIBER Quarterly, 21 (3/4), p.pp.455–474. Available from: <; [Accessed 22 May 2012].Deegan, M. & Sutherland, K. (2009) Transferred illusions: digital technology and the forms of print. Ashgate Publishing, Ltd.Duguid, P. (2007) Inheritance and loss? A brief survey of Google Books. First Monday, 12 (8). Available from: <; [Accessed 15 January 2012].Duranti (1995) Reliability and authenticity: the concepts and their implications. Archivaria, 39 (10).Earnshaw, R.A., Vince, J. & Carr, R. (2008) Digital convergence?: libraries of the future. London, Springer.FADGI (2009) Guidelines for TIFF Metadata, Recommended Elements and Format. Federal Agencies Digitization Initiative (FADGI). Available from: <guidelines/TIFF_Metadata_Final.pdf> [Accessed 14 March 2012].FADGI (2010) Technical guidelines for digitizing cultural heritage materials: Creation of raster image master files. Federal Agencies Digitization Initiative (FADGI). Available from: < 24.pdf> [Accessed 14 March 2012].Fulton, W. (2012) Scanning Basics 101 [Internet]. Available from: <; [Accessed 13 January 2012].HathiTrust (2007) Image Quality Review Manual. Ann Arbor, MI, USA, University of Michigan Library. Available from: <documents/UM-QR-Manual.pdf> [Accessed 3 March 2012].Karle-Zenith, A. (2006) Google Book Search and the University of Michigan. In: Charleston, SC USA, Libraries Unlimited. Available from: <; [Accessed 22 March 2012].Koninklijke Bibliotheek (2003) Guidelines and procedures for the execution of projects for Memory of the Netherlands. Memory of the Netherlands Project Office. Available from: <geheugenvannederland.nl/hgvn/webroot/files/File/PDF/richtlijnen/Microsoft%2520Word%2520-%2520Richtlijnen%2520en%2520procedures%2520%25204.0%2520engels.pdf> [Accessed 14 March 2012].Lefevere, F. & Saric, M. (2009) Detection of grooves in scanned images. Available from: <; [Accessed 24 March 2012].Minerva Project (2004) Good Practices Handbook. Minerva Working Group 6. Available from: <; [Accessed 15 January 2012].Murray-John, P. (2011) Introduction to CMSes. WordPress, Omeka, and Drupal. Available from: <; [Accessed 12 February 2012].NC ECHO Project (2007) Guidelines for Digitization 2007 Revised Edition. North Carolina, Institute of Museum and Library Services. Available from: <; [Accessed 23 March 2012].NIH (2007) Roxburghe Ballad Archive: Preservation and Access. Washington, DC, National Institute for Humanities. Available from: <; [Accessed 5 February 2012].NISO (2007) A Framework of Guidance for Building Good Digital Collections. 3rd ed. Baltimore, MD, USA, National Information Standards Organization. Available from: <; [Accessed 17 April 2012].Pletschacher, S. (2011) IMPACT Evaluation Tools, ground truth and datasets. Available from: <; [Accessed 1 April 2012].Rieger (2008) Preservation in the Age of Large-Scale Digitization. Washington, DC, Council on Library and Information Resources.Rydberg-Cox, J.A. (2009) Digitizing Latin Incunabula: Challenges, Methods, and Possibilities. Digital Humanities Quarterly, 3 (1). Available from: <; [Accessed 5 February 2012].Schreibman, S., Siemens, R.G. & Unsworth, J.M. (2004) A companion to digital humanities. Oxford, Wiley-Blackwell.Terras, M.M. (2008) Digital images for the information professional. Burlington, VT, Ashgate Publishing, Ltd.Weel, A. van der (2011) Changing Our Textual Minds: Towards a Digital Order of Knowledge. Manchester, England, Manchester University Press. Available from: <; [Accessed 13 March 2012].ResourcesHYPERLINK ""An Absolute Beginner's Introduction to TEI P5 XML. Workshop at NUI Galway, 13-17 April 2009Attributes for Common Compression TechniquesCopyright Term & the Public Domain in the United States Developing WordPress Locally With MAMP | Smashing WordPressDigital History: A Guide to Gathering, Preserving, and Presenting the Past on the WebGoogle and Univ of Michigan Digitization Project Workflow ChartGuides to Quality in Visual Resource ImagingHYPERLINK ""IMPACTNC ECHO Digitization Manual GuidelinesOmekaRepository Software Survey, Product Comparison TableRepresentative Institutional Requirements for TEI by ExampleUsing Omeka to Build Digital Collections: The METRO Case StudyVideos from the IMPACT Final Conference 2011Digital libraries used in researchThomas MacGreevy ArchiveEnglish Broadside Ballad Archive (EBBA)Walt Whitman Archive Dred Scott ArchiveEEBO-TCPECCO-TCPEvans-TCP HYPERLINK "" The William Blake ArchiveThe Mark Twain Papers and ProjectSpecifications for digitization of original source material, Digital Metallurgy, by The Internet Archives.Resolution600 dpiScanning mode/color spaceNot availableFile Format-MasterTIFFOCRABBYY FineReader 8.0Catalog MetadataMARCXMLDetailsMaster and use copy. Digital master created according to Benchmark for Faithful Digital Reproductions of Monographs and Serials, Version 1. Digital Library Federation, December 2002.(Source: OCLC Number 681415385 at ) ................
................

In order to avoid copyright disputes, this page is only a partial summary.

Google Online Preview   Download