'Global Review of 2000 round of - University of Thessaly



ESA/STAT/AC.84/9

6 July 2001

English only

Symposium on Global Review of 2000 Round of

Population and Housing Censuses:

Mid-Decade Assessment and Future Prospects

Statistics Division

Department of Economic and Social Affairs

United Nations Secretariat

New York, 7-10 August 2001

Adapting new technologies to census operations*

Arij Dekker**

* This document was reproduced without formal editing.

** The Netherlands. The views expressed in the paper are those of the author and do not imply the expression of any opinion on the part of the United Nations Secretariat.

Adapting new technologies to census operations

by

Arij Dekker

Specialist in Census Technology

Paper prepared for

The Expert Group Meeting on

Global Review of 2000 Round of Population and Housing Censuses:

Mid-decade Assessment and Future prospects

United Nations Statistics Division

New York, 7-10 August 2001

The views expressed are those of the author.

Contents

1. Introduction

2. Management, communication, logistics, quality assurance

Data capture

3.1 Intelligent Character Recognition (ICR)

3.2 Automatic coding

3.3 Outsourcing and decentralization

GIS, remote sensing and GPS

Data processing and storage

5.1 Census processing software

5.2 Data storage

Use of the Internet

6.1 The Internet for data collection

6.2 The Internet for data dissemination

Data dissemination – other issues

7.1 Statistical disclosure control

7.2 High-capacity physical media

7.3 Structured archives – the statistical data warehouse

How to choose appropriate technology

More

Conclusions

Discussion

References

Glossary

1. Introduction

It is commonly known that the art of population census taking goes back many centuries. Ever since the end of the 19th century, there have been efforts to take advantage of a succession of newly available technologies to make such large and costly statistical enquiries more efficient and effective. A census is labor-intensive, requiring large numbers of temporary staff. Personnel costs usually are the principal component of census budgets, with expenditure for information and communication technology coming second.

Even small improvements in the methodologies used, or in the effectiveness of the equipment, can result in important gains in quality and/or expense of the whole operation. Census budgets depend on national cost levels and the depth of the enquiry, but generally vary between a few dollars per capita in low-cost countries to as much as 30 dollars per capita in highly developed environments. A rough estimate of the total expense of the current Round of Censuses would put it between 30 and 50 billion dollars. Certainly an enticing target for those trying to improve the rate of value-for-money.

The name of Herman Hollerith stands out as an early adaptor of modern technology to census work. He borrowed from the ideas of Joseph-Marie Jacquard, who had invented punched cards to control looms. Hollerith saw a way to use such cards in sorting and tabulation. By doing this he not only expedited the release of the results of the 1890 US census; he started an entire industry.

There have been many less-known census innovators who have put newly discovered methods and technology to good use. Information technology has usually been on the forefront of these efforts. Census data processing equipment has graduated from machines just assisting tabulation work, to indispensable tools in virtually all phases of census work. Computers are used for planning, to support mapping, in project management, in all stages of data capture, cleaning, coding, and reporting, and in demographic analysis [De97]. Many of the recent improvements in census taking have been possible thanks to the ever-growing capabilities of data processing equipment and communication networks operating on local, national, and world-wide levels. For the sake of continuity it is important that the use of newer technology is embedded into, and builds upon, existing sound methodology [UN98].

There are presently several important efforts to bring co-ordination and focus to the innovation process in official statistics and census taking. One is the Paris21 initiative: Partnership in Statistics for Development in the 21st Century. The members of Paris21 – there are several hundred of them – are drawn from leading national and international statistical agencies, academic institutions, etc. One of the several issues currently being reviewed by the experts combining their efforts under the Paris21 initiative is how census work can be made more cost-effective. See the web site for details [PaWW].

The United Nations Statistics Division has a long history of furthering sound statistical principles, and the sharing of know-how. A web site giving access to information on good statistical practices has recently been opened [UNWW]. On a regional scale, Eurostat has conducted a series of technical seminars by the names of NTTS (New Techniques and Technologies for Statistics) and ETK (Exchange of Technology and Know-how). The 2001 meetings on these issues have been conducted in a combined form on Crete, Greece, last June.

Noteworthy also is the Eurostat web site by the name of VIROS, Virtual Institute for Research in Official Statistics [EuWW1]. VIROS identifies and classifies areas of research where participating organizations may place the results of their studies and experiences, while remaining entirely responsible for it. Eurostat acts as a central co-ordinator, attempting to integrate the individual elements into a coherent set. The ultimate goal is to facilitate access to information on research activities and results. Eurostat is naturally interested in such issues, facing, as it does, the need to combine many statistical traditions, and overlaying them where possible with state-of-the-art integration technology.

When considering the technological options before them, census offices face a number of questions. Some of these are:

• how to make an informed choice in selecting appropriate technology;

• how to maintain the integrity of the existing statistical and census systems;

• how to deal with the option of outsourcing[1], and management of outsourced tasks;

• confidentiality concerns relating to the preferred solutions.

This paper will look briefly at various areas where census work has recently benefited from new technology, and will discuss the issues referred to above. Definite answers on the questions raised can be formulated only by individual census organizations themselves.

2. Management, communication, logistics, quality assurance

A nation-wide census differs in many respects from day-to-day statistical work. It lacks the repetitive nature that allows collections with a greater periodicity to gradually be improved. The level of expenditure and number of staff are much higher than statistical managers are used to. Some governments therefore establish census offices separate from the national statistical agency. It may be necessary to recruit professional management, experienced in dealing with large but temporary organizations. Since a census can be seen as a large time-critical project, with many interlocking operations, the use of modern project management software is of vital importance.

A census operation requires efficient communication between (many) thousands of persons, as well as procurement and storage of a large variety of items, most of which have to distributed to all corners of the country, and then recollected.

Recent developments in mobile telephony (cell phones) have made person-to-person communication easier, even in countries with extensive and reliable fixed-line networks. But complete mobile coverage has not been accomplished in most developing countries. Census communication with remote areas continues to be problematic in some cases. It is still possible that satellite telephone systems, that function everywhere on earth, will fill this void. Some ambitious projects in this domain, such as that known as “Iridium,” have not drawn enough initial subscribers. But with most of the enormous investment costs now written off, user prices are coming down. The groundstations including antennas are still rather voluminous, but completely portable. Operations planners need to be cognizant of all communications options open to them, including the regional differences, and make arrangements accordingly.

Where printed or printable communication is required, fax technology is rapidly giving way to electronic mail. This is true for census operations, but relying on e-mail entails vulnerability to Internet service interrupts, computer illiteracy and virus attacks. It is important to always keep a fax capability for backup.

Improved computer software and wide availability of PC’s has made managing the movement of goods much easier. Bar-code technology can be a key element in this. Using bar codes in stead of printed numbers has advantages in avoiding transcription errors and speeding up processing. A combination of the two can be used if easy human recognition of the codes may also sometimes be required. Census managers, who are not logistics professionals, tend to overlook this established technology.

A typical application of bar-code technology is to label all items specific for a particular enumeration area (maps, enumerator ID, summary sheets, transport box) with a specific bar code. At the point where the materials are sent out, the codes will be scanned, allowing automatic update of a database of items forwarded. The same process can be used to maintain a database of items retrieved from the field.

Labeling individual questionnaires with unique codes can also be helpful, although the resulting administrative overhead is considerable. Such identifiers can protect against the fairly common problem that entire batches of questionnaires arrive back erroneously geocoded. Standard retail scanners, but also most intelligent character recognition systems (see Section 3.1) will read bar codes without difficulty.

Quality assurance, including the use of scientifically-sound sampling methods, should be an integrating part of all census operations. Many of the methods in this field depend on statistical principle, and have been developed by statistical innovators [De86]. The census office must thrive to a consistent level of assured quality throughout its operations, and can not afford to disregard the techniques that help to achieve and verify it [SS01].

Data capture

3.1 Intelligent Character Recognition (ICR)

It is probably true to say that the current Round of Censuses has seen the breakthrough of ICR technology. In the 1985-1994 Round only about 20 % of countries undertaking censuses used some form of character or mark recognition [De94]. The large majority still relied on keyboard data capturing. In the current Round nearly all census offices of industrial market economies - and numerous other ones - apply imaging through scanners, recognition software, and what more is required to (partially) do away with manual data entry.

There is no doubt that recognition technology has made great strides in the last decennium, but it seems true also that the example provided by census “pioneers” has made switching course easier for those organizations that otherwise might have hesitated. ICR offers a promise of greater efficiency, but is inherently riskier than keyboard data entry. For example: poorly designed or badly printed questionnaires are a nuisance in manual data entry, but may sink an anticipated ICR data capturing operation. The need for elaborate pre-tests, already so obvious in traditional census taking, is even more apparent when scanning technology is to be used.

The main fundamental problem still existing is that handwritten characters are often poorly recognized where the writer is not already familiar to the recognition system. In censuses which use auto-response or a large number of enumerators, this obviously is the case. To avoid the problem, it is possible to limit the automatic recognition to marks or numeric digits only. But even digits can not always be reliably interpreted, so quite a few manual data-entry personnel will still be required to fill the gaps.

Scattered information suggests that the ICR process proceeds not always as smoothly as anticipated. Experiences obtained during the final operations tests induced the US Bureau of the Census to move from a one-pass to a two-pass processing system, where sample data from the long forms will only be computer-stored during a second capturing operation [Pr00]. This change of approach has had no effect on processing deadlines. Some European countries (for example: Estonia) have reported difficulties in recognizing handwritten alphabetic characters, requiring them to hire additional staff to assist the automatic recognition process. A recent meeting in Bangkok [UN01] heard about problems of varying severity in Thailand, the Philippines, China, Macao SAR, and Indonesia[2]. For information on the details of the problems experienced, retrieve the country papers from the Web site referred to.

In Thailand, earlier plans to establish 15 regional ICR centers for the April 2000 Census were cancelled after more sophisticated (and expensive) scanners and software turned out to be required. A single ICR complex now operates in Bangkok (Fujitsu 4099 scanners, TeleForm software). Some problems were reported with poorly written characters and scanner maintenance.

The May 1st, 2000 Census of the Philippines works with four decentralized capturing centers, using Kodak 3590 scanners and Eyes and Hands software. One of the biggest problems here is that the print quality of some questionnaires is not in accordance with specifications, which causes the ICR software to tag them as unidentifiable. Another difficulty is illegible handwritten entries. The number of verification licenses, required to manually correct such rejects, had been underestimated. This has been a learning process. Experiences are sufficiently positive to use ICR again for the upcoming Census of Agriculture and Fisheries.

China, Macao SAR reports good results for its pilot operation for the 2001 Census. The paper contains an interesting table, obtained from a sample of 150,000 images of digits. The table does not immediately confirm the effectiveness of ICR as implemented. It would seem useful to dispense training to enumerators about how to best write certain numerals.

Digit |0 |1 |2 |3 |4 |5 |6 |7 |8 |9 |All | |Recognition rate (%) |94.83 |96.83 |94.92 |91.11 |96.00 |94.95 |97.29 |97.72 |90.43 |81.74 |95.64 | |Reject rate (%) |5.17 |3.17 |5.08 |8.89 |4.00 |5.05 |2.71 |2.28 |9.57 |18.26 |4.36 | |Accuracy rate (%) |99.38 |99.89 |99.78 |99.73 |99.89 |99.41 |99.79 |99.59 |99.12 |100.00 |99.72 | |Error rate (%) |0.62 |0.11 |0.28 |0.27 |0.11 |0.59 |0.21 |0.41 |0.88 |0.00 |0.28 | |

ICR for the July 1st, 2000 Census of Indonesia is handled by 29 processing centers throughout the country, using Kodak DS 3500 scanners and NCS NestorReader recognition software embedded in own Visual Basic programming. The country paper reports many troubles that hamper the census ICR operation. These include sub-standard questionnaire printing (despite elaborate quality controls), poor writing by enumerators, inadequate document handling in the field resulting in unusable forms, scanner maintenance problems, and complex file management. The authors deserve the highest praise for sharing these experiences for others to learn from. The massive nature of the operation in Indonesia, scattered civil unrest, financial constraints, and various logistics problems have obviously all been a factor here. Despite the difficulties, CBS Indonesia is confident that the data capture operation will be completed successfully.

The October 2000 Census of Aruba (not reported in Bangkok) used Fujitsu M3079DG scanners and Eyes and Hands software. All data for this small country of about 100,000 people were captured by April 2001. The operation was quite carefully prepared, and proceeded smoothly, including the integrated computer-assisted coding work. There were no cost advantages compared to keyboard data entry.

Such problems as are reported can be divided into those that have to do with the recognition process itself, and all other ones. If the recognition rate is unacceptably low, this can usually be remedied by reducing the pre-set security level. But there is a price to pay: error rates will go up. Other problems may include unreliable paper transport in the scanners, which can have plenty of causes, including dirt, the use of “white-out” on sheets, and damaged forms, possibly as a result of bad weather conditions. It is not unheard of that such difficulties require large numbers of questionnaires to be transcribed, again increasing error rates.

As a general rule, success is often reported by census offices that went through a long and careful preparation process, including several pre-tests. Those that have to cut short on the groundwork may become the source of less fortunate stories. Complete quality assurance management – for example in the printing process of the questionnaires – is of the essence here.

If recognition of handwritten text is now becoming a more reliable tool, it would be logical to think of speech recognition as the next step. After all, this is a more direct method of data collection. Speech recognition has broad economic potential, and is a topic of much research. Some commercial applications of this technology are appearing, especially in processing verbal instructions received by telephone, and in the automotive industry. But progress in this area has been slower than expected. Statistical applications are still rare.

3.2 Automatic coding

Recognizing verbal texts usually has for purpose to accommodate associated automatic coding. That is, the computer reads a text, for example the name of a geographic area, and then selects the applicable code from an associated file or database.

Such solutions, which ideally would allow completely automatic data capture and coding, depend on two pre-requisites: (i) the recognition process must be sufficiently reliable and, (ii) the search algorithms do indeed lead from the recognized term(s) to the appropriate code. A 100% character recognition rate is not required, since the algorithm may still be successful with incomplete or partially mangled terms.

However, there are indeed problems with this process. First there is the recognition reject rate, as referred to above, which might require an unexpected level of human interference. Next comes the difficulty of automatically determining the applicable codes, the severity of which depends on the nature of the variable concerned. Geographic terms are usually not too difficult to code automatically, except perhaps for the lowest level (e.g. village), where spelling my not be standardized and homonyms occur. Occupation and Industry tend to be more problematic. Despite the efforts by census field staff to extract full information from respondents, these variables will often be reported in terms that can not be easily linked to ISCO, ISIC or NACE codebooks.

The issues of automatic and computer-assisted coding have been subject of considerable research [Me97, Do99, Bl97]. The tasks are a challenge to those applying modern methods of artificial intelligence, neural networks, and fuzzy logic[3]. But however elegant and advanced the matching algorithms are, once reporting from the field is multi-interpretable, too general, or otherwise inadequate, there is no easy way out. Many specialists feel that in those situations it is difficult to conceive automatic solutions that approach in quality the judgment of an experienced human coder. By letting the computer take care of the simpler cases, and relaying the remainder to human coders, an efficiency gain can nevertheless be obtained.

As to the coding of Industry, it may be noted that this can be improved by using a register of establishments or enterprises, and their known ISIC or NACE codes. Respondents may find it easier to report the name of their employer than to describe the principal economic activity of the company. This approach obviously requires the existence of a comprehensive national business register.

In conclusion: ICR in census has certainly not become an off-the-shelf technology. It requires careful design and extensive testing of questionnaires. The integration of ICR with associated operations, such as coding, needs ample prior thought and a clear strategy, again to be tested for effectiveness.

3.3 Outsourcing and decentralization

Census data entry, through ICR or otherwise, is a potential candidate for outsourcing. Since it is a one-time high-volume application, there might be contractors that possess equipment and skills allowing them to offer the census office conditions that it could not match in an in-house operation. Meanwhile, it should be noted that outsourcing brings responsibilities of contracting and monitoring that require resources too. Confidentiality concerns multiply where outside contractors dealing with individual data are concerned. Quality assurance, already a major consideration in any event, becomes even more crucial if outside contractors are involved (see, for example, [Wh01]). It would be attractive if the contractor could work within the census premises. In any event, contractor staff should be subject to confidentiality rules at least as severe as the ones imposed on temporary census staff.

It should be noted that managers with an excellent in-house management record may still have difficulty controlling outsourced work, which requires different skills. These include knowledge of the service market, awareness of legal issues, negotiating skills, and more. In a census situation one easily ends up in circumstances where the supplier is in control, since the census organization, even while unhappy with the services provided, can not afford to turn away.

Sometimes government regulations put barriers in the way of outsourcing tasks that could better be assigned to specialized providers outside the census office. That situation obviously should be changed, but most likely the required reforms need to be implemented at a government level different from the one supervising national statistical services.

Decentralized data capture would allow the census organization to keep matters in its own hands, but obtain advantages by spreading the work to its regional centers. The problems are somewhat comparable to outsourcing, although easier managed. Much depends on the local situation: magnitude of the task at hand, conditions of the labor market, efficiency of communication and transport, … Assigning more work outside the capital may also have a social and public relations benefit. General guidelines in this domain are impossible to formulate.

4. GIS, remote sensing and GPS

A more comprehensive discussion of these issues can be found in the paper on “Identifying and resolving problems of census mapping,” also presented in this meeting. Since new mapping technology is an essential part of census innovation, brief remarks are included here.

Mapping technology has made great strides over the past decades. It has moved from an activity depending on field exploration and manual drawing, to one using remote sensing and computer-assisted map management.

While aerial photography from planes was used for census mapping (mostly for dense urban areas) before the era of satellite technology, the latter offers a much more cost-effective solution for remote sensing. Commercially-available satellite pictures provide resolutions well beyond those required to identify individual buildings. Availability of such photographs greatly reduces – but certainly not removes – the need for on-the-ground inspection.

The field work itself benefits from the now common availability of cheap hand-held global positioning systems (GPS), that again depend on satellite technology. Topographical maps and satellite pictures establish the starting platform for census field mapping. Cartographic staff armed with maps, pictures and GPS systems can now complete and annotate the maps to produce excellent orientation material for enumerators.

Maps are now usually produced, stored, and updated using specialized computer systems and commercial software. The essential elements of satellite photographs or paper maps can be digitized by hand-tracing on digitizing tablets. Once the maps are finished, they can be (re)printed at will. The vector images are stored in computer files without the risk of degradation over time.

It is in this context useful to point to a growing tendency for national statistical agencies to establish basic statistical reporting areas independent from the administrative territorial organization [Ja99a], sometimes in the form of a grid of squares[4]. The reporting areas should be large enough to maintain individual response confidentiality, yet small enough to allow regrouping of these statistical areas into the lowest level of administrative territorial units. The approach removes some of the problems of maintaining time-series in the context of ever-changing administrative borders.

The value of census information is enhanced if combined with underlying base maps which permit users to generate thematic maps of their choice. Several census offices now market integrated products – usually on CD-ROM – that provide this capability. Other offices adhere to the opinion that such a service goes beyond the task of national statistical agencies, and limit themselves to providing aggregated census data to commercial publishers. This should not be confused with outsourcing, since responsibility for the final product lies with the counterparts. The census office is accountable only for supplying reliable data that respect the requirements of statistical disclosure control.

Many statistical agencies maintain one or more geographic information systems for their own use. At the same time it is widely accepted that the role of statisticians primarily exists in providing data of the best possible quality to users. In many cases the task of integrating information from various sources into complex GIS systems is best left to others. This is especially true if such GIS systems serve a specialized user community, such as urban planners or environmentalists.

Electronic maps have become indispensable and cost-effective tools for a wide range of operations in censuses and statistics.

5. Data processing and storage

1. Census processing software

Many countries, especially those in the developing world, have long relied on public-domain software for their census processing requirements. Such software was built and maintained by non-profit agencies, usually supported by subsidies from national or international donors.

It would appear that overall there has been less effort in this respect recently, than at the time of previous Census Rounds. This can be explained partially by the growing capabilities of commercially available software. There may also be a case of donor fatigue. Donors tend to prefer to think in terms of projects with a clear beginning and end. Developing and maintaining a software system is a never-ending task, since changing hardware and software environments require on-going support and re-development efforts, which can be considerable.

Due to the relative scarcity of new (re-)development, some public-domain census or survey processing systems are starting to look a little obsolete. They may, for example, be completely or partially DOS-based. Even while that software might be as effective as ever, and perfectly able to do the job, the DOS interface is unfamiliar to a new generation of users. These also may find it difficult to convince their supervisors and peers that it is preferable to work in an apparently dated environment. Using modern tools is better for a data processing person's professional repute. A consequence of these developments appears to be increasing use of alternative software, such as commercial statistical software systems (SAS, SPSS, ....) and database application generators (MS Access).

Some recent announcements have improved the picture for non-profit software. The United States Bureau of the Census, through its International Programs Center, is now offering additional modules of its CSPro Census and Survey Processing system, which is being developed in co-operation with Macro International and Serpro S.A. [CBWW]. CELADE, the Population Division of the United Nations Economic Commission for Latin America and the Caribbean (ECLAC), continues work on more advanced versions of the statistical database system Redatam [CEWW].

Developing census processing applications in software not specifically intended for that purpose can be described as customizing that software for census purposes. It requires programming skills that are not always readily available. Some associated use of modern object-oriented programming languages is nearly unavoidable. There is also no particular place where scheduled training in such a specialized subject (developing census applications in general-purpose software) can be obtained. As a result, census organizations have relied on outside contractors, which did not necessarily understand fully the statistical issues involved. In this sense the current situation as regards census processing is more complex than that of the preceding Census Round.

On the other hand, where initially enough basic computer skills were available to the census office, census data processing staff may have received additional exposure to modern general-purpose software. This will have a spin-off into other statistical development work, or into their careers, or (hopefully) both.

The difficulty of customizing general-purpose software for census applications should not be underestimated. It can be considerably more complex than applying specialized census software. Outsourcing the assignment might only compound the problems. Contractors to be entrusted with the duty of developing census processing systems should have a proven record in such work. The census office will still need specific expertise to undertake the task of contracting and supervising the activities.

The broader issue of statistical disclosure control, including cell-suppressing software, required by all census offices to protect the confidentiality of individual responses, will be briefly discussed in Section 7.1 below.

5.2 Data storage

Census data used to be stored often simply as flat files. A principal concern was to make sure that the data and meta-information were properly preserved over time. This in order to guarantee that additional computer analysis would be possible at some later stage, for example at the occasion of the next census. Statistical agencies are now increasingly aware of the fact that data from various collections can have much added value if preserved, with associated meta-data, in a common storage structure, sometimes called a “data warehouse”. While this fashionable term may go as quickly as it came, the underlying principle is unchallenged. The relational storage model has been explored for depositing statistical information, but not always to complete satisfaction. More about this in Section 7.3 on Structured archives.

6. Use of the Internet

6.1 The Internet for data collection

While electronic mail has been fairly common since the late eighties, wide access to the contents of the Internet at reasonable transmission speeds was still unusual in the previous Round of Censuses. Problems with the use of paper-based questionnaires had become apparent long before. In many countries response rates on mailed questionnaires are declining, a consequence of respondent fatigue and, perhaps, a diminished sense of civic responsibility. Where enumerators still personally visit dwellings, the chances of finding respondents at home during working hours have become smaller due to modern lifestyles and smaller household sizes.

Census offices have proposed and/or used various measures to remedy these problems. These include more elaborate information campaigns and efforts to mobilize the co-operation of civil societies, having enumerators work weekends and evening hours, approaching respondents by telephone, and sampling the initial set of non-respondents (thus, in a sense, giving up on complete coverage). While some successes have been reported, the efforts and costs required to obtain an acceptable response rate are now considerably greater than before.

Thus it is only logical that attention has focused on the Internet as a gateway into an increasing number of homes. Using “push” technology, it would be possible to deliver to each Internet-connected household a uniquely identified electronic questionnaire, possibly already pre-filled with basic data obtained from civil registry. Respondents would correct and complete the information, and then return it by data transmission to the census office, which would receive an electronic record, thus avoiding most of the data entry work.

Electronic data collection from establishments (including enterprises and government/ public sector agencies) has already become fairly common. If households and individuals are approached in the same way, one could use the methods of CASI (Computer Assisted Self-Interviewing) [Fi99, Ke99], which can render valuable assistance to respondents and prevent mistakes.

Unfortunately, there are as yet several problems that hold back electronic data collection from households:

• Incomplete coverage

While the number of households having access to the Internet grows rapidly nearly everywhere, there are only a few countries where the connection rate has surpassed 50%;

• Bias

Internet access is more common to affluent and younger households, therefore wide use of this data collection channel might result in a biased response pattern;

• Unstructured address system

As compared to the postal system or the telephone network, the Internet addressing system is much less regulated, which reflects the origins of the Internet. Subscribers largely invent their own addresses, and may change these at any time. They could have one or several addresses. It would be a major effort to assemble current e-mail addresses of households at any given point in time, and nearly impossible to maintain such a register with a degree of reliability. This essentially precludes the use of push technology for census at the present moment.

• Attraction to hackers

There is little doubt that allowing respondents to use the Internet would attract hackers, who would consider it a challenge to be enumerated twice, use someone else’s identification, or worse. Census offices understandably are not looking forward to face such challenges.

Fig. 1. Swiss census data collection via the Internet (demo version, partial screen)

Notwithstanding these difficulties, several census offices, including those of Switzerland (Fig. 1), the United States and Singapore, have allowed electronic response during the current Round of Censuses [OF01, Ha00, Pr00, UN01]. This did not involve “push” technology. Rather, respondents would be required to take the initiative themselves by downloading census forms or completing them while on-line with the census office. To avoid misuse, it is essential that each household can authenticate its response. This might involve certification with a unique identification code, unknown to others. That code, then, has to be delivered to the household, which may have to be done by hand delivery. Safe and reliable electronic delivery of authentication codes is again a problem difficult to tackle within the current state of technology. Electronic response requires encryption on the browser side, since unprotected responses could be intercepted.

The United States census limited Internet response to the short form only, and did not undertake major efforts to publicize or recommend this method. It appears that in all three cases demonstrating that the census organization is in tune with modern technology was a factor in opening up the Internet channel.

Since it is unlikely that the printed questionnaire could be abandoned at short notice, census data collection via the Internet requires seamless integration of the two data streams. Three in the case of Singapore, where response via telephone (CATI) was another alternative.

Several census offices, for example the Office for National Statistics of the United Kingdom, have reported that they decided not to use the Internet for data collection at this time, after having studied the dangers still present in those uncharted waters. Statistics Canada has been conducting an “Internet Test” in two distinct geographical areas for its May 15, 2001, Census.

In conclusion: there remain a number of problems - of a varied nature - that so far have prevented the wide use of electronic questionnaires for census purposes. The Internet needs to grow, and methods suitable for census data collection by means of it have to be developed and tested. Expectations are that the situation surrounding electronic response will have evolved significantly towards the next Round of Censuses.

6.2 The Internet for data dissemination

The technology of dissemination of statistical information is undergoing a fundamental shift. The printed publication has certainly not disappeared, and remains important, for example to provide a permanent and continuously accessible record, and for easiest browsing. But on-line consultation of statistical sites – with or without payment for the information obtained – is becoming the principal avenue of information dissemination. This takes place via the Internet, since the independently managed bulletin boards, to be reached through point-to-point communication with the information provider, can not offer comparable user comfort.

The challenge to statistical offices is considerable. Long used to the relative peace of carefully preparing a publication and then waiting for it to come into print, they now must adhere to a strict calendar of electronic release. Users always want the data sooner, but will complain when later these have to be revised or - worse - turn out to have contained any error.

Under these conditions designing a dissemination strategy has not become any simpler. The user community rightfully expects statistics to make full use of new media, yet there continues to be substantial demand for paper publications. This may happen in a situation of restricted funding and shortage of technical skills. Statistical offices must not only formulate a strategy, but also revisit it periodically. Where costs dictate it, the use of dissemination outlets needs to be adjusted on the basis of reports on their use. Cost recovery may help to improve the situation.

Just like printed publications, publication in electronic form can be of varying cognitive quality, perhaps even more so. Furthermore, the rapid technological developments make providing the best possible interface a moving target. Eurostat in its NORIS (Nomenclature on Research in Statistics) [EuWW1] identifies the following examples of research in this area:

• Contributing to Internet-related standardization activities so that statistical requirements can be taken into account;

• Bandwidth-intensive applications: statistical queries, audio- and video-broadcasting;

• Use of intelligent agents (knowbots) for information interchange;

• Improving man-machine interfaces, including the use of virtual reality;

• Application of GIS technologies to improve the visualization of geographically-oriented statistical information.

The capability of a statistical organizations’ web site is becoming of ever greater importance. Statistical and census organizations nowadays are not only assessed on the quality and timeliness of their printed information, but also, and perhaps more, on the effectiveness of their web presence.

Appropriate measures should be taken. Web sites must be built and maintained by professionals. There should be, if at all possible, a continuous monitoring system of user satisfaction and visitors’ browsing behavior. This for the purpose of easing access to popular items, noting signs of apparent user confusion, and general continuous improvement of the site. If dynamic access to databases is offered, such applications should be reasonably bug-free and have reached sufficient maturity [UN01]. Launching a high-technology service that generates numerous disappointed users brings benefits to no one.

The need to maintain a full range of up-to-date ICT capabilities, including web skills, in an environment where such qualities are in high demand, is a burden to many national statistical agencies. Outsourcing can be a solution, but since information dissemination is a core activity of official statistics, it is not an obvious alternative.

As an aside it may be mentioned here that the Internet offers excellent possibilities to disseminate and retrieve international standards and guidelines for statistical work. An example is the classifications server RAMON developed by Eurostat [EuWW2].

7. Data dissemination – other issues

1. Statistical disclosure control

As the mass of readily accessible statistical information increases, there is urgent need to improve the protection of individual information provided by persons or establishments, using techniques known as statistical disclosure control. The odds here could be shifting in an unfavorable direction, since statisticians need to provide more information faster, while ill-intentioned users attempting to filter out sensitive information have access to ever more powerful analytical computer tools, and time on their side. It has become impractical to visually inspect each table or data cube (see below) for potential risks, but automatic screening tools are coming to the rescue [Wi01, Gi99]. These will suppress, combine, or otherwise obscure potentially risky cell values.

7.2 High-capacity physical media

Information dissemination on non-rewritable high-capacity media also remains an important delivery channel, especially for massive data that is not highly time-sensitive, such as most census information. Censuses nowadays routinely result in the production of many CD’s, and the first DVD products of much higher capacity have appeared [CB01]. Data structures on CD-ROM and those underlying a web site can have much in common, including the use of browsing through hyperlinks. Parallel development of the applications is an efficient way to benefit from that.

3. Structured archives – the statistical data warehouse

As already mentioned in Section 5.2, storage of census data in a “warehouse” structure favors its use in conjunction with other statistical information kept there. This is strictly speaking not a census issue, since it addresses the broader subject of statistical information management. A warehouse might consist of a number of data cubes, n-dimensional spaces, where one dimension consists of observations, the others are selection dimensions. In a simple example dealing with a census cube, observations could

Fig. 2. Data cube in four dimensions [Ba96]

be the total numbers of males and females, and selecting dimensions age group, place of residence, ethnicity, occupation, and so on. Or see the diagram in Fig. 2 above, for an example in four dimensions (including three 3-dimensional sub-structures) from the area of business statistics.

Cubes require the existence of a superstructure that allows them to be approached via hierarchical menus (“drill-down”) and logical combinations of keywords. Meta-data need to be available too, preferably stored while avoiding redundancies. Other storage formats than the data cube should also be accommodated by the data warehouse.

While practical applications exists, and can be accessed through the web sites of several national statistical agencies, the subject remains a work in progress. Data warehouses are by no means restricted to statistics, and the topic is much broader than can be described here. For more information, see for example [Ka00], or explore the Internet.

[pic]Fig. 3. Organization Chart of Divisions for Business and Social Statistics, Statistics Netherlands

Taking this concept to a further level, one could impose that after completion of a census or survey the information gathered is to be stored in the data warehouse first, and that periodic or one-off publications are generated only by retrieving data from this central storage system.

This concept is illustrated by the example in Fig. 3 above [Ke00]. BaseLine is the final product of Registers and Surveys. It holds all data as supplied by primary and secondary data sources. MicroBase contains the data as they result from editing, imputation, translation and micro-integration. The output-aggregate database StatBase holds the results after estimation for (sub-)populations of statistical units. StatBase claims to contain all publishable data produced by Statistics Netherlands. The publication-data warehouse StatLine can be seen as a set of views on StatBase. It presents the total output of the Bureau as a structured set of multi-dimensional tables. StatLine is disseminated both on CD-ROM and on the Internet.

Several national statistical agencies have done important work on these issues, such as Statistics Canada (CANSIM II), Statistics Sweden (PC-Axis) and Statistics Netherlands. A trial version of the Stat-series of programs (the name of the full package is StatSuite) is downloadable [SNWW]; PC-Axis retrieval software can be obtained from the Web site of Statistics Sweden [SSWW]. Other developers may also be willing to provide test versions of their software if requested.

As regards the applicability of data cubes, it would seem that the principal problem is not so much their storage and retrieval, but logical design of these information containers. Formulating a comprehensive set of cubes that fulfill the various requirements of (i) easily accommodating the results of statistical collections, (ii) satisfying the requirements of a wide variety of users, and (iii) fully respecting confidentiality concerns, is not a simple assignment.

Whatever new or revised data dissemination product is being envisaged, the importance of extensive prototyping and launching “beta” versions – among real and critically-minded users - can not be overemphasized. This point was also made, and convincingly, by the recent ESCAP Workshop on Population Data Analysis, Storage and Dissemination Technologies. The report of this Workshop and several of its other papers constitute highly recommended reading [UN01].

How to choose appropriate technology

At this point it would have been useful to provide clear guidelines to census planners about how to make an informed choice of technology, and about approaches such as outsourcing and its associated risks. Unfortunately, this is impossible, as conditions and considerations vary widely not only from country to country, but also - and with increasing speed - over time.

There is not one preferable set of technologies for census operations. The best choice depends on the magnitude of the project, the availability of local skills, the funding situation, existing prior experience, available time for preparation, and many other factors. The current Round of Censuses shows a surprisingly wide spectrum of methods and techniques being used.

Informed choices are never possible without the information being available to the decision makers. Census planners need to acquaint themselves with the state of the art, both at a national level and internationally. They should preferably travel to comparable countries that have recently used methods and technology that may be of interest. The superiors of these planners need to recognize this need for exploration, and allocate the resources for the task to take place.

In deciding the parameters for a new census, one might want to look first at the preceding census. What worked well and what could use improvement? If an approach satisfied the last time, the arguments to replace it by something else need to be twice as strong.

Every decision has a financial angle. If census costs can be reduced significantly while maintaining or even improving quality, that certainly should be worthy of serious consideration.

Outsourcing is by no means the panacea that some would have it to be. Stories of success and failure are equally present. Here, as well as elsewhere, there is no substitute for solid fact finding, careful negotiating, making sure that the chances of misunderstanding are minimal, and a continuous quality assurance programme [Wh01].

The final and most important consideration should be: what effect do the available alternatives have on the quality of service provided to information users? Statistical offices and census organizations live by the grace of the service they render to others. They need to strive incessantly to provide better information, in terms of timeliness, data quality, ease of access, completeness, and pertinence. Any potential improvement there merits review.

8. More

There is little doubt that the ever-evolving technological environment in future will have an even more profound effect on census taking methods – perhaps moderated by legal requirements and confidentiality concerns.

Already it has become possible to uniquely identify individuals through certain physiological characteristics, such as fingerprint or iris patterns, facial identifiers, or vocal frequency sequences (voice prints) [Ja99b]. This technology has a bright future, due to applications too numerous to sum up here: ATM transactions, building access control, payment authorizations, ... . It can remove the burden placed on people by the requirement to constantly carry identification and credit cards, and to remember perhaps a series of personal identification codes.

Biometric identification in combination with access to a database (perhaps, wireless access) can remove the need for statisticians to ask respondents the same questions about unchanging characteristics (date of birth, sex, ethnic origin, place of residence at prior census, ...) repeatedly. In a broader sense, it would make it technically easier to establish and maintain electronic civil registers that would be more complete and current than those in existence today.

As an intermediate step one could think of a personal multi-purpose chipcard (“smartcard”), from which information could be copied without manual transcription. Applications at this level are already widespread in banking, library management, medical services, and more. In a rather far-reaching concept, such data could also be stored in personal “digital data vaults” on the Internet. Once authorized by the owner, information users such as census organizations would be able to retrieve from there the data items they require.

The use of biometrics or personal chipcards in civil registration or census so far has been experimental at most, and digital data vaults are just an idea that has recently been launched. But it is easy to foresee that these and similar techniques, with all their various implications, will become the subject of increasing debate in the not too distant future. It is part of what has been termed "pervasive computing," an ever growing presence of computer power and associated sensors and controls in daily life. Statistical organizations need to involve themselves in this debate, to make sure that new developments and standards take their requirements into consideration.

A few countries have dropped the door-to-door census for what is called an “administrative” or “virtual” census [La01]. This may involve a comprehensive inspection and merging of various registers, to arrive at the national universe of dwellings and persons. Again, it will be an advantage if the statistical office is a partner in the definition and maintenance of the principal registers. In other countries the “short form”, which contains the questions to be asked from everyone, has been reduced to a bare minimum.

In both cases – administrative census and minimal short form - additional information is usually gathered through sample surveys. These methods share methodological ground. Good statistical practice prescribes the use of sampling methods wherever the underlying universe is sufficiently known. In future Census Rounds sampling technique will become increasingly important, and with it the need for statisticians to explain their ways to the world.

9. Conclusions

New technologies make their way into census work, but not always as quickly and broadly as might have been expected. Census planners need to be conservative, since they know that their solutions must be right the first time. Nevertheless, one sees innovation turn into standard practice. This includes digital mapping, ICR, and electronic publishing. The Internet has become an essential medium for information dissemination, and will grow in importance for data collection.

The conditions under which censuses are conducted differ greatly between countries. Even for most sub-tasks there is no single best technological approach. Technical awareness, a sense of the realistic, a methodical approach, and plenty of preparation time, are the principal requirements for census planners.

Census technology changes much more rapidly than the underlying statistical methods and principles. New technology should never endanger, and if possible reinforce, the continuity of existing reporting systems.

Outsourcing raises many problems, including confidentiality concerns, but it can deliver economies and resolve bottlenecks. Again the local situation – including the management ability of the census office - determines whether it is a valid alternative. The solution is more evident for one-off special operations (e.g. census data entry), then for ongoing tasks, such as Web site management. Where outsourcing could offer advantages, but bureaucratic obstacles stand in the way, these should be removed.

The current Census Round shows substantial technological evolution from the preceding cycle; in the next Round the difference will only be greater.

11. Discussion

Here are some slightly provocative questions that symposium participants may wish to discuss:

• As compared to data capture by keyboard, ICR has advantages as well as drawbacks. Given technical problems experienced by some countries, is the move towards ICR justified by experience gained so far?

• There is no doubt that ICR equipment interprets characters less accurately than human operators. Can we call that progress?

• Do we still need technical assistance projects producing census software? Or can commercial software systems now fully cover census requirements?

• Why, with CD-ROMs and the Internet, continue to print costly census reports?

• Do “data cubes” present a valuable concept? Or is this just a solution looking for a problem? Are there more suitable storage formats for statistics?

• Will the Census of Population of Housing as we know it disappear, because of ever-advancing technology?

References

[Ba96] Basset P., and A. Stoyka: Statistics Canada’s aggregate output database – CANSIM II. Proceedings of the Conference on Output Databases, Voorburg, the Netherlands (1996)

[Bl97] Blum, Olivia: Editing and Coding Module, in New Census Technologies: The Israeli Experience. Proceedings of the Euro-Med Workshop (March 97)

[CB01] US Bureau of the Census, Public Information Office: Census Bureau breaks new Ground with Release of DVD Products. News release dated 6 February 2001.

[CBWW] CSPro Web site at

[CEWW] CELADE Web site at

[De86] Deming, W. Edwards: Out of the Crisis. Center for Advanced Engineering Study, MIT, Boston, USA (1986)

[De94] Dekker, Arij: Computer Methods in Population Census Data Processing. International Statistical Review, 62, 1, pp 55-70 (1994)

[De97] Dekker, Arij: Data Processing for Demographic Censuses and Surveys, with Special Emphasis on Methods Applicable to Developing Country Environments, UNFPA/NIDI, The Hague, ISBN 90-70990-67-9 (1997)

[Do99] Dopita, Patricia: Population Census Evaluation, 1996 Census Data Quality: Occupation, Australia Bureau of Statistics (1999)

[EuWW1] VIROS Virtual Institute for Research in Official Statistics Web site at

[EuWW2] RAMON Classifications Web site at

[Fi99] Figueiredo, José and Ana Lucas (National Institute of Statistics of Portugal): Potentials and Pitfalls of INE-P IS/IT Strategy on the Past Ten Years. Proceedings of the strategic reflection colloquium on IT issues for statistics, Eurostat, Luxemburg, September 1999

[FP01] UNFPA: Report of Joint Interagency Coordinating Committee on Censuses for sub-Sahara Africa and PARIS 21 Census Task Force Meetings. Eurostat, Luxemburg, October 2000

[Gi99] Giessing, Sarah: Transferable software for automated secondary cell suppression. Seminar on the Exchange of technology and know-how (ETK), sponsored by Eurostat, Prague (1999)

[Ha00] Haug, Werner and Marco Buscher: E-census, the Swiss Census 2000 on the Internet. INSEE/Eurostat Workshop “Census beyond 2001”, Paris, November 20-21, 2000.

[Ja99a] Jacob, Michel and Jean-François Royer: Le recensement de la population de 1999, in Les actualités du Conseil national de l’information statistique 30 (January 1999).

[Ja99b] Jain, Anil (editor), et al.: Biometrics: Personal Identification in Networked Society. Kluwer International Series in Engineering and Computer Science, Volume 479. Kluwer Academic Publishers, Dordrecht, the Netherlands, ISBN 0-7923-8345-1 (1999)

[Ka00] Kambayashi, Yahiko, Mukesh Mohania and A. Min Tjoa: (Eds.): Data Warehousing and Knowledge Discovery, Second International Conference, DaWaK 2000, London, UK, September 4-6, 2000, ISBN 3-540-67980-4 (2000)

[Ke99] Keller, Wouter: Preparing for a New Era in Statistical Processing: How new technologies and methodologies will effect statistical processes and their organisation. Proceedings of the strategic reflection colloquium on IT issues for statistics, Eurostat, Luxemburg, September 1999

[Ke00] Keller, Wouter and Ad Willeboordse: Statistical Processing in the Internet Era: the Dutch View. Conference on Network of Statistics for Better European Compliance and Quality of Operation, Radenci, Slovenia, 13-15 November 2000 (this paper can be retrieved from the Web site of the Statistical Office of Slovenia at )

[La01] Laan, Paul van der, and Peter Everaers: The Dutch Virtual Census. Meeting 66, ISI 53rd Session, Seoul, 2001 (to be published).

[Me97] Meyer, Eric and Pascal Rivière: SICORE, un outil et une méthode pour le chiffrement automatique à l’INSEE. International Blaise Users Group, Paris (1997)

[OF01] Office Fédéral de la Statistique: Utilisation de e-census. On the web site of the Swiss Federal Statistical Office at

[PaWW] Web site of the Paris21 Partnership in Statistics for Development in the 21st Century, at

[Pr00] Prewitt, Kenneth: Prepared Statement before the Subcommittee on the Census, Committee on Government Reform, U.S. House of Representatives (March 8, 2000)

[SNWW] (the principal web site of Statistics Netherlands is at )

[SS01] Q2001 – International Conference on Quality in Official Statistics. Organized by Statistics Sweden and Eurostat, May 14-15, 2001, Stockholm, Sweden. Web site at. .

[SSWW] Web site of Statistics Sweden at .

[UN01] UN Economic and Social Commission for Asia and the Pacific: Report on the Workshop on Population Data Analysis, Storage and Dissemination Technologies, Bangkok, 27-30 March 2001. This report, and other workshop papers, can be retrieved from the ESCAP web site at

[UN98] UN Department of Economic and Social Affairs, Statistics Division: Principles and Recommendations for Population and Housing Censuses, Revision 1. Statistical Papers Series M No. 67/Rev. 1 (1998)

[UNWW] Good practices Web site at

[Wh01] Whitford, David and Jennifer Reichert: Quality Assurance Challenges in the United States’ Census 2000. Q2001 - International Conference on Quality in Official Statistics, Stockholm, 14-15 May 2001.

[Wi01] Willenborg, L. and T. de Waal: Elements of Statistical Disclosure Control. Springer Verlag, Berlin/Hamburg, ISBN 0-387-95121-0 (2001)

Glossary

Artificial intelligence, neural networks, fuzzy logic: various forms of innovative software techniques that often depend on non-deterministic (heuristic) methods

ATM transaction: a transaction through an Automated Teller Machine, or money dispenser

Automatic coding: the conversion, by unassisted computer, of verbal texts into applicable codes

Biometric identification: identification of individuals through one or more of their physical characteristics

Bulletin board: digital information service, often operated independently from the Internet

CASI (Computer Assisted Self-Interviewing): the technique whereby respondents independently complete electronic questionnaires, assisted only by specially-designed computer programs

CATI (Computer Assisted Telephone Interviewing): Respondents answer questions by telephone, interviewers key the responses directly into computers

Computer-assisted coding: coding activity whereby human coders decide and computer systems provide assistance

Data cube: multi-dimensional structure for storing statistical information

Data warehouse: the assembled data capital of enterprises or institutions, stored and managed in a way that favors access and analysis

Digital data vault: a space on the Internet where citizens can safely store, and eventually provide access to, personal data

DVD: Digital Video Disk, the more capacious successor of the CD-ROM

GIS (Geographic Information System): an information system designed to capture, store, update, manipulate, analyze and display all forms of geographically referenced information

GPS (Global Positioning Systems): by now common instruments that show the geographic location of the carrier

ICR (Intelligent Character Recognition): the art of interpreting written or printed characters through image scanning and computer analysis. Used to be called Optical

Character Recognition when the role of recognition engines was less crucial.

ICT: Information and Communication Technology

ISCO: International Standard Classification of Occupations

ISIC: International Standard Classification of all Economic Activities

Knowbot: (from Knowledge Robot) intelligent agent gathering information on the Internet; more specific than search engines

Meta-information: ancillary information clarifying statistical figures (definitions, standards, units, collection method, …)

NACE: Nomenclature Générale des Activités Economiques – statistical classification of economic activities used by the European Union

Object-oriented languages: languages for computer programming that attach code to objects and classes. Different from more monolithic procedural languages.

Outsourcing: delegate (part of) activities to an outside contractor

Pervasive computing: the omnipresence of computer power and associated sensors and controls in daily life

Point-to-point communication: (as used in the present text) electronic data communication by direct connection, not using the Internet

Push technology: using the Internet to deliver specific but unrequested information to selected e-mail addresses

Quality Assurance: a planned and systematic pattern of all the actions necessary to provide adequate confidence that a product will conform to established requirements.

Relational storage model: currently the most popular data model for general-purpose database systems; theoretical foundation formulated by E.F. Codd

Remote sensing: monitoring from a distance, as from aeroplanes or earth satellites

Satellite telephony: telephone communication through geo-stationary satellites, no land-based relay stations

Smartcard: electronic card carrying a computer chip, and providing (much) more than memory functionality

Statistical disclosure control: the complex of measures preventing unauthorized access to sensitive statistical information

Voice print: A stored digital model of an individual’s voice, used for identification purposes

-----------------------

[1] For a definition of this and many other terms used in this paper, refer to the Glossary.

[2] For example, recognition rates of handwritten characters might drop under 90%. This value should always be considered in connection with the security level, a pre-set parameter that decides how “confident” the recognition engine(s) must be before accepting a character as representing a particular symbol. Among these accepted characters there are usually mistakes (the “errors”). On the other hand, the rejected characters contain “confirms”, which are characters that would have been correctly recognized at a lower security level. The remaining rejects are “corrects,” always requiring operator action .

[3] Automatic coding can be seen as a form of translation, and uses methods similar to those applied in the popular but even more difficult research area of machine translation of natural languages.

[4] A particularly narrow square grid of 100 by 100 meters has been used since 1968 (!) by the Federal Statistical Office of Switzerland, principally for environmental and agricultural statistics.

-----------------------

[pic]

An Invited Paper Meeting is to be conducted on this and related issues at the 53rd ISI Session which takes place later this month in Seoul. This is Meeting 16 on “Internet and Innovative Data Dissemination,” called by Heli Jeskanen-Sundström of Finland.

An Invited Paper Meeting is to be conducted on this and related issues at the 53rd ISI Session which will take place later this month in Seoul. It concerns Meeting 17 on “Internet Data and Innovative Collection,” called by Warren Mitofsky of the USA.

• old technologies never disappear entirely.

• developing new technologies always takes longer than expected.

• once it finally comes, it takes off faster than was thought.

................
................

In order to avoid copyright disputes, this page is only a partial summary.

Google Online Preview   Download