Big Data Strategy Document - European Commission



EUROPEAN COMMISSIONEUROSTATDeputy Director-GeneralTask Force Big DataBig Data Strategy Document?A common denominator for the ESS strategy related to B.D.1. ScopeThis document presents a long-term strategy for the incorporation of big data into the production of official statistics in the European Statistical System (ESS). Its target audience includes all stakeholders of the ESS, including keepers of data sources, users?and knowledge partners.The aim of the long-term big data strategy is threefold. First, a strategy creates focus and implies choices. It is important to have a common understanding of the way ahead among all ESS partners. Second, a strategy implies concrete implementation actions. A logical follow-up of the strategy is therefore to define such actions, in particular those with need a long preparation or execution time. Third, a strategy can be clearly communicated to stakeholders. This is extremely important because a big data approach without active involvement of stakeholders is a no-go.With long-term we mean the period beyond 2020. This period derives from the ESS big data action plan and road map (BDAR) that was adopted by the ESS Committee (ESSC) in September 2014. The period until 2020, which represents the short and medium terms in the BDAR sense, is covered by actions that belong to the BIGD project. The BIGD project is part of the ESS Vision 2020 implementation portfolio. Key actions in the short and medium term include first and foremost a number of extensive pilot experiments to build hands-on experience in exploring big data sources, and moreover some stocktaking activities to better understand e.g. legal, ethical and educational aspects of big data. In addition to the short and medium-term actions, a long-term strategy was announced in the BDAR. The current document aims at providing such a strategy.Obviously, we need to define what we mean by big data in the context of this document. It is difficult to give a precise and concise definition, though. Big data is a moving target where new sources emerge all the time. An often cited definition was given by Gartner in 2001, focusing on 3 V’s: Volume, Velocity and Variety. For the purpose of official statistics, in 2013 the UN/ECE project on big data came up with a definition that highlights new IT-issues compared to more traditional data sources. The UN/ECE project also proposed a classification of big data sources that starts from the origin of data.An important characteristic of big data is that the keepers of data sources are usually private companies, which poses a number of major challenges in addition to the IT-related issues mentioned in the definition above.We explicitly exclude two important types of statistical data sources from the strategy:Questionnaire-based statistical data, collected either by survey samples or censuses.Administrative data collected by or on behalf of government institutions. The size of administrative sources can become quite large but their well-behaving nature and predictability distinguishes them from “true” big data sources according to the definitions given above.Official statistics in the ESS are those statistics that are produced by ESS partners under a specific formal agreement, in most cases an EU regulation or other legal instrument. Thus, the strategy excludes specific national or regional statistics that are produced under a different mandate (although in many case it is difficult to draw a clear borderline).?In particular, the strategy does not cover statistics produced by third parties (unless commissioned officially) aiming at describing similar or the same phenomena as official statistics; think e.g. of inflation figures based on web scraping of prices (e.g. the ‘Billion Prices Project’).In this document we take a ‘ceteris paribus’ approach which means that we assume that the basic ways of functioning of the ESS and the institutional arrangements for official statistics will to a large extent remain the same after 2020. In cases where we propose to modify these settings for the sake of big data we clearly indicate that.Long term objectives of Big Data Action Plan and Roadmap:On the long term, big data sources should be integrated into the production of official statistics across the ESS. We expect that big data sources will be available for a wide range of statistics products. The respective legislation should enable use of big data sources for official statistics and we have to work towards a situation where the general public endorses use of big data sources for statistical purposes.?The European and national legislation is adapted in such a way that it is compatible with the ethical use of big data in official statistics.? Big data sources are available to the ESS in such a way that business continuity is guaranteed.? Big data sources are integrated in the official statistics production across the ESS.? A large pool of statistics graduates with data science skills is available across the ESS.? Methods, tools, IT infrastructures and quality frameworks are reviewed and adjusted to new requirements related to big data sources and official statistics??2. Background2.1 Data Revolution"As our world is changing, we have to change with it." (ESS Vision 2020)We live in an increasingly connected and digitalized world, where the digital transformation and the data revolution touch almost all topics and contents official statistics usually deal with. New data sources and general developments will raise new tasks and issues and lead to a reassessment and potential extension of the role of NSIs. On the one hand, current societal and IT developments create new user expectations and new objects and phenomena of interest on which official statistics cannot offer data so far. On the other hand, big data as a part of these developments offers the opportunity to find and set up measures for the aspects that are not covered by official statistics so far.For enforcing the relevance of official statistics, catching-up with current trends of the digital transformation is of special importance, in order:to be able to map these trends in official statistics, since they reflect the evolution of our society and thus should be a part of official statistics, too.to face the competitive pressure arising from new data producers, especially in the case of missing experience with the analysis of the new data at official statistical institutions, that may threaten the role of national statistical institutes as the most relevant data provider for high-quality and reliable statistics.2.2 Recent strategic decisions related to big data The production of official statistics and the cooperation within the European Statistical System (ESS) is based on strict principles of quality defined in the ESS Code of Practice and on a common understanding of the future cooperation manifested in the joint ESS Vision 2020. By adopting the Scheveningen Memorandum on big data and Official Statistics in 2013, the ESS agreed upon adapting to major developments in society by exploiting the potential of big data sources and enforcing the collaboration for big data projects at European and global level. The ESS Vision 2020 emphasises the importance of establishing partnerships with data owners, investing in new IT tools and methodological development as well as considering organisational challenges related to the use of big data. Subsequently, the ESS Big Data Roadmap and Action Plan 1.0 were proposed, which are also an integral part of the ESS Vision Implementation portfolio.?2.3?Framework for production of official statistics?2.3.1??????? Legal basis and Code of PracticeThe way official statistics are produced is supposed to be neutral, objective and scientifically independent, which is usually guaranteed by law. Each statistic is based on a legal basis for ensuring a high quality of statistical data for the needs of different heterogeneous user groups. Because of the legal basis of each statistic, a strict prerequisite for the analysis of a new big data source is a legal mandate for collecting this type of data. In addition, confidentiality of personal data has to be guaranteed. An additional aspect related to legislation that has to be considered is the potential infringement of property rights of the data owner when collecting and analysing data from easily accessible big data sources.When considering the use of emerging data sources, the following aspects of the ESS Code of Practice are particularly concerned:Commitment to qualityCommitment to scientific approachNon-excessive burden on respondents and cost effectivenessCommitment to timeliness and punctuality2.3.2??????? Statistical Production ProcessBased on the legal framework different data sources are used for collecting data. Using data generated by big data sources might have an impact on the structure of the production process of official statistics. Depending on the big data source and the degree of homogeneity and structure of the data, either only a slight adaption or a completely new production process with additional steps might be necessary. In many cases it is the analysis of patterns in a data set from a big data source (“data mining”) that reveals interesting insights about the underlying data set. In contrast, usually the object of a statistic is determined first before the adequate method for the collection of data is chosen.?3. StakeholdersIn addition to current stakeholders, there are a number of groups to involved / considered for a long term strategy related to big data and Official Statistics. These groups might not be completely new in the current environment but they might change role, e.g. from respondent to data provider or they might gain in importance, e.g. businesses offering statistical services.The organisations includeStatistical Offices and international statistical organisationsGovernment and other public organisations (authorities)Data protection authoritiesRegulators in different sectors, e.g. telecommunicationAcademia including research institutes and universitiesPrivate businessesNon-profit organisations active in digital services (open software community)MediaGeneral public / citizens?4. Skills and experience for working with big data?4.1 Data scientists and iStatisticiansWith the permanent increase of new digital data sources a further developed set of skills and expertise is necessary to produce official statistics in a high quality manner.Data scientists with skills and experience in statistics, analytics and IT will be much more important for working with often unstructured digital information but this kind of skills won’t be enough to integrate new digital data sources into official statistics.The Committee on Organizational Framework and Evaluation of the High-Level Group for the Modernization of Official Statistics developed the following list of big data team level competencies:-????????? Team work-????????? Interpersonal and communication skills-????????? Delivery of results-????????? Innovative and contextual awareness-????????? Specialist knowledge and expertise-????????? Statistical/ IT skills-????????? Data analytical/ visualization skillsThe future ‘iStatistician’ will be part of a team. The future large pool of ‘iStatisticians’ should be coming from a wider academic field than statistics and NSIs. Inclusion of new digital data sources into official statistics requires more than data science skills.??4.2 Working fieldsFirst of all the ESS should further develop the strong collaboration with universities for educating the next generation of data producers as well as data users. The European Master of Official Statistics (EMOS) is a step in the right direction. For the permanent training of NSIs staff the European Statistical Training Programme (ESTP) is an important tool. In the long run the internal training inside the NSIs will be more and more an essential part of the educational process to get the large pool of staffs with big data team competences. This includes education of colleagues without academic background.4.3 Education concept for statisticians working in official statisticsCooperation with universities, the ESTP as well as internal trainings are the instruments to educate statisticians in a fast changing time. For the first time, with the EMOS there exists a well scripted curricular with the up to date knowledge about official statistics. The learning outcomes describe in the current content what newcomers should know about official statistics when they start their career at a NSI. This kind of concept is still missing for the ESTP and in many cases also for the internal training inside the NSIs. Often the different courses have more an independent and coexisting character. A designed curricular to link the courses is missing.Based on the results of current contractual activities, a comprehensive education concept should be worked out for the ESS for the time period after 2020. The EMOS learning outcomes could provide the frame for the content and be the starting point for this ESS education concept. The currently isolated programmes (EMOS, ESTP and internal trainings) should be following a common training concept. The development of a set of ESS degrees in Official Statistics for academic and non-academic staff inside the ESS could initiate the dynamic to build the bridge between the currently isolated programmes.The ESS should develop a MOOC platform (massive open online courses) where webinars, slides and further training materials are accessible at one place for universities as well as NSIs, Eurostat and other interested institution like Central banks. Considerable at this stage is also the cooperation with commercial consulters.5. Analysis of opportunities and risksThis analysis updates the UNECE SWOT for the official statistics community, and groups the opportunities and risks in three distinct areas:efficiency and resources,quality and business continuity,privacy , disclosure and official statistics image?5.1 Opportunities and Benefits5.1.1 Efficiency and resourcesBig data can increase the efficiency if statistical indicators can be produced at lower cost by reducing sample sizes or replacing (partially) current data collections.Reduction of burden on respondents (businesses and individuals) is an additional potential gain. Big data techniques, tools and methods may also improve statistical processes when applied to traditional sources.Use of big data reinforces the shift from stovepipe to a multisource and multi-output production structure for official statistics. Necessary resources, especially requiring specific analytical skills could be pooled in order to streamline statistical production.5.1.2 Quality and business continuityThe use of big data sources can increase the different quality dimensions of statistics, such as timeliness, relevance, detail, accuracy. Big data can increase the portfolio of products from already available data, e.g. through application of machine learning to perform data mining and pattern recognition, statistical analysis, prediction, ?data visualization and visual analytics.5.1.3 Privacy disclosure and public imageEnsuring privacy and confidentiality may improve the reputation of statistical offices among stakeholders and the public.5.2. Risks and Challenges5.2.1 Efficiency and resourcesThe access to data for production is a risk often encountered.Use of big data requires new IT resources and new skills, either methodological, IT and for analysis of metadata and there is a risk of lack of availability of experts or a risk of loss of experts due to more attractive conditions in other domains. The impact may even be higher, if statistical offices trained staff on their own cost.The initial investment in use of big data can be high (training and acquiring resources from the external) and may reduce the resources from the established segments in a stage where there is no guarantee of the effective gain on the reallocation of IT and human resources.5.2.2 Quality and business continuityThe use of external and uncontrolled sources may be a risk for data quality that is a key element of the official statistics. This risk may be linked with the risk of losing reliability and confidence by users and public.The process of generating data is often not made completely available to NSIs and the lack of metadata is a major cause of loss of quality (or quality assessment) with an impact again on NSI image.Data providers from third parties, for example, social network data or voluntarily contributed data, bears the risk of being manipulated, either by the data provider or by users of the system. ?Changes in the commercial data sources, e.g. due to technical or nontechnical changes, loss of access to data constitute a risk for the sustainability of the source.The large amount of possible products may also result in being too much driven by technology instead of addressing user demands.5.2.3 Security, Disclosure and TrustTrust can be damaged or lost in case of security breaches, disclosure of confidential data compromising privacy of individuals or businesses. Likelihood and impact may increase in the age of big data, due to an increasing number of data sources, new flows of data between providers and NSIs, possible storage of data in the cloud, more possible links between data, a bigger attractiveness by intruders, or increasing pressure to disclose data.Data security mechanisms and anonymisation techniques have to follow scientific progress and technological developments. Transparency of methods and procedures as well as independence of statistical offices are key elements for building and preserving trust.?6. New Roles for Statistical Offices / Official StatisticsThe following possible new or extended roles have been identified:Form data providers to data interpreters;Providing analytical services;More flash estimates and nowcasting, develop forecasting capabilities;Provision of short term policy support;Develop role as information broker;Certification of processes to derive statistics from big data;Possible strategic actions to achieve new roles of statistical offices to be defined?Consequences of big data usage for statistical offices and resulting actionsmulti sources - multipurpose data processing - consequence on structure of statistical officesHow do surveys change? (e.g. providing auxiliary information for other data sources, compensate selectivity) ................
................

In order to avoid copyright disputes, this page is only a partial summary.

Google Online Preview   Download