Introduction - European Commission | Choose your language



Measuring Sustainability Reporting using Web Scraping and Natural Language ProcessingAlessandra Sozzi ( HYPERLINK "mailto:alessandra.sozzi@.uk" alessandra.sozzi@.uk)Keywords: web scraping, NLP, natural language processing, SDG, sustainable development goals, sustainability, reporting, indicatorIntroductionIn September 2015 the United Nations stipulated its requirements for Sustainable Development Goals (SDG’s). The Goals are being followed-up and reviewed using a set of global indicators, which result from the aggregation of the national level indicators produced by each member state. However, there are many indicators for which national level statistical estimates are not currently produced. This abstract highlights some proof-of-concept research using Web-based data for an indicator relating to SDG Target 12.6 which is “To encourage companies, especially large and transnational companies, to adopt sustainable practices and to integrate sustainability information into their reporting cycle” [1]. The indicator proposed is "Number of companies publishing sustainability reports, by turnover band, geography, national or global company, sector and number of employees”.This document outlines the steps taken to develop a web scraping program able to collect sustainability information from websites of a sample of the 100 largest UK private companies (ranked according to sales), followed by the use of natural language processing (NLP) techniques to process and extract additional insights from the data collected. MethodsWeb scraping is a technique to extract data from websites without the need of a user interaction. The web scraping program, developed in Python, accesses all the 100 companies’ websites provided and looks for pages that contains text relating to sustainability information. Text analysis on the HTML content of the selected pages is used to extract additional insights.The scraperFor each company, the scraper navigates through the company website, accessing every internal link. While recursively traversing websites, the scraper flags only the pages that suggest sustainability content, i.e. at least one of the predefined keywords is found in the URL address of the page or the text of the hyperlink leading to the page. The list of keywords, manually chosen, is as follows: csr, environment, sustainab, responsib, footprint. Once a page containing a keyword is found, it is cleaned before being saved in a database. A high level overview of the architecture of the scraper is shown in REF _Ref463872535 \h Figure 1.Figure SEQ Figure \* ARABIC 1. High level overview of the web scraping programThe Content ExtractorWeb pages are often cluttered with additional features, such as navigation panels, pop-up ads and advertisements, around the main textual content. These noisy parts tend to affect the performances of NLP tasks negatively. When the ContentExtractor component of the Item Pipeline receives the full HTML of the web page, it extracts just the main textual content using an ensemble of machine learning algorithms encapsulated in the Dragnet method [2]. The Item then proceeds to the MongoConnector, which checks its validity before sending it to a MongoDB database. A comparison of four different content extraction methods was performed on a sample of 30 web pages, which led to choosing the Dragnet method as the most suitable for the task. The four methods initially considered were Dragnet, Readability, BeautifulSoup get_text(), <p> Tags.Table SEQ Table \* ARABIC 1. Average precision, recall and F1 score for the sample of 30 web pagesDragnetReadability<p> Tagsget_text()Precision0.920.820.800.46Recall0.730.710.680.97F10.760.730.710.59For the purpose of the NLP task, it is more important to have a high precision, i.e. a high quality of the retrieved content, compared to recall and the F1 score. The lower recall implies that the Dragnet method will probably retrieve less content. On the other side, there is more assurance that the extracted content is relevant for the NLP ic Modelling and Latent Dirichlet AllocationTopic models represent a family of computer programs that extract topics from texts. A topic is intended here as a list of words that occur in statistically meaningful ways. Topic modelling algorithms do not require any prior annotations or labelling of the documents. Instead, the topics emerge from the analysis of the original texts. Latent Dirichlet Allocation (LDA) is a special case of topic modelling [3]. Given a collection of documents, it assigns to each topic a distribution over the words of the entire corpus (topic-words distributions) and to each document a distribution over topics (document-topic distributions) in an entirely unsupervised way. ResultsA total of 563 sustainability-related web pages were collected from 59 companies. 35 companies did not have any sustainability pages published on their websites and thus no pages were found by the scraper. The remaining 6 companies had specific website features that did not allow the scraper to identify the sustainability content. Two additional findings were found: If we consider the companies ranked according to sales, in the first half of the rank, 80% of the companies have sustainability-related content on their websites. In the second half of the rank, only 45% of the companies have sustainability-related content on their websites. Companies on top of the rank are mainly construction or manufacturing companies and big retailers whereas companies in the bottom are mostly companies working in the service industry (recruitment consultancies, travel agencies and so on). Visualising topics as distributions over wordsThe LDAvis package [4] allows visualising the topic-words distributions produced by the LDA algorithm. The left panel visualises the topics as circles in the two-dimensional plane whose centres are determined by computing the Jensen–Shannon divergence between topics, and then by using multidimensional scaling to project the inter-topic distances onto two dimensions. Each topic’s overall prevalence is encoded using the areas of the circles. The right panel depicts a horizontal bar chart whose bars represent the individual terms that are the most useful for interpreting the currently selected topic on the left. A pair of overlaid bars represents both the corpus-wide frequency of a given term as well as the topic-specific frequency of the term. The λ slider allows ranking the terms according to term relevance. An interactive version of the visualisation is available here.Figure SEQ Figure \* ARABIC 2. Topics as distribution over wordsVisualising companies as distributions over topicsTo understand how the industry affects sustainability reporting, the document-topic distributions are grouped by company and averaged. These can be referred to as company-topic distributions. This allows comparing the topic distribution of companies that belong to the same industry. As an example, Fashion Retailers (on the left) and Transport companies (on the right) are shown. Each plot is constructed as follows: Topic numbers are on the x-axis. This number is taken from the topics visualisation in REF _Ref463870224 \h Figure 3 The y-axis measures the topic proportions Each line represents a company belonging to that particular industryThe boxes are used to show some of the most frequent terms for some of the topicsFigure SEQ Figure \* ARABIC 3. Fashion Retailers (on the left) and Transport companies (on the right) company-topic distributions.The two fashion retailers split their distributions between topics that relate to the environment, and topics related to people, charities and the community. Companies in Transport industry have more heterogeneous topics distributions, although they still show some commonalities.ConclusionsThe results show that there is potential in discerning the number of companies publishing sustainability information via scraping of their websites. Overall, of the 65 companies this process correctly identified 59 of them and for 35 companies the scraper rightfully did not find any sustainability content. This result is a good starting point and improvements to the process can be incorporated in future research. LDA was used to identify topics in the web pages identified by the scraper. The analysis of the text extracted from the web pages shows that the subject of sustainability is much more nuanced than what might result from a mere keyword analysis. The extent to which sustainability reporting is affected by industry and size of the company is worth exploring more, especially when considering a larger number of companies.ReferencesUnited Nations Statistics Division, Final list of proposed Sustainable DevelopmentM. E. Peters and D. Lecocq, Content Extraction Using Diverse Feature SetsD. M. Blei, A. Y. Ng and M. I. Jordan, Latent Dirichlet Allocation, Journal of machine Learning research (2003), 993-1022C. Sievert and K. E. Shirley, LDAvis: A method for visualizing and interpreting topics, Proceedings of the workshop on interactive language learning, visualization, and interfaces (2014), 63-70 ................
................

In order to avoid copyright disputes, this page is only a partial summary.

Google Online Preview   Download