1. INTRODUCTION .sg

 USING LATENT DIRICHLET ALLOCATION FOR ANALYSIS ACROSS DIFFERENT SOCIAL MEDIA PLATFORMS: A CASE STUDY OF A MEDIA PUBLISHER COMPANY IN SINGAPOREERIC YEO PU ZHONG; TAN YONG SIANG; TANG SHING HEISINGAPORE MANAGEMENT UNIVERSITYABSTRACT Collaborating with a media publisher company which focuses on travel and lifestyle stories for Singaporeans, this research study aims to analyse and identify opportunities to improve the quality of content curation across the company’s various platforms: Instagram and Facebook page. With different metrics used across platforms, it is important to identify topics derived from each post’s description. This makes it hard for the company to have an overview of performance on different social media postings. Therefore, this paper aims to explore the use of Latent Dirichlet Allocation (LDA) to perform topic modelling and in turn analyse post performance between the social media sites based on common topics identified. In addition to data provided by the company, we scraped data from Instagram as well as their blog posts. A combination of Python scripts and tools in JMP Pro were used to clean the data and derive additional columns where required. Tableau is also used for visualization during exploratory data analysis. The research study first explores the use of Term Frequency-Inverse Document Frequency (TF-IDF) to reflect the importance of a term in posts. Subsequently, we reduced the dimensions of data using singular value decomposition, before calculating the engagement ratio of each topic on each platform. Engagement ratio is calculated using the formula (“Likes”+“Share”+”Comment”)/Number of followers. Lastly, our team will be using these valuable insights to provide concrete recommendations for our sponsor to improve post topics that performing less than expected and create more posting with well-performing topics. 1. INTRODUCTIONOur client “XYZ” is an independent media publisher that focuses on travel and lifestyle stories for Singaporeans. Most of their content are published through their website, YouTube channel, Facebook page and Instagram. The website comprises of editorial based and community based where members can upload their own reviews. Today, they reach over 3 million Singaporeans each month. However, the company currently lacks the capability to perform cross platform comparison. This meant that the management do not have an overview of the entire company’s content performances. Consequently, this might affect XYZ ability in developing a niche and crafting a sound advertising fee. As a result, XYZ advertisement revenue will be affected.Having published various content types over the years, the company knows what posts perform well and what doesn’t. However, monitoring content engagement over different platforms such as Facebook and Instagram can be highly complex due to the nature of the posts such as timing, content type, authors. As such, they would like to perform a holistic, cross-platform analysis to identify performance differences across platforms.As the competition within the new media space intensifies, our client wishes to quickly have an overall view of their content performances across different platforms. This will allow them to prioritise content curation and at the same time justify advertisement fee increase so that there will be encouraging growth in overall revenue.After speaking with our client, we identified 1 main key performance indicator (KPI) that could be used across Facebook and Instagram. It is the engagement ratio. Engagement ratio is defined as (“Likes”+“Share”+”Comment”)/Number of followers.It is important that our client not only retain their current pool of followers but also get more followings from the wider pool of audience Furthermore with advertisement revenue being their main source of revenue, they have to ensure that their content is well liked so as to reach out to a wider pool of audience.In order to choose the right model to utilize, we researched on various literature research papers and examples of doing cross platform analysis. Two models were identified in this process. They are Latent Dirichlet Allocation (LDA) and Anova, Tukey-Kramer HSD. Next we will discuss our methodology used throughout the entire project’s process. We then move on to elaborate on our data set collection, preparation and transformation. After which, we applied the respective model to gain analytical insights. Lastly, we mentioned how our entire research would provide value to our client, highlighted some limitations before ending with our conclusion2. LITERATURE REVIEW2.1 LATENT DIRICHLET ALLOCATIONLatent Dirichlet Allocation (LDA) is a natural language processing statistical model that allows us to create “topics” from a set of documents. The following analysis are referenced from online tutorials like Barber and Besbes. It is also a generative probabilistic model for collections of discrete data such as text corpora. In the context of text modeling, the topic probabilities provide an explicit representation of a document (David et al., 2003). We will use the post description for generating the text corpus for all the platforms.A graphical model of the LDA generative processIn our model, we will use the original unsupervised LDA model for supervised classification to be able to benefit from a large amount of unlabeled data input from social media posts, in order to enhance the model. According to Momtazi (2018), we can define this process by introducing the following LDA notations:D denotes the number of documents in the entire corpus.The number of topics, denoted by T, is assumed to be known and fixed.Each topic ?t, where 1≤t≤T, is a distribution over a fixed vocabulary of terms and ?tw is the term proportion of term w in topic t.θd is the topic mixture of the dth document and θdt is the topic proportion of topic t in document d.zd are the topic assignments for document d, where zd,n is the topic assignment for the nth term in document d.wd are the terms occurring in document d, where wd, n is the nth term in document d. All terms are elements of a fixed vocabulary.β is the Dirichlet prior on the topic-terms distributions.α is the Dirichlet prior on the document-topics distributions.Furthermore, Momtazi (2018) touches on the workings of the generative process: :For each topic, we will choose a multinomial distribution ?t from a Dirichlet distribution with parameter β. That is, choose ?t~Dir(β), where 1≤t≤T.For each document d, choose a multinomial distribution θd from a Dirichlet distribution with parameter α; i.e., θd~Dir(α).For each term in document d, pick a topic assignment zd,n from the distribution θd for the nth term in document d.Pick a term wd, n from the distribution ?zd,n.2.2 ANOVA & TUKEY-KRAMER HSDThrough a variety of business applications and analytical analysis, ANOVA has been used in conjunction with Tukey-Karmer HSD. It has been used together to compare factor level means and at same time used to understand how each independent variable affect the dependent variable.Through an experiment in finding out the effectiveness of different rust inhibitors, 4 brands (A,B,C,D) were examined. (Kutner, Nachtsheim, Neter, Li, 2005)The research design was as such;To draw 40 samples10 units assigned to each brandExpose each unit to the same severe weather conditionAt the end of the experiment, record the coded value to determine the effectiveness of the rust inhibitorANOVA was carried out on the sample set across the 4 different brands. The single factor study determined that the factor-level means were different and requires a Tukey-Kramer HSD to provide a better understanding of which factor levels means were better and thus identifying which brand is the best rust inhibitor.We could apply ANOVA and Tukey-Kramer HSD in a similar fashion to investigate how each “topic” fare amongst each other. Subsequently, we aim to identify top performing topics and provide our client with actionable and valuable insights to improve content curation. 3. TOOLS AND LIBRARIES USEDDuring the course of our project, we have used the following technologies to conduct our analysis:Microsoft ExcelOur team utilized excel function to manipulate the data. With the functions, we extracted only necessary data for analysis TableauTableau was the main analytical tool that enabled us to create charts and dashboard that would help us better understand the dataset provided. Through visual representations and analytical charts, our team derived insights that were useful for our client.Jupyter NotebookProject Jupyter is an open-source project which supports interactive data science and scientific computing across all programming languages. We use this tool for data wrangling, building and evaluation of models.JMP Pro As a statistical discovery software, JMP Pro is used in our project for joining, concatenating columns in multiple tabs across different excel files, as well as forsimple graph analysis. We have also used it to identify missing data pattern as part of data preparation. PythonThe Python libraries used are:Pandas, a data structure for data wranglingNLTK, a natural language toolkit for Pythonstop_words, a Python package containing stop words.gensim, a topic modeling package containing the LDA model.4. METHODOLOGYFigure 4.1 process FlowFigure 4.1 shows the entire process flow for our project. We first collected data from our client’s Facebook insights and crawled our client’s instagram’s page to get the data. Subsequently, we performed necessary cleaning and transformation to our data sets. Next, we conducted exploratory data analysis to better understand our data set and to find the appropriate model that could be utilised on our data set. Next, we determined the suitability of the model by conducting statistical tests. After which, we used LDA, ANOVA & Turnkey-Kramer as our choice of models to carry out further analysis.5. DATASET5.1 DATA COLLECTION & PREPARATION5.1.1 INSTAGRAM DATAUsing NPM, we utilised instagram-profilecrawl to crawl through XYZ’s followers, posts, likes, and comments using their public Instagram profile as input. Through the NPM command prompt, we used $ instagram-profilecrawl <input> command to begin crawling. This results a JSON file being saved.Using Python (Pandas Framework), we transform the following JSON data into a tabular structure for processing:Figure 5.1: Snippet of DataFrameAfter preparation, we have the following metadata:urlPermalink for Instagram posturlImagePermalink for Instagram imagedateDate and time of posted contentwidth Width of contentheight Height of contentnumberLikes Number of likes garnerednumberComments Number of comments garneredisVideo Indicates if content is a videomultipleImage Indicates if content contains multiple imagesdescription Caption of contentmentions Users that are mentioned in the posttags Hashtags “#” used in the post5.1.2 FACEBOOK DATAThis three time period of data were combined into oneOnly data from this particular tab “Lifetime: Number of stories created about your Page post, by action type (total count)” was used.The following contains the metadata that would be used for analysisThere are empty cells found in “Type”. As such, our team had to sieve out rows that does not contain any “Type” and understand what's a common characteristic amongst these rows. We realised that rows that do not contain any “Type” are posts that mentions about XYZ. As such, we will remove these rows of data from our analysis since they are not any post that originates from the Facebook Page.5.2 DATA CLEANING & TRANSFORMATION5.2.1 DATA CLEANSINGFigure 5.3: Code snippet for tokenizing textFor each post description, we will tokenize the sentence, remove stop words and punctuations. 5.2.2 TERM FREQUENCY-INVERSE DOCUMENT FREQUENCY (TF-IDF)Frequency-inverse document frequency (TF-IDF) is a numerical statistic to reflect how important a term is in respect to all the posts. The tf-idf a of a term t in a post p is proportional to the number of times it appears in each post p but is also offset by the frequency of the term t in the collection of the posts of the platform. This is because some words appear more frequently in all the posts but are irrelevant to identify a topic.Figure 5.4: Code snippet for TF-IDF matrixWe make use of sklearn (Python)’s TfidfVectorizer to create the matrix, where The number of rows is the total number of post descriptionsThe number of columns is the total number of unique terms (tokens) across the posts6. MODELS & INSIGHTS6.1 LATENT DIRICHLET ALLOCATION6.1.1 CONSTRUCTING THE MODELFigure 6.1: Code snippet for LDA ModelUsing 25 as the number of topics, we have the following results for Instagram posts:Figure 6.2: LDA Model ResultsBased on these terms, we assign topics manually to each and categorize the data for Instagram and Facebook. Results for Youtube show a vastly different array of topics from Facebook and Instagram content, and we will be excluding it from the cross-platform analysis.6.1.2 DIMENSION REDUCTION WITH SVD AND T-SNETo visualize our results on a 2D plane, we first transform the matrix into a 2-dimensional data using a combination of 2 dimension reduction techniques, SVD and t-distributed Stochastic Neighbor Embedding. SVD transformer performs linear dimensionality reduction by means of truncated singular value decomposition (SVD). t-SNE is a tool which converts similarities between data points to joint probabilities and tries to minimize the Kullback-Leibler divergence between the joint probabilities of the low-dimensional embedding and the high-dimensional data.Figure 6.3: Code Snippet for SVDFigure 6.4: Code Snippets for TSNEUsing these techniques, we get a 2-dimensional data of the TF-IDF matrix for visualization as shown:Figure 6.5: Sample Data of Processed Data6.1.3 VISUALIZING LDA MODELUsing BokehJS for Python, we created a scatter-plot which allows us to easily explore the data and inspect if topics have been assigned correctly.Figure 6.6: LDA Topic Visualization with BokehJSIn addition, we also made use of pyLDAvis, a Python package designed to help users interpret the topics in a topic model that has been fit to a corpus of text data to produce an interactive web-based visualization.Figure 6.7: Topic Visualization with pyLDAvisUsing this tool, we were able to understand the topics better and spot additional irrelevant terms which we used as a feedback loop to tokenize the data and recreate the LDA models.6.2 DATA INSIGHTS6.2.1 DATA CONSOLIDATIONConsolidating data from Facebook and Instagram gives us the following data for analysis, with 2 additional columns, “type” and “topic”:Figure 6.8: Screenshot of consolidated data6.2.2 DATA ANALYSISFigure 6.9: Engagement Rate for Different Platforms across TopicsAs seen from the analysis, there are certain topics that are identified on Instagram but not on Facebook.As we have defined earlier, Engagement Rate is calculated by the formula:# (Likes + Comments + Shares) / # FollowersInstead of using 3 different measures, the Engagement Rate encapsulates the amount of “engagement” from the company’s audience. The analysis shows that engagement rates on Instagram are higher than that of Facebook’s, despite not having a “share” feature.On Facebook, videos have relatively higher engagement rates compared to Shared Videos and Posts. Figure 6.10: Distribution of Engagement Rates across TopicsBy the average Engagement Rates, we see that the top 3 topics are: [Censored].6.2 ANOVA & TUKEY-KRAMER HSDH0: There are no significant differences in average engagement rates between each platformH1: There are significant differences in average engagement rates between each platformFigure 6.11: Comparisons of all pairs (Platform) using Tukey-Kramer HSDFrom the results, we can see that there are significant differences (p-Value < 0.05) for average engagement rates between all pairs except for Facebook Post and Facebook Shared Videos.From the Connecting Letters Report, we can conclude the following rankings of engagement rates in descending order:Instagram ImagesInstagram VideosFacebook VideosFacebook Posts / Shared Videos Figure 6.12: Comparisons of all pairs (Topics) using Tukey-Kramer HSDFrom the results in figure 6.12, we can see that there are significant differences (p-Value < 0.05) for average engagement rates between most of the pairs.From the Connecting Letters Report, we can conclude the following rankings of engagement rates in descending order:[Censored]TravelFood / Giveaways / FestiveLocal / Others / DealsVideo7. RECOMMENDATIONSPlatformThe company should continue posting quality content on Instagram as there is a generally a higher engagement rate, and think of ways to engage the audience on icsThe company should continue to create content based on topics such as seen from above due to higher engagement rates. By strategizing their resources on topics with better engagement rate, XYZ can focus on production with more effective outreach to their audiences in different platforms. For future business development, our sponsor might want to consider having tiered pricing for better performing posts, especially for sponsored posts, to increase sponsorship revenue. Production team might also want to consider reducing or eliminating low engagement posts, with high production costs. 8. IMPLICATIONS / LIMITATIONSThere are various limiting factors in topic modelling like LDA. These factors include the number of documents, the length of individual documents, the number of topics, and the Dirichlet parameters. Documents in our case, refers to the post descriptions for various social media platform. Other common LDA limitations include we have to fix the number of topics and must be known ahead of time. Moreover, there might be uncorrelated topics, where the Dirichlet topic distribution cannot capture correlations and in our model, LDA has bags of words where sentence structure is not modeled. Another limitation of the topics predicted by LDA is that they are not a hundred per cent accurate due to the ambiguous nature of the content. For example, a particular post may contain information pertaining to both “Food” and “Travel”, but will only be categorised as one category with the higher probability.In addition, the Engagement Rate has its own limitations despite taking into account the most important metrics “Likes”, “Comments”, “Shares” and number of “Followers”. As we observe, it is relatively more difficult to maintain a higher Engagement Rate as the number of followers increase to a large number. Likewise, accounts with a very small follower base is likely to get a higher engagement rate as well, therefore, both of these metrics should be used to provide a more accurate view. Furthermore, Instagram currently does not have a “share” feature as compared to Facebook.9. CONCLUSIONLDA, as a text modeling tool, allows the team to derive common topics for cross-platform analysis across social media platforms. Previously, the company has expressed difficulty in analysing the performance of postings on an overview level. When we first started on the research, we first cleaned the data by crawling data from Instagram and then conducted exploratory data analysis. From our results, we recommend to the company that they should continue posting quality content on Instagram as there is a generally a higher engagement rate, and improve their engagement strategy with the audience on Facebook. We also recommend a number of selected topics that we have identified from the LDA model.In conclusion, with the ever-changing preferences of social media viewers, our results will become irrelevant in the future. Therefore, it is important for the sponsor to continue the process of data collection and re-execute our analysis in order to obtain the latest insights. This would enable the company to continually evolve and adapt their contents to suit their audiences’ preference based on the better-performing topics.10. REFERENCES Barber, J. (n.d.). Latent Dirichlet Allocation (LDA) with Python. Retrieved April 08, 2018, from BESBES, A. (2017, March 15). How to mine newsfeed data and extract interactive insights in Python. Retrieved April 08, 2018, from Kutner, M. H., University, E., Nachtsheim, C. J., Neter, J., & Li, W. (2005). Applied Linear Statistical Models.Momtazi, S. (2018). Unsupervised Latent Dirichlet Allocation for supervised question classification. Information Processing and Management, 54(3), 380-393.Tang, J. (n.d.). Understanding the limiting factors of topic modeling via posterior contraction analysis. Retrieved April 08, 2018, from 11. CONTACT INFORMATIONYour comments and questions are valued and encouraged. Contact the authors at:Eric Yeo Pu Zhong: eric.yeo.2014@sis.smu.edu.sg Tan Yong Siang: ystan.2014@sis.smu.edu.sg Tang Shing Hei: shtang.2014@sis.smu.edu.sg ................
................

In order to avoid copyright disputes, this page is only a partial summary.

Google Online Preview   Download