Table of Figures - Virginia Tech

 Virginia TechDepartment of Computer ScienceBlacksburg, VA 24061CS 4624Multimedia, Hypertext, and Information Access CapstoneSpring 2021Authoritative Venues Final ReportAli YoussefBella MarkuKyle ForstTanner SpicerClients: Harinni Kumar and Dr. Cliff ShafferProfessor: Edward A. FoxMay 12, 2021Table of Contents TOC \h \u \z Table of Figures PAGEREF _jdshureln315 \h 2Table of Tables PAGEREF _sykkkpbzgq1 \h 31. Executive Summary PAGEREF _3c199sk8nyld \h 42. Introduction PAGEREF _id29utymdm9o \h 52.1. Background and Objectives PAGEREF _gywahvwzuvp4 \h 52.2. Deliverables PAGEREF _i9ypbu36zwby \h 52.3. Clients PAGEREF _365zu56059zs \h 52.4. Team PAGEREF _ls12tmmedjyg \h 63. Requirements PAGEREF _ltyc2svz10y0 \h 64. Design PAGEREF _i6ps4q2s0y51 \h 74.1. Initial Research and Exploration PAGEREF _5rt7h6qy206j \h 74.2. Overview of Chosen Solution PAGEREF _8689b8g4llkd \h 75. Implementation PAGEREF _lefwtcxprrdy \h 86. Testing/Evaluation/Assessment PAGEREF _1tlbtb11kqyh \h 117. User’s Manual PAGEREF _5m7zkvhituda \h 118. Developer’s Manual PAGEREF _ajq3001aprdh \h 12Table of FiguresFigure #: DescriptionPage #Figure 1. Format of conference data: key data columns are the conference which the research was published at, as well as the research’s title, abstract, entry_date, and authors10Figure 2. Conference to venue category mappings, containing a column with the conference’s acronym, full name, and categorization by topic10Figure 3. Output of the scraping program. Program attempts searching by venue acronym and then full name. 11Figure 4. Snippet of Nginx configuration file. Shows configuration of port and domain name, along with proxying HTTP headers. 12Figure 5. The website that hosts the Venue Recommender. Displays input fields for publication title and abstract. Produces results upon clicking the “Recommend” button. 14Figure 6. Results shown after running the Venue Recommender. Lists recommended venues within groups along with each individual venue’s h5-index. 14Figure 7. Breakdown of our steps for achieving goal 1. 15Figure 8. Breakdown of our steps for achieving goal 2.15Figure 9. Breakdown of our steps for achieving goal 3. 15Figure 10. Diagram of our overall workflow, with individual workflows consisting of steps covering each of our three goals.18Figure 11. The data produced by h5.py stored in a .xlsx file. Contains venue acronyms, venue full names, and their respective h5-indices. 19Figure 12. Example of an input file for h5.py. 19Figure 13. Example of changing the input file that is read in h5.py. 20Figure 14. Screenshot of the virtual machine after activating the Python virtual environment. Shows how the command line’s prompt changes to add ‘(recommenderenv)’ to indicate usage of the virtual environment. 22Table of TablesTable #: DescriptionPage #Table 1: Describes our implementation-specific service names, input file names and IDs, output file names and IDs, and necessary libraries/functions/environments.17Table 2: Describes files utilized in the creation of classifier models. 18Table 3: Describes files utilized and created in the data scraping process. 211. Executive SummaryThe goal of the Authoritative Venues project was to use machine learning algorithms to create a web application that can accurately recommend fitting venues for Computer Science researchers trying to publish their work. By providing a ranked output list of publication venues related to a paper’s topic, we help researchers make more informed decisions about where to submit their work for publication. Our first of two clients, Dr. Cliff Shaffer of the Virginia Tech Department of Computer Science, initially proposed a project involving gathering citation and ranking data on various CS Education publication venues to determine which ones are most influential. The final project is a hybrid of this project and a project initially started by Harinni Kumar, a graduate of Virginia Tech’s master degree program for Computer Science and Applications. Over the course of the semester, our team expanded upon the initial framework code provided to us by Harinni in order to incorporate Dr. Shaffer’s requests. This original framework code used machine learning algorithms to recommend relevant venues, grouped by topic, when provided with a paper title and abstract. However, the application could only be run locally and the output listed the venues in random order. To improve on this, we added a ranking system to the output where each venue’s h5-index, the number of articles published in the last 5 years, is also listed. The h5-index is commonly used for measuring productivity and impact of a particular venue, so we included this to help researchers choose venues with a high community impact. Additionally, we reserved and configured a virtual machine to host the application on the web using a Virginia Tech URL so that it can be accessed at any time from any computer. This report details the requirements for the project as outlined by our sponsors, our research and design process, the details of our implementation and testing, a user’s manual for those accessing the web application, a developer’s manual for those interested in working with the backend, and an overview of the project’s timeline and key takeaways. 2. Introduction2.1. Background and ObjectiveThe computer science discipline has a unique controversy about the relationship between conferences and journals when it comes to paper publications. Both conferences and journals are considered valid venues for publication, as opposed to other fields where journals are the only venues (or most prestigious) available. Journals within the computer science discipline often decide not to publish papers that have already been reviewed at conferences, which caused conferences to gain traction as a separate authoritative entity. Both venues have pros and cons: conferences have been criticized for the lack of thoroughness in the reviewing process, while journals have faced criticism for their slow reviewing time as opposed to a conference [1]. As of right now, there are existing websites which rank various computer science publication venues, but there were none we could find which would provide customized recommendations for where to submit one’s work. This is where our project fulfills a need: with our web application, researchers can easily provide a paper title and abstract and receive customized results listing and ranking the best venues for publication.2.2. DeliverablesThe final deliverables for this project are as follows: A functioning web application, hosted 24/7 on one of Virginia Tech’s virtual machines, which provides the output described aboveAn open-source repository of project documentation hosted on GitHub, with source code written in PythonA final report including information about the completion process, a User’s Manual, and a Developer’s Manual2.3. ClientsThe two clients for our project are Dr. Cliff Shaffer, Associate Department Head for Graduate Studies and Professor of Computer Science at Virginia Tech, and Harinni Kumar, a graduate of Virginia Tech’s master degree program for Computer Science and Applications who is now working as a software engineer at Walmart Labs. Dr. Shaffer’s original project proposal was to scrape average citation count per paper for various conferences and journals within the computer science field and then compare these average citation counts by venue. His goal was to use those citation counts to measure a venue’s level of influence within the field. We were able to find some overlap between that project and the project that Harinni began last year for her master thesis, which was a venue recommender that used machine learning neural networks to analyze a paper’s title and abstract, and output publication venues that usually publish similar topics. Her application listed venues in random order, and could only be run on a user’s machine locally. Therefore, for our overall project, we agreed on three major elements that combined the wishes of our clients and increased accessibility for our users: an improved neural network system to build on Harinni’s progress, a ranking system added to the output to measure the relevance of venues as requested by Dr. Shaffer, and a URL hosting the functioning application to allow for user access anytime, anywhere.2.4. TeamOur team is composed of the following members: Bella Marku, Ali Youssef, Tanner Spicer, and Kyle Forst. We are all seniors majoring in Computer Science and graduating Spring of 2021. We decided to take the Multimedia, Hypertext, and Information Access capstone to learn more about media information, access, and systems, and to gain exposure to concepts like text scraping/processing, electronic publishing, and data architectures.Bella Marku has a second major in Computational Modeling and Data Analytics and was interested in a project involving data collection and processing. She gained experience in these topics in her CMDA capstone course, which she took during Fall 2020, and her technology internship program at Capital One during Summer 2020. She will be returning to Capital One in August for a full time position in their Technology Development Program. For this project, her main contributions included designing the presentations, writing content for the reports, and researching and executing the hosting of the online web application. Ali Youssef was also interested in a data processing project, as he wanted to delve into a type of project he had not done before. Although he had done projects with minor data processing portions in the past, he aimed to take part in a project that was primarily focused on collecting and utilizing data. For this project, his main contributions included collecting data on every venue that was utilized, developing the ranking system, and setting up the web server and Flask deployment of the online web application. Tanner Spicer is also a Computational Modeling and Data Analytics major interested in participating in more projects involving data processing and manipulation. During his previous capstone, he gained experience working with the Convolutional Neural Networks used in this project to compare different venues for publishing papers. Within this project, he’s been working to test out other models beyond Convolutional Neural Networks, like Recurrent Neural Networks, that lend themselves better to natural language processing. Kyle Forst was also looking to gain some experience in working with data processing, an area that he does not have any real prior experience with. He also wanted to learn more about what constitutes a useful venue of publication and interact with and learn from the data associated with determining such a metric. His primary contributions include helping to create the presentations, fixing minor issues with accessing the virtual machine, and working on the report.3. RequirementsThe requirements for this project were developed through a joint effort between our team members, Dr. Cliff Shaffer, Harinni Kumar, and our capstone professor, Dr. Edward Fox. Dr. Shaffer’s main requirement was to develop a system for measuring the impact of various computer science publication venues so that they could be easily compared. He did not have any strict specifications for how to implement this, but did suggest looking into scraping Google Scholar to obtain average citations per paper or other similar impact metrics. Harinni had no specific requirements for us to achieve other than to improve on her application system currently in place. She shared the code and data she developed when working on this project last year so that we could build on her progress. She did provide us with ideas with areas for improvement, such as refining the machine learning models to improve output accuracy and configuring the application to run in a web browser rather than on a local port. With Dr. Fox’s recommendations and guidance, we combined the feedback from both Dr. Shaffer and Harinni into a list of 3 concrete requirements:Improve machine learning models to increase output accuracy.Add a ranking system to the output for comparing venue impact.Configure the recommender application to run in a web browser.4. Design4.1. Initial Research and ExplorationThe majority of our research involved understanding neural networks and choosing a metric to distinguish different venues and their impact as compared to each other. The pre-existing code written by Harinni outputs venues by using convolutional neural networks, so we needed to gain a more thorough understanding of how they work and are executed in Python in order to improve the accuracy of output. Additionally, we did some background research on what would be the best metric to use to compare impact of venues. The h5-index “is the largest number h such that h articles published in the past 5 years have at least h citations each” [3], so we chose it because it incorporates both number of articles and number of citations per article, both of which are very influential indicators in terms of overall impact.4.2. Overview of chosen solutionThe core functionality of the venue recommender application is split into two parts: the machine learning recommender part and the Flask web application part. Harinni created her original code and split it across two repositories based on this distinction, so we decided to build upon that organization system rather than try to change the file layout and develop our own. All of her code was written in Python, so we continued in that language as well. The first part of the application is contained within a Github repository called “Recommender_CL”. This contains all of the code related to machine learning and developing neural networks. It utilizes data about different computer science publication venues--their name, description, and topics of interest, among other things--collected by Harinni to create the neural network models. It then uses these models when a new paper title and abstract is presented in order to output venues known for relevant topics. The networks and output use a group-based model, in which every venue is grouped by their main topic like “Robotics” or “Neuroscience” and groups are returned as output rather than individual venues. When the code was passed along to us, the recommender was able to return relevant groups, but the venues within each group were listed in random order. To fulfill Dr. Shaffer’s request for incorporating a metric to compare venues, we designed and implemented a ranking system based on the h5-indices of venues so that the venues within the group with the highest h5-indices would be listed first.The second part of the application is contained within a Github repository called “GroupRecommender_flask”. This repository utilizes the pre-trained CNN models from the Recommender_CL repository in order to accept user input and return output. To create a functioning user interface, the code utilizes Flask, which is a preexisting lightweight application framework. Initially, the application as given to us by Harinni functioned in a web browser when the user has all the necessary files on their computer and navigates to “127.0.0.1:5000” in their browser. However, one of our design requirements was that the application needed to be fully functional from a web URL on any computer, which we were able to solve by using a virtual machine to host the application. More details on the data collection, web scraping, ranking system, and virtual machine can be found in the following section on implementation.5. ImplementationOne of the benefits of continuing Harinni’s master thesis project is that the vast majority of the data needed for a project like this was already collected and cleaned for us. We were provided with numerous CSV files containing key information about conferences and venues. This data was then used by Harinni to train convolutional neural networks and export models used for categorization of new data. Some examples of the data used to implement the models are as follows. Figure 1, adapted from “ACM Venue Recommendation System” [2], contains paper information for research published at conferences, and the table format is similar for research published in journals. The key sections to note are the title and abstract, the conference field which contains the name of the conference where the paper was presented, the entry date, and the authors. Figure 2, also adapted from “ACM Venue Recommendation System” [2], pairs conferences to the topic group they fit best, described in the “Venue terms” column. The “Conf Acronym” column contains the short acronym used to describe the conference, the “Conference Full Name” column contains the unabbreviated version of the name, and the “Venue Terms” column matches the conference to one of our broad topic groups used for sorting venues, such as “Computational Linguistics” or “Computer Graphics”. Figure 1. Format of conference data: key data columns are the conference which the research was published at, as well as the research’s title, abstract, entry_date, and authors; adapted from [2]Figure 2. Conference to venue category mappings, containing a column with the conference’s acronym, full name, and categorization by topic; adapted from [2]In order to create a ranking system for the program, we needed to obtain data on the h5-indices for all the venues utilized. To do this, Ali created a web-scraping program in Python that scraped Google Scholar Metrics for venue data. The scraping program would first attempt to scrape based on the venue’s acronym. If a match was found, the h5-index value of that venue would be retrieved and stored, but if no match was found, the program would instead scrape based on the full name of the venue. Without any additional parameters, the scraping program was initially able to retrieve data for 65% of the venues. Figure 3. Output of the scraping program. Program attempts searching by venue acronym and then full name.Ali then added more search checks in order to increase the success of the scraping program. If the full name of the venue contained an “and”, the program would query “&” in its place if the scrape was initially unsuccessful. Additionally, Google Scholar Metrics would occasionally include “ACM” in the full name of a venue, which in a few instances led the scraping program to not find a match when one existed [5]. Ali added a check for this as well in order to handle this case. After these checks were added, the scraping program was able to retrieve data for 88% of the venues. The remaining 12% had to be obtained manually, however through manual web searching, only an additional 4% of the venue data was able to be obtained. The 8% of venues that still had missing data did not have an h5-index value available online. These venues with missing h5-index data have a ‘Cannot find venue’ message in place of an h5-index value in the recommender application. The manual process included the following steps: Querying the names/acronyms of each venue that was missing data into Google Scholar Metrics, Guide2Research, or Scopus; finding the h5-index value of each venue upon query completion; and then manually recording the h5-index value into an Excel spreadsheet [6][7]. The key to implementing our web application was securing a virtual machine through Virginia Tech’s Computer Science Research Virtual Machines Program. Instead of paying for a web hosting service, we were able to reserve a VM that can keep the website running 24/7. The virtual machine we requested has the following specifications: 4 coresx86_64 architecture16 GB RAM1 TB Disk SpaceCentOS 7Using the guidance of Harinni and Dr. Fox, we determined that these specifications would allow the application to handle any level of computation required for a user to obtain their output. The VM does not contain a GPU, as we were not handling multiple functions at the same time. The application uses convolutional neural network models that were trained in advance and saved in order to generate output, so no expensive machine learning computations should be required for average user use. To host the website, we first needed to decide on which web server we were going to be using. We were debating between utilizing the Nginx or Apache web servers, and ultimately chose Nginx. Nginx is better suited towards delivering static files, which was the end goal of the Flask deployment process. It is also typically much faster at handling requests and delivering content than Apache, which was appealing to us considering we wanted to make sure the application ran as fast as possible. To set up Nginx, we needed to configure it to listen to port 80 (HTTP) and respond to our server’s domain name, authvenue.cs.vt.edu [4]. In Figure 4, a snippet of the Nginx configuration file is shown containing these changes. It also shows how we set some standard proxying HTTP headers that are used to provide information about the remote client connection. Figure 4. Snippet of Nginx configuration file. Shows configuration of port and domain name, along with proxying HTTP headers. We then needed a way to run the application on the web server and have the website handle running multiple processes simultaneously. To do so, we opted to use Gunicorn, a web server gateway interface. Gunicorn allowed us to run multiple instances of the Flask application and handle multiple requests simultaneously. Gunicorn featured multiple ways to handle concurrency. The first involves ‘workers’, which are processes that run Python applications. The next way was through Gunicorn threading, which allows for each worker process to have multiple threads [8]. The recommended amount of maximum concurrent requests with Gunicorn is equal to (2 * Number of Cores) + 1. Since our VM contained four cores, we configured our Gunicorn instance to create three workers and three threads per worker, with each worker running an instance of the Flask application. As such, our website can handle nine concurrent requests at maximum. With Nginx and Gunicorn properly configured, the application is now hosted at authvenue.cs.vt.edu. 6. Testing/Evaluation/AssessmentAfter setting the application up on the chosen URL, we successfully tested it on a variety of different operating systems and in different web browsers. This was to ensure that users have a uniform experience with using the app. We also provided the same input multiple times and examined the output to ensure that it was standardized across different instances. 7. User’s ManualWebsiteThe website serves as a way for users to easily utilize the venue recommender. It displays two input fields, one for the publication title, and one for publication abstract, as shown in Figure 5. The user simply needs to enter information about their paper into these two fields, and then hit the grey “Recommend” button. The website will then display recommendations for venues produced by the recommender based on the information provided by the user. It will also display the ranking information of these venues, as shown in Figure 6. Some venues do not have ranking information, and a ‘Cannot find venue’ message will be displayed in place of an h5-index value for them. Figure 5. The website that hosts the Venue Recommender. Displays input fields for publication title and abstract. Produces results upon clicking the “Recommend” button. Figure 6. Results shown after running the Venue Recommender. Lists recommended venues within groups along with each individual venue’s h5-index. 8. Developer’s ManualMethodologyThe majority of our application’s users will likely be involved in research and higher academia. The purpose of the application is to provide information about relevant computer science publication venues, so some examples of website users would be a researcher looking for a venue to publish their work in neurotechnology, a professor looking for a ranking of publication venues related to robotics, or a student looking for high-impact computer science venues where they could find works that are more highly cited on average. We have identified three primary goals that apply to each user of our website. The website needs to provide customized venue recommendations based on a paper’s title and abstract; see Figure 7.The website needs to provide a ranking system based on h5-indices to compare venues listed as output; see Figure 8.The website needs to be available on any computer at any time; see Figure 9. Figure 7. Breakdown of our steps for achieving goal 1. Figure 8. Breakdown of our steps for achieving goal 2. Figure 9. Breakdown of achieving our steps for goal 3.Goal 1: Provide customized venue recommendations based on a paper’s title and abstract; see Figure 7Sub-task: Generate convolutional neural network (CNN) models Input: Cleaned data that was given to usOutput: Binary classifiers for each possible venue which can be used to categorize new dataLibraries, Functions, & Environments: Jupyter Notebook, Keras Python APISub-task: Run the algorithm using CNN models to match the input with known venuesInput: Paper title and abstract, CNN models developed in subtask 1Output: List of relevant venues, grouped by topicLibraries, Functions, & Environments: Jupyter Notebook, Keras Python Deep Learning APIGoal 2: Provide a ranking system to compare venues listed as output; see Figure 8Sub-task: Scrape h5-indices for all possible venuesInput: Google Scholar Metrics website URLs containing h5-index informationOutput: List of all venues in the database and their h5-indexLibraries, Functions, & Environments: Python, Google Scholar, BeautifulSoupSub-task: Identify relevant publication venuesInput: Paper title and abstract, CNN models developed in advanceOutput: List of relevant venues, grouped by topicLibraries, Functions, & Environments: Web browser, FlaskSub-task: Assign h5-index to each venue in outputInput: List of relevant venues, list of all possible venues and their corresponding h5-indicesOutput: The h5-index associated with each specific venue that was recommended Libraries, Functions, & Environments: Web browser, Jupyter Notebook, FlaskSub-task: Output results on recommender websiteInput: Website URL, user clickOutput: List of both the relevant venues and their corresponding h5-indicesLibraries, Functions, & Environments: Web browser, Jupyter Notebook, FlaskGoal 3: Ensure website is available on any computer at any time; see Figure 9Sub-task: Set up virtual machineInput: Request for VM with given specs from CS Research Virtual MachinesOutput: VM allocation with given specsLibraries, Functions, & Environments: CentOS 7Sub-task: Run server using Flask and nginxInput: Flask web app framework code, nginx configurationOutput: Web server hosting our Flask application 24/7Libraries, Functions, & Environments: Flask, nginx, CentOS 7Sub-task: Navigate to recommender websiteInput: Link to websiteOutput: Flask web application functioning in any browserLibraries, Functions, & Environments: Web browser, FlaskIn Table 1, we describe our implementation-specific service names, along with their respective input file names, input file IDs, output file names, output file IDs, and libraries/functions/environments. Our overall workflow for the three goals we created is displayed in Figure 10. Each workflow consists of steps that showcase how we worked towards completing our goals. Service IDService NameInput file name(s)Input file IDsOutput file nameOutput file IDLibraries, Functions, EnvironmentsA1Customized venue recommenda - tionsvenue_info.csv, venuerecordcount.txt, venue_groups.txt, venue_aliases.csvB1group_classifiers.zipC1Python, Jupyter Notebook, Keras, GithubA2h5-index ranking systemInputParser.py, scraping_program.pyB2venue_indices.txtC2Google Scholar, Python, BeautifulSoup, Jupyter Notebook, GithubA3Web applicationrecommender.py, run.pyB3none, output will show up on authvenues.cs.vt.eduC3Virtual machine, CentOS 7, nginx, gunicorn, Flask, web browser, GithubTable 1: Describes our implementation-specific service names, input file names and IDs, output file names and IDs, and necessary libraries/functions/environments.Figure 10. Diagram of our overall workflow, with individual workflows consisting of steps covering each of our three goals.Model CreationThree .ipynb files were used in the process of model creation. They are all Jupyter notebook files, and are laid out and described in Table 2. The groupClassifierCreator.ipynb file was utilized to create, train, and test group classifier models. The journalClassifier.ipynb and conferenceClassifier.ipynb files were used to create, train, and test classifier models for journal and classifier venues, respectively. File NameExtensionDescriptiongroupClassifierCreator.ipynbUsed to create, train, and test group classifier models.journalClassifier.ipynbUsed to create, train, and test classifier models for journal venues specifically. conferenceClassifier.ipynbUsed to create, train, and test classifier models for conference venues specifically.Table 2: Describes files utilized in the creation of classifier models. ScrapingThe script we created to obtain h5-index data, h5.py, scrapes data from Google Scholar Metrics [5]. The program will search for venue acronyms, the full venue name, and a few additional checks. Upon completion, the data is stored in a .xlsx file, and is formatted as shown in Figure 11.Figure 11. The data produced by h5.py is stored in a .xlsx file, containing venue acronyms, venue full names, and their respective h5-indices. In order to use the script, provide a .xlsx file, as in Figure 11, containing a list of venue acronyms and venue full names. One column should be named ‘Venue’ and the second column should be named ‘FullName’. In our case, we used fullvenueinfo.xlsx, but the input file can be changed directly through the code as shown in Figure 12. Afterwards, the script can be run by simply invoking ‘python h5.py’ in the command line or directly through a Python script editor. The script will run and produce output based on the input file given. An example of an input file is shown in Figure 12. Upon completion, the script saves the h5-index data in a file named venues_h5.xlsx. Figure 12. Example of an input file for h5.py. Figure 13. Example of changing the input file that is read in h5.py. Datasets/ScrapingTo scrape Google Scholar Metrics for publication venue data, a web scraper was written in Python. This web scraper is a one file program named “h5.py”. This program utilizes Python’s BeautifulSoup library to make venue search queries to Google Scholar Metrics where it then obtains and saves a given venue’s h5-index to an array of h5-indices [5]. It obtains the search queries from fullvenueinfo.xlsx, a data file containing the full name and acronym of every venue utilized by the recommender. It then outputs every venue and its corresponding h5-index from the h5-indices array to a data file called venues_h5.xlsx. The files utilized in the scraping process are also described in Table 3. The scraping program is able to maintain a fairly high success rate (80-90%) when querying for venues and obtaining their h5-indices, but it will fail when querying a venue that doesn’t exist in Google Scholar Metric’s database. The program can be run via the command line by issuing the command ‘python h5.py’. It will query approximately 4-5 venues per second and produce constant output as it runs. The output will first print the venue acronym currently being queried. If it fails to find an h5-index using the acronym, it prints ‘Trying Full Name’ and attempts to query the full name of the venue instead. Once it succeeds in retrieving an h5-index for a venue, it prints output in the form: ‘Found {venue name} h5-index: {h5-index value}’. Otherwise, if it is unable to retrieve an h5-index for a venue, it instead prints output in the form: ‘Cannot find venue {venue name}’. To manually fill in missing data, first open the scraper’s output file, venues_h5.xlsx. Filter the ‘h5-index’ column by rows containing a value of ‘Cannot find venue’. This will give us the names of the venues that the scraping program was not able to automatically find h5-index values for. The missing data can be found by making manual search queries of the venue names to other h5-index databases, such as or . Additionally, to add information about new venues, additional rows can be added to the fullvenueinfo.xlsx file in the form: {Venue Acronym, Venue Full Name}. Any questions regarding the scraping process can be directed to Ali Youssef at aligy@vt.edu. File NameExtensionDescriptionh5.pyWeb-scraping program written in Python that scrapes Google Scholar Metrics for h5-index data. fullvenueinfo.xlsxData containing each venue’s acronym and full name. Used with h5.py in order to scrape h5-index data. venues_h5.xlsxData containing the h5 indices of all venues utilized in the project. This file is automatically produced by h5.py.Table 3: Describes files utilized and created in the data scraping process. Virtual MachineAs described in section 5, the VM used for this project has the following specifications:4 coresx86_64 architecture16 GB RAM1 TB Disk SpaceCentOS 7To work on the virtual machine, we had to first ssh to rlogin.cs.vt.edu, as the VM must be connected to from a Virginia Tech IP. Once connected to rlogin, we could ssh to authvenue.cs.vt.edu and login as a root user. It was crucial that we had full access permissions when trying to set up our application. Once we were logged in, the first step was transferring the Flask application files from our local system to the VM. We utilized rsync, a fast file copying tool that allowed us to efficiently transfer all of the application files remotely. If rsync is not an option, the scp command can also be used to transfer files to the VM, though slower results may be experienced.The application files were transferred and placed into the /root/GroupRecommender_flask/ directory. The next step was installing the required components. To do this, we utilized the following command: ‘yum install python-pip python-devel gcc nginx virtualenv’. Within the /GroupRecommender_flask/ directory, a Python virtual environment was created and placed into /root/GroupRecommender_flask/recommenderenv/ by running the command: ‘virtualenv recommenderenv’. This virtual environment contains all the Python libraries that are required to run the Flask application. To activate the virtual environment, use the command: ‘source recommenderenv/bin/activate’. There should be a change in the command line prompt to indicate that a virtual environment is now in use, as shown in Figure 14.Figure 14. Screenshot of the virtual machine after activating the Python virtual environment. Shows how the command line’s prompt changes to add ‘(recommenderenv)’ to indicate usage of the virtual environment. Within the virtual environment, the following commands were issued in order to install all required Python libraries:pip install tensorflow==2.2.0rc2pip install spacy==2.2.3pip install scikit-learn==0.22.1pip install keras==2.3.1pip install numpy==1.18.1pip install pandas==1.0.3pip install nltkpip install matplotlibpip install flaskpip install gunicornWithin the /GroupRecommender_flask/ directory, wsgi.py was created in order to serve as an entry point for the application in connection with Gunicorn. This file simply imports the Flask instance from the application and then runs it. To configure the VM to start Gunicorn and run the application on boot, a service file, recommender.service, was created in /etc/systemd/system. This file specifies the working environment and executable start locations of the application (/GroupRecommender_flask/ directory) and Gunicorn instance (within the virtual environment folder), and creates/binds a Unix socket file within the /GroupRecommender_flask/ directory called recommender.sock. The service could then be started and enabled using the following commands: systemctl start recommender systemctl enable recommenderEnabling a service is necessary for it to start on boot of the VM. Once the recommender service was up and running, Nginx was then configured by editing the nginx.conf file located in /etc/nginx/. This file was configured such that Nginx listened to port 80 (HTTP) and responded to our server’s domain name, authvenue.cs.vt.edu, as shown in Figure 4. Additionally, standard proxying HTTP headers were set within this file such that Gunicorn had information regarding remote client connection, as shown in Figure 4. Once Nginx was configured, it could then be started and enabled using the following commands:sudo systemctl start nginxsudo systemctl enable nginxNow, Nginx would run on boot and serve the Flask application using Gunicorn. In order to update Nginx or Gunicorn, the following pip commands can be utilized:pip install nginx -Upip install gunicorn -UIt is important to note that Gunicorn was installed within our Python virtual environment, and as such the virtual environment must be activated to update it as well.Future WorkAddressing future development, one key improvement is to address Harinni’s models’ runtime performance during categorization. Currently, Harinni’s models take several minutes to classify a paper for a group of venues. We believe this is related to the number of neurons connecting each layer. However, dropout (with significant effect to performance) between any of the layers significantly degrades the classification abilities of our models. Changing model architecture to a recurrent neural network further degrades performance (however, this doesn’t affect classification as much as implementing dropout). We would like to note that model training is relatively faster than this classification probably because we have easy access to hardware acceleration during training that isn’t easily available for long-term web app use. There is also more to be done in terms of testing the program and its output. We did not have time to implement it this semester, but potential next steps that can be taken for testing and assessment would be building test programs with test cases of input/correct-output pairs in a file and scripts to run our program and compare it against the output. There are some improvements that can be made to the website. Gunicorn can potentially be better configured to improve the speed of the application, as currently the recommender takes between 3-5 minutes to produce results. The Gunicorn configuration file is located at /root/GroupRecommender_flask/gunicorn.conf.py/. The UI of the application can be improved by editing the static files located in /root/GroupRecommender_flask/Venue_Recommender/static. We also need to refine the scraping program to ensure that we obtain an index for each venue, because it currently does not obtain data for every single venue that was used in model training and output. This results in occasional venue outputs with the phrase “cannot find venue” because the scraping program was unable to find its h5-index. The current UI is quite basic, and there are no features other than inputting a title/abstract and obtaining a recommendation. Potential useful features that can be added include: filtering the recommender’s results, user registration to manage a user’s recommendation history, additional pages that provide information about the recommender or venues. An improved logging system could be developed for the recommender, as currently only information about the most recent usage of the recommender is saved in the app.log file located in /GroupRecommender_flask/VenueRecommender/logs. Additionally, the security of the website can be improved by migrating it from http to https. Any questions regarding the virtual machine or website can be directed to Ali Youssef at aligy@vt.edu. 9. Lessons LearnedThe original problem proposed by our first client, Dr. Shaffer, was seeking to find a relevant ranking system for the impact of conferences and journals in order to better inform a prospective researcher on where would be best to publish their work in order to have the most impact. The problem that we encountered with this through our research is that we found such systems already exist [5][6][7]. This put us in a difficult situation as we were uncertain how to proceed given that our problem appeared to have already been tackled by various other projects across the web. This is where we had to consult with the professor and were directed to our potential next client, Harinni Kumar, who had a relevant project that we could build off of for our own project. This put us on the path to the project we pursued.A lesson to be taken away from this is one of flexibility and resilience. In the real world, things often don’t go as planned. It is important to be able to adapt and overcome these situations in order to move forward. We had to change both our project and our client in order to land on a project that seemed feasible, was related enough to the original project to still be of interest to our original client Dr. Shaffer, and was substantial enough to be an appropriate choice for our semester-long project for this course. 10. TimelineJan 31-Feb 6Initial meeting with Dr. ShafferPreliminary research on existing ranking systems and APIsFeb 7-13Continued research on ranking systemsMeeting with Dr. Shaffer to discuss research and possible ways forwardMeeting with Dr. Fox to discuss a new direction after our research found pre-existing solutionsInitial email communications with Harinni, the second client that Dr. Fox directed us toFirst presentation about our progress so farFeb 14-20Continued communication between Dr. Shaffer, Dr. Fox, and Harinni on how to work between the two clients to best have a project for both of themInitial meeting with Harinni to discuss her project and what we might be able to do with itMeeting with Dr. Fox again to discuss how to proceedFeb 21-27We make the decision to move forward with building off of Harinni’s project while also trying to answer Dr. Shaffer’s problemFurther communication with our clients and Dr. Fox informing of this decisionUpdating our project page to reflect the change in directionReceiving the data used for the project from HarinniFeb 28-March 6Meeting with Dr. Shaffer to go over the new project direction and show him the dataset that we are working withWe are added to the github repository to access the projectMeeting with Harinni for a walkthrough of her code in the projectMarch 7-13Initial work on notebook files in the projectGroup meetings in class to discuss work on the projectMarch 14-20Continued group meetings in classContinued work on models used in the recommenderMarch 21-27Continued communications with Dr. Shaffer on the progress of the projectMeeting with Dr. Fox to discuss the ranking portion of our recommenderMarch 28-April 3Initial work on interim reportContinued work on ranking system of the recommenderApril 4-10Completion of both methodology and interim reportsCompletion of ranking system for the recommenderApril 11-17Development of web applicationAdd h5-index scores to interfaceApril 18-24Deployment of Flask web applicationApril 25-May 1Meet with Dr. Schaffer and Harinni to discuss project progressMay 2-7Completion of Final Report and Presentation11. AcknowledgementsWe would like to acknowledge our original client that arranged for the project topic Dr. Shaffer, our additional client who provided us with her existing project Harinni Kumar, and our professor Dr. Fox.Dr. Cliff ShafferAssociate Department Head for Graduate Studies and Professor of Computer Science at Virginia TechHomepage: : shaffer@cs.vt.eduPhone: (540) 231-4354Office: 2000A Torgerson Hall at Virginia TechHarinni KumarSoftware engineer at Walmart LabsEmail: kkharinni35@Dr. Edward FoxProfessor of Computer Science at Virginia TechHomepage: : fox@vt.eduPhone: (540) 231-5113Office: 2160G Torgersen Hall at Virginia Tech12. References[1] Vrettas, G. and Sanderson, M. (2015), Conferences vs. Journals in Computer Science. J Assn Inf Sci Tec, 66: 2674-2684. Published in JASIST Volume 66 Issue 12, December 2015. Date Accessed: January 28, 2021 [2] Harinni Kodur Kumar,, ACM Venue Recommendation System, 2019, Thesis for Master of Science in Computer Science and Applications, Blacksburg, VA: Virginia Tech Department of Computer Science, Date Accessed: February 9, 2021[3] Nancy Davenport, Scholarly Research Impact Metrics, 2018, American University, Date accessed: April 6th, 2021[4] Liam Crilly, Nginx documentation. 2021. Date Accessed: March 15, 2021 [5] Google Scholar, Google Scholar Metrics Top Publications, 2020, Date Accessed: March 08, 2021[6] Bouchrika, I., & Mirabel, C. 2021. Guide2Research - leading research portal for computer science. Date Accessed: March 08, 2021[7] Scopus, Scopus Journal Rankings. 2021. Date Accessed: March 09, 2021[8] Benoit Chesneau, Gunicorn configuration overview. 2021. Date Accessed: March 15, 2021 ................
................

In order to avoid copyright disputes, this page is only a partial summary.

Google Online Preview   Download