High Performance Computing Cookbook for MSIS Data …



High Performance Computing – Information for MSIS ResearchersBUS ADM 896 Independent Study – Spring 2019Department of Management Science and Information SystemsCollege of ManagementUniversity of Massachusetts BostonDon Jenkins (don.jenkins001@umb.edu and Tristan Stull (Tristan.stull001@umb.edu),Instructor: Professor Jean-Pierre KuilboerContents TOC \o "1-3" \h \z \u Introduction PAGEREF _Toc9419805 \h 2Motivation PAGEREF _Toc9419806 \h 2What is an HPC environment? PAGEREF _Toc9419807 \h 2Facilities PAGEREF _Toc9419808 \h 2Information and Resources PAGEREF _Toc9419809 \h 4Getting Started PAGEREF _Toc9419810 \h 5Connecting to the HPC PAGEREF _Toc9419811 \h 5Linux Background PAGEREF _Toc9419812 \h 5Command Line File Transfer PAGEREF _Toc9419813 \h 7Text Editing PAGEREF _Toc9419814 \h 7Types of Queues: Batch Jobs vs Interactive, Short, and Long Sessions PAGEREF _Toc9419815 \h 9Good Neighbor Policy (“Fair Use”) PAGEREF _Toc9419816 \h 10Bash Scripting PAGEREF _Toc9419817 \h 10GUI Based File Transfers PAGEREF _Toc9419818 \h 11Designing Experiments PAGEREF _Toc9419819 \h 14Suitable Experiments PAGEREF _Toc9419820 \h 14Experiment Profile (as a reference) PAGEREF _Toc9419821 \h 15Scale-up Approach PAGEREF _Toc9419822 \h 15Helpful Links with useful information PAGEREF _Toc9419823 \h 16IntroductionMotivationWe have prepared this cookbook as part of our independent study to share what we have learned about the high-performance computing (HPC) environments available to researchers here at UMB, and the types of research appropriate to use HPC environments. At the beginning of this effort, it seemed that many in the department (doctoral students and faculty) were not aware of what resources were available, how to gain access, or how to use them. To create meaningful analytical results for research and publication, many Ph.D. students in the MSIS Data Science Track will need to address large data sets or complex models requiring significant computational power. To date, students have run their experiments on personal laptops or individual school desktops, limiting the scope and scale of the models they can build and test. Several of the students have been using these single machines to run algorithmic experiments on very large, sparse networks (thousand to tens of thousands of nodes), when these analyses would be better suited to run on available HPC environments. In at least one case, a colleague has been running moderately sized experiments on a desktop computer that takes four to five days of uninterrupted processing to finish. What is an HPC environment?HPC stands for high-performance computing, which is based on clusters of professional-grade computers linked together to execute computationally expensive experiments efficiently (or, in some cases, to make them possible). REF _Ref9185793 \h \* MERGEFORMAT Figure 1 shows a visual representation of a typical HPC cluster. For a more detailed explanation, this training from the UMass Medical School provides significantly more detail. Facilities The good news is that we have fantastic resources available to use – for free. Basically, there is a high-performance environment (‘Chimera’) located on the UMass Boston campus, while there is a large-scale facility (a ‘supercomputer’) called MGHPC (or in this document ‘Green’) located in Holyoke, MA. The Green assets are shared with all UMass campuses and a number of other Massachusetts-based universities. Both facilities are available, standing by, and fully paid for already, yet are significantly underutilized by UMass Boston researchers at present. The university also has two knowledgeable, experienced, and helpful administrators to support students and staff conducting research using these resources. The only hard requirement is that system users abide by the Good Neighbor Policies (see below). We should also provide appropriate attribution or citation in any published works or presentations.Our aim is to get the Data Science researchers in the MSIS program to use these resources, which will introduce a new group of users to the system. To ensure effective and efficient “new”users, we recommend the following for anyone aiming to use the HPC resources:Attend the two-hour introductory workshops offered periodically on campusTake time to read the wikis and knowledge bases (links below)Join the applicable mailing lists to stay up to date about system activityExchange know-how, best practices, and lessons learned with each otherLearn the rules and ‘cultural etiquette’ of using these resourcesPrepare batch jobs as assiduously as possible using the Scale-Up Approach (see below)Respect the batch-oriented nature of resource-intensive shared environment processingFigure SEQ Figure \* ARABIC 1 - Depiction of typical HPC Cluster. Source: U Mass Medical School slides on MGHPCC WikiIn using the HPC resources, everything is handled with “soft boundaries,” meaning it is theoretically possible to overuse UMB’s allotment (or even the UMASS system allotment). If you do, this could cause problems for our institution and possibly even increased costs for the department. That said, the HPC administrators also take pains to say that users should be cautious, but not too overly cautious. They monitor activity and potential over-use, so unless someone is doing something quite malicious, such as hacking, the worst that is likely to happen is that an administrator has to step in and kill your job. Proper experiment design along the Scale-Up Approach covered later should mitigate potential risks of overuse of resources.FacilityOverviewChimeraNeed to be on campus network or use VPNAccess request info. for new accounts.System Summary: 1 Master Node + 9 Computing Nodes: CNs have 2-4 CPUs with 8-cores, each with up to 256 GB RAM.Bandwidth: Fibre Infiniband (e.g. very fast internally).More details here.MGHPC(‘Green’)Access from anywhere (no VPN needed).Form for new accounts.System Summary:13,312 CPU cores68 / 670 racks of nodes for UMASS, include many GPU nodes with various Tesla GPUs~ 4,000 TB of storageA lot of RAM.Bandwidth: Fibre Infiniband (e.g. very fast internally).More details here.Table SEQ Table \* ARABIC 1 - Summary of Chimera and Green HPC resourcesInformation and ResourcesThere are numerous resources available to aid in introducing researchers to the HPC environments, from introductory users all the way up to expert levels. Below is a short list of presently available resources with associated links or contact information.ResourceTypeNotesHPC Web Page for UMass BostonOverview Knowledge BaseKnowledge base access and Linux knowledge.MGHPCC WikiKnowledge base bit disorganized, but has some useful bits and pieces including slides from prior training sessions.Dr. Jeff DusenberryHPC Admin UMass Bostonjeff.dusenberry@umb.edu. Jeff also teaches workshops on HPC and Linux. Check here for schedules (none scheduled at time of writing).Table SEQ Table \* ARABIC 2 – Available HPC resourcesGetting StartedConnecting to the HPCAfter you get your account set up (return to REF _Ref9422251 \h Table 2 if you have not), the user will be connecting to the HPC using a secure shell, which in UNIX is called an SSH. To connect to Green, the user just needs internet access for their computer. Connecting to Chimera requires being on the campus network or logged in via the UMass VPN. To get access to the UMass Boston VPN, a student or staff member needs to submit their request to the IT department to gain access. The requirements are different for different types of OS and computers. Mac OSWhen using a Mac, getting into the HPC is simple. All that is required is to open the terminal window and type in “ssh” followed by your account. You will be prompted for your password before being allowed to connect. The screenshot shown in REF _Ref9422327 \h Figure 2 shows the terminal window in Mac for the initial login to the chimera HPC resources. Figure SEQ Figure \* ARABIC 2 - Terminal screenshot of Chimera loginMicrosoft WindowsWindows has a few more levels of complexity to set up the computer. Most users will use an SSH client like Putty to make the connection to the HPC and will use the Windows Command window to enter their Linux commands. The HPC Wiki provides instructions for connecting a Windows machine to both HPC environments at this handy link.Linux BackgroundUse of HPC environments requires knowledge of Linux, as it is the operating system on which all of the environments run. Typically, one works in Linux from a command line interface (CLI), which is what SSH provides. However, there are also some client/server processes in which one uses a graphical user interface (GUI) on the client machine (e.g. your laptop computer). Secure FTP using FileZilla is an example of this (more on this below). Do not be afraid of Linux! Linux has a learning curve, but it is very rewarding. In fact, it changes your entire frame of reference with regard to computing. Finally, it has a cute penguin logo as shown in REF _Ref9264664 \h Figure 3. That should put you at ease when you find yourself frustrated at some point in your HPC usage.Figure SEQ Figure \* ARABIC 3 - Linux iconWhile providing a full introduction to Linux tutorial is out of scope for this cookbook, we include some helpful Linux hyperlinks below and a short list of handy Linux commands is shown in REF _Ref9204138 \h \* MERGEFORMAT Table 3. There are many helpful resources available online that provide Linux tutorials, but a reasonably good introduction and short-list of resources used in preparing this document includes: tutorial for beginners for full series of tips, tricks, and toolsStack Overflow for nearly unlimited questions and answers from beginners to for tutorials at multiple levels of experienceThe Linux Basics Course for very good series of 54 videos from beginner to expertWhen using the command line for Linux, you can find out all you need to do and all permutations of commands by typing “man” followed by the command you are interested in executing. Warning, the screen dump from typing “man” can be a bit overwhelming in some cases, but should not be considered a roadblock to mandCommentslsLists what files are in the current directorycdAllows you to change the current directory to the directory you type following the command (e.g. cd newdir)mkdirAllows you to make a new directory in the existing directory (e.g. mkdir newdir)module availSee what modules are available for you to usemodule listSee what modules are currently loadedmodule loadNecessary to load modules you need to use such as gurobi, python, R, etc. (e.g., module load gurobi, module load R/3.5.1 ;R). Need to be specific to which version of software you are loading if multiple versions are availablemvMoves a file from one directory to another (e.g., mv thisfile /newdir)rmDeletes a file, but use with extreme caution, there is no way to recover a file deleted in this manner (e.g., rm thisfile)historyShows you the history of commands you have run. Extremely helpful to see what you may have done in past sessions that you have since forgotemacs & viUseful text editors that allow you to modify text files within the terminal window. Use for simple edits, otherwise we recommend beginners do their edits on a local text editor and transfer file for use to server.Table SEQ Table \* ARABIC 3 - Handy Linux commands for beginnersCommand Line File TransferOne can use command line tools such as PuTTy for Windows or Terminal window on a Mac, or as discussed later, there are helpful GUI tools (e.g. FileZilla) to streamline and simplify file transfer. The simplest command line file transfer looks like this:rsync -av -e ssh hello_world.sh ts17b@ghpcc06.:rsync will prompt you for the password, and then do the copy. If successful, you will get output such as:building file list ... donehellosent 204 bytes received 42 bytes 44.73 bytes/secText EditingText editing is likely easier to do on your local computer using a text editor you are familiar with using, and then moving the files over using file transfer. That said, if you are comfortable working in the command line for simple scripts and such, you can use built-in tools like emacs or vi text editors. You are restricted in moving around and need to remember a few key commands to save, edit, and exit. There are readily available references for both emacs hotkeys and vi hotkeys. To enter the editor, you just need to type the command and the file you are opening as shown in the two images below in REF _Ref9279729 \h \* MERGEFORMAT Figure 4.Figure SEQ Figure \* ARABIC 4 – Looking at R script for Gurobi model as shown in emacs editorTypes of Queues: Batch Jobs vs Interactive, Short, and Long SessionsAll work processed on an HPC environment is submitted as a job. There are three types of jobs: interactive, short, and long. You can specify the type of job you are submitting. In each case, the work is assigned by a piece of software called the job scheduler (residing on the head node) to one or more processing nodes. An interactive job means that you will take over a computing node and have this to yourself for the duration, which is capped by the administrators to be no longer than six hours. A short job is a batch job that takes four hours or less. A long batch job is anything that takes longer than four hours (up to 30 days). Along with the type of job, one can also specify the resources to use in submitting the job per the command line. If resources are not specified, the system will assign a default set of resources.Example batch jobs on Green (Chimera has slightly different commands):Example 1: bsub "echo Hello MSIS! Welcome to HPC! > ~/firstjob.txt"Trivially outputs the text to a file in the home directory.Example 2: bsub –q short –W 1:00 –R rusage[mem=2048] –J “Myjob” –o hpc.out –e hpc.err wc –l reads.fastqThis job specifies the queue type (short), how much memory to use (2048), and the error file to output any error messages (hpc.err). The “rusage” command can also specify how many cores to use on a node. This is not typically required. Example 3:bsub -q interactive -W 120 -Is bashSubmits an interactive job for 120 minutes, specifying the bash shell to be used as the interface. Most data experiments requiring HPC resources should be submitted as batch jobs. Interactive sessions, limited automatically by the system to ~6 hours or less, are intended only for actual, in-person, constant interaction with the analysis. This is considered a less common use-case in the HPC environment. One example that comes to mind is working with a data set using an interactive analytic approach such as an iPython shell. The point is that batch jobs are easier to manage. Interactive sessions take up resources and lock others out. There are many resources online to help you gain a better understanding, but we recommend these two: Reference 1 & Reference 2 (from Slide 50).Good Neighbor Policy (“Fair Use”)Don’t lock others out! For example, don’t log in to an interactive processing session and leave it running all day while going off to do other things. Further, if you feel that your experiment is likely to consume enormous amounts of resources other than CPU and RAM, such as aggregate bandwidth or enormous amounts of disk storage, you should reach out to the administrator and discuss. The default storage allotment on Green is 50 GB, which is your home directory. In requesting larger allotments of storage on Green the request should be made by the Principal Investigator (e.g. for students, this is generally the faculty advisor). There is no set limit to the amount of storage that can be made available. It may be possible to get as much as a petabyte for a limited period of time (say, 1 month). Bash Scripting“Bash” is the “Bourne Again Shell”: a command line interpreter (which is a type of interface to the system). It is capable of processing ‘simple’ programs (‘scripts’) which are generally oriented towards automation of file and directory operations that would otherwise be annoyingly repetitive. These can be useful for automating repetitive tasks.One writes a Bash script using any text editor (e.g. your local text editor, then uploaded to the HPC). To execute a Bash script you first have to make it executable in Linux (read up on Linux permissions). [Note: we realized that Green is set up so that any file with the suffix “.sh” seems to be given automatic execute privileges in the user home directory.]Example: A simple ‘Hello world’ script in Bash.#!/bin/bash# simple hello, world! scriptecho "Hello, World!"The first line tells the operating system to use a bash interpreter. The second line is a comment. The third line simply outputs the text “Hello, World!” to the default output device (terminal screen).First, check the permissions ‘mode’ of the file. We need to know if it will execute.-rw-rw-r-- 1 ts17b 52 Feb 8 11:06 helloNote that it has read/write for the first two groups, read for the third, and no execution privileges at all (the missing ‘x’).Change the mode to add execution privileges:chmod +x helloCheck that the privileges have been correctly set:ls -l-rwxrwxr-x 1 ts17b 52 Feb 8 11:06 helloRun the script: [ts17b@ghpcc06 ~]$ ./helloHello, World!:Not very exciting, but every journey starts with a first step. From here, it is all about basic programming skills, and getting up the learning curve. Next example: here’s slightly more useful script that takes a file name input and counts the number of lines in the file (assuming scalar text). It does this by calling a program called “cat” which simply reads a file and outputs its contents to default output. Then the pipe symbol “|” is used to re-direct default output to another program, “wc -l”, which actually counts the lines for us. As you can see, the script operates a loop, adding each line. #!/bin/bashfunction num_lines { cat $1 | wc -l}for file in "$@"do lines=$( num_lines $file ) echo "The file $file has $lines lines"doneTo learn more about these utility programs, type “man cat” and “man wc” at the Linux command line [‘man’ is unix for “manual pages”]. As you will see, “wc” is actually ‘word count’, but with the flag ‘-l’ set it counts lines. There are hundreds of such useful programs in the Unix world. If interested in learning more, pick up an old copy of Unix Power Tools.GUI Based File Transfers Transferring files via the command line often can be cumbersome if you aren’t using the system on a daily basis. Fortunately, there are free software packages you can download to use with a graphical user interface that simplifies the visualization of your file structures and eases your ability to transfer files to and from the HPC system. The software we used for creating this cookbook is called FileZilla and is available for Linux, Mac, and Windows operating systems. We include a quick walkthrough for how to download and setup FileZilla on a Mac, but the setup for Windows appears to be just as simple.First, you need to go to the FileZilla website and download the FileZilla client to your computer. Make sure to choose the client link to download the client to your machine. There are easy to follow directions and choices for your operating system.Figure SEQ Figure \* ARABIC 5 – Download page for FileZilla clientOnce you have installed the FileZilla client, you will need to set it up to connect to the HPC. Below we will demonstrate how to do it for Chimera, but the set up for Green is essentially the same with just a different Host name. Below are the steps to connect:Start FileZilla and open the site manager from File > Site Manager as shown below:Figure SEQ Figure \* ARABIC 6 – Starting the Site Manager in FileZillaMake the following changes in the Site Manager as shown in REF _Ref9261160 \h Figure 7:Change Protocol to SFTP – SSH File Transfer ProtocolType in the server under Host. Chimera is simply chimera.umb.eduType in 22 for the Port.Set the Logon Type to NormalType in your username without the @ and the info that followsType in your Password for your Chimera accessRename the connection (lower left button) and save itFigure SEQ Figure \* ARABIC 7 – Configuring HPC site in FileZillaTo go online, press Connect at the bottom of the screen and monitor the status to see when you are connected, as shown in REF _Ref9423442 \h Figure 8.Figure SEQ Figure \* ARABIC 8 – Connecting to the HPC site in FileZillaAs shown in REF _Ref9261115 \h \* MERGEFORMAT Figure 9, you should see the local files on the left-hand side and the Chimera file structure on the right-hand side. At this point, you can manage your files on Chimera as simply as you would on your local machine. Simply drag and drop files in the directories you would like them to be in, and conduct housekeeping of your files on Chimera.Figure SEQ Figure \* ARABIC 9 – FileZilla view when connected to HPCThat is all it takes to use the FileZilla client to transfer and manage your files on the HPC. There are other clients out there that you can download and use. This this link provides a good discussion about the different software packages, with advantages and disadvantages of eachDesigning ExperimentsSuitable ExperimentsHPC environments expect computationally intensive experiments. A good example is the type of experiment Dr. Pajouh and his group do with optimization models implemented on large, sparse graphs that can be of exponential complexity. Even with branch-and-bound or branch-and-cut modifications to reduce the problem size and complexity, these tasks generate significant calculational workload on even moderately sized data sets. Another example is text analysis of large corpora using an approach such as Expectation-Maximization. The algorithm may be asymptotically efficient, yet still require large processing resources on any “interestingly-sized” data set. A good use case for the HPC resources is a model that requires significant computational horsepower to compute, such as a large-scale optimization problem run on Gurobi. A bad use case for the HPC resources is harvesting enormous numbers of Twitter tweets over an extended period of time or doing large-scale Web scraping. These “bad” tasks can put a heavy burden on shared external bandwidth for the campus, which is a limited resource. Another bad example might be a process requiring intense, high-volume interaction with a GUI client. While VNC is available, it is not ideal for low-latency, high-volume visual work. Experiment Profile (as a reference)CharacteristicValue (sample values)RemarksExperiment Name(tristanstull001_UMS_optim_04012019)Pick a name for the experimentTarget Environment(GHPCC)E.g. Chimera HPC, GHPCCBatch or Interactive?BatchSpecify whether the experiment will be submitted as a batch job or run interactively.Description(Multiobjective Linear Optimization of UMS data from April 2019 using Simplex algorithm via Gurobi Optimizer 8.1.0.)A short description of the experiment.Executable(ums_1_simplex_constraintset001.sh)Examples:experiment_name.sh, invoking R/3.4.0; experiment_name: a CPP program invoking gurobi810.Target Queue(Large.)See information about queues here.Initial Storage Volume( ~10 MB)Amount of storage required before the program runs (input data).Execution Storage Volume( ~10 MB)Amount of storage required during and/or after the program has run (e.g. output data).Peak Execution RAM(150 GB)Based on prior experience. Best approach: build up a history by analyzing the actual completion statistics and resources used on a series of smaller jobs.Peak Execution CPU cores(32)Ibid.GPU requirements(None.)Specify whether and how many GPU cores are required, and if applicable what type.Note: jobs requiring GPU must be submitted to the GPU queue.Expected Execution Time(~ 30 minutes.)Execution time helps determine which queue is appropriate. Longer experiments can be run for up to 30 days on the long queues.Table SEQ Table \* ARABIC 4 - Experiment profile checklistScale-up ApproachLocal Machine. Work out the model, code and de-bug, and test with a small data set.Chimera. Upload, compile (if needed) to run on Linux, work out the Linux commands (note: some software may be available on Green but not on Chimera). Run with data until such point as it is clear that Chimera is no longer up to tasks.Green. Now, with all the code debugged and compiled for Linux, and the main experiment data set ready, you are ready to run the experiment on Green. First upload the code and data, and then set up an appropriate batch job. If your experiment is enormous, check current performance (while on campus or VPN), as well as current batch jobs (‘bjobs’ on Green, ‘jobs’ on Chimera) to see with whom you might be competing. Consider contacting these researchers, and/or the administrator to coordinate, if needed. Notes: If GPU is needed, it is a different kind of job. See Wiki.If parallel processing is needed, you have to check if the application supports that (e.g. R does not, ‘out of the box’); Python appears to, using the multi-processing module (caution: author has no experience with this to-date).Software installed to the user’s home directly (as opposed to a system-wide install) cannot be used to run in a distributed mode (distributed across multiple nodes). If this is needed, work with administrator to get software installed. Easier for Chimera, than for Green (but not impossible).Helpful Links with useful information Harvard MGHPCC reference materialContains much information about the Harvard instance of Green, with much of it being applicable to UMass Boston studentsCase Western HPCC reference materialContains solid walktrhoughs of how to use an HPC environment at Case Western, but much is applicable to UMass Boston studentsIntroduction to High Performance Scientific Computing High level overview of how to use HPC resources in scientific researchLinux ToolboxA solid reference for using LinuxA Beginner’s Guide to High-Performance Computing (Oregon State University)An entry level discussion about HPC computing from a beginners perspectiveIntroduction to Linux: Command Line Basics - videoAn hour long video with many of the needed basics for your first go at using LinuxEND - ................
................

In order to avoid copyright disputes, this page is only a partial summary.

Google Online Preview   Download