Large-Scale Cluster Computing Workshop



Large-Scale Cluster Computing Workshop, FNAL

Day 1: 5/22/2001

General information (Dane)

Mail presentations to lccws@

Clusters builders handbook based on notes from workshop

Goals (Alan)

HEPiX formed a Large Clusters SIG last year

SIG should run “special” meetings and this (LCCWS) is one of those.

Primary goal: gathering practical experience and build the ‘cluster builders guide’

Welcome (Matthias)

D0 data handling: SAM – GRID enabled. Remote clients: Lyon, NIKHEF, Lancaster, Prague, Michigan, Arlington

LHC computing needs (Wolfgang)

The IEEE CS Task Force on Cluster Computing (CSTF) (Gropp)

(gropp@mcs.)

Setup standards. Be involved with issues related to the design, analysis and development of cluster systems as well as the applications that use them.

• Task force with short-term lifetime (2-3 years)

• Cluster computing is NOT just parallel, distributed, OSs or the Internet. It is a mix of them all.













• TFCC whiltepaper:

• TFCC newsletter:

• TopClusters project. Numeric, I/O, web, Database and application level benchmarking of clusters

• TFCC: over 300 registered members. Over 450 on the TFCC mailing list tfcc-l@bucknell.edu

• TFCC future plans: cease as task force and attain full Technical Commettee status.

Scalable clusters

• list: 26 Clusters with 128+ nodes (8 with 500+ nodes). Does not include MPP-like systems (IBM SP, SGI Origin, Compaq, Intel TFLOPs, etc.)

• Scalability practical definition: Operations comlete “fast enough” -> 0.5 to 3 seconds for “interactive”.

• Operations are reliable: approach to scalability must not be fragile.

• Developing and testing cluster management tools: requires convenient access to large-scale system. Can it co-exist with production computing?

• Too many different tools

o Why not adopt UNIX philosophy? Example solution: “Scalable Unix tools”. Those are just parallel versions of common Unix commands like ps, ls, cp, …, with appropriate semantics. Designed for users but found useful by administrators.

o The basic Unix commands (ls, …) are quintessential tools

o Pt names

o Performance of ptcp: copy a single 10MB file on 241 systems toke 14 seconds (100BaseT)

o Based on a very simple daemon forking processes under user’s id. Authentication relies on trusted hosts.

o Implementation layered on MPI.

o GUI tool: ptdisp

o Open Source. Get from

o Needs MPI implementation with mpirun. Developed with Linux, MPICH, MPD, on Chiba City at Aragonne.

• resource available for research on scalability issues.

• Large programs:

o DOE Scientific Discovery through Advanced Computing (SciDAC)

o NSF Distributed Terascale Facility (DRF)

o OSCAR: goal is a “cluster in a box” CD. Target smaller clusters (sponsored by Intel)

o PVFS (Parallel Viertual File System)

o Commercial Efforts: Scyld, etc.

• Q: Model we have for a small Unix box does not necessary scale. Yes, it is maybe not an elegant model but a working model.

• Q (Tim): why not only ptexec? To ease the learning.

Linux farms at RCF

Linux farms provides majority of CPU power in RCF. Mass processing of raw data from RHIC experiments. Mostly rack-mounted due space restrictions.

• 338 intel-based nodes (dual and quads). Ranging from 200MHz to 800MHz. ~18,176 SpecInt95

• RH6.1, 2.2.18 kernel. Take advantage to pend disks together

• OS upgrades over the network or initiated via bootable floppy. Master Linux image kept in a separate machine.

• AFS (mostly OpenAFS) and NFS servers for access to user software and data.

• CRS: Central Reconstruction System

o Interface with HPSS for access to STK silos

o Jobs submitted o batch server nodes are relayed to batch master node, which interfaces with HPSS and the CRS nodes.

o Check that HPSS and NFS to minimize job losses

• CAS: Central Analysis

o LSF or interactively

o LSF queues separated by experiment and by priority

o LSF license cost is an issue

• Monitoring software:

o Web interfaces and automatic paging systems (also for users)

o System availability and critical services (NFS, NIS, SSH, LSF)

o Data stored on disk and backed-up

o Cluster management software collects vital hardware statistics (VALinux)

o VACM for remote system management

• Security:

o Transition to use open SSH. The few systems upgraded haven’t had any problems. Issues for farm deployment is Kerberos token passing so that the users don’t need to authenticate twice.

• CTS for trouble ticket system

BaBar Clusters (Charles Young, SLAC)

• Migrating to Linux (and Solaris)

• Reconstruction cluster is Solaris based.

o Up to 200 CPU farm

o Not using batch but a customized job control

o Scaling issues: tightly coupled systems, serialization or choke points, reliability concerns when scaling up.

• Pursuing Linux farm nodes:

o VA Linux 1220 – 1 RU, 2 CPU, 866MHz, 1GB

o Some online code is Solaris specific

• MonteCarlo

o 100 CPU farm nodes. Mix of Linux and Solaris

o LSF

• Analysis Cluster

o Data stored primarily in Objectivity format

o Disk cache with HPSS back end

o Levels: Tag, Micro, Mini, Reco, Raw, etc.

o Varied access parttern

▪ High rate and many paralle jobs to Micro

▪ Lower rate and fewer jobs to Raw.

• Analysis farm: ~200 CPU, Solaris. Gigabi Ethernet

• BaBar computing is divided in to Tier A, B and C. Three tier-A: SLAC, IN2P3, RAL

• Offline clusters more likely GRID adaptable

o Distributed resources

o Distributed management

o Distribute users

Fermilab offline computing Farms (Stephen Wolbers)

• Batch system is Fermilab-written FBSNG, a product which is the result of many years of evolution (ACP, cps, FBS, FBSNG) (Farms Batch System New Generation)

• No real analysis farms.

• The farms are designed for small number of large jobs and a small number of expert users

• 314 PCs + 124 more on order

• CDF: two I/O nodes (big SGI). Tape system directly attached to one of them

• Datavolume per experiment per year ~doubles every 2.4 years

• Future: disk farms maybe replace tapes. Analysis farms?

HPC in the Human Genome Project (James Cuff)

• The Sanger Center founded in 1993; >570 staff members now

• Data volume: 3000 MB database for Human Genome.

• IT farms

o >350 Compaq Alpha systems

o +440 nodes annotation farm

o -> 750 alphas

• Front-end compute servers to login to

• ATM backbone.

• LSF

• Computer systems architecture: Fiber channel/memory channel Tru64 (V5) clusters.

• Annotation farms: 8 racks with 40 x Tru64 v5.0. Total 320GB memory, spinning 19.2 TB internal storage.

• Two network subnets (multicast and backbone)

• Highly available NFS (Tru64 CAA)

• Fast I/O (ATM> switched full duplex Ethernet)

• Socket data transfer (rdist, ..)

• Modular supercomputing

• Immediate future: looking heavily into SAN. Wide Area Clusters. GRID technology.

• Global compute engines

Linux clusters of the H1 experiment at DESY (Ralf Gerhards)

• ~50 nodes

• SuSE linux v3

• Batch system: PBS (some problems to integrate it with AFS). No support contract with PBS people.

• Data access: AFS, RFIO, Disk Cache

• H1 framework (for event distribution) based on Corba

• Conclusions:

o H1 computing fits well with Linux based desktop environment

o Investigate new data storage models

▪ Move processes to data

▪ Use linux file servers

PC Clusters at KEK (A.Manabe)

• Belle PC cluster:

o >400 CPUs by 3 clusters

o Cooperation with SUN servers for I/O

o Number of users is small (7MSLOCs C++). LSF scheduled but a set of core-developers are allowed to get on to the servers. Full build ~24hours.

• Network: 9 Cisco 6509 switches. Farm nodes are connected over 100BaseT.

• Q (Tim): who mounts NFS? Mostly farm nodes. Using auto-mounter with soft mounts. Have seen mount-storms. Haven’t seen any stale. (Tim) we see stales with soft mounts.

• Staffing:

o 7 sysadmin

o 3 mass storage

o 3 applications

o 1 batch

o 4 operations

o 0 operators! Works well.

o Same staff supports most UNIX desktops

• Approaching ~100 system/staff. To cope with this they always aim to reduce complexity.

• Issues: limited floor space. Power + cooling requirements go up per rack.

o With no operators remote power and console management becomes import

o Console servers with up to 500 serial lines per server

o Installation. Haven’t been very successful with “burn-in”.

o Maintenance. Frequent problem with divergence from original models (e.g. physical memory)

o Need database and bar-code to keep track of machines

• System admin:

o Network install. Using KickStart and JumpStart. 256 machines in < 1hr

o Patches and configuration management done with a homemade system (~10yrs)

o Nightly maintenance

o System Ranger for monitoring (local tool)

o Report summarization: reporting becomes massive. Tool to condense reports to present common reports for groups of machines

• User Application issues:

o Workload scheduling starts to become complicated

o Startup effects. E.g. a user submits a job to 300 nodes and the loading of the executable kills the fileserver

o Mount-storms with AMD

• Q (Tim): does anybody do anything to dump Linux? No

Farm Cluster observations (BNL, Thomas A. Yanuklis)

• Machine life-cycle: at RFC machines have a life span of 3 years before obsolete

• Operational notes: no benchmark. Only minor difference between interactive and batch systems (homogenous clusters)

• Power and cooling are becoming a problem. Recently installed new (30 tons?) AC unit

• Space seems to be less of a problem given greater rack density. Tradeoff is power and cooling.

• Software upgrades: who proposes and approves the changes? Users or administrators?

• Have tested an IBM Cable Chain System that uses “System Management Processors”. PCI cards with onboard Ethernet, commands are issued to cards over a private network. Interesting product

PDSF computing model (Thomas Davis, ASG/NERSC, LBNL)

• NERSC is a supercomputer center

• PDSF: originally came from SSC (Originally: Particle Datamodel Simulation Facility), crated and moved to Livermore.

• No new hardware, software for close to 8 years.

• PDSF is not an experimental oriented cluster. Buy in service model. E.g. STAR. Clients pay for a fixed slices of CPU time/disk space. Clients has no direct ownership of cluster. PDSF admins.have the right to move resources around when needed.

• Client can get more CPU power than what they buy: if no one else is using CPU’s in cluster, then resources are available to other users.

• Do not buy hardware support or maintenance. Hardware is bought with warranty and when it ends the system is used until it dies.

• Expect to retire any Intel based system after 3 years

• Disk size is climbing even faster than Moore’s law.

• Memory constraints can also force early retirement of systems.

• Always trying to buy machines as late as possible.

• Software:

o LSF for batch queueing

o Control of software is vigorous; because PDSF has many experiments, any software changes are controlled

• No space, power or cooling problems. Brand-new $35M machine room. Sized for an SP/2 installation

Panel

• Acceptance tests with applications. (James Cuff) burn-in and acceptance test with a “kill cluster” csh script. (Tim) CERN puts in tender that the vendor performs a specified acceptance test.

• Handling of differences in OS distributions. If an OS version or kernel gives a problem who deals with it and how? Only solved with highly competent administrators reporting the problems to the vendors. How can trusted relations be built?

• VA Linux put a lot of effort in OS/kernel/drivers to run efficient on their hardware.

Parallel session B1: Data Access, Data Movement

• What does people think about SANS

o FC

o ISCSI

• Common cluster & WAN or GRID

• Rather large storage in commodity clusters

• Transparent media

• Transport protocols

• Performans

Sanger Institute: James Cuff

Key project: Ensembl ()

Automatic annotation system for large eukaryotic genomes.

Data:

• Multi-gigabyte DNA and protein sequence data

• Flat file, and binary index date formats: strings

• MySQL instances with single tables (files) >8Gb

• Total DB size of >25Gb, cache ca. >3Gb of feature space in memory

• All of this needs to be highly available, queried, scanned, …

• Sequence databases propagated from large clusters to node /data/… is in sync (dropbomb.sh): does an Rdist to each one of the masters in the rack who rdist’s the data down to the nodes

• All binaries, configuration files, LSF and sequence databases are all local to the farm nodes (only home-directory is NFS)

• MDP – reliable multicast technology.

• Initial tests: MDP scaled to 40MB/s (40x1MB/s) over 100BaseT.

o Failed production tests

o Incomplete files, corrupted files

• Can only reliably multi-cast @64Kb/s

• Key points:

o Need to move large files. Can move them but not in a sensible way.

• Q (Don): multicast problem due to switches? No

• Q: how must computation on each file? 4GB file for all analysis once per month. But this is going to change for a faster turn-around

The Difficulties of Distributed Data (Condor project, Doug Thain, Wisconsin)



Condor project established in 1985. Software for high-throughput cluster computing on sites ranging from 10->1000s of nodes. No master source of anyone’s data. Concentrate on software for loading, buffering and caching.

…. End of battery …

Parallel session A2: Installation, upgrading, testing

System installation & updates (A.Manabe, KEK)

• Network installation: may get ‘server too busy’

• Duplicate disk image (hardware work)

• Diskless PC: trouble shooting becomes difficult

Idea: network disk cloning software, dolly+. Network booting PXE. Remote power controller.

Dolly+:

• a software to copy/clone files or/and disk images among many PCs through a network (ETH Lausanne).

• Sequential file& block file transfer

• Configuration file: needed only for server host

• Using ring topology. Utilize max. performance ability of full duplex ports switches. Other systems using server-many clients topology -> server bottleneck for many clients

• Dolly+ uses pipelining & multi threading

• Performance: 1 server -> 100 Nodes: 19min (EIDE), 9min (SCSI)

• Ring topology implies break if one node fails. Dolly+ overcome this with shortcut

• Beta version:

• Q: how do you determine who is neighbor to who? Configuration on the server. Next node is identified by IP address (implies that node is running). Batch installer (modified RH kickstart) for initial installation.

Quick overview of NPACI Rocks (Philip Papadopoulos, San Diego)

• Focus on small clusters

• Pitfalls:

o Instlling each node “by hand”: difficult to keep software on nodes up to date

o Disk imaging techniques: Difficult to handle heterogeneous nodes. Treats OS as a single monolithic system

o Specialized installation programs (IBM’s LUI, RWCPs, Multicast installer): let linux packaging vendors do their job

• Penultimate

o RH Kickstart: need to fully automate to scale out (Rocks gets you there)

• Cluster-wide configuration files derived through reports from a MySQL database (DHCP, hosts, PBS nodes, …)

• NPACI Rocks toolkit:

• Nodes are 100% configured: use of DHCP, NIS for config.

• All software is delivered in a RedHat package (RPM)

• Never try to figure out if node software is consistent: if you ever ask yourself, re-install the node

• Re-installation (single HTTP server 100BaseT)

o One node: 10 minutes

o 32 nodes: 13 minutes

o Use multiple HTTP servers + IP-balancing switches for scale

• Leverage widely-used (standard) software wherever possible

• Rocks-dist:

o Integrate RH package from:

▪ Redhat (mirror) – base distribution + updates

▪ Contrib. Directory

▪ Locally produces packages

▪ Local contrib.. (e.g. commercially bought code)

▪ Packages from rocks.npaci.edu

o Sinle updated distribution that resides on front-end

o Kickstart file is a test description of what’s on a node. Rocks automatically produces frontend and node files.

• MAC extraction: program insert-ethers. Parses a file (e.g., /var/log/messages) for DHCPDISCOVER messages

o Extracts MAC address and, if not in table, adds MAC address and hostname to table

• Configuration derived from database (/etc/hosts, /etc/dhcpd.conf, pbs node list)

• CD/Floppy used to install the first time for the front-end node. Only CD for the other nodes.(?)

• By default, hard power cycling will cause a node to reinstall itself

o Addressable PDUs can do this on generic hardware

• With no serial (or KVM) console, able to watch a node as installs (eKV), but …

o Can’t see BIOS messages at boot up

• Syslog for all nodes sent to a log host (and to local disk) -> can look at what happened on a node before it went offline. Always forwarded.

• Monitoring: pre-kickstart telnet to the node

• Monitoring: snmp status, Ganglia (UCB) – IP multicast based monitoring system. Still weak.

• No problem (security) with DHCP. Have seen NIS leaks

• Software upgrades run as a batch job

DataGRID WP4 installation task (Jaroslaw Polok, CERN)

• Is lcfg based on cfengine?

Panel

• Linux BIOS from Los Alamos? VA Linux has played with it. Difficult to get setup now but it works. Correct philosophy: network is a BIOS thing. There should also be a SSH server in the BIOS.

• Open Hardware project? No recent activity

• Open BIOS project? (seems to have died)

• SystemImager allows for system updates while the system is running (even the kernel) and the upgrade will happen when the executable is used next time.

• Scenario: finds a security hack in ftpd, how is it handled? NPACI: reinstall; KEK rsynch;

• Acceptance test: FNAL (seti@home, network tests, …)

• Interesting tools (used by VA Linux)

o Burn-in software used by VA (internal): CTCS (Servers Test Control System)

o BIOSwriter (Aragonne). Eric Hendrix. With read option it sucks the BIOS content into an image file and move it to another machine with the same (literally) BIOS and run it with write option. Could keep the images on a NFS server. (Available at SourceForge)

Day 3 24/5/2001

Clusters at JLAB-1 (Ian Bird)

• 2u IDE disk servers (500GB)

• Environment

o NetApp fileservers (NFS&CIFS)

o Centrally provided software apps.

o Available everywhere (farms, clusters, desktop)

• Locally written MSS, previously OSM.

o 150TB/year

o 2TB/day (in/out)

o Work file servers 10TB RAID-5

o Farm cache servers 4x400GB SCSI

• LSF (~$200/node academic price)

• Hierarchical share of resources

• Users don’t call LSF directly. Use Java client (jsub, thin wrapper around LSF), that:

o Is available from any machines

o Provides missing functionality

o Plugs into MSS

o Web interface: applet which can access local disk

o Used since ~4 years without change

• Lattice QCD clusters use PBS (primarily because MIT partner don’t want to buy LSF) with JLAB written scheduler (7 stages – mimic LSF hierarchical behavior). Can submit from web-portal with means to get certificate.

• Future: combine jsub&LQCD portal features to wrap both LSF and PBS. XML-based description language.

• Cluster mgmt:

o Kickstart + 2 post-install scripts driven by a floppy

o Looking at PXE – DHCP (available on newer motherboards)

o Alphas configured “by hand + kickstart”

o Upgrades: autorpm (especially for patches). New kernels – by hand with scripts.

o System monitoring:

▪ LM78 to monitor temp + fans via /proc

▪ Mon ()

▪ Mprime (prime number search) has checks on memory and arithmetic integrity (system burn-in)

o Performance monitoring

▪ Reports from LSF etc.

▪ Graphs from mrtg/rrd

o Space – very limited. No people in the same building as the machine room.

Clusters at CCIN2P3 (Wojciech Vojcik, CCIN2P3)

• Must cope with multi-platform and multi-experiments clusters

• All groups share same interactive and batch (BQS) clusters and other types of services (disk servers, tape, HPSS and networking).

• Use AFS for /usr/local/*

• AFS for homedirectories

• Three compilers required gcc 2.91.66, 2.91.66 with patch for Objy 5.2 and gcc 2.95.2

• 35 different high energy, astrophysics and nuclear physics experiments

• Six silos + small DLT for exchange

• ~35TB SCSI disk

• WAN in France: VPN on ATM with guaranteed bandwidth

• Batch system: BQS local development used since 7 years. Posix (?) compliant client command. Platform independent (portable). Provides resource definition (e.g. platform/OS combinations). Extendible to user resources (switches, counters): e.g. only 20 running instances of same job.

• Running at about 80% load

• D0 data uses internal Enstore format

• Import/export problem

Condor

• HTC not HPC

• Good for managing a large number of jobs

• DAGMan for inter-job dependencies (e.g. one job to staging a tape, …)

• Condor daemons constantly monitor machines

• Fault-tolerance at all levels

• Manages ALL resources and jobs under one system. Easier for users and administrators.

• Condor is developed by form sysadmins

• Condor_master: periodically checks the time stamps on binaries it is configured to spawn, and if binary is newer it will gracefully run down the current running version and start the new one

• Condor_eventd: administrators specify events in a config file (similar to a crontab):

o Date and time

o Checks and plans for moving off running jobs from the machines to allow for the intervention

• Supports about 14 different platforms

• Globally shared condor configuration file /etc/condor/ should be on a shared filesystem

• Local configuration file: local policy settings for a given owner. Can be installed local but it is recommended to keep them is a version system on a shared filesystem (AFS).

• Daemon-specific configuration

• Authentication/authorization is currently Host/IP based. Machines can be configured with different control levels. Future work: kerberos and X.509 authentication in beta mode. It will be integrated with condor tools to get rid of host/IP authorizations. Enable encrypt channels and allow for passing AFS tokens. All condor binaries will be digitally signed and condor_master will only spawn new daemons if they are properly signed.

• Machines in a pool can usually run different versions and communicate with each other

• ,

• Contracted support is available. Questions condor-admin@cs.wisc.edu

• Q: a future scaling issues is how quickly can one contact all submit machines for match making.

• Q: scheduling with I/O (mass-storage systems) and shared network bandwidth are hot research in Condor.

• Q: host/IP on multi-home host one has to specify interface

• Q: fail-over of central manager? Pool continue to work until master comes back. There is no way (currently) to have another instance taking over automatically.

MetaProcessor Platform (United Devices)

• seti@home

• Largest distributed computing project in history

• Generalizations of seti

• Cancer research project participation

o 400K+ individual members

o 600K+ Machines

o 100M+ CPU hours

o ~163 (1GHz) year/day

• Designed to aggregate millions of devices

• Maximizes application results

• Encryption and code signing protect data and devices

• Application life-cycle control. Enables phased application testing and deployment.

• Unobtrusive agent (2MB install). Runs at priority lower than normal tasks

• Self-updating infrastructure: agents and tasks automatically upgrade to latest versions as devices connect

• Cross-platform (Linux, NT). Started on Linux but NT needed because it provides most of the resources.

• Agent must be able to communicate over Internet through firewalls etc. Request is over HTTP. Agent utilizes idle resources. Everything is crypted (communication, local stored data).

• Redundant scheduling handles nodes being switched off. Every job is run 2 – 3 time. Allows for application validation (run on different platforms).

• MetaProcessor SDK

o Lightweight C/C++ API with POSIX I/O

• Ideal applications:

o Coarse-grain parallelism

o High computation-data communication ratios

o Small footprint on the client

o Scheduling based on client characteristics

o Incremental transfer of new data

• Application: THINK – model interaction between proteins and potential drug molecules

o Input ................
................

In order to avoid copyright disputes, this page is only a partial summary.

Google Online Preview   Download