High Performance Computing Group
Colorado School of Mines

FACULTY PROPOSALS FOR TIME ALLOCATION ON Wendian and BlueM

The Mines' High Performance Computing Group invites faculty and associated research groups to submit proposals for the use of Mines' high performance computing (HPC) platforms, Wendian and BlueM.

The purpose of proposal request is to ensure that Mines-owned resources are utilized in a beneficial and efficient manner. Collection of information regarding the technical aspects of the project, help HPC staff to determine appropriate hardware requirements to fit your project. Information, including funding sources, project purpose, etc. are collected for the purpose of articulating the benefits of HPC, and ensuring that adequate resources are available for future use.

Wendian is Mines' new HPC platform coming on line in the fall of 2018. It contains the latest generation of Intel processors, Nvidia GPUs, and OpenPower nodes. In total, it has 82 regular compute nodes plus 5 nodes with GPUs combined to over offer 350 TFLOPs. It also has 3 administration nodes, 6 file system nodes heading up 1152 Tbytes (raw) storage @ over 10 Gbytes/Sec; Wendian runs the CentOS version 7 of linux. Parallel jobs are managed via the Slurm scheduler. The programming languages of choice include C, C++, Fortran, OpenMP, OpenACC, Cuda and MPI. The overall specifications are:

ProcessorCoresMemory (GB)Nodes / CardsCores TotalMemory Total (GB)
Skylake 615436192391,4047,488
Skylake 615436384391,40414,976
Skylake 5118241925120960
GPU cards for 5118s (Volta)3220640
OpenPower 816256232512
OpenPower 9 16256232512
Totals1072,99225,088

BlueM is Mines legacy HPC platform but still allows researchers to run large simulations in support of the university's core research areas while operating on the forefront of algorithm development. BlueM is a unique high performance computing system from IBM. The overall specifications are:

Feature Value
Teraflop rating 154 teraflops. (Roughly 7xRA)
Memory 17.4 terabytes
Nodes 656
Cores 10,496
Disk 480 terabytes

BlueM is unique in configuration. It contains two independent compute partitions that share a common file system. The two partitions are built using different architectures. The first partition, known as Mc2 (Energy), runs on an IBM BlueGene Q (BGQ). The second partition, known as Aun (Golden), uses the iDataplex architecture. Each of the architectures is optimized for a particular type of parallel application.

Proposals can be submitted using the following instructions and electronic form. This procedure is to be used by Mines' faculty members to request a specified number of node-hours. Allocations will be made on a semi-annual basis, and allocation awards will be valid for six months from the award date. Proposals will be evaluated based on:

  • sciences theme;
  • reasonableness of number of node-hours requested (each node has 16 cores);
  • code scalability;
  • number of students and post docs associated with the project;
  • clear tie to requested or existing external funding;
  • previous history of bringing funding to CSM using HPC;
  • faculty publications which rely on HPC; and
  • faculty achievements, awards and honors associated with HPC.

Importance of listing Grants and Publications

Researcher who have set allocations in 2018 will be grandfathered to Wendian. To check your allocation set up date run the command:

    /opt/utility/accounts
						

on AuN or Mc2. Accounts set up this year will begin with 18*.

You will see in the form below, fields for entering information about your proposals, funded research, and HPC related publications. This is important because this information will be used as part of the justification for future machine and infrastructure updates. This information will be reviewed and additional information many be sought if your entries do not appear complete.

Calculation of node hours to request

Researchers with previous allocations on BlueM can determine there usage by running the following commands on either AuN or Mc2:


    /opt/utility/aunhrs ############
    /opt/utility/mc2hrs ############

where ############ is the account number. These commands will appear to report account information back to the the beginning of last year but since these accounts have only been active on our current accounting system since the fall only the hours used since August will be reported.

You can see the accounts you are authorized to use by running the following command on either AuN or Mc2:

    sacctmgr list association cluster=mc2 user=$LOGNAME format=User,Account%20

This will return a list of your accounts. You can then see the association between your account number and your project title by running the following command, replacing ############ with the values from the previous command.

sacctmgr list account ############ format=Account%15,Desc%80

The per core performance of Wendian nodes will roughly 2x-4x of AuN.

Successful proposals will result in an award of a fixed number of node-hours, and faculty members will be able to track the node-hour usage by their group members through the commands given above. We have been rather lax in the past on enforcing limits to the time granted. Because of various contractual obligations we will need to enforce limits. Users will be allowed to run past their allocated hours but at a low priority.

Wendian Specific Policies

  1. People will be allowed to purchase nodes on Wendian
  2. The cost to the users of a Wendian node will be less than the "list" price, ~$11,500/$8,500
  3. A percentage of the nodes of the original nodes will general use, not owned by research groups
  4. When a percentage of the of the original nodes are purchased we will start to add new nodes purchase by research groups
  5. When Kaby Lake becomes generally available we will stop purchasing Sky Lake
  6. Purchased nodes on Wendian will be managed as they are currently managed on Mio with a minor exception. There will be three queues types:
    1. full - all nodes on the machine
    2. compute - all nodes that are not owned by someone (default)
    3. group - nodes owned by people
  7. Allocations on Wendian, Mc2 and AuN will be by proposal
  8. People will be given a fixed core hour allocation. (could be monthly, quarterly, or full year) after their allocation is used they will be given a lower priority.
  9. If people are using their own nodes the usage will not be charged against their allocation.
  10. If a user requests exclusive access they will be charged for 36 cores even if they don't use all of them
  11. Users will be highly encouraged to specify memory requirements on runs. If not specified it will be set to low but reasonable value.
  12. For nonexclusive access users will be charged based on the higher of the two metrics, number of cores used or the amount of memory they use.
  13. Wendian will have approximately 1 Pbyte (1000 Tbytes) of storage with a majority of the storage in scratch. Research groups will have the opportunity to "purchase" some of the storage. Files stored in owned storage will not expire.
  14. Faculty will be called upon to provide research summaries and publications generated and to participate in HPC meetings and promotional activities.

Mc2 and AuN Policies

  1. Compute time on Mc2 and AuN will be free. That is, time will not be charged against allocations.
  2. Allocations set up before 01/01/2018 will be disabled January 2019.
  3. Allocations on Wendian, Mc2, and AuN will be by proposal

File system Policies

  1. Directories belonging to people who leave Mines will be deleted after 3 months. It is the PI's responsibility to archive any desired data before that time.
  2. Directories which have not been accessed for 1 years are subject to deletion. Directories maybe deleted earlier as the needed.

Mio Policies

  1. Students who are not supported by a researcher will be allowed to run on Mio. However, faculty will not be allowed to take authorship of papers based on research done on Mio unless they own nodes.
  2. Students supported by faculty and faculty will only be allowed to run on Mio if their research group has purchased nodes on Mio.
  3. Nodes on Mio that fail outside of warranty will be retired.
  4. No new research groups will be added to Mio.

For questions please contact Dr. Timothy Kaiser at: tkaiser@mines.edu

Important Dates:

Proposal Web form available
Friday, November 16, 2018
Proposals Due:
Friday, December 14, 2018
Allocations announced:
Friday, December 21, 2018
Old allocations expire:
January 4, 2919
Allocations go into effect:
January 4, 2919
Allocations expire:
July 1, 2919

Instructions

Fill in all of the information and hit "Submit". All fields not marked optional must be filled. The larget text input boxed will scroll but the can also be expanded. After you hit submit you will see a summary of what you have entered. Please save this for your records.

There are two types of accounts available, normal and experimental Experimental accounts are for those people who are new to HPC and need to gain experience before submitting a request for a normal account. If you had a previous BlueM allocation but you did not use a significant portion of it you are expected to request an experimental allocation.

Faculty Data

Faculty Name:
Email:
Title of Project:
Academic Department:
Groups Technical Point of Contact:
The Technical Point of Contact will
be the person who you select that will be
most knowledgeable about the computation
aspects of your project. They will be a
resource for others in your group and for
the people in the HPC group to help them
understand issues.

Allocation Request

Account Type (See Instructions):
Node-Hours Requested (each node has 16 cores):

Project Overview

Provide a concise, one paragraph summary of the project.

Clarify the earth/energy/environment science tie to the proposed use of BlueM.

Explain the impact of the proposed research project.

Defend the number of node-hours requested. Be as quantitative as possible. Include record of usage on similar projects if possible.

Number of students and post docs associated with the project:
			
			
Identify any external funding requiring this computer time, and the total project award to CSM by year. Has this funding been received yet? Explain whether or not GECO was specifically mentioned in the proposal to this funding agency.

List of collaborators and their organizations.

Code information

Commercial codes or software packages to be used for this project
(must be supplied by faculty member)
Open Source codes or software packages to be used for this project

Brief description of codes (including web site if available):

Code Scalability information
Attache a pdf description of scalability (Optional)
(Optional) Scaling Information choose a *pdf file to upload:

Average number of nodes to be used in operational runs:

Estimated wall time (hours) for completion of operational runs: hours

Does your code have check point restart capability?

Is there a version of your code that uses GPUs?
This does not effect allocation but will only
be used for future planning.

Anticipated maximum temporary storage requirements: Gbytes

Data Archiving Information

Data on the SCRATCH space of BlueM is subject to purging and users must archive any data that is not in eminent current usage. Please estimate your yearly archival requirements and plans for storing data
Anticipated Data Archive requirements: Gbytes
and plan

RECORD OF HPC FUNDING AND PUBLICATIONS

Provide a list of recently funded projects and projects in the proposal review process. Please provide a list of recent publications related to this project, including the Digital Object Identifier number. Identify publications that have explicitly identified CSM HPC resources in the publication acknowledgements.

Proposals in Review and/or Recent (3 years) funded Reserach
Title of Investigation Status Source
of
Funding
Amount $K CSM Project #
or
CSM Proposal #
Start/Stop Dates
of Investigation
# Students
and
Post docs

HPC Related Publications
DOI # CSM Resources
Acknowled
Citation

When you hit submit you will be shown a copy of your input.
If you see an error use the "Go Back" function on your browser,
correct the error and resubmit.

Modified November 21, 2018