POSTS

Hpc Research

December 18, 2017

I have always been interested in learning about cutting edge technologies and enjoy applying logic to solve problems. During the summer and fall of 2017, I learned about high performance computing and modified CQSim, the HPC queue simulator.

Project

Learning and Exploring Cobalt – the HPC Job Management Suite

Weekly Updates

Update One (7-17-17)

I spent this week getting familiar with Cobalt’s administrative and user commands. I have almost finished installing the simulation environment on a CentOS 6.4 VM based on VirtualBox. Alongside reading about the Cobalt platform, I read the following paper: Fault-Aware, Utility-Based Job Scheduling on Blue Gene/P systems. I learned about the various utlitily functions that can optimize the job selection process, as well as the performances gains from fault-aware job allocation. I found it interesting that the basic FCFS beats all of the proposed smart utility functions when it comes to large, time consuming jobs. However, the real efficiency that is derived from smart utility functions is by cutting down the average response rate for shorter jobs. This paper also explored QSim and Blue Gene systems briefly which helped my understanding of how the elements depend on each other.

Update Two (7-24-17)

This week I had read three papers: Utilization and Predictability in Scheduling in the IBM SP2 with Backfilling (TPDS 61), Improving Batch Scheduling on Blue Gene/Q by relaxing 5D Torus Network Allocation Constraints (IPDPS 15), and Experience and Practice of Batch Scheduling on Leadership Supercomputers at Argonne. The main focus of this week was exploring scheduling techniques and finally knowing the scheduling policy in use at Argonne. The first paper, TPDS 61 was published over 20 years ago and aims to create a more stable and predictable scheduling system through conservative backfilling. It argues that agressive (EASY) backfilling does not improve predictability because jobs behind the first job can hypothetically have an indefinite wait-time. When a running job terminates earlier than expected, you retain the originl schedule but compress it. This point is important because a combination or a form of this method could be used with the dragonfly network where the estimated run time is highly unpredicatable. The second paper, IPDPS 15 describes a couple of allocation mechanisms to externally allocate network resources to applications: MeshSched and CFCA (Contention-free and communication aware. I have also been exploring the Cobalt code base and alongside reading the code I am refreshing and adding to my knowledge of Python. The final paper explores some challenges that are present moving forward in Argonne:

1. For some data collection instruments, having on demand computation capabilities would be helpful. Explore task-switching/time-sharing techniques to satisfy this immediate demand.
2. How to prepare the parallel machines at Argonne to deal with queue depth of millions, because of pushes by the government for HTC workload support.

Update Three (7-31-17)

I read the following papers assigned by Professor Lan.

1. Watch Out for the Bully! Job Interference Study on Dragonfly Network: This paper studied the effects of random job placement coupled with adaptive routing on individual jobs. It found that less communication intensive applications were being affected negatively. Essentially, the goal is to minimize the worst case of less comunication intensive applications, but keep the performance of communication intensive applications. The method which was tested was a hybrid-random-contiguous-adaptive allocation. The less communication intensive applications would be placed contiguously and the rest would be allocated randomly, load balancing. While this method showed some improvement, the performance of less communication intesive apps are still significantly affected. This is why minimizing the worst case of less communication intensive applications still remains an important task.
2. Job Scheduling HPC Clusters: This report established the two classes of clusters: high throughput computing (HTC) clusters and high performance computing (HPC) clusters. For HTC clusters reducing load imbalance to maximize the number of jobs completed are critical goals, whereas minimizing communication overhead is a consideration that must be taken with any HPC cluster. Essentially because there is active communication taking place between nodes in HPC clusters, communication speed is vital because it directly affects the performance of the application and the time in which the job is completed. This report further details the general architecture of HPC clusters and the various terminologies to explain the metrics and approaches for job scheduling.

Reading these two papers raised some questions on the dragonfly architecture itself and Application Placement Scheduler (ALPS) which is the internal scheduler used on Cray platforms. This led me to read the original paper on the dragonfly topology. This clarified the overall architecture, but I still need to know how exactly this is applied on the Theta system.Lastly, I learned that APLS is used to create a level of abstraction between the an external resource management model, like Cobalt, and the underlying hardware and operating system architecture. I also read the user guide on Theta and MIRA, mainly focusing on the former.

Update Four (8-7-17)

Data center scheduling is done without knowing the duration in advanced, so a hybrid scheduling method that is both suitable for HPC clusters and utilizes the unknown duration-based scheduling components of data center scheduling could help improve scheduling efficiency. These are the blog article and papers I was assigned this week: (blog post) The evolution of cluster scheduler architectures: There are five types of scheduler architectures on HTC clusters: monolithic, two-level scheduling (involves partitioning), shared-state architectures, fully distributed architectures, and hybrid architectures which combine fully distributed with monolithic or shared state designs. Essentially using a distributed model for short tasks and low priority batch workloads and using a centralized (monolithic or shared state?) for the rest. By reading the user guide on Theta, “Early Evaluation of the Cray XC40 Xeon Phi System ‘Theta’ at Argonne, and the “Validation Study of CODES dragonfly network model against Theta ALCF system” I gained a better understanding of the capabilities and limitations of Theta. However, I have not pieced together how I could utilize this in terms of building an appropriate scheduler, this remains a key focus of this following week. Though it has a lot of useful information for when I build an experiment. Because of the level of variability with respect to MPI performance can be significant when there is sudden contention caused by an interfering job, many data points should be run to ensure statistical accuracy. I also thought core specialization was an interesting idea. Also running more than one thread per core, does not improve performance, and degrades performance after two. Lastly, the final paper showed the effect of scale on the benchmark tests and also reached the conclusion that the CODES simulation is largely accurate. However, I noticed that the Bully paper (I read last week) did not have the same configurations as this Theta system, but I guess it doesn’t matter since it just affects latency which I believe is linear.

Update Five (8-28-17)

This week I explored papers about CODES, using machine learning to cut error in underestimating times, and just in general various papers and slides from the CODES workshop. Trade-off between Prediction Accuracy and Underestimation Rate in Job Runtime Estimates takes advantage of the repetitive nature of job submission in HPC clusters, meaning the same jobs are submitted repeatedly. This is an example of a proper use of machine learning methods being applied to HPC scheduling. There is also the problem of overestimation which is at more than twice the actual running time for nearly 40% of the jobs run. Nearly half the jobs exit abnormally meaning that the user underestimated the time and the resources essentially went to waste. However, current techniques overlook this factor and emphasize lowering the overestimation rate instead. This paper uses the Tobin Model to predict runtime of jobs which means that the regression model can set a threshold on the historical accuracy of the data and adjust it appropriately this helps with lowering the underestimation rate. There is relative improvement in the system utilization, overall accuracy (not compared to other ML techniques), and the overhead is low at only a second at every scheduling turn. This paper points out that the job wait time may not be as important of metric for the condition this paper is measuring/improving. Job wait time is influenced by the length of a queue, with high underestimation (more backfilling opportunities) the queue is shorter in the short run, but more jobs are being handled because many have to be resubmitted. This does not take into account this factor.

CODES (Codesign of Multi Layer Exascale Storage Architecture) is a simulator that takes into account the network architecture/node structure and it focuses on the communication overhead by having the MPI trace. CODES improves it’s accuracy by utilizing the ROSS scheduler which uses a combination of conservative and optimistic scheduling ultimately reducing the state saving overhead. The paper Enabling Parallel Simulation of Large-Scale HPC Network Systems says that HPC applications most frequently use nearest neighbor and local communication patterns.

After learning about CODES, I started wondering what the purpose of CQSim would be when we adapt the platform to Dragonfly since CQSim will not be taking MPI traces and whether a full Dragonfly system is optimal for HPC clusters because apps with locally concentrated communication patterns are negatively affected.

Update Six (9-25-17)

I started learning about CQSim so I read the manual and design reports. I started reading through the code base and I found that there are two versions that seem like they should be used while I develop the next version. CQSim_V13 seems as if it is for development (in Eclipse) because it has a .project file. It also contains elements such as Factory.py which is an additional layer to access modules in a more robust manner which the other version, Cqsim, does not. Another detail I noticed was that the adapt files were set in this version whereas in the Cqsim folder that is not the case.

In Cqsim, some files are outside the folders designated in the online documentation. This leads me to believe that this may be an older version since it is missing Factory and the correct location of files and has duplicates in odd locations. For example, the following files [Filter_job.py~, Filter_job_SWF.py~, SWF_filter.py~, config_n.set~, cqsim_main.py~, cqsim_path.py~, n_config.set~, opt_type_add.py~, optparse_add.py~, read_config.py~] are outside the src folder, some of these have copies within the src folder. Why are these here? Also, within the Cqsim/src directory why are there three different versions of the same file with the file endings .py, .py~, and .pyc. The files with the extentions: .py and .py~ are identical, so why are they there? Also, the parameters for adapt mode are set without adapt mode being set or passed in the first place in this version. This is not the case in Cqsim_V13. As a result, Cqsim backfill does nothing with the adapt para_list which is passed because adapt mode is not set.

Here are some other minor details in Cqsim/src/cqsim_main.py:

module_info_collect is built with two different methods for each, from the parameters themselves being different Class_Info_collect.Info_collect (alg_module=module_alg,debug=module_debug)

and from the other verison (V13) module_info_collect = modules.info (avg_inter=[600,3600],debug=module_debug)

Backfill - ‘ad_bf’ adapt mode is not set Start window - ‘ad_win’ does not exist, however ‘ad_win_para’ is set Basic algorithm - ‘ad_alg_para’ is set but ‘ad_alg’ is not set

I ended up going with the Cqsim folder instead of v13 despite my concerns because the most recent person to use it was Xu.

I also went to the STARS conference in Atlanta, Georgia where there were breakout workshops as well as a poster session. The workshops were geared towards increasing underrepresented groups’ participation in STEM. While I learned many ways to get involved and how to engage underrepresented groups in a different manner in the workshops, I found the poster session more valuable. It connected me to undergraduate and PhD students who are doing research in HPC clusters as well. For example, I met Miss Ghalami, a PhD candidate and she wrote A Parallel Approximation Algorithm for Scheduling Parallel Identical Machines. In this paper, she explores

This was a beneficial experience which provided me the grounds to grow my network. Due to the relaxed environment among peers closer to my age, I found that I gained more knowledge through the poster session, rather than the day-long Chameleon conference in which most of the information went over my head.

Update Seven (10-5-17)

This week I worked to understand plan-based scheduling and simulated annealing in order to implement it in CQSim. For that, I read the following paper: Exploring Plan-Based Scheduling for Large-Scale Computing Systems. Plan-based scheduling creates permutations of possible execution plans and selects an estimate to the optimal execution plan using simulated annealing. Plan-based scheduling reduces job wait time by 40% and job response time by 30%. A simulated annealing algorithm with a high initial temperature and a slow cooling scheme performs well. To understand simulated annealing, I simply read the Wikipedia article in detail. I learned that the meta-heuristics of simulated annealing may accept a worse neighbor to avoid getting stuck in a local optima, eventually over time this acceptance rate becomes lower. I got the core algorithm coded in C++, I just need to integrate it with CQSim.

Update Eight (10-19-17)

I am working on modifying CQSim to be more efficient focusing on the unnecessary additions such as opening a file object when printing log, node_struc does not need to create a file, and in job_trace, the array should be storing 100-1000 jobs at a time and writing to the result folder/deleting them from the array once it reaches it’s limit. I am also working on integrating the plan based scheduling code with CQSim figuring out how to have Python scripts communicate with C++ executables or programs.

Update Nine (11-2-17)

I have shifted my focus to working on the node structure addition, as currently it is just a list of nodes with no sense of structure (dragonfly, torus, etc). I am working on developing a simple mesh network to understand the basics and then eventually moving on to adapting to multiple network architectures. The paper, Balancing Job Performance with System Performance via Locality-Aware Scheduling on Torus-Connected Systems briefly talks about solving the problem of processor ordering using a space filling curve like the Hilbert Curve. The rest of the paper explores the concept of window-based job allocation.