Developing new algorithms: Ivo Filot at Penn State University, part 3
In the summer of 2016, Ivo Filot went to Penn State University (USA) to stay for two months in Adri van Duin’s research group. The goal of his stay was to develop a new algorithm for fitting ReaxFF potentials. In this blog, kept especially for MCEC, Ivo tells you all about his research, his professional as well as his personal findings, and the perks and peculiarities of working halfway across the world.
Week 3: Supercomputers (Monday July 18th – Sunday July 24th)
Last week I explained how I was able to parallelize my program so that we can utilize multiple computational cores to perform our calculations. In this post, I’ll explain what supercomputers are and how they are used in research.
Remember your first computer? It probably was a machine such as a Commodore 64, an 80286 or maybe a Pentium I or II. What these machines have in common is that they all use a single central processing unit, or CPU. Programs that ran on such computers were typically single-thread programs. They have a single thread of operations to execute (remember the serial tasks of my last post?).
Ivo and his Commodore 64
With the advent of dual-core processors around 2000, such as the Intel Core 2 DUO and the AMD Athlon X2, programs were now able to use two computational cores at the same time. This enabled programmers to utilize more computational resources. But it came with a cost: first, there had to be identified which part of the program could be executed in parallel, and then, a parallel algorithm had to be programmed for this part.
Nowadays, it is very common for a processor to have multiple cores. I bet your phone has at least four. What does this have to do with supercomputers?
Let’s imagine you would like to build the biggest number-crunching machine the world has ever seen. You could for instance add more-and-more cores to a single processor, but in the end you get in trouble removing heat from the system and your processor would melt. A better way is thus to connect multiple (efficiently cooled) machines together and let them cooperate over a network on the same task.
That is basically what a supercomputer is, a very big cluster of in principle independent machines that are able to cooperate on the same task by communication over a very fast network. In academic research, we have the Dutch supercomputer Cartesius at our disposal. To get an idea of the scale of such facilities, Cartesius embodies 47,776 cores and has 130 TB of memory. It is currently still ranked in the top 100 (at position 97) fastest supercomputers.
My challenge for this week was to run my force field fit program on Cartesius. To explain why this is difficult, I have to explain on how processors can communicate in parallel algorithms. We identify two different kind of architectures for parallel programs, being: 1) shared memory and 2) distributed memory.
Figure 1: Shared memory architecture. All processors can ‘see’ and use the same memory.
Figure 2: Distributed memory architecture. All processors can only directly see their own share of memory. To see the memory of other processors, they have to communicate this data over the network.
Cartesius uses both shared as well as distributed memory in the following way.
Cartesius uses nodes. Each node has a number of cores which all share the same memory. If you run a program on only a single node, all your threads can read instructions and data from the same memory. This is way easier to program, but in the end you are limited to the number of available cores on any single node.
As such, I wanted to develop my program so that it can run on multiple nodes, hence it requires support for distributed memory systems.
The challenge here is that you have to program the memory distribution among the cores yourself.
Remember the analogy with building a house last time? In a shared memory system, you can tell all the construction workers where the blueprint (or construction plan) is and they will all make use of that. If one worker adjusts the blueprint, it is immediately visible to all the other workers because they all use the same blueprint. In a distributed memory set-up, you have to copy the blueprint to make sure that after any adjustment of the construction plan, the plan is copied to all the other workers, else they might have different ideas on how the house will look like.
In the end, I managed to develop my program in such a way that it can make efficient use of distributed memory architectures. I benchmarked my program and it is indeed able to run in a parallel fashion on the Cartesius supercomputer. The next phase is to perform (low-level) optimization of my program. After you have written your program and checked that it compiles and runs, you typically still have a relatively crude program with lots of room for further improvement. You start reviewing your code and identifying the bottle-necks, which you then try to solve. This is called optimization (for speed) and will be the topic of the next blog.