HPC 101 Tutorial and RQJ
What does RQJ do?
RQJ is a highly specialized software program that takes a few to thousands of computers and make them function as a single large computer. RQJ enables engineers and scientists to rapidly iterate on mathematical models for whatever problem they are working on. Essentially, a few engineers can do the work of many. Drawing on our past work, one scientist was able to compute two years worth of work (on a single computer) in two weekends with a large cluster. It freed him up to do other more productive work. Another scientist doing nuclear modeling for the oilfield was able to perform a new down hole tool design analysis every couple of days. What used to take years and an extremely large team was completed in less than a year by a small team.
RQJ can handle tens of thousands of small jobs taking a few minutes each. It can also handle hundreds of large jobs with each job taking up to several weeks to complete.
High Performance Computing
This is the software technique that allows the linking of computers together to work as a single large computer. Most of the easy scientific/business developments have already been found. New and break-through developments will almost certainly need lots of computer modeling since that is the most cost effective way to prove new theories.
What is a Batch Queue?
Often problems require numerous separate computations. There can be thousands of these computations. It is not cost effective to build a large enough Cluster of computers to run all of these jobs at one time. The cost effective solution is to store all of these jobs and run them as space becomes available in the Cluster. This allows the computers to run nights, weekends and holidays.
How does a Computer Cluster help my business?
Normally you have to conduct physical tests or construct mathematical models of whatever problem you are solving (Genetics, Seismic, Nuclear, Electrical, Mechanical, new Drugs, Chemical,...) in new product development. It is much safer for humans, animals and the environment if a proper mathematical model can be constructed for the problem at hand. If your team can find a much better solution numerically in months instead of guessing with numerous physical trials over years, your Company will have better Product-Market Fit at a much lower cost. RQJ allows your highly paid scientists and engineers to work much more productively.
What is a single core job?
The easiest numerical problem models run on a single CPU core. These are the types of jobs that are initially coded. They start at the top of the computer code and linearly run down the code to the end. They will never use more than a single CPU core. These jobs typically take much longer to run (the computation load is not being shared among multiple CPU cores).
What is SMP?
Symmetric Multi-Processing(SMP) is programming technique where two or more CPU cores are used to solve a single problem. SMP processes kind of scale with the number of CPU cores. Each numerical code has its own scaling behavior. Very few problems will run N times faster with N cores.
What is MPI?
Message Passing Interface (MPI) is when you link M computers with each computer having N CPU cores. Much larger types of problems can be solved with this setup. Some typical problems types are Weather Prediction, Large Stock Market Prediction, Nuclear, seismic, Electromagnetic and Fluid Flow problems. MPI processes kind of scale with the number of CPU cores. Very few problems will run N time faster with N cores.
How do I size machines in the Cluster?
Often, relatively few scientists and engineers will account for the bulk of your computational needs. Each person should be able to accurately describe the type of machine (CPU speed, number of CPU's, RAM, disk space, network speeds...) that are required to solve their problems. Hopefully there is some overlap in specifications. SRS recommends a separate Batch Queue be setup for each size of problems. Trying to run most large scientific codes on too small of computers can take dramatically more wall clock time since the CPU will spend more time in wait states and performing the actual calculations.
Can I randomly mix and match machines?
Not really. See "How do I size machines in the Cluster".
Can I use older desktops that have gone out of warranty?
Providing they meet problem specifications sure. We have done this several times at several different Companies with great success.
Can I just buy several new AMD Threadripper CPU's?
This option should be seriously considered. We REALLY like AMD's new CPU. Beasts!
Phases of Research and Development and Cluster Size
This is a synopsis of decades in the oilfield.
First, an engineer has to tune his math model to the problem. This can take some time. In this stage, the Cluster can be small. Make sure that excess capacity is available to help speed this phase along.
In the second stage, the engineer will start to apply his model to an increasing number of conditions and problems. The cluster will need to be sized accordingly.
In the third stage, the engineer will have created a repeatable process to model and analyze the results of the hundreds of separate computer runs. This is when the Cluster should be expanded rapidly.
Political Problems between Departments
There is almost always a disagreement between departments on how to divide up computer resources. Additionally, this causes a charge back/billing nightmare. It really does simplify life to build a separate Cluster for each Department. Been there, done that. Also, it is relatively rare for two Departments to have similar computer sizing needs and effectively, you will have two Clusters in practice anyway. It does not matter if the Clusters are logically separated or physically separated. They still can't process the same problems.
What kind of Administrators do I need and how many?
Obviously, the larger the Cluster, the more support you need. The real question is "Do you choose a truly complex queuing Software" or a next generation Software product (RQJ) that automates as much of the queuing problems as possible? The next question is do you need eight hours per day Monday-Friday, weekend days and/or 24x7 support?
If you choose the complex legacy Software, plan on trying to find multiple Administrators. These types of individuals can normally go to cloud computer companies and have a very secure future. In Houston, Texas where there is a lot of HPC activity because of the Oilfield, it is difficult to find and keep these employees. We can only guess at how hard it is in other parts of the US.
If you choose a next generation package like RQJ, you can get by with junior administrators if you need to.
Are you really serious about your Open Source Claims?
Absolutely! We really like Open Source a lot but not for HPC. Our principals have 50 years of HPC experience, wrote two queuing systems from scratch along with a research paper and have decades of commercial Software development and Administration. We tried to compile the most likely Open Source project and spent a fair amount of time with nothing to show for it but wasted time. We even chose one that we had previously gotten to work in years past and is still running at a client site. Do you have equivalent skills on your team? Is it worth the wait? Can you fix any problems? Probably not. We want to stress that these projects have not been maintained for years (often decades) and can't be compiled with current build environments and compilers. The good packages were written with Java. Java 7 and 8 have broken most Open Source Java projects. Not too long ago 60%+ of Open Source projects could not compile under Java 8. Migrating from Java 8 to 11 is fraught with problems. See the comments from the JodaTime library Author Over 90% of Java Developers are still at Java 8 and below. Here are three JVM usage surveys from developers Survey 1 , Survey 2 and Survey 3 . If you are not convinced yet, a lot of the documentation is located in really old newsgroups. It is hard to find what you are looking for. If the help file is daunting (like the good HPC Open Source packages are), then you know supporting the Software is going to be bad.
RQJ supports Windows 10 and above that are fully patched. It also supports all current Windows Server Versions that are fully patched. A Windows High Performance Computing Cluster may only contain Windows boxes as cluster members. The main RQJ server *Should be* be a supported Windows Server Version.
We developed the code on Windows 7/10 boxes and we have discovered a number of issues during development with personal editions of Windows that cause issues with cluster members:
- You are limited to 20 or so IP address being addressed concurrently (limits the number of boxes in the cluster)
- 32 Tasks (you can buy a version with more tasks - limits the number of jobs/box)
- The host IP address can change over time (there is no order to the loading of IP devices in the kernel, they are loaded at random -- unlike Linux)
- The Operating System will have more problems than Linux thus causing more reboots
Some of these issues apply to Windows Server Versions.
It should be noted that ALL of the top 1000 HPC installations run Linux. This is not an accident. All this being said, if you need to run Windows or only on Windows 10 that is ok and you will have a few extra issues from time to time.
We only support current Redhat, Centos and Ubuntu distributions. The OS must be at current patch levels. A Linux cluster may only contain Linux machines.
Multiple Cluster Costs
Since RQJ is licensed by the CPU Core, there are no additional charges to have any number of clusters that you want. As a practical matter, each cluster should be able to run the same type of problems. This means similar CPU's, Memory, Disk and Networking cards.
All CPU cores in the cluster must be covered by your RQJ license. The software will refuse to run on a computer if there are not enough licenses to fully cover the CPU.
What if I need Storm Rider to Create Scripts or Install the Software for us?
We offer remote and local installation services for a fee. We can write required scripts either remotely or on site for a fee. If you need additional consulting, this can also be arranged.
What kind of Guarantee do you offer?
Within the first 30 days, you can get a full refund.
To Purchase Contact:
Storm Rider Software LLC
4321 Kingwood Dr. #57
Kingwood Texas 77339