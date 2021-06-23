



The era of vector supercomputing may seem like ancient history to some, but it is still deeply rooted in major business and government institutions. While not optimal or desirable in most cases, the costs and challenges of rewriting or porting old codes that still do a heroic job have not been possible or practical, especially in areas demanding applications of high performance computing.

The US Naval Research Laboratory is one of the organizations hoping to recover long-used vector codes on modern systems without high-overhead code refactoring. Specifically, they looked at a deeply legacy Computational Fluid Dynamics (CFD) solver created at the US Air Force Research Center, which was written in Fortran and added over the years via Fortran 90 and MPI tweaks. .

The FDL3DI code, which first appeared in the early 1990s, was designed for vector processing and is still used in aerospace and other fields, almost exclusively in government applications. Kaith Obenschain of the Naval Research Laboratory, along with NEC and Syntek Technologies have collaborated to leverage the benefits of modern HPC without any trauma of code rewriting or porting through the NEC vector engine.

The history of NEC vectors dates back to 1983, as do some of the codes still in use today, but they have succeeded in advancing the computational capabilities of the NEC vector engine in the most modern way. Each Vector Engine has 8 total cores for a combined total of 2.15 teraflops of double precision performance with everything you would expect from other top processors (six HBM / 48GB memory modules, for example) . The secret sauce is in NEC’s scalar processing unit, which takes all nonvector instructions on every code while vectorized C, C ++, and Fortran with MPI run on the VE. These units are scalable, with each host managing up to 8 VE machines (in the case of the Naval Research Lab, they were housed in an HPE Apollo 6500 Gen 10 8 VE system).

The aim was to assess how the performance and usability of NEC vector engines compare to existing CPU architectures using a legacy CFD solver, explain Obenschain and colleagues. FDL3DI was originally vectorized and optimized for efficient operation on vector processing machines. The NEC VE architecture, high memory bandwidth and the ability to compile Fortran were the main motivations for the evaluation.

With optimizations, this vector architecture was found to be 3 times faster for main memory issues with competitive CPU architectures for smaller size issues. This performance using well-known standard techniques is considered a key advantage of this architecture.

By profiling and modifying key compute cores using typical vector optimizations and specific to NEC VE, the code was able to successfully use the vector engine hardware with minimal code modification. Scalar code developed later in the lifespan of FDL3DI has been replaced with user-friendly vector implementations.

The Naval Research Team found that their codes could work without any changes, but the improvements required some tuning. As generalizations, these might be useful to anyone considering redoing old codes.

They explain that codes designed for vector machines have undergone various optimizations for different architectures over the years and if one of those tricks was optimization through scalar code, it would take more work. Another observation is that the NEC VE works particularly well with codes limited by memory bandwidth but the performance is comparable to that of AMD Epyc for example.

As the Naval Research Group continues its work to breathe new life into old codes, they will explore PI scaling beyond a few VE units, but expect challenges from the MPI stack and global communication. They will also look at other applications that might be suitable for an upgrade through NEC VE.

Some of the benchmark and performance results based on FDL3DI can be found here.

