I was once tasked with the same scenario when I was an Engineering IT Manager for an aerospace startup.
I would first ask the definition of "Bang for the Buck". Are we referring to Max TFLOPs/Hardware cost, HPC Utilization/Total cost of ownership, Value to business (Time to market, minimizing prototyping and tooling costs, material optimization) / Total cost of ownership, or something else entirely? The best bang for the buck is usually to use a managed cloud HPC provider but based on your post it sounds like you really want to build and maintain HPC. Given the ITAR nature of our business and lack of ITAR cloud vendors we had to build ours. Below was our process feel free to modify as needed.
As someone previously stated, the first key considerations are workloads. Explicit vs Implicit solutions and different FEA and CFD solutions scale very differently and most will plateau on certain interconnects. There is a very large difference in architecture between running NASA Fun3D, Ansys CFD, LSDyna and Siemens Nastran and the architecture will change depending on the requirement(s) and performance goal.
Once you know the target solutions, next consider interconnect requirement. Some codes do not scale well across multiple nodes eliminating the need but some codes demand extremely low latency (ie infiniband). Those codes are completely inefficient without microsecond RDMA capability.
Next evaluate GPU compute compatibility. Be careful as some vendors have only partially implemented GPU compute and are only used under certain circumstances.
When evaluating CPU choice, we always used Performance/Watt as the benchmark. Check spec.org for normalized performance comparisons and divide by TDP.
Memory per system is a function of Solution memory size * number of concurrent jobs needed / number of nodes * ~1.25
Supermicro is by far the cheapest solution if you are going to integrate yourself.
linux is the standard operating system with most RHEL derivatives supported by most software vendors
Don't forget to consider storage. Most HPC systems generate many TBs of information and during a Job need high bandwidth storage access, usually shared. Phase 1 for us was to build a storage server with 10TB of SSDs and share the volume to all nodes with NFS. We dreamed of DDN but could not afford it.
Connect everything together, determine MPI stack and any interconnect RDMA requirements (OFED, etc.). Install OS, configure a workload manager such as SLURM, write your submission scripts and start Testing. Plan for a very long testing cycle if you integrate yourself.
Experts are hard to find and can be expensive, good luck and I hope you have a team of rock star linux gurus
I was once tasked with the same scenario when I was an Engineering IT Manager for an aerospace startup. I would first ask the definition of "Bang for the Buck". Are we referring to Max TFLOPs/Hardware cost, HPC Utilization/Total cost of ownership, Value to business (Time to market, minimizing prototyping and tooling costs, material optimization) / Total cost of ownership, or something else entirely? The best bang for the buck is usually to use a managed cloud HPC provider but based on your post it sounds like you really want to build and maintain HPC. Given the ITAR nature of our business and lack of ITAR cloud vendors we had to build ours. Below was our process feel free to modify as needed. As someone previously stated, the first key considerations are workloads. Explicit vs Implicit solutions and different FEA and CFD solutions scale very differently and most will plateau on certain interconnects. There is a very large difference in architecture between running NASA Fun3D, Ansys CFD, LSDyna and Siemens Nastran and the architecture will change depending on the requirement(s) and performance goal. Once you know the target solutions, next consider interconnect requirement. Some codes do not scale well across multiple nodes eliminating the need but some codes demand extremely low latency (ie infiniband). Those codes are completely inefficient without microsecond RDMA capability. Next evaluate GPU compute compatibility. Be careful as some vendors have only partially implemented GPU compute and are only used under certain circumstances. When evaluating CPU choice, we always used Performance/Watt as the benchmark. Check spec.org for normalized performance comparisons and divide by TDP. Memory per system is a function of Solution memory size * number of concurrent jobs needed / number of nodes * ~1.25 Supermicro is by far the cheapest solution if you are going to integrate yourself. linux is the standard operating system with most RHEL derivatives supported by most software vendors Don't forget to consider storage. Most HPC systems generate many TBs of information and during a Job need high bandwidth storage access, usually shared. Phase 1 for us was to build a storage server with 10TB of SSDs and share the volume to all nodes with NFS. We dreamed of DDN but could not afford it. Connect everything together, determine MPI stack and any interconnect RDMA requirements (OFED, etc.). Install OS, configure a workload manager such as SLURM, write your submission scripts and start Testing. Plan for a very long testing cycle if you integrate yourself. Experts are hard to find and can be expensive, good luck and I hope you have a team of rock star linux gurus