Infrastructure – Convecs

The CONVECS computing infrastructure is centered around a primary hub site located in Padua, along with several secondary spoke sites situated in the data centers of the Universities of Verona and Venice, and at the INFN National Laboratory in Legnaro. The hub and spokes are interconnected through high-speed links to facilitate data exchange among the various data centers and to ensure low-latency access to computing and storage resources for users at different locations.

The infrastructure is capable of providing users with various computing services under different paradigms, including high-performance non-interactive computing through job scheduling systems, virtual machine allocation for cloud computing applications, and the creation of remote desktop systems with preconfigured software for visualization and data analysis applications.

The architecture is composed of four functional blocks:

Block 1 consists of servers specifically dedicated to delivering computing services.
Block 2 is responsible for controlling and managing all scientific computing nodes—for example, handling the allocation of physical servers required for non-interactive computing and managing automatic failover for nodes experiencing operational issues.
Block 3 consists of a storage system used to maintain copies of virtual machine images and to provide space for storing data processed by the computing nodes.
Block 4 includes the high-performance networking devices and connections required to interconnect the hardware systems within a single site, ensuring high-speed communication and aggregation among the nodes.

Figure 1. Reference architecture diagram for data centers. HPC = High Performance Computing. VDI = Virtual Desktop Infrastructure.

Computing hardware

The computing block consists of a series of nodes designed to meet the needs of advanced scientific processing. At full capacity, the plan is to implement an infrastructure whose computing hardware will include several types of nodes:

41 CPU-based nodes, totaling over 2,600 cores and 20 TB of RAM. This type of server supports scalable workloads, particularly suited for mathematical simulations and high-dimensional data processing.
25 mid-level GPU nodes (e.g., NVIDIA L40S), totaling over 450,000 CUDA cores and 1,000 GB of vRAM. These resources are dedicated to mid-level parallel computing tasks, such as molecular modeling, image analysis, and moderately parallelized algorithms.
4 high-performance GPU nodes based on NVIDIA DGX systems, each equipped with 8 next-generation GPUs (H100) capable of operating in parallel. Each node therefore features over 116,000 CUDA cores and 640 GB of vRAM. These resources are dedicated to complex scientific simulations, including highly parallelized computations involving artificial intelligence algorithms based on deep learning and large language models (LLMs).

The storage block has been designed to meet diverse data storage and management needs, combining high capacity, fast access speeds, and reliability. A first category of storage is dedicated to the long-term archiving of data that does not require frequent access, such as historical datasets and backups. This type of storage has a net capacity of 2.4 PB and is optimized to reduce operational costs through the use of low-energy solutions. A second category of storage is dedicated to the storage of data that requires frequent access with low latency. This system has a capacity of 4.2 PB and is optimized to support data-intensive real-time computations, complex simulations, and machine learning applications, ensuring high performance even under significant workloads.