Optimal Embedding
Embedding refers to the placement and routing of a netlist hypergraph into a 3-D chip layout, as well as layout-related optimizations to minimize area, power and delay metrics while satisfying layout constraints (design rules), etc. In this topic, the ultimate goal is to scale the capacity and speed of optimal and near-optimal solvers for the hypergraph embedding problem. This must advance along axes of (1) instance complexity (number of vertices, hyperedges), and (2) new objective function terms (timing, power, yield, reliability), and solution constraints (layout design rules, timing).
SMT-based Polygon-level Layout
Advancements in lithography technology have led to a scenario where minimum metal pitch (MP) becomes smaller than the contact poly pitch (CPP). This discrepancy has underscored the necessity of finding an optimal ratio between CPP and MP, which is called “gear ratio” (GR), for Design-Technology Co-optimization (DTCO) exploration. While the existing automated cell synthesis approaches have supported uniform grids with limited GR options [1, 2], SMT-based polygon-level layout, which is called “SMTCell”, introduces a novel exploratory framework for cell layout generation that enables flexible GR choices.
To enable flexible GR, we need to construct relative layered grid graph construction. Focusing on the fact that GR is only defined between vertical layers and vias are constructed at crossing points between vertical layers and horizontal layers, column sets are created for each crossing point between each vertical-horizontal layer pairs. This layer direction and connection concept is described in the left side of the Figure 1a below. Using these column sets, a relative layered grid graph is constructed and this enables arbitrary GR setting. With this freedom of GR setting is given, we could explore several GR settings by evaluating them at block-level, which was disabled due to the limitations on GR setting before this research. The block-level evaluation result is shown in Figure 1b below.
DG-RePlAce
Global placement is a fundamental step in VLSI physical design that determines the locations of standard cells and macros in a layout. However, emerging machine learning accelerators have introduced new challenges for global placement. On the one hand, machine learning accelerators with millions of standard cells and macros raise runtime concerns for the design closure process. For a large design with about 10M instances, it can take about 18 hours for the state-of-the-art global placer RePlAce [1] to finish global placement. On the other hand, machine learning accelerators featuring 2D processing element (PE) arrays, such as systolic arrays, have gained prominence due to their efficiency in convolutional neural network computations. The dataflow and datapath architectures of these machine learning accelerators exhibit substantial differences compared to those of traditional datapath designs, requiring dedicated treatment during global placement to achieve decent Quality of Results (QoR).
DG-RePlAce is a dataflow-driven GPU-Accelerated Analytical global placement framework for large-scale machine learning accelerators. We incorporate the physical hierarchy extraction approach into DG-RePlAce to capture the dataflow information during global placement. Special attention is also paid to the datapath regularity in machine learning accelerators during placement. Additionally, we use the GPU-accelerated density force and wirelength force computation algorithms to speed up the global placement process.
Experimental results using a variety of machine learning accelerators show that, compared with RePlAce [3] and DREAMPlace [4], our approach achieves an average reduction in routed wirelength by 10% and 7%, and total negative slack by 31% and 34%, respectively. Experimental results on the two largest TILOS MacroPlacement Benchmarks [5] testcases show that compared with RePlAce and DREAMPlace, DG-RePlAce achieves much better timing metrics (WNS and TNS) measured post-route optimization. This suggests that the proposed dataflow-driven methodology is not limited to machine learning accelerators. The ongoing roadmap includes integration of congestion prediction, placement optimization (sizing/buffering), etc., along with availability in the master release of the OpenROAD tool.
BlobPlace
Today’s place-and-route (P&R) flows are increasingly challenged by the complexity and scale of modern designs. Often, heuristics must trade-off between turnaround time and quality of PPA outcomes. Clustering has long been seen as a solution to these challenges. However, traditional clustering heuristics only optimize a cutsize criterion and do not consider design information (logical hierarchy, timing, switching activity, etc.) that strongly affects PPA outcomes. Previous works demonstrate that clustering can either reduce runtime but with PPA degradation, or improve PPA but with runtime degradation. By contrast, we propose a PPA-aware clustering methodology, and an improved clustered placement approach based on machine learning (ML)-accelerated virtualized place-and-route (V-P&R), to improve both runtime and PPA relative to academic and commercial flat placement methods.
We have developed a PPA-aware clustering methodology that considers additional netlist information—logical hierarchy, timing criticality of paths, and switching activity of nets—to achieve PPA-aware clustering. Our clustering methodology achieves noteworthy PPA benefits and outperforms traditional clustering methods when applied in OpenROAD and Cadence Innovus flows. In the seeded placement approach, a seed placement of clusters is used to induce seed locations of instances, from which the flat P&R flow is continued. Obtaining a high-quality seed placement of clusters requires two elements: how to form the clusters, and how to feed the clusters into a placer. To this end, we use our PPA-aware clusters, and a novel V-P&R framework to determine cluster shapes (utilizations and aspect ratios) to use in the cluster placement. We accelerate the V-P&R framework using a graph neural network (GNN)-based ML model that achieves a mean absolute error (MAE) of 0.131 (for label values in the range [0.564, 2.96]) and R2 score of 0.638.
With the open-source OpenROAD tool, our approach achieves up to 47% (average: 36%) global placement runtime improvement with similar half-perimeter wirelength (HPWL) and 90% (29%) improvement in post-route total negative slack (TNS). With the commercial Cadence Innovus tool, our methods achieve up to 3.92% (1%) improvement in power and 99% (49%) improvement in TNS.
References
[1] D. Park et al., SP&R: SMT-Based Simultaneous Place-and-Route for Standard Cell Synthesis of Advanced Nodes, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems 40(10) (2021), pp. 2142–2155.
[2] S. Choi et al., PROBE3.0: A Systematic Framework for DesignTechnology Pathfinding with Improved Design Enablement, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems 43(4) (2023), pp. 1218-1231.
[3] C.-K. Cheng, A. B. Kahng, I. Kang and L. Wang, RePlAce: advancing solution quality and routability validation in global placement, IEEE Trans. on CAD 38(9) (2019), pp. 1717-1730.
[4] Y. Lin, Z. Jiang, J. Gu, W. Li, S. Dhar et al., DREAMPlace: Deep learning toolkit-enabled GPU acceleration for modern VLSI placement, IEEE Trans. on CAD 40(4) (2020), pp. 748-761.
[5] https://github.com/TILOS-AI-Institute/MacroPlacement
[6] H. Esmaeilzadeh, S. Ghodrati, J. Gu, S. Guo et al., VeriGOOD-ML: An open-source flow for automated ML hardware synthesis, Proc. ICCAD, 2021, pp. 1-7.
Team Members
Andrew Kahng1
Farinaz Koushanfar1
David Pan2
Collaborators
Seokhyeong Kang3
1. UC San Diego
2. UT Austin
3. POSTECH