2024 DAC
G2PM: Performance Modeling for ACAP Architecture with Dual-Tiered Graph Representation Learning DAC24
Author:Tuo Dai; Bizhao Shi; Guojie Luo
Affiliation:Peking University
Abstract:
Performance estimation is a crucial component in the optimization processes of accelerator development on the Versal ACAP architecture. However, existing approaches present limitations - they are either too slow to facilitate efficient iterations, or they lack the necessary accuracy due to the specific AIE array architecture and two-level programming model of Versal ACAP. To tackle this challenge, we propose G$^2$PM, a performance modeling technique based on a hierarchical graph representation centered on the AIE array. More specifically, we employ a hierarchical graph neural network to identify features of both kernel programs and dataflow programs, taking into account the hardware and software characteristics of the Versal ACAP architecture. In our evaluations, our method demonstrates significant improvements, achieving a mean error rate of less than 1.6\% and providing a speed-up factor of 4165$\times$ compared to the simulation-based method.
2024 TCAS-II
Imbalanced Large Graph Learning Framework for FPGA Logic Elements Packing Prediction.
Author:Zhixiong Di, Runzhe Tao, Lin Chen, Qiang Wu, Yibo Lin
Affiliation:Center for Energy-Efficient Computing and Applications, School of Integrated Circuits, Peking University, Beijing, China; School of Information Science and Technology, Southwest Jiaotong University, Chengdu, China
Abstract:
Packing is a required step in a typical FPGA CAD flow, with high impacts on FPGA placement and routing. Early prediction of packing results can guide design optimization and expedite design closure. In this brief, we propose an imbalanced large graph learning framework, ImLG, to predict whether logic elements will be packed after placement. Specifically, we propose dedicated feature extraction and aggregation methods to enhance the node representation learning of circuit graphs. With imbalanced distribution of packed and unpacked logic elements, we further propose techniques such as graph oversampling and mini-batch training for this imbalanced learning task in large circuit graphs. Experimental results demonstrate that our framework can improve the F1-score by42.82%compared to the most recent Gaussian-based prediction method. Physical design results show that the proposed method can assist the placer in improving routed wirelength by0.93%, and SLICE occupation by0.89%.
2024 TCAS-I
NASA-F: FPGA-Oriented Search and Acceleration for Multiplication-Reduced Hybrid Networks.
Huihong Shi, Yang Xu, Yuefei Wang, Wendong Mao, Zhongfeng Wang
Electronic Science and Engineering, Nanjing University, China; Sun Yat-sen University, Shenzhen, China
Abstract:
The costly multiplications challenge the deployment of modern deep neural networks (DNNs) on resource-constrained devices. To promote hardware efficiency, prior works have built multiplication-free models. However, they are generally inferior to their multiplication-based counterparts in accuracy, calling for multiplication-reduced hybrid models to marry the benefits of both approaches. To achieve this goal, recent works, i.e., NASA and NASA+, have developed Neural Architecture Search (NAS) and Acceleration frameworks to search for and accelerate such hybrid models via a tailored differentiable NAS (DNAS) engine and dedicated ASIC-based accelerators. In this paper, we delve deeper into the inherent advantages of FPGAs and present an enhanced approach called NASA-F, which focuses on FPGA-oriented search and acceleration for hybrid models. Specifically, on the algorithm level, we develop a tailored one-shot supernet-based NAS engine to streamline the search for hybrid models, eliminating the need for executing NAS for each deployment as well as additional training/finetuning steps. On the hardware level, we develop a chunk-based accelerator to fully leverage the diverse hardware resources available on FPGAs for the acceleration of heterogeneous layers in hybrid models, aiming to enhance both hardware utilization and throughput. Extensive experimental results consistently validate the superiority of our NASA-F framework, e.g., we can gain↑0.67%top-1 accuracy over the prior work NASA on CIFAR100 even without additional training steps for searched models. Additionally, we can achieve up to↑1.86×throughout and↑2.16×FPS with↑0.39% top-1 accuracy over the state-of-the-art multiplication-based system on Tiny-ImageNet. Codes are available at https://github.com/shihuihong214/NASA-F
2023 ASPDAC
Area-Driven FPGA Logic Synthesis Using Reinforcement Learning.
Author:Guanglei Zhou, Jason Helge Anderson
Affiliation: University of Toronto Toronto, Canada
Abstract:
Logic synthesis involves a rich set of optimization algorithms ap-plied in a specific sequence to a circuit netlist prior to technology mapping. A conventional approach is to apply a fixed “recipe” of such algorithms deemed to work well for a wide range of differ-ent circuits. We apply reinforcement learning (RL) to determine a unique recipe of algorithms for each circuit. Feature-importance analysis is conducted using a random-forest classifier to prune the set of features visible to the RL agent. We demonstrate conclusive learning by the RL agent and show significant FPGA area reductions vs. the conventional approach (resyn2). In addition to circuit-by-circuit training and inference, we also train an RL agent on multiple circuits, and then apply the agent to optimize: 1) the same set of circuits on which it was trained, and 2) an alternative set of “unseen” circuits. In both scenarios, we observe that the RL agent produces higher-quality implementations than the conventional approach. This shows that the RL agent is able to generalize, and perform beneficial logic synthesis optimizations across a variety of circuits.
2022 DAC
CNN-inspired analytical global placement for large-scale heterogeneous FPGAs.
Author:Huimin Wang, Xingyu Tong, Chenyue Ma, Runming Shi, Jianli Chen, Kun Wang, Jun Yu, Yao-Wen Chang
Affiliation: State Key Lab of ASIC & System, Fudan University, Shanghai 200433, China; Zhangjiang Fudan International Innovation Center, Fudan University, Shanghai 200433, China; Graduate Institute of Electronics Engineering, National Taiwan University, Taipei 10617, Taiwan; Department of Electrical Engineering, National Taiwan University, Taipei 10617, Taiwan
Abstract:
The fast-growing capacity and complexity are challenging for FPGA global placement. Besides, while many recent studies have focused on the eDensity-based placement as its great efficiency and quality, they suffer from redundant frequency translation. This paper presents a CNN-inspired analytical placement algorithm to effectively handle the redundant frequency translation problem for large-scale FPGAs. Specifically, we compute the density penalty by a fully-connected propagation and gradient to a discrete differential convolution backward. With the FPGA heterogeneity, vectorization plays a vital role in self-adjusting the density penalty factor and the learning rate. In addition, a pseudo net mode is used to further optimize the site constraints by establishing connections between blocks and their nearest available regions. Finally, we formulate a refined objective function and a degree-specific gradient preconditioning to achieve a robust, high-quality solution. Experimental results show that our algorithm achieves an 8% reduction on HPWL and 15% less global placement runtime on average over leading commercial tools.
2022 ASPDAC
DREAMPlaceFPGA: An Open-Source Analytical Placer for Large Scale Heterogeneous FPGAs using Deep-Learning Toolkit.
Author:Rachel Selina Rajarathnam, Mohamed Baker Alawieh, Zixuan Jiang, Mahesh A. Iyer, David Z. Pan
Affiliation: Department of Electrical & Computer Engineering, The University of Texas at Austin, TX, USA Intel Corporation, San Jose, CA, USA
Abstract:
Abstract:Modern Field Programmable Gate Arrays (FPGAs) are large-scale heterogeneous programmable devices that enable high performance and energy efficiency. Placement is a crucial and computationally intensive step in the FPGA design flow that determines the physical locations of various heterogeneous instances in the design. Several works have employed GPUs and FPGAs to accelerate FPGA placement and have obtained significant runtime improvement. However, with these approaches, it is a non-trivial effort to develop optimized and algorithmic-specific kernels for GPU and FPGA to realize the best acceleration performance. In this work, we present DREAMPlaceFPGA, an open-source deep-learning toolkit-based accelerated placement framework for large-scale heterogeneous FPGAs. Notably, we develop new operators in our framework to handle heterogeneous resources and FPGA architecture-specific legality constraints. The proposed framework requires low development cost and provides an extensible framework to employ different placement optimizations. Our experimental results on the ISPD'2016 benchmarks show very promising results compared to prior approaches.
2020 ASPDAC
High-Definition_Routing_Congestion_Prediction_for_Large-Scale_FPGAs
Author:Mohamed Baker Alawieh; Wuxi Li; Yibo Lin; Love Singhal; Mahesh A. Iyer; David Z. Pan
Affiliation:ECE Department, UT Austin
Abstract:
To speed up the FPGA placement and routing closure, we propose a novel approach to predict the routing congestion map for large-scale FPGA designs at the placement stage. After reformulating the problem into an image translation task, our proposed approach leverages recent advancement in generative adversarial learning to address the task. Particularly, state-of-the-art generative adversarial networks for high-resolution image translation are used along with well-engineered features extracted from the placement stage. Unlike available approaches, our novel framework demonstrates a capability of handling large-scale FPGA designs. With its superior accuracy, our proposed approach can be incorporated into the placement engine to provide congestion prediction resulting in up to 7% reduction in routed wirelength for the most congested design in ISPD 2016 benchmark.
2020 ICCAD
flowTune: Practical multi-armed bandits in Boolean optimization
Author: Cunxi Yu.
Affiliation: University of Utah
Abstract:
Recent years have seen increasing employment of decision intelligence in electronic design automation (EDA), which aims to reduce the manual efforts and boost the design closure process in modern toolflows. However, existing approaches either require a large number of labeled data for training or are limited in practical EDA toolflow integration due to computation overhead. This paper presents a generic end-to-end and high-performance domain-specific, multi-stage multi-armed bandit framework for Boolean logic optimization. This framework addresses optimization problems on a) And-Inv-Graphs (# nodes), b) Conjunction Normal Form (CNF) minimization (# clauses) for Boolean Satisfiability, c) post static timing analysis (STA) delay and area optimization for standard-cell technology mapping, and d) FPGA technology mapping for 6-in LUT architectures. Moreover, the proposed framework has been integrated with ABC [1], Yosys [2], VTR [3], and industrial tools. The experimental results demonstrate that our framework outperforms both hand-crafted flows [1] and ML explored flows [4], [5] in quality of results, and is orders of magnitude faster compared to ML-based approaches [4], [5].
2019 TRETS
PIMap: A flexible framework for improving LUT-based technology mapping via parallelized iterative optimization
Author:Gai Liu and Zhiru Zhang.
Affiliation:School of Electricaland Computer Engineering,Cornell University,USA
Abstract:
Recent years have seen increasing employment of decision intelligence in electronic design automation (EDA), which aims to reduce the manual efforts and boost the design closure process in modern toolflows. However, existing approaches either require a large number of labeled data for training or are limited in practical EDA toolflow integration due to computation overhead. This paper presents a generic end-to-end and high-performance domain-specific, multi-stage multi-armed bandit framework for Boolean logic optimization. This framework addresses optimization problems on a) And-Inv-Graphs (# nodes), b) Conjunction Normal Form (CNF) minimization (# clauses) for Boolean Satisfiability, c) post static timing analysis (STA) delay and area optimization for standard-cell technology mapping, and d) FPGA technology mapping for 6-in LUT architectures. Moreover, the proposed framework has been integrated with ABC [1], Yosys [2], VTR [3], and industrial tools. The experimental results demonstrate that our framework outperforms both hand-crafted flows [1] and ML explored flows [4], [5] in quality of results, and is orders of magnitude faster compared to ML-based approaches [4], [5].
2018 FPL
Machine-Learning_Based_Congestion_Estimation_for_Modern_FPGAs
Author:D. Maarouf, A. Alhyari, Z. Abuowaimer, T. Martin, A. Gunter, G. Grewal, S. Areibi, A. Vannelli
Affiliation:School of Engineering/School of Computer Science, University of Guelph Guelph, Ontario, Canada
Abstract:
Avoiding congestion for routing resources has become one of the most important placement objectives. In this paper, we present a machine-learning mode for accurately and efficiently estimating congestion during FPGA placement. Compared with the state-of-the-art machinelearning congestion-estimation mode, our results show a 25% improvement in prediction accuracy. This makes
our mode competitive with congestion estimates produced using a global router. However, our mode runs, on average, 291x faster than the global router
2016 DATE
Adaptive Threshold Non-Pareto Elimination: Re-thinking machine learning for system level design space exploration on FPGAs
Author:Pingfan Meng, Alric Althoff, Quentin Gautier, and Ryan Kastner.
Affiliation:Department of Computer Science and Engineering, University of California
Abstract:
One major bottleneck of the system level OpenCL-to-FPGA design tools is their extremely time consuming synthesis process (including place and route). The design space for a typical OpenCL application contains thousands of possible designs even when considering a small number of design space parameters. It costs months of compute time to synthesize all these possible designs into end-to-end FPGA implementations. Thus, the brute force design space exploration (DSE) is impractical for these design tools. Machine learning is one solution that identifies the valuable Pareto designs by sampling only a small portion of the entire design space. However, most of the existing machine learning frameworks focus on improving the design objective regression accuracy, which is not necessarily suitable for the FPGA DSE task. To address this issue, we propose a novel strategy - Adaptive Threshold Non-Pareto Elimination (ATNE). Instead of focusing on regression accuracy improvement, ATNE focuses on understanding and estimating the inaccuracy. ATNE provides a Pareto identification threshold that adapts to the estimated inaccuracy of the regressor. This adaptive threshold results in a more efficient DSE. For the same prediction quality, ATNE reduces the synthesis complexity by 1.6 - 2.89× (hundreds of synthesis hours) against the other state of the art frameworks for FPGA DSE. In addition, ATNE is capable of identifying the Pareto designs for certain difficult design spaces which the other existing frameworks are incapable of exploring effectively.
AI+EDA
FPGA synthesis, placement, and routing