LLM4HWDesign - Problem

Introduction

Large Language Models (LLMs) have demonstrated remarkable capabilities in generating high-quality content from natural language prompts, sparking growing interest in their application to hardware design [1,2,3]. The potential of LLMs to streamline design flows and enhance hardware design accessibility for non-experts is significant. Initiatives like Architecture 2.0 [4] aim to transform the hardware design paradigm by leveraging artificial intelligence to create more advanced and efficient hardware systems while significantly reducing manual design overhead.

Despite the significant potential and community excitement, current state-of-the-art (SOTA) pretrained LLMs, such as OpenAI's GPT-4 [5], still struggle to produce practical hardware designs without extensive human intervention in their original forms. In hardware code generation, for example, these models tend to either (1) generate non-synthesizable or non-functional code, necessitating human correction, or (2) produce overly simplistic or impractical implementations [3]. This issue can primarily stem from the LLMs' limited exposure to hardware design data during pretraining. A pioneering attempt in ChipNeMo [6] demonstrates that using an in-house large-scale Verilog code dataset can effectively improve LLMs' Verilog code generation abilities. However, there are no publicly available datasets, posing a significant limitation to the further development of LLM-assisted hardware design. Therefore, developing open-source, high-quality, hardware-specific code datasets is essential for unlocking the full potential of LLM-assisted hardware design.

This year's contest seeks to address this challenge by asking you to help build a large-scale, high-quality Verilog code generation dataset. By open-sourcing this dataset, we aim to establish critical infrastructure for advancing LLM-assisted hardware design workflows. Winning participants will be invited to co-author a technical report summarizing our efforts, insights, and lessons learned, thereby paving the way for future initiatives.

Objective

The goal of this contest is to enrich the current Verilog code dataset to a large-scale, high-quality open-source dataset, facilitating the development of more effective LLM-assisted Verilog code generation through fine-tuning. Participants are asked to (1) collect or generate Verilog code samples and (2) enhance the dataset quality through data cleaning and label generation techniques. Participants' contributions will be evaluated based on the improvement their data brings to the fine-tuned LLM.

Problem Definition

To achieve our goal of enriching the Verilog code generation dataset, we aim to leverage one of the current SOTA datasets, MG-Verilog [7], as the starting point and improve the scale and quality of the dataset. We propose a two-phase contest, with the first phase aiming to improve the scale of the existing dataset and the second phase to improve the quality of the dataset. We introduce the problem definition for each of the phases below.

Phase I

In this phase, we aim to explore scalable methods for collecting and generating Verilog code and corresponding natural language instructions to increase the scale of the Verilog code dataset. Participants are asked to focus on the following areas:

1. Data Collection: Investigate methods to gather new Verilog code samples from various sources, including but not limited to open-source repositories, academic publications, and proprietary designs. All collected samples must be appropriately licensed for open-source and public use.

2. Data Generation: Explore techniques to leverage existing LLMs or other tools to generate new Verilog code samples. There are no specific restrictions on the approaches participants adopt, provided that the generated content is available for open-source and public use.

Phase II

In this phase, we aim to explore automatic and effective methods to improve the quality of the MG-Verilog dataset. Specifically, participants are tasked with developing and applying innovative data filtering and labeling methods to improve the quality of the MG-Verilog dataset, thereby enhancing the performance of the fine-tuned LLM on the evaluation dataset. Specific areas to explore include:

1. Data Filtering: Develop techniques to automatically remove low-quality data samples from the dataset, focusing on reducing the potential harm to performance caused by low-quality data collected in Phase I. Please note that our contest restricts data filtering to static methods, which remove a fixed subset of data samples from the dataset throughout the fine-tuning process. Dynamic data filtering, which adaptively includes or excludes samples across different fine-tuning epochs, is not allowed.

2. Accurate Descriptions: Develop techniques to automatically generate more accurate descriptions for the data samples. Focus on bridging the gap between high-level instructions and the detailed implementations that LLMs are expected to adopt during code generation.

3. Label Design: Create labeling strategies that facilitate the learning process of LLMs. Aim to narrow the gap between the knowledge acquired by LLMs during pretraining and the new knowledge needed during fine-tuning.

Scoring

We evaluate participants' submissions in Phase I and Phase II separately, with both phases using the CodeLlama-7B-Instruct model as the target LLM and our in-house evaluation dataset as the target evaluation benchmark. To obtain a valid ranking, participants are expected to participate in both phases.

In Phase I, we aim to assess the distinct contribution of each participant's collected/generated dataset to enhance the target LLM's hardware code generation ability. First, we will deduplicate each participant's data against our base dataset (deduplication code will be provided later). Then, we will evaluate the fine-tuned LLM's performance improvement after incorporating each participant's deduplicated data samples. Specifically, we will evaluate the following scores:
- Base Performance ( $S_{Base}$ ): Fine-tune the target LLM on the base dataset (as proposed in [7]), and measure its performance on the evaluation benchmark, denoted as $S_{Base}$ .
- Comprehensive Performance ( $S_{Comprehensive}$ ): Fine-tune the target LLM on the comprehensive dataset, which includes the base dataset and all submitted data samples. Measure its performance on the evaluation benchmark, denoted as $S_{Comprehensive}$ .
- Participant's Data Performance ( $S_{Include}$ ): Fine-tune the target LLM on the base dataset combined with the data samples collected by the individual participant. Measure its performance on the evaluation benchmark, denoted as $S_{Include}$ .
- Exclusion Performance ( $S_{Exclude}$ ): Fine-tune the target LLM on the comprehensive dataset, excluding the data samples collected by the individual participant. Measure its performance on the evaluation benchmark, denoted as $S_{Exclude}$ .
Given the limited computation resources, we will first grading each participant's submission by

$S_{I}$

For top-ranking participants, we will further evaluate the following score as their final score in Phase I for awards.

$S_{I}$

We will release both $S_{I}'$ and $S_{I}$ after the contest.

In Phase II, we evaluate the fine-tuned LLM's performance improvement when fine-tuning with the enhanced MG-Verilog dataset using the participants' data filtering and labeling techniques. Specifically, we evaluate the following additional score:
- Participant's Improved Dataset Performance ( $S_{Improv}$ ): Fine-tune the target LLM on the MG-Verilog dataset, but using the corresponding participant's strategy to filter out data samples and generate labels. Measure its performance on the evaluation benchmark, denoted as $S_{Improv}$ .
The corresponding participants' score in Phase II is calculated as

$S_{II}$

The final score for each participant is a combination of their scores from Phase I and Phase II, calculated as follows

$S_{Final}$

Submission Guidelines

Each participant is required to submit their data samples and generated labels with all necessary code and materials to reproduce their submissions. Submissions should be organized and include the following components for each phase:

For Phase I:
- Data Samples: The generated and/or collected data samples should build upon the provided base dataset. These data samples must be submitted according to our specified format as demonstrated here. To ensure the correctness of the format, you can use our provided training codebase to verify if your data can be loaded as training data.
- Data Sample Generation Code (if applicable): Submit the code used to generate the data samples, enabling the re-generation of the submitted data samples.
- Fine-Tuned LLM (Optional): Submit the LLM fine-tuned on the combined dataset consisting of the base dataset and the data samples provided by the participant, using the provided codebase.
For Phase II:
- The Filtered Dataset with Generated Label: The dataset, with low-quality data samples removed and labels for remaining data samples automatically generated using the participants' developed technique, should be saved in the provided format (Please refer to the starting toolkit for data format). You can use our training codebase to verify if your labeled dataset can be loaded as training data.
- Label Generation and Data Filtering Code: Submit the code used to filter out data samples and generate the labels, enabling the re-generation of the submitted labels.
- Fine-Tuned LLM (Optional): Submit the LLM fine-tuned on the MG-Verilog dataset using the participant's labels.

Award-winning teams are expected to submit a technical report introducing their solutions before the award ceremony. Additionally, they are invited to attend the ICCAD conference to present their solutions and receive their awards in person. Detailed guidelines for the technical report and presentation format will be released soon.

Starting Toolkit

At the beginning of Phase I, we will release a starting toolkit, including (1) an existing dataset as the base dataset, (2) an example dataset of hardware code from external sources providing the format example of participants' submission, (3) a codebase to fine-tune a specific LLM with the base dataset and the example submission dataset, (4) an evaluation script to measure the how the example submission dataset mitigate the bias of the base dataset, and (5) the deduplication codebase we will use to duplicate the repeated data samples. Participants are expected to just replace the example submission dataset with their own collected datasets and get the corresponding metric from the starting toolkit to further improve their datasets during Phase I.

At the beginning of Phase II, we will add the following items to the starting toolkit: (1) the collected dataset from Phase I by all participants, (2) an example set of labels for part of the samples in the collected dataset, and (3) an evaluation script to measure the how the example set of labels improve the quality of the collected dataset.

Reference

[1] Blocklove, J., Garg, S., Karri, R., & Pearce, H. (2023, September). Chip-chat: Challenges and opportunities in conversational hardware design. In 2023 ACM/IEEE 5th Workshop on Machine Learning for CAD (MLCAD) (pp. 1-6). IEEE.

[2] Liu, M., Pinckney, N., Khailany, B., & Ren, H. (2023, October). Verilogeval: Evaluating large language models for verilog code generation. In 2023 IEEE/ACM International Conference on Computer Aided Design (ICCAD) (pp. 1-8). IEEE.

[3] Fu, Y., Zhang, Y., Yu, Z., Li, S., Ye, Z., Li, C., ... & Lin, Y. C. (2023, October). Gpt4aigchip: Towards next-generation ai accelerator design automation via large language models. In 2023 IEEE/ACM International Conference on Computer Aided Design (ICCAD) (pp. 1-9). IEEE.

[4] Reddi, V. J., & Yazdanbakhsh, A. (2023, July). Architecture 2.0: Challenges and Opportunities. In 2023 60th ACM/IEEE Design Automation Conference (DAC) (pp. 1-2). IEEE.

[5] Achiam, J., Adler, S., Agarwal, S., Ahmad, L., Akkaya, I., Aleman, F. L., ... & McGrew, B. (2023). Gpt-4 technical report. arXiv preprint arXiv:2303.08774.

[6] Liu, M., Ene, TD., Kirby, R., Cheng, C., Pinckney, N., Liang, R., ... & Ren, H. (2023). Chipnemo: Domain-adapted llms for chip design. arXiv preprint arXiv:2311.00176.

[7] Zhang, Y., Yu, Z., Fu, Y., Wan, C., & Lin, Y. C. (2024, June). MG-Verilog: Multi-grained Dataset Towards Enhanced LLM-assisted Verilog Generation. In LAD 2024: International Workshop on LLM-Aided Design.

Contest Problem