EDA and chip-design thesis database

From F-Si wiki
Jump to navigation Jump to search

We present a database of theses related to chip design and Electronic Design Automation (EDA). The goal is to make such theses discoverable by topic and reusable by the free-silicon community. The theses were selected by analyzing with a Large Language Model (LLM) all the theses contained in the OpenAIRE database. Each thesis was then scored in 53 searchable sub-categories:

Access the Thesis Database now.

What the Thesis Database page shows

The Thesis Database is exposed through an interactive table where each row is a thesis and the columns include:

  1. bibliographic data (title, year, country, …)
  2. a “best category / best score” summary
  3. a set of fine-grained categories (analog layout, RTL/HDL & HLS design, timing & STA, place & route, formal verification, PDK & standard-cells, education, etc.).

You can:

  1. filter by score for each specific category
  2. search by keywords
  3. access the download page directly (if available)

Please note that this is not an exhaustive list because:

  1. the OpenAIRE database is not complete (especially for non-EU publications) and
  2. the LLM might have missed some relevant theses. Still, this database should give the community a map of the landscape and a starting point for more detailed reading.

Motivation and origin of the project

Over the past decades, thousands of Master and PhD theses have been written on topics directly related to integrated circuit design, and Electronic Design Automation (EDA).

Such theses:

  1. are usually accompanied by a context which provides a general overview on a topic;
  2. are usually written in simple terms;
  3. may contain know-how that may not qualify for a scientific paper, but still be essential for understanding a topic;
  4. may be about exploratory topics which are not yet mature enough for other publications.

Theses are therefore very well suited both for humans interested to gather a general understanding of a topic, as well as for enhancing the knowledge of Large Language Models, e.g. through Retrieval Augmented Generation (RAG) or Cache-Augmented Generation (CAG). A structured database is further useful for our activities too: When preparing the Free Silicon Conference (FSiC), we repeatedly needed to find authors in specific niches, and when looking for university groups to visit, we needed to identify groups by research topic and by priority.

Theses are however difficult to find: They are either scattered across too many specialized repositories (e.g. the archives of university groups) or are hidden in general publication databases whose size is difficult to handle. Standard search engines, moreover, typically fail at filtering such theses in an exhaustive way.

We therefore decided to try a more systematic approach by letting an LLM analyze large databases at once.

Methodology, reproducibility and Markdown conversion service

We let an LLM analyze the metadata (title, abstract, keywords, etc.) of all ~6.6M theses contained in the OpenAIRE publication database. Starting from the 12 September 2025 dump, we used a local LLM (Qwen3-Reranker-4B) to estimate, for each thesis, how relevant it is to the general field of chip design and EDA.

We thereby extracted the theses with a score > 0.1 resulting in 13’272 theses. Each of these was then scored a second time using the same reranker against 53 narrower categories (like “clock-tree synthesis” or “standard-cell libraries”).

We further automatically downloaded all theses (when openly available) starting from the URL provided for each thesis by OpenAIRE, and converted all theses into Markdown using Marker and Docling.

The full pipeline can be downloaded here. We publish the source code so that others can replicate this approach for different research fields.

While we cannot provide a full thesis dump for potential copyright restrictions, we would be happy to offer the Markdown conversion of the theses contained in the database free of charge to free-silicon/FOS-EDA developers, provided that the requester confirms to hold the necessary rights to the document.

For queries please write to theses'at'f-si.org.

Acknowledgements

This work was co-funded by the Swiss State Secretariat for Education, Research and Innovation (SERI) under the NGI0 Commons Fund project. The NGI0 Commons Fund has received funding from the European Union’s Horizon Europe research and innovation programme under grant agreement No. 101135429.

Funded by SERI logo NGI Zero Logo