Building a Local-LLM Datasheet Extractor for IC Driver Development

Ari Mahpour
|  Created: March 2, 2026
Building a Local-LLM Datasheet Extractor for IC Driver Development

As the PCB design process moves faster and faster, the need for up-to-date device driver libraries keeps increasing. Designs including many integrated circuits (ICs) with datasheets containing tens (if not hundreds) of pages create an untenable position for software engineers to keep up. Simply writing device drivers and testing them is a full-time job, but parsing through hundreds of pages of datasheets with tight timelines can be almost impossible, especially with the dizzying amount of “rapid prototype” designs that are spinning every few weeks.

As always these days, we turn to AI for help. Given the correct context, a large language model (LLM) can write well-organized and comprehensive device driver libraries to offload the struggle that everyday embedded engineers face.

In this article, we’re going to look at how to extract and structure the data an LLM needs in order for it to efficiently build robust device driver libraries for you.

Why LLMs Need Structured Datasheet Data

It can be tempting to upload a PDF to an LLM and say, “Build a driver library from this datasheet.” To be honest, we’ll probably have that sometime in the future (maybe sooner than we think). In the meantime, however, it’s impractical to assume an LLM can consume an unstructured document and make that much sense out of it.

In order to create a workflow for AI, we need to first think about the process a human goes through when “processing” a datasheet. When I first encounter a large datasheet (i.e. more than a dozen pages), I try to scan the first few pages then head to the section that interests me the most. In the case of this article, I am interested in writing device drivers. This means I may want to look at the SPI or I2C timing diagram, register maps, and command structures. After reviewing the datasheet a few times we start to create mental maps or shortcuts to the areas of interest and usually jump right to that location when we need to reference the same data (i.e. I’ll always go back to the register map because that’s what I care about versus the absolute maximum ratings section).

LLMs aren’t much different. They also need an “overview” and a “memory map” of where to find the pertinent data. A generic PDF won’t do this so we have to break apart the PDF into consumable, bite-sized chunks, that it can easily navigate to. Many LLM-based web tools process PDFs for a single session, but that output isn’t reusable later. This means that if you want to reference this information later you’ll need to re-upload the file and process it all again. The second benefit to breaking down these datasheets into structured data is that we can then drop them into a database for consumption at a later date. We can keep everything stored in a database that can either be recalled by a colleague or an LLM writing code or performing a review on a Pull Request.

In summary, PDFs need to be extracted into smaller chunks and organized in a way for both humans and LLMs to digest in parts.

Architecture

The good news for us is that there are already plenty of software libraries (open and closed source) that break apart PDF files for us. In my example project, I make use of the open source library called Docling. With the help of Docling, a local vision model (run with Ollama), and some custom schemas, we’re able to put together a pretty decent JSON document that LLMs can ingest and databases can store.

For starters, we can leverage Docling to just extract everything from text to images into a large JSON file (with pointers to the image files). We leverage Hybrid Chunking to ensure large bodies of text are broken down into bite-sized chunks. Without specifying a chunking method, you risk exporting all text into a single JSON object. This makes it extremely difficult (and expensive) for LLMs to parse and digest.

When it comes to LLMs and writing code (which is our primary goal), having images can be helpful (i.e. processed by vision models) but not something that is preferred. Providing a model with a very detailed description of an image or translating an image (such as a waveform diagram) into data points gets you much more mileage. Take, for example, performance characteristics of a component over temperature. A graph in a datasheet will show how it degrades over time. Pointing a PNG file to an LLM that can’t process images makes that useless. If a vision model, however, processes that image, interpolates that data into an XY plot, and stores it, another LLM can take the text output of that interpolation and use it to answer questions, plan, write code, etc. In this project, that is exactly what we do. Given that many images can be simple logos or pictures, we use a local model, Moondream, to perform a first pass on the images to classify and describe them. Once we’ve completed the first pass, we filter out the simple images and then pass them into a sophisticated vision model that can extract detailed descriptions of the images and store them alongside the JSON file.

Speeding Up Development

Using the process we just discussed, we can now point an LLM to our JSON document (and related image extraction and classification documents). We can also store this data into a database for other users to point their LLMs to. Not only will this support us in writing device drivers but it gives us quick access to “chat with our datasheets” while using much cheaper/faster models. This enables faster register definition lookups, cleaner init sequences, and easier review by the LLMs. Additionally, troubleshooting or updating drivers with the assistance of an LLM also gets much easier because of the access it has to all the other detailed information.

Conclusion

In this article, we covered what datasheet extraction looks like from a practical, local-first perspective. For more details, I invite you to look at the example project (a work in progress) to see how it’s all implemented. We also mentioned that, while uploading a PDF to a flagship model and chatting/coding with it gets you going, it isn’t a sustainable approach (both time wise and financially). This approach gives you something to build and store for the future so everyone on the team can enjoy.

About Author

About Author

Ari is an engineer with broad experience in designing, manufacturing, testing, and integrating electrical, mechanical, and software systems. He is passionate about bringing design, verification, and test engineers together to work as a cohesive unit.

Related Resources

Back to Home
Thank you, you are now subscribed to updates.