Building Neural Networks on FPGAs

Ari Mahpour
|  Created: August 8, 2024  |  Updated: August 9, 2024
Building Neural Networks on FPGAs

In Understanding Neural Networks we looked at an overview of Artificial Neural Networks and tried to give examples in the real world to understand them at a higher level. In this article we’re going to look at how to train neural networks and then deploy them onto a Field Programmable Gate Array (FPGA) using an open source library called hls4ml.

The Model

In contrast to Understanding Neural Networks, we’re going to be looking at a much simpler model for two main reasons. For some, moving from 28 x 28 pixels to a full model was a huge leap. Although I explained the breakdown as a series of piecewise functions in a large multidimensional matrix, it was still hard to grasp. We often hear about “parameters,” but in image processing models, they can seem abstract and complicated. We also want to use a simpler model because we’re going to synthesize it onto hardware. A Graphics Processing Unit (GPU), even a standard home desktop GPU, can quickly calculate and train models on the fly. It’s designed specifically for that task (i.e. number crunching). Think of how many calculations go into 3D games with their sophisticated physics engines and image rendering. GPUs have also been popular with mining different types of cryptocurrencies because of its ability to crunch numbers. And, finally, training and running a neural network is also just a series of matrix multiplications and parallel computing. While FPGAs do contain math blocks within their fabric the software tools to compile these number crunching sequences are not optimized the way GPU are (at least not yet).

For these two reasons I have chosen a much simpler model, the Iris classification model. While this isn’t as fun as a vision based model (e.g. MNIST database), it’s much more straightforward. The dataset used to train the model is a set of 150 observations of three different species of Iris flowers: Setosa, Versicolor, and Virginica. Each observation (i.e. dataset) contains several measurements unique to those flowers. They are:

  1. Sepal Length (cm)
  2. Sepal Width (cm)
  3. Petal Length (cm)
  4. Petal Width (cm)

With these measurements we’re able to create a model that can tell us (with a fairly high level of certainty) what species of flower it is. We provide those four inputs and it tells us if the Iris is of type Setosa, Versicolor, or Virginica. The nice thing about this dataset is that there isn’t much overlap within these values. For example, Setosa, Versicolo, and Virginica all have pretty distinct Sepal lengths. The same goes with the other 3 characteristics. This is shown in the scatter plot of the dataset:

Pairwise Scatterplots of Iris Dataset

Figure 1: Pairwise Scatterplots of Iris Dataset

When we spoke about why use a neural network in Understanding Neural Networks we discussed the concept of rule-based programming: a set of rules that the computer can flow (i.e. an algorithm). In this case, we most certainly could code this problem up as an algorithm. The scatter plots above can probably be represented by a series of mathematical equations. We use this set of data as more of an example for the complexity that’s to come (think of the MNIST database). With that we move onto the next topic: Implementation.

Why an FPGA?

As we prepare to implement this on an FPGA you’re probably asking, “Why even build a neural network on an FPGA?” We’ve already established that GPUs just make more sense for this application. As you may or may not know, GPUs are expensive and require lots of power. They also generate a lot of heat when running at full capacity. When designing application-specific hardware (or a chip), you optimize for space, power, heat, and cost. The first step to designing an Application Specific Integrated Circuit (ASIC) is to first design it on an FPGA. If we ever plan to build neural networks on ASICs this would be the first step.

How do we do it?

So now we’ve finally become convinced to give this a shot. We’ve seen examples on how to train a model and we assume that if we simplify the model then implementing it on an FPGA should be pretty simple, right? Wrong. Remember what we talked about with the many matrices and series of equations? Trying to stuff all those computations onto a small FPGA can be extremely challenging. Luckily, there’s a really nice shortcut out there that the kind folks at the Fast Machine Learning Lab (and community) worked on in an open source project called hls4ml. This project takes existing models and converts it into a high level synthesis language (e.g. SystemC) which then can be converted into commonly used FPGA languages such as Verilog or VHDL (using FPGA synthesis tools such as Vivado from AMD Xilinx). That translation, in itself, cuts out an immense amount of steps. Attempting to build a complete neural network in just Verilog or VHDL (especially a complex one) would be extremely challenging. This tool really helps us get straight to the meat - but not without a few intermediate steps.

Optimizations

We’re at the point where we’ve trained our model and we’d like to run hls4ml and test this out on an FPGA. Not so fast. If you take the Iris dataset, train it with basic techniques, and then synthesize the code with hls4ml, you’ll probably get past synthesis. Now head towards place and route and you’ll never be able to fit that model as it stands on a small FPGA. Remember, we also need all the logic around data handling and communication on our FPGA as well. In their tutorials, they use a Pynq-Z2 board to demonstrate running a model on an FPGA. This board was chosen because it contains not only an FPGA but also a microprocessor on the chip (Zynq 7000) which runs a full Jupyter Notebook web server. This means you can run hardware accelerated functions on an FPGA but also experience a simple, easy to use interface to load and test your data. This interface that acts as a transport layer between the operating system and the FPGA still takes up space on the FPGA itself (just as if we had planned to use a straight up FPGA like the Spartan or Virtex FPGAs).

The challenge, as mentioned above, is that FPGAs are limited in size and don’t have the same capacity that GPUs do. As a result we won’t be able to dump a full model on an FPGA - it will never fit. Even the simple Iris model couldn’t be placed initially on the Pynq-Z2. By Tutorial 4 you start to become more acquainted with optimization techniques (some of which are so esoteric that I haven’t even scratched the surface on understanding how they work under the hood). Once you apply these optimization techniques you should be able to get models (at least the simpler ones) onto an FPGA. In this repository we can look at BuildModel.py (specifically the model training function) and observe that we’re not just training a basic model but also optimizing it:

def train_model(self):
    """
    Build, train, and save a pruned and quantized neural network model using QKeras.

    This function constructs a Sequential model with QKeras layers. The model is pruned to reduce the number of parameters
    and quantized to use lower precision for the weights and activations. The training process includes setting up
    callbacks for updating pruning steps. After training, the pruning wrappers are stripped, and the final model is saved.

    Parameters:
    None

    Returns:
    None
    """
    # Initialize a Sequential model
    self.model = Sequential()

    # Add the first QDense layer with quantization and pruning
    self.model.add(
        QDense(
            64,  # Number of neurons in the layer
            input_shape=(4,),  # Input shape for the Iris dataset (4 features)
            name='fc1',
            kernel_quantizer=quantized_bits(6, 0, alpha=1),  # Quantize weights to 6 bits
            bias_quantizer=quantized_bits(6, 0, alpha=1),  # Quantize biases to 6 bits
            kernel_initializer='lecun_uniform',
            kernel_regularizer=l1(0.0001),
        )
    )

    # Add a quantized ReLU activation layer
    self.model.add(QActivation(activation=quantized_relu(6), name='relu1'))

    # Add the second QDense layer with quantization and pruning
    self.model.add(
        QDense(
            32,  # Number of neurons in the layer
            name='fc2',
            kernel_quantizer=quantized_bits(6, 0, alpha=1),  # Quantize weights to 6 bits
            bias_quantizer=quantized_bits(6, 0, alpha=1),  # Quantize biases to 6 bits
            kernel_initializer='lecun_uniform',
            kernel_regularizer=l1(0.0001),
        )
    )

    # Add a quantized ReLU activation layer
    self.model.add(QActivation(activation=quantized_relu(6), name='relu2'))

    # Add the third QDense layer with quantization and pruning
    self.model.add(
        QDense(
            32,  # Number of neurons in the layer
            name='fc3',
            kernel_quantizer=quantized_bits(6, 0, alpha=1),  # Quantize weights to 6 bits
            bias_quantizer=quantized_bits(6, 0, alpha=1),  # Quantize biases to 6 bits
            kernel_initializer='lecun_uniform',
            kernel_regularizer=l1(0.0001),
        )
    )

    # Add a quantized ReLU activation layer
    self.model.add(QActivation(activation=quantized_relu(6), name='relu3'))

    # Add the output QDense layer with quantization and pruning
    self.model.add(
        QDense(
            3,  # Number of neurons in the output layer (matches the number of classes in the Iris dataset)
            name='output',
            kernel_quantizer=quantized_bits(6, 0, alpha=1),  # Quantize weights to 6 bits
            bias_quantizer=quantized_bits(6, 0, alpha=1),  # Quantize biases to 6 bits
            kernel_initializer='lecun_uniform',
            kernel_regularizer=l1(0.0001),
        )
    )

    # Add a softmax activation layer for classification
    self.model.add(Activation(activation='softmax', name='softmax'))

    # Set up pruning parameters to prune 75% of weights, starting after 2000 steps and updating every 100 steps
    pruning_params = {"pruning_schedule": pruning_schedule.ConstantSparsity(0.75, begin_step=2000, frequency=100)}
    self.model = prune.prune_low_magnitude(self.model, **pruning_params)

    # Compile the model with Adam optimizer and categorical crossentropy loss
    adam = Adam(lr=0.0001)
    self.model.compile(optimizer=adam, loss=['categorical_crossentropy'], metrics=['accuracy'])

    # Set up the pruning callback to update pruning steps during training
    callbacks = [
        pruning_callbacks.UpdatePruningStep(),
    ]

    # Train the model
    self.model.fit(
        self.X_train_val,
        self.y_train_val,
        batch_size=32,  # Batch size for training
        epochs=30,  # Number of training epochs
        validation_split=0.25,  # Fraction of training data to be used as validation data
        shuffle=True,  # Shuffle training data before each epoch
        callbacks=callbacks,  # Include pruning callback
    )

    # Strip the pruning wrappers from the model
    self.model = strip_pruning(self.model)

 

There’s an immense amount to digest here but the gist of what’s happening (in addition to the training) is that through quantization, regularization, pruning, weight initialization, and (Adam) optimization we are reducing the size and complexity of the weights to make it more efficient and faster. Some of these methods are used to improve accuracy and ensure the training process works well. Once we’ve run our model through these techniques it should be small enough to turn into FPGA code. Remember, there are plenty of ways to do this. This was the approach taken by the tutorial so I stuck with what worked.

Synthesis

So now we’re ready to compile, synthesize, and place and route our design. At the end of the process we should end up with a bitstream file that, effectively, acts as a flash file to instruct the FPGA how to operate. Since the Zynq has both a processor and FPGA it’s quite easy to program the FPGA via Python (which we’ll get to later). Using the referenced repository, we can define and compile our model using HLS and then build the bitstream file all in a few lines of Python code. In the following code we perform some extra validation to ensure that our HLS generated model has the same (or close enough) accuracy as the original model.

def build_bitstream(self):
    """
    Builds the HLS bitstream for the trained model.

    This function converts the trained Keras model to an HLS model using hls4ml, compiles it, and generates the bitstream for FPGA deployment.
    It also validates the HLS model against the original model and prints the accuracy of both models.

    """
    # Create an HLS config from the Keras model, with the layer names granularity
    config = hls4ml.utils.config_from_keras_model(self.model, granularity='name')

    # Set precision for the softmax layer
    config['LayerName']['softmax']['exp_table_t'] = 'ap_fixed<18,8>'
    config['LayerName']['softmax']['inv_table_t'] = 'ap_fixed<18,4>'

    # Set the ReuseFactor for the fully connected layers to 512
    for layer in ['fc1', 'fc2', 'fc3', 'output']:
        config['LayerName'][layer]['ReuseFactor'] = 512

    # Convert the Keras model to an HLS model
    hls_model = hls4ml.converters.convert_from_keras_model(
        self.model, hls_config=config, output_dir='hls4ml_prj_pynq', backend='VivadoAccelerator', board='pynq-z2'
    )

    # Compile the HLS model
    hls_model.compile()

    # Predict using the HLS model
    y_hls = hls_model.predict(np.ascontiguousarray(self.X_test))
    np.save('package/y_hls.npy', y_hls)

    # Validate the HLS model against the original model
    y_pred = self.model.predict(self.X_test)
    accuracy_original = accuracy_score(np.argmax(self.y_test, axis=1), np.argmax(y_pred, axis=1))
    accuracy_hls = accuracy_score(np.argmax(self.y_test, axis=1), np.argmax(y_hls, axis=1))

    print(f"Accuracy of the original pruned and quantized model: {accuracy_original * 100:.2f}%")
    print(f"Accuracy of the HLS model: {accuracy_hls * 100:.2f}%")

    # Build the HLS model
    hls_model.build(csim=False, export=True, bitfile=True)


It’s hard to appreciate what’s done here but the amount of work involved to get from a Keras model to a bitstream file is insurmountable. This makes it significantly more accessible to the average Joe (albeit the quantization, reduction, and optimization was not simple in the least).

The configurations are, again, something that was borrowed from the tutorial and the reuse factor is a number that can change based upon how much of the logic we want to reuse. After that it’s a few calls to hls4ml and, viola, you have a bitfile!

Testing

So now that we have a bitfile we’d like to test this on the actual FPGA. Using the Pynq environment makes this easier. To get started we have to copy over the bitfile, our test script, and the test vectors onto the Pynq board. This can be done with a simple SCP command or through the web interface. To make things easier (as done in the hls4ml tutorial) we dump everything into a single folder called “package” and then copy it over to the target device via SCP (see the BuildModel.py script in the repository for more details). What’s important to note here is an extra auto generated library called axi_stream_driver.py. This file contains the helper functions to not only flash the FPGA side of the Zynq but also performs the data transfer to test out the FPGA-based neural network model.

Once the files have been transferred to the Pynq board we can either open up an SSH shell on target or create a new notebook to run the code that lives within on_target.py. I prefer to run the script via command line but it will require root privileges so you’ll need to first run sudo -s after you’ve SSHed into your device. Once you’ve got a root shell running on the Pynq board you can navigate to jupyter_notebooks/iris_model_on_fpgas and run the test with the command python3 on_target.py. If everything ran correctly you should see the following output:

Expected output of on_targer.py script

Figure 2: Expected output of on_targer.py script

Validation

So how do we know if everything even worked? Now we need to validate the outputs (i.e. predictions) that the FPGA generated and compare it with our local model that we created using our GPU (or CPU). In layman's terms, for every sepal/petal length/width we provide as an input we’re expecting either an Iris of type Setosa, Versicolor, or Virginica to be an output (all in numbers of course). We’re hoping that the FPGA is just as accurate as the model running on our local machine.

We’ll need to copy the output from the script back to our machine. You can run an SCP command again or just download via the Jupyter notebook interface. The output file will be called y_hw.npy. After copying the file we’ll need to run utilities/validate_model.py. The output should look something like this:

Results from validation script

Figure 3: Results from validation script

As you can see, our optimized model (on the PC) and synthesized model (on the FPGA) both share the same level of accuracy: 96.67%. Out of all 30 test points we failed to predict just a single one - nice!

Conclusion

In this article, we took the Iris classification dataset and created a neural network model out of it. Using hls4ml, we built a bitfile and necessary libraries to run the same model on a Pynq-Z2 board. We also ran a validation script that compared the computer based model against the FPGA based model and how they both stacked up against the original data. While the model and dataset was fairly trivial, this tutorial laid the foundation for what designing complex neural networks on FPGAs is all about.

Note: All the code for this project can be found in this repository.

About Author

About Author

Ari is an engineer with broad experience in designing, manufacturing, testing, and integrating electrical, mechanical, and software systems. He is passionate about bringing design, verification, and test engineers together to work as a cohesive unit.

Related Resources

Related Technical Documentation

Back to Home
Thank you, you are now subscribed to updates.