Custom DPU-PYNQ Overlay for the Sundance VCS³ Board
Extends the DPU-PYNQ framework with a custom accelerated platform, generating the required FPGA overlay binaries and a matching arch.json.
This project extends the DPU-PYNQ framework to support the Sundance VCS³ board, a Zynq UltraScale+ MPSoC platform that is not natively supported by the official PYNQ board list. The goal is to enable Vitis-AI DPU acceleration on the VCS³ board and allow users to run deep-learning inference directly from Python and Jupyter notebooks using the pynq_dpu package.
The project targets Vitis-AI version 3.5, which supports the DPUCZDX8G IP (v4.1) and is compatible with Vivado, Vitis, and PetaLinux 2023.1. Development is based on the design_contest_3.5 branch of the DPU-PYNQ repository, which provides updated infrastructure for recent platforms such as the Kria KR260 and KV260 SOMs.
To achieve this, a custom accelerated Vitis platform was created for the VCS³ board using Vivado. The platform integrates the Zynq UltraScale+ Processing System, clocking and reset infrastructure, AXI interconnects, interrupt support, and DDR memory access required by the DPU. The platform is exported as a Xilinx Shell Archive (XSA) and used by Vitis during kernel linking.
The DPUCZDX8G IP is configured using a custom dpu_conf.vh file, which defines the DPU architecture, including compute size, memory usage, parallelism, DSP utilization, and power configuration. From this hardware configuration, an arch.json file is generated, which accurately describes the DPU architecture to the Vitis-AI compiler. This file is essential for compiling neural-network models that are fully compatible with the hardware DPU implementation.
Using the custom platform, DPU configuration, and architecture description, the Vitis build flow generates the required overlay binaries (dpu.bit, dpu.hwh, and dpu.xclbin). These binaries can be deployed on the VCS³ board and used with the pynq_dpu runtime to execute Vitis-AI–compiled models from Python.
Overall, this project demonstrates how a non-standard Zynq UltraScale+ board can be successfully integrated into the DPU-PYNQ ecosystem, enabling accessible and high-performance AI acceleration on custom hardware.
Things Used In This Project
Hardware Components
1) Sundance VCS³ Development Kit
Software Components
Introduction
The target platform used in this project is the VCS³ Development Kit, a compact embedded system based on an AMD Zynq™ SoC that combines integrated ARM CPUs with FPGA fabric. This architecture makes the VCS³ well suited for embedded computing and hardware-accelerated applications.
VCS³ 3D Kit Store Page: https://store.sundance.com/product/vcs3-dev-kit/
Documentation:
- Application Starters Guide: https://store.sundance.com/wp-content/uploads/2023/10/VCS3-Application-Starters-Guide-v1.1.pdf
- Development Kit – Getting Started Guide: https://store.sundance.com/wp-content/uploads/2025/01/VCS3-Development-Kit-Getting-Started-Guide.pdf
About This Project
This project is about porting and enabling the AMD/Xilinx DPU-PYNQ framework on the Sundance VCS³ board, which is not officially supported by the DPU-PYNQ repository out of the box.
DPU-PYNQ normally provides ready-to-use FPGA overlays that allow deep-learning inference (via Vitis-AI DPU) to run easily from Python/Jupyter notebooks on supported Zynq UltraScale+ platforms. This project extends that ecosystem to the custom Sundance VCS³ board, enabling generation of:
dpu.bit– FPGA bitstreamdpu.hwh– hardware handoff metadatadpu.xclbin– Vitis acceleration binary- arch.json – defines specific capabilities and configuration of the DPU and used by Vitis-AI compiler
These files make it possible to run neural-network inference using the pynq_dpu Python package on the VCS³ hardware.
Motivation
There were two main motivations:
- Hardware limitation – The Sundance VCS³ board is a powerful Zynq UltraScale+ MPSoC platform, but it is not included in the official list of PYNQ-enabled boards. This prevents users from easily deploying AI workloads using DPU-PYNQ.
- Ease of AI deployment – By adding VCS³ support, the project allows users to:
– Use Python instead of low-level drivers
– Run inference notebooks
– Rapidly prototype AI applications on custom hardware
In short, the project removes a major usability barrier and makes AI acceleration on VCS³ accessible and reproducible.
How Does It Work?
The project works in several coordinated steps:
1. Extend the DPU-PYNQ board infrastructure
A new VCS³ board directory is created inside the DPU-PYNQ boards/ folder. This includes:
- A project configuration file (
prj_config) describing AXI connectivity and kernel mapping - A DPU configuration header (
dpu_conf.vh) that defines the DPU architecture (compute size, memory usage, DSP usage, power settings, etc.)
This tells Vitis how the DPU should be built for the VCS³ FPGA.
2. Create a Vivado accelerated platform (XSA)
Because DPU-PYNQ requires an accelerated platform, a custom Vivado platform project is built targeting the VCS³ board:
- Zynq UltraScale+ MPSoC configured with board presets
- Multiple clocks for DPU and various IPs in the block design
- AXI master/slave interfaces for DDR access
- Interrupt controller and PL-to-PS IRQ wiring
The completed design is exported as a platform.xsa file, which the DPU-PYNQ uses as the hardware foundation.
3. Package the DPU as a Vitis kernel and generate the FPGA overlay
Using Vitis scripts:
- The DPU RTL is packaged into a kernel object (
.xo) - AXI ports are mapped to the platform memory interfaces
- Kernel metadata is generated for system linking
The v++ linker:
- Integrates the DPU kernel with the VCS³ platform
- Synthesizes, places, and routes the design
- Produces the final dpu.xclbin, dpu.bit, dpu.hwh and the arch.json
This step effectively creates the FPGA overlay used by PYNQ.
Tool Versions:
This project targets Vitis-AI version 3.5, which supports the DPUCZDX8G IP version 4.1 and is compatible with Vivado, Vitis, and PetaLinux 2023.1. The design_contest_3.5 release of the DPU-PYNQ repository supports PYNQ 3.0 and Vitis-AI 3.5.0. This release also includes updates specifically targeting the Kria SOM platforms (KR260 and KV260).
DPU C ZD X8G is used for edge devices. The respective accronyms are explained below;
DPU: Deep Learning Processing Unit
C: CNN applications
ZD: ZYNQ DDR hardware platform
X: DECENT Quantization
8: Quantization bitwidth of 8 bits
G: General purpose design target
1. Extend the DPU-PYNQ board infrastructure
The initial step is to clone the DPU-PYNQ repository and check out the design_contest_3.5 branch using the following command:
git clone --branch design_contest_3.5 --single-branch https://github.com/Xilinx/DPU-PYNQ.git.
├── boards
├── contest_patch.sh
├── host
├── LICENSE
├── MANIFEST.in
├── pynq_dpu
├── pyproject.toml
├── README.md
└── setup.pycd DPU-PYNQ/boards.
├── check_env.sh
├── DPUCZDX8G
├── gzu_5ev
├── kr260_som
├── kv260_som
├── Makefile
├── pynqzu
├── README.md
├── rfsoc2x2
├── rfsoc4x2
├── TySOM-3A-ZU19EG
├── TySOM-3-ZU7EV
├── Ultra96v1
├── Ultra96v2
├── ultrazed_eg_iocc_production
├── vermeo_t1_mpsoc
├── vermeo_t1_rfsoc
├── zcu102
├── zcu104
├── zcu106
├── zcu111
├── zcu1285
├── zcu208
├── zcu216
└── ZUBoard_1CGCreate a directory for the VCS³ board and change into it:
mkdir vcs-3 && cd vcs-3Create a project configuration file named prj_config in the vcs-3 directory and populate it with the following contents. This file defines the clocking, kernel connectivity, and Vivado implementation strategy used during the Vitis linking process.
# /*
# * Copyright 2019 Xilinx Inc.
# *
# * Licensed under the Apache License, Version 2.0 (the "License");
# * you may not use this file except in compliance with the License.
# * You may obtain a copy of the License at
# *
# * http://www.apache.org/licenses/LICENSE-2.0
# *
# * Unless required by applicable law or agreed to in writing, software
# * distributed under the License is distributed on an "AS IS" BASIS,
# * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# * See the License for the specific language governing permissions and
# * limitations under the License.
# */
[clock]
[connectivity]
sp=DPUCZDX8G_1.M_AXI_GP0:HPC0 sp=DPUCZDX8G_1.M_AXI_HP0:HP0 sp=DPUCZDX8G_1.M_AXI_HP2:HP1
nk=DPUCZDX8G:1
[advanced] misc=:solution_name=link
[vivado] prop=run.impl_1.strategy=Performance_Explore
Next, a DPU configuration file (dpu_conf.vh) is created to define the architectural parameters of the DPU IP integrated into the Vivado project. This file specifies the hardware characteristics of the DPU, including compute size, memory usage, parallelism, DSP utilization, and power configuration. These parameters directly determine how neural network models are mapped onto the DPU hardware and therefore must be consistent with the models compiled using Vitis-AI.
In this configuration, the DPU is instantiated with a B1152 architecture, providing a balanced trade-off between performance and resource utilization. URAM and DRAM are disabled, indicating that only on-chip BRAM resources are used for feature maps, weights, and biases. The design enables channel augmentation and sets the ALU parallelism to level 2, improving computational throughput while maintaining compatibility with supported models. DSP resources are prioritized by enabling high DSP48 usage, and the design targets an MPSoC-based platform, consistent with the Zynq UltraScale+ architecture of the VCS³ board.
Any modification to architecture-related parameters in this file—such as DPU size, memory configuration, or parallelism—requires regenerating the corresponding arch.json file and recompiling the neural network models to ensure software–hardware compatibility.
/*
* Copyright 2019 Xilinx Inc.
*
* Licensed under the Apache License, Version 2.0 (the "License");
* you may not use this file except in compliance with the License.
* You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/
//Setting the arch of DPU, For more details, Please read the PG338 /*====== Architecture Options ======*/ // |——————————————————| // | Support 8 DPU size // | It relates to model. if change, must update model // +——————————————————+ // | define B512 // +——————————————————+ // |
define B800 // +——————————————————+ // |
define B1024 // +——————————————————+ // |
define B1152 // +——————————————————+ // |
define B1600 // +——————————————————+ // |
define B2304 // +——————————————————+ // |
define B3136 // +——————————————————+ // |
define B4096 // |——————————————————|
define B1152 // |——————————————————| // | If the FPGA has Uram. You can define URAM_EN parameter // | if change, Don’t need update model // +——————————————————+ // | for zcu104 : define URAM_ENABLE // +——————————————————+ // | for zcu102 :
define URAM_DISABLE // |——————————————————|
define URAM_DISABLE //config URAM ifdef URAM_ENABLE
define def_UBANK_IMG_N 5 define def_UBANK_WGT_N 17 define def_UBANK_BIAS 1 elsif URAM_DISABLE define def_UBANK_IMG_N 0 define def_UBANK_WGT_N 0 define def_UBANK_BIAS 0 endif // |——————————————————| // | You can use DRAM if FPGA has extra LUTs // | if change, Don’t need update model // +——————————————————+ // | Enable DRAM : define DRAM_ENABLE // +——————————————————+ // | Disable DRAM : define DRAM_DISABLE // |——————————————————|
define DRAM_DISABLE //config DRAM ifdef DRAM_ENABLE define def_DBANK_IMG_N 1 define def_DBANK_WGT_N 1
define def_DBANK_BIAS 1
elsif DRAM_DISABLE
define def_DBANK_IMG_N 0
define def_DBANK_WGT_N 0
define def_DBANK_BIAS 0 endif
// |——————————————————| // | RAM Usage Configuration // | It relates to model. if change, must update model // +——————————————————+ // | RAM Usage High : define RAM_USAGE_HIGH // +——————————————————+ // | RAM Usage Low :
define RAM_USAGE_LOW // |——————————————————|
define RAM_USAGE_LOW // |——————————————————| // | Channel Augmentation Configuration // | It relates to model. if change, must update model // +——————————————————+ // | Enable :
define CHANNEL_AUGMENTATION_ENABLE // +——————————————————+ // | Disable :
define CHANNEL_AUGMENTATION_DISABLE // |——————————————————|
define CHANNEL_AUGMENTATION_ENABLE // |——————————————————| // | ALU parallel Configuration // | It relates to model. if change, must update model // +——————————————————+ // | setting 0 :
define ALU_PARALLEL_DEFAULT // +——————————————————+ // | setting 1 :
define ALU_PARALLEL_1 // |——————————————————| // | setting 2 :
define ALU_PARALLEL_2 // |——————————————————| // | setting 3 :
define ALU_PARALLEL_4 // |——————————————————| // | setting 4 :
define ALU_PARALLEL_8 // |——————————————————|
define ALU_PARALLEL_2 // +——————————————————+ // | CONV RELU Type Configuration // | It relates to model. if change, must update model // +——————————————————+ // |
define CONV_RELU_RELU6 // +——————————————————+ // |
define CONV_RELU_LEAKYRELU_RELU6 // |——————————————————|
define CONV_RELU_LEAKYRELU_RELU6 // +——————————————————+ // | ALU RELU Type Configuration // | It relates to model. if change, must update model // +——————————————————+ // |
define ALU_RELU_RELU6 // +——————————————————+ // |
define ALU_RELU_LEAKYRELU_RELU6 // |——————————————————|
define ALU_RELU_RELU6 // |——————————————————| // | argmax or max Configuration // | It relates to model. if change, must update model // +——————————————————+ // | enable :
define SAVE_ARGMAX_ENABLE // +——————————————————+ // | disable :
define SAVE_ARGMAX_DISABLE // |——————————————————|
define SAVE_ARGMAX_ENABLE // |——————————————————| // | DSP48 Usage Configuration // | Use dsp replace of lut in conv operate // | if change, Don’t need update model // +——————————————————+ // |
define DSP48_USAGE_HIGH // +——————————————————+ // |
define DSP48_USAGE_LOW // |——————————————————|
define DSP48_USAGE_HIGH // |——————————————————| // | Power Configuration // | if change, Don’t need update model // +——————————————————+ // |
define LOWPOWER_ENABLE // +——————————————————+ // |
define LOWPOWER_DISABLE // |——————————————————|
define LOWPOWER_DISABLE // |——————————————————| // | DEVICE Configuration // | if change, Don’t need update model // +——————————————————+ // |
define MPSOC // +——————————————————+ // |
define ZYNQ7000 // |——————————————————|
define MPSOC
2. Create a Vivado accelerated platform (XSA)
An AMD XSA (Xilinx Shell Archive) file is required to describe the hardware platform details of the custom Sundance VCS³ board. This file captures the board’s hardware configuration and serves as the input platform definition for Vivado and Vitis during accelerated application and DPU overlay generation.
Before creating the platform project, the Sundance board definition files must be made available to Vivado.
Copy Board Files
Clone the repository below to access the Sundance VCS³ board files.
git clone https://github.com/kwame-debug/vcs3.gitCopy the Sundance VCS³ board files to <vivado installation path>/Xilinx/Vivado/2023.1/boards/
cp -r board_files /tools/Xilinx/Vivado/2023.1/data/boards/This step enables Vivado to recognize the VCS³ board as a selectable target during project creation.
Launch Vivado
Source the Vivado environment and launch the Vivado workspace by executing the following commands in a Linux terminal:
source /tools/Xilinx/Vivado/2023.1/settings64.sh
vivado &This command initializes the Vivado environment and opens the Vivado GUI.
Create the Vivado Platform Project
When creating the project:
- Select RTL Project as the project type.
- Ensure that the “Project is an extensible Vitis platform” option is enabled. This option is required to generate a platform compatible with the Vitis acceleration flow.
Select Target Board
Choose VCS-3 from the Boards list, then click Next followed by Finish.
Note: The VCS³ board will only appear in the list if the board files have been copied into the Vivado installation directory as described in the previous step.
Create the Block Design
Create a new block design from the Vivado Flow Navigator.
Add Zynq UltraScale+ MPSoC IP
Add the Zynq UltraScale+ MPSoC IP to the block design.
Apply Board Preset
Click Run Block Automation to apply the VCS³ board preset to the Processing System.Ensure the following options are selected:
- All Automation
- zynq_ultra_ps_e_0
- Apply Board Preset
Click OK to continue.
Add Clock Wizard
Add a Clock Wizard IP and configure three output clocks as follows:
clk_out_REG_100MHz: 100 MHzclk_out_DPU_200MHz: 200 MHzclk_out_DSP_400MHz: 400 MHz
Ensure that Active Low Reset and Locked options are enabled. Click OK to close the Clock Wizard.
Add and Configure System Resets
Add three Processor System Reset IPs to the design and rename them as follows:
- proc_sys_reset_100MHz
- proc_sys_reset_200MHz
- proc_sys_reset_400MHz
Connect Clocks and Resets
Use Run Connection Automation to connect the clocks and reset modules:
- Click on the Run Connection Automation link to open the wizard.
- Set clk_in1 of clk_wiz_0 to zynq_ultra_ps_e_0/pl_clk0 (99 MHz
- Assign reset clock sources:proc_sys_reset_100MHz → /clk_wiz_0/clk_out_REG_100MHzproc_sys_reset_200MHz → /clk_wiz_0/clk_out_DPU_200MHzproc_sys_reset_400MHz → /clk_wiz_0/clk_out_DSP_400MHz
- Set ext_reset_in for all reset modules to /zynq_ultra_ps_e_0/pl_resetn0
Repeat the configuration in the above image for ext_reset_in under proc_sys_reset_200MHz and proc_sys_reset_400MHz.
The figure below shows the final output of the fully configured block design. Different colours are used to highlight the respect clock and reset connections.
Finalize Clock Configuration
- Connect all dcm_locked pins to the Clock Wizard locked output.
- Open Windows → Platform Setup, navigate to Clock Settings, and enable all three clocks.
- Set clk_out_DPU_200MHz as the default clock for kernel linking. The default clock(clk_out_DPU_200MHz) is used during the v++ linker to connect to IP blocks with no user assignments for link configuration.
Add Interrupt Support
Enable interrupt support by:
- Opening the Zynq UltraScale+ MPSoC configuration
- Enabling AXI HPM0 LPD (32-bit data width)
Note: Ensure that AXI HPM0 FPD and AXI HPM1 FPD are disabled.
- Enabling PL-to-PS IRQ0[0–7]
- Adding an AXI Interrupt Controller
- Configuring the interrupt controller for single interrupt output
Double click the Interrupt controller and select “single“ as option for interrupt output Connection as indicated in the figure below and click OK to close the window.
- Connecting axi_intc_0/irq to pl_ps_irq0[0:0]
Click on Run Connection Automation link to open the Run Connection Automation window. Ensure that axi_intc_0 and s_axi are both selected. Select /clk_wiz_0/clk_out_DPU_200MHz as clock source for Master interface and click OK to close the window.
Enable interrupt signals for the platform
Enable AXI Interfaces
Enable AXI master and slave interfaces to allow kernel access to DDR memory:
- Enable required AXI ports in the Processing System (zynq_ultra_ps_e_0)
- Enable M01_AXI through M07_AXI under ps8_0_axi_periph
Note: Ensure that memport for S_AXI_HPC0_FPD and S_AXI_HPC1_FPD are set to S_AXI_HP memport. Leave Memory column blank.
Click on the Platform Name option under Settings and enter the platform details as illustrated below;
Finalize and Validate Design
- Validate the design by pressing F6 (any critical warning shown can be safely ignored)Note: The critical message indicated below shows up. This can be safely ignored by clicking on OK.
- Create the HDL wrapper (select Let Vivado manage wrapper and auto-update)
Right-click on vcs3_bd.bd under Design Sources and select Create HDL Wrapper. Mantain the “Let Vivado manage wrapper and auto-update“ option and click OK.
- Generate the block design (select Global synthesis)Click on Generate block design from flow navigator (under IP INTEGRATOR). Ensure that Global is selected as Synthesis Option.
- Generate the bitstream
Next, click Generate Bitstream in Flow Navigator to launch Runs and Click OK to close the window.
Export Hardware Platform
Export the hardware platform with the following options:
- Platform Type: Hardware and Hardware Emulation
- Platform State: Pre-synthesis
- Include Bitstream: Enabled
After successful bitstream generation, export Hardware Platform and ensure that “Hardware and hardware emulation” is selected as platform type.
.
├── dpu_conf.vh
├── platform.xsa
└── prj_config3. Package the DPU as a Vitis kernel and generate the FPGA overlay
Before building the FPGA overlay, XRT (Xilinx Runtime) must be enabled so the build and runtime tools can correctly interface with the FPGA and manage accelerator kernels.
source /opt/xilinx/xrt/setup.shThis step configures the environment with the required XRT drivers, libraries, and utilities needed to package the DPU as a Vitis kernel and generate a PYNQ-compatible FPGA overlay.
Next, navigate back to the boards directory and build the overlay:
cd ..
make BOARD=vcs-3├── binary_container_1
├── dpu.bit
├── dpu_conf.vh
├── dpu.hwh
├── dpu.xclbin
├── kernel_xml
├── packaged_kernel_DPUCZDX8G_hw_vcs-3
├── platform.xsa
├── prj_config
├── sample_link.ini
├── scripts
├── tmp_kernel_pack_DPUCZDX8G_hw_vcs-3
├── vivado.jou
├── vivado.log
└── xcd.logThe arch.json file is located in the path below
./vcs-3/binary_container_1/link/vivado/vpl/prj/prj.gen/sources_1/bd/vcs3_bd/ip/vcs3_bd_DPUCZDX8G_1_0/arch.json