This project aims to provide a graphics server system based on hardware accelerated graphics, and a easy way to develop the graphics primitives. The accelerated versions runs faster using the exact same C code as the software version by automatic translation (transpiling) to Verilog code.
As an example, let's see how to use and develop drawing privimites for solid rectangles and ellipses.
Ellipse case (see ellipse_fill32.cc file):
MODULE ellipse_fill32(
bus_master(bus),
const uint16& x0,
const uint16& x1,
const uint16& y0,
const uint16& y1,
const uint32& rgba, //color
const uint32& base, //pixel offset
const int16& xstride, //normally 1, but can run backwards
const int16& ystride //bytes to skip for next line (usually the framebuffer width * 4 bytes)
)
The bus
argument is automatically handled both in software and in hardware by privided macros.
The rectangle primitive follows the same function signature.
You can directly call the function using compilation with a normal C compiler:
ellipse_fill32(BUSMASTER_ARG, x0, x1, y0, y1, rgba, base, xstride, ystride);
The BUSMASTER_ARG macro is an automatic argument, defined on a provided header file. It's useful for the implementation of the hardware accelerated primitive, as explained in the next section.
The following image is produced by the simulator by calling 1000 times to the software implementation of the primitives, using random coordinates:
To target hardware implementation on a FPGA, a Verilog file is automatically generated from from the corresponding C file having the drawing primitive algorithm, by using the following external tools: CflexHDL and the Silice transpilers (see Makefile.common target "c2v" for the invocation command details). Since this project aims to be transpiler-agnostic, the PipelineC transpiler is planned for a future version.
See a portion of the generated Verilog code, where part of the C expressions and interactions with the memory bus can be readily appreciated:
4: begin
if (_q_x<_q_rw) begin
_t_xx = _q_x*_q_x;
_t_xh = (_t_xx)*(_q_hh);
if (_t_xh+_q_yw<_q_wh) begin
_d_bus_dat_w = in_rgba;
_d_bus_we = 1;
_d_bus_stb = 1;
_d_bus_cyc = 1;
if (!((_d_bus_stb&&_d_bus_we)&&!(_d_bus_stb&&in_bus_ack&&_d_bus_we))) begin
_d_bus_stb = 0;
_d_bus_adr = (_q_bus_adr+(in_xstride));
_d_x = _q_x+1;
end
end else begin
_d_bus_adr = (_q_bus_adr+(in_xstride));
_d_x = _q_x+1;
end
_d__idx_fsm0 = 4;
end else begin
_d__idx_fsm0 = 5;
end
end
After generating a System On Chip (SoC) for the target FPGA using the LiteX framework and a provided script, that includes the generated verilog files, the accelerator can be called as follows:
regs->x0 = x0;
regs->x1 = x1;
regs->y0 = y0;
regs->y1 = y1;
regs->base = VIDEO_FRAMEBUFFER_BASE + y0*FRAME_PITCH + x0*sizeof(rgba);
regs->xstride = SDRAM_BUS_BITS/8;
regs->ystride = FRAME_PITCH;
regs->rgba = rgba;
regs->run = 1; //start
while(!regs->done); //wait until done
As seen, you first set memory mapped registers with the desired values, then you start the core, then wait until the done
flag is set.
Each accelerator core gets mapped starting at a fixed address (default 0x80000000 for the first accelerator, 0x80000800 for the second and so on, as provided by the correspoding macros). A C structure with the layout of registers is conveniently provided too (each register is 32-bit aligned, even if smaller).
The resulting execution in hardware is as follows:
You can visually appreciate that both images seems like the same, but how we can be sure the generated Verilog behaves the same as the software implementation? The solution is to use the accelerated implementations and also call the compiled software implementation in the same SoC, then we can compare if results are the same. The test program does specifically that, reporting how many pixels are in error, if any:
You can see that the accelerator is about 3X faster, while it also frees the CPU for other tasks.
In case of any discrepance, non-matching pixels are marked in red (this was generated by inducing a coordinate error in the software implementation) and the amount of pixels in error is reported.
Console output would be in this case:
Pixel errors: 320 (screen should have no red pixels)
==========================================
*** TESTS FAILED ***
==========================================
The acceleration is readily appreciated on the following video, where the software implementation is run prior to the hardware one. After that, a clock demo application is shown, using a combination of drawn rectangles and ellipses.
The code for that demo application can be compiled with the main simulator code on the host machine (e.g. Linux) to see results on the simulator window. This eases testing while in development since the program compiles in few seconds, then it can be run on the hardware platform to check if producing the same results. In the provided video, you can see that the hardware matches the simulator.
Accelerator cores gets directly connected to the dynamic RAM of the respective boards (acting similarly as a DMA), and using caching to achieve access at high speed. That way, the accelerated cores are about 7X faster than when running in software, as the current tests reports (line drawing core on a ECP5 device):
Start software rendering
elapsed 9057225 us, ops/s: 828153 (2 FPS @640x480)
Switch to hardware rendering
elapsed 1229585 us, ops/s: 6100246 (19 FPS @640x480)
Just waiting a bit to evaluate image...
Pixel errors: 0 (screen should have no red pixels)
==========================================
TESTS PASSED
==========================================
The following video shows the corresponding images on the display (first the software renderer, then the accelerated renderer).
Curently the project supports the following boards:
Lambdaconcept ECPIX-5 with a Lattice ECP5 device. It currently utilices an open source toolchain for building the bitstream. The board name used in the project is lambdaconcept_ecpix5.
Digilent Arty A7-35T with an AMD Artix7 device. The board name used in the project is digileny_arty.
Just run:
make BOARD=lambdaconcept_ecpix5 run upload
Where the target run runs the simulator, and the upload target makes the bitstream and uploads it to the FPGA board.
The FPGA board is now capable of running a graphics-enabled micropython port, capable of controlling a PC monitor. Up to 8 bit per color channel are supported.
Example: accel_basic.py
Generate and upload the bitstream (use your own serial port location if different):
make BOARD=digilent_arty digilent_arty SERIAL_PORT=/dev/ttyUSB1
cd micropython/ports/litex
make
litex_term.py --kernel build/firmware.bin /dev/ttyUSB1
cd -
openFPGALoader -b arty build/digilent_arty/gateware/digilent_arty.bit
Then the .py file can be uploaded using the standard mpremote
command:
mpremote connect /dev/ttyUSB1 run demos/micropython/accel_basic.py
This generates the following picture:
To run the clock demo, accel_clock.py (in same folder)
The same repository used for the port of micropython to the FPGA platform was used to target the project's CPU board (F133A SoC by Allwinner Tech)
Supported features are:
The repository of the extended micropython is linked as a git submodule on the main repo (see micropython@51dfc37b30 at the project's main repo). The sub repository directs to the micropython repo.
The source for the f133 port are under the following folder
A basic example to test micropython on the CPU is to use the REPL mode:
Under the f133 folder, there's a micropython script that shows how to access the video framebuffer: test/test_video.py
Execution of this produced the following result:
Note the compatibility of the micropython code with the one for the FPGA platform.
cd ports/f133
make
The make
command will build the firmware and upload it to the board. To upload new firmware, you have to first push the reset button.
A brand-new board was designed using a graphics-capable CPU:
It's capable of running the clock demo, using the graphics primitives ellipse and rectangle fill
This produces the following image:
cd target-cpu/f133-bare/
make
The make
command will build the firmware and upload it to the board. To upload new firmware, you have to first push the reset button.
This milestone is conclusive proof that this framework is capable running the drawing primitives as software or hardware, since the same code runs on the CPU as software and in the FPGA as a hardware core, producing matching visual results.
The same port of micropython to the FPGA platform was used to target the project's CPU board (F133A SoC by Allwinner Tech)
The CPU board is capable of communication using serial ports, they're typically used to control the board externally and to get debug messages.
See an example of boot mesages outputted from the default debug port (UART0):
See below an example interaction with micropython using the second port (UART1), accesible in the board at the 6-pin header (bottom left).
This is achieved in software by calling a initialization function uart_probe(UART_COMM)
to enable the second port. See main.c for a example usage and driver_uart.c for the implementation.
An I2C driver was added to the micropython port for the CPU board. It can access display information provided by PC monitors and determine available resolutions.
It works by configuring I/O pins for open-drain behaviour, thus electrically compatible with the I2C standard. See mphalport.h for implementation details.
The micropython script to run is: i2c_vga_ddc.py, which obtains the following information:
The script can be run from the Linux console with the mpremote
command:
mpremote connect /dev/ttyUSB0 run demos/micropython/i2c_vga_ddc.py
The wiring can be done using the I/O board using a 40-pin FPC cable and direct wiring to the VGA connector as shown:
A new function to decompress JPEG file was implemented, it's based on the C model of a Verilog decompressor, so in that form it's useful to test things in sofware before moving it to hardware. The original code was changed to avoid dynamic memory allocations thus easier to run in the bare metal environment, see target-cpu/f133-bare for sources.
A JPEG file is embedded in the firmware image by means of a direct include of the raw file data (.incbin directive in the rawdata.S assembler source)
The example image size is 33.8KB when compressed and 921.6 KB when decompressed (27:1 ratio). Software decompresion takes 195ms (5FPS) expected to reach 30FPS with the planned video decoder accelerator (Verilog).
Working example on the bare metal enviroment:
The decompression algorithm is in the c_model_jpeg_test.cpp source.
This project is funded through the NGI0 Entrust Fund, a fund established by NLnet with financial support from the European Commission's Next Generation Internet programme, under the aegis of DG Communications Networks, Content and Technology under grant agreement No 101069594.