Crypto-RV: High-Efficiency FPGA-Based RISC-V Cryptographic Co-Processor for IoT Security

Crypto-RV: High-Efficiency FPGA-Based RISC-V Cryptographic Co-Processor for IoT Security
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Cryptographic operations are critical for securing IoT, edge computing, and autonomous systems. However, current RISC-V platforms lack efficient hardware support for comprehensive cryptographic algorithm families and post-quantum cryptography. This paper presents Crypto-RV, a RISC-V co-processor architecture that unifies support for SHA-256, SHA-512, SM3, SHA3-256, SHAKE-128, SHAKE-256 AES-128, HARAKA-256, and HARAKA-512 within a single 64-bit datapath. Crypto-RV introduces three key architectural innovations: a high-bandwidth internal buffer (128x64-bit), cryptography-specialized execution units with four-stage pipelined datapaths, and a double-buffering mechanism with adaptive scheduling optimized for large-hash. Implemented on Xilinx ZCU102 FPGA at 160 MHz with 0.851 W dynamic power, Crypto-RV achieves 165 times to 1,061 times speedup over baseline RISC-V cores, 5.8 times to 17.4 times better energy efficiency compared to powerful CPUs. The design occupies only 34,704 LUTs, 37,329 FFs, and 22 BRAMs demonstrating viability for high-performance, energy-efficient cryptographic processing in resource-constrained IoT environments.


💡 Research Summary

The paper introduces Crypto‑RV, a high‑efficiency RISC‑V co‑processor designed to accelerate a comprehensive set of cryptographic primitives required in Internet‑of‑Things (IoT), edge, and autonomous systems. The authors identify that existing RISC‑V platforms either rely on software implementations or provide limited instruction‑set extensions, which suffer from severe memory‑access bottlenecks and under‑utilized pipelines when processing hash functions and block ciphers. Crypto‑RV addresses these shortcomings through three architectural innovations.

First, a dedicated 128 × 64‑bit internal buffer is placed directly adjacent to the execution pipeline. By loading message blocks, constants, and intermediate states into this buffer once, the core eliminates the repetitive load/store operations that dominate 70‑85 % of cycles in conventional designs. Custom data‑movement instructions enable bulk transfers of up to 128 words in a single operation, reducing load/store overhead by 17‑58× and allowing the cryptographic engines to sustain one pipeline iteration per clock after warm‑up.

Second, the design integrates three unified, crypto‑specialized execution units. The SM3/SHA‑256/SHA‑512 unit shares a message expander, compressor, and rotator across 32‑bit and 64‑bit modes, using a four‑stage pipeline with a single adder per stage. In SHA‑512 mode it processes a 1024‑bit block each cycle; in SHA‑256/SM3 mode it runs two 512‑bit blocks in parallel, effectively doubling throughput while reusing >80 % of arithmetic resources. The AES‑128/HARAKA‑256/HARAKA‑512 unit also adopts a four‑stage pipeline (SubBytes, ShiftRows/MixColumns, AddRoundKey, output accumulation). By incorporating round‑constant generation logic, the engine fully accelerates the HARAKA sponge construction, eliminating the typical 70 % software bottleneck and achieving a three‑fold throughput increase over partial implementations. The SHA‑3/SHAKE‑128/SHAKE‑256 engine unrolls two Keccak rounds per clock cycle within a carefully optimized combinational datapath, halving the effective round count from 24 to 12 while preserving timing closure. Mode‑select multiplexers allow seamless switching between fixed‑output SHA‑3 and variable‑output SHAKE functions.

Third, Crypto‑RV introduces hierarchical double‑buffering between a 1024 × 64‑bit data memory (DM) and the internal buffer. At startup the entire workload (constants, initial states, and message blocks) is pre‑loaded into DM. While the cryptographic engine processes data from the buffer, fresh blocks are streamed in from DM, achieving perfect overlap of compute and DMA transfers. This mechanism is crucial for large‑message or tree‑hash workloads (e.g., Merkle‑tree traversals, SPHINCS+ signatures), where it prevents the pipeline from stalling due to memory bandwidth limitations.

The implementation on a Xilinx ZCU102 FPGA runs at 160 MHz, consumes 0.851 W dynamic power, and occupies 34,704 LUTs, 37,329 flip‑flops, and 22 BRAMs. Performance results show speed‑ups of 660× (SHA‑256), 604× (SHA‑512), 789× (SM3), 965×–1,061× (AES‑128 and HARAKA), and 220× (SHAKE‑128/256) compared with a baseline RISC‑V core. Energy efficiency ranges from 62.76 to 187.08 Mbps/W, delivering 4‑12× better efficiency than high‑performance CPUs such as Intel i9‑10940X, i7‑12700H, and ARM Cortex‑A53.

In summary, Crypto‑RV demonstrates that a tightly coupled, buffer‑centric, and algorithm‑unified co‑processor can provide order‑of‑magnitude improvements in latency, throughput, and energy efficiency for a broad portfolio of cryptographic algorithms on resource‑constrained platforms. The authors suggest future work to extend the architecture to support post‑quantum hash‑based schemes (e.g., SPHINCS+) and to explore ASIC implementations for even lower power and area footprints.


Comments & Academic Discussion

Loading comments...

Leave a Comment