Accelerating llama.cpp with RISC-V Vector Extension
By Ahmad Tameem | December 13, 2023

The emergence of Large Language Models (LLMs) has significantly impacted the software industry. We are witnessing a change not only inside the industry but in our day-to-day aspects as well. But as these LLMs introduce a lot of groundbreaking achievements, they also raise new areas of challenges and obstacles. One of those aspects is performance, only a year ago, the idea of running LLMs with billions of parameters on edge devices seemed far-fetched, but the use of technology like vectorization and quantization of LLMs has made it possible, and more importantly, now they offer strikingly similar performance on CPU as compared to running it on dedicated GPUs and expensive servers.

Amidst these advancements, at 10xEngineers we have noticed that RISC-V is not well caught in this industry, so we took the initiative of running the LLaMa.cpp LLM on the RISC-V board, and not only it, we also added RISC-V vectorization support for it, and for its library GGML as well. This will not only enable LLaMa.cpp and GGML to run efficiently on RISC-V hardware with vector support but also open a way to compare its performance with other vector processors like Intel AVX and Arm Neon in the future. 

This article is not only about LLaMa.cpp but we will also delve into other aspects like the basic understanding of vector processors and GGML. So, Let’s get started with RISC-V and understand the difference between scalar and vector processors.

What is RISC-V ISA?

RISC-V is an open-source instruction set architecture (ISA), which is based on the reduced instruction set computer (RISC) design. On the contrary, we have the Intel and AMD ISA which are based on Complex instruction set architecture and are proprietary. Other than these, ARM ISA is also widely used which although is RISC, but proprietary. Unlike these proprietary ISAs, RISC-V's open-source nature allows for flexibility and customization, enabling a wide range of applications from small embedded systems to large servers. 
Thus open and, easy-to-use and learn nature of RISC-V fosters innovation and collaboration in the development of processor technology, making it a popular choice for academic research, startup companies, and even established industry players looking to explore custom processor designs. As a result, RISC-V is rapidly gaining traction in the industry, with a growing ecosystem of developers and companies contributing to its evolution.
Now, it is also important to distinguish between a scalar processor and a vector processor, basically, the base ISA implementation is usually referred to as a scalar processor. Let’s delve into more details about this in the following part!

Scalar vs. Vector Processor

A Scalar Processor executes one instruction at a time on a single data element, and according to Flynn’s Taxonomy, this is classified as SISD (Single Instruction Single Data). This approach contrasts with Single Instruction Multiple Data (SIMD) architectures which can execute a single instruction on multiple data elements in parallel. And later approach is being used in vector processors. 
So, in vector processors, we can utilize this to our benefit for performing computation on multiple independent array elements simultaneously. Which will likely speed up our application performance by a factor of 2 to 16 times against the scalar instructions. This speed-up depends on how much the data is independent and the specs of the vector processor. Now, here is the drawback, this will not be useful if data elements are related or dependent on each other and not to forget that data is also needed to load into the vector processor. Vector processors are usually added in SoC as a Coprocessor to assist the Scalar Processor. 
Vector Processors are an essential part of many modern processors and they can significantly impact the performance of applications if utilized correctly for example in graphics-intensive applications they offload the processing from CPUs for acceleration. These are also useful in image processing and AI applications. 
It is also interesting to note that today, most CPUs have multiple cores, since each core will have its vector processor, the performance can be significantly increased by using multi-threading, which is exactly being used in LLaMa.cpp.

Vector Processor vs GPUs

GPUs (Graphics Cards) are somewhat like vector Processor, but instead of SIMD, they are usually based on the SIMT concept which is Single Instruction Multiple Threads, in simple words, there are multiple cores inside them, and use data parallelism to assist the CPU. These are standalone processors which have their RAM and cache and run in parallel to CPUs. They consume much more power with the benefit of much faster processing and also are expensive. These are also much more difficult to program and require specific frameworks like CUDA, which is being used for Nvidia GPUs.

Understanding RISC-V Vector Extension & Intrinsics

The RISC-V Vector Extension (RVV), ratified in 2021, marked a significant advancement in the RISC-V architecture by introducing SIMD instruction sets. This extension introduced several new features and instructions, including:

  • A set of 32 new registers for storing data for vector processing. Based on the vector processor specs these registers can be 128 bits, 256 bits, 512 bits, or more bits.

  • It also adds a wide array of instructions, ranging from data loading to multiplication, addition, logical operations, and many more. The base version has more than 300+ instructions.

  • This extension is designed in such a way that if I wrote a code for the vector specs version with 128-bit registers, it will also work on the 512-bit version and vice versa.

  • A prominent feature of it is the scaling of vector registers, I can combine two or more registers to make more room for loading elements, and although it may take more cycles but allows more compact and reusable code.
  • The most distinctive feature of it is scaling, which is its ability to load a variable number of elements into registers. This flexibility is a departure from traditional vector processing approaches that typically require a fixed number of elements, thus enabling more dynamic and efficient data handling. 

  • It has also variable element width and in combination with scaling, we can use these to write complex vectorized algorithms.

  • Instructions are designed in such a way that each instruction supports masking and tail agnosticism. 

  • It also supports Floating Point computations and if more advanced computation or some specific feature is required, there are several extensions available, and if not we can always add our custom extension since it is open-source.


But the question arises how could we access and utilize a vector processor in our software application? 


Vectorization Methods

There are two ways for vectorizing code, the first is through auto-vectorization and the second one is to do manual programming to add vector support for our application. 

Auto-Vectorization 

Autovectorization, wherein the compiler automatically converts standard code into vector instructions, seems like a straightforward solution. However, its effectiveness can be limited, especially in more complex scenarios where manual vectorization may yield a much better optimization.
For instance in LLaMa.cpp, in the case of Intel AVX and Arm Neon manual vectorization is 2-4 times faster than compiler auto-vectorization. This is due to the reason that designing a perfect compiler that could handle every type of scenario is nearly impossible and will require a lot of effort, so we have to rely on what the best compiler can offer. And honestly, in defense of the compiler, it's not that bad, we rarely have to worry about vectorization today and we leave it to the compiler to do the job except in some rare scenarios.

Manual Vectorization

There are further two methods to integrate vector instructions into your application through manual means: Inline assembly and Intrinsics. 
Inline Assembly: This approach involves directly embedding assembly language code within your C or C++ code. It allows for a high degree of control and precision, enabling developers to write highly optimized code specific to the vector processor's capabilities. However, it is much more complex and time-consuming as compared to the intrinsic method.
Intrinsics: These are the functions or instructions that are written in C or C++ code, and are directly mapped to specific assembly instructions by the compiler. In our case, these intrinsics will be mapped to corresponding vector assembly instructions. This method is a simplified way in comparison to the last one, due to the ease of high-level programming but sometimes it offers less control and could produce a less optimized version. It is also important to note that the compiler, will also perform optimizations as well. However, while programming with RISC-V vector intrinsics I do notice that I have less control as compared to doing it through assembly which sometimes offers a better way to optimize our program. 

RISC-V Vector C Intrinsics

In the case of RISC-V, we have the RISC-V C intrinsics library that has a collection of more than 13000+ vector intrinsic functions that we can use easily in our software program. At first glance, this number may seem overwhelming, but the reason behind it is that each intrinsics has to account for C/C++ datatypes and other things like masking, etc, that’s why we end up with such an enormous library but I would say it is still better than Intel AVX family intrinsics, since in them they have to account for both legacy support and complex architecture. Both GCC (GNU Compiler Collection) and Clang (C/C++ frontend for LLVM) support the RISC-V Vector intrinsics library. There is still some development work in progress to add complete support of RISC-V vector extension version 1.0 in both GCC and Clang, but still, the currently available version of the library is enough for most of the use cases.

GGML: An Overview of Machine Learning Library behind LLaMa.cpp

GGML (https://github.com/ggerganov/ggml) is a Machine Learning library developed by Georgi Gerganov and is written entirely in C. Like other ML libraries (Pytorch and Tensorflow), it is also open-source and provides similar tensor operations such as multiplication, addition, etc. But in contrast with Tensorflow and PyTorch which are accessible through Python APIs and require GPUs for better performance, it is highly optimized for CPU and requires much lower computation to run LLMs. Remarkably, it is so efficiently designed that it can run the entire LLM with quantized weights on a smartphone device.

Along with the C implementation, GGML uses Quantization methods and also relies on extensive multi-threading and vectorization for performance boost. It has also zero memory allocations during runtime. And one more thing, it has no third-party dependencies which makes it very easy to deploy and use anywhere.


GGML has found its primary applications in Large Language Models such as llama.cpp (https://github.com/ggerganov/llama.cpp) and whisper.cpp (https://github.com/ggerganov/whisper.cpp) 

Accelerating GGML for RISC-V

When we want to optimize or accelerate our code through vectorization, we typically target the most computationally intensive segments of the code, which often turn out to be loops consuming a substantial portion of the application's execution time — generally between 40 to 80%. Identifying and optimizing these critical regions can lead to significant performance improvements. These regions can be easily singled out by using some profiler tool and then we can map the loop to vector instructions.
So, the same is applicable in our case for accelerating GGML, and here the region is the “dot product function” for quantized weights, which is taking almost 80% of execution time here, and it was already vectorized for Intel AVX, ARM Neon, and other common vector processors. 
Thus, we wrote the Vector Intrinsics for this function that offloads the loop computation inside this function to the RISC-V vector processor. This will reduce the number of instructions to be executed significantly since the inner loop was executed millions of times during the execution, leading to performance improvement and speedup for applications using GGML on the RISC-V hardware with a vector processor.

For a detailed insight into the implementation, you can refer to the following pull request (PR),
PR Link: https://github.com/ggerganov/llama.cpp/pull/2929 

LLaMa.cpp

LLaMa.cpp is the open-source C/C++ implementation of the LLaMa Large Language Model (LLM), originally developed by Meta. For those unfamiliar, Large Language Models are types of artificial neural networks trained on a large corpus of data to predict the next token (a set of characters) based on previous token history. They utilize Attention mechanisms and Transformer architecture, similar to other notable LLMs like ChatGPT and Google Bard.

Originally LLaMa.cpp was meant only to run the LLM on Macbook, but it quickly gained popularity and is now available for most popular architectures, even on Android phones. It uses the Machine Learning operations defined in the GGML tensor library. Some key features of LLaMa.cpp:

  • High Performance: The design objective behind LLaMa.cpp is to efficiently run LLMs on CPUs, which is, achieved by using platform-specific optimizations and multi-threading to fully utilize the multiple CPU cores for maximum performance. 
  • GGUF Format: It also uses the custom weight format (GGUF) for efficiently storing the model weights and then loading them on the RAM. To run models with LLaMa.cpp, conversion to GGUF format is necessary for compatibility.
  • No Dependencies: Similar to GGML. it also has no dependencies, which makes deployment and execution very easy across various platforms.
  • Optimized for Multiple Architectures: Includes optimizations for various architectures like Intel, PowerPC, Apple Mac, Android, and ARM.
  • Mixed Precision: Supports mixed F16/F32 precision.
  • Quantization Support: Typically, Large Language Models (LLMs) rely on floating-point precision, such as FP16 and FP32, which demands significant RAM and computational resources. To address this, we can use quantization. LLaMa.cpp implements multiple quantization levels which involves converting LLM models to N-bit Quantization. This method significantly reduces the processing power and RAM required, at a minimal cost to precision. With advancements in quantization techniques, the trade-off in accuracy has become negligible, resulting in a process that, not only retains much accuracy but also faster. LLaMa.cpp supports a range of quantization options, including 2-bit, 3-bit, 4-bit, 5-bit, 6-bit, and 8-bit, thus accommodating various computational needs and scenarios. For instance, 8-bit quantization has a minimum accuracy loss of less than 0.5%, and 2-bit is not recommended to use since it loses most of the information but is the fastest. 
  • Minimum RAM requirements: Thanks to the above features, LLaMa.cpp can manage models with billions of parameters while requiring minimal RAM. For instance, a model like the original LLaMa was only runnable on expensive GPUs such as A100s with 40GB RAM, and here we can run a 7 billion parameter model with 4-bit quantized weights with only 4 GB RAM. 
  • GPU Compatibility: Supports CUDA, Metal, and OpenCL for GPU acceleration. Although it is primarily designed for the CPU, we can offload some model layers to the GPU if it is available. 
  • Additional Functionalities: Beyond these capabilities, LLaMa.cpp also provides other features, including options for model training and fine-tuning.

Although LLaMa.cpp relies on GGML for computations with one exception, instead of using the general quantization technique available in GGML, LLaMa.cpp uses its unique quantization method informally known as k-quants, which means that there is additional functionality related to quantization we also have to port it to RISC-V vector. We can still use the GGML quantization technique as a legacy feature but k-quants is much better in accuracy and performance tradeoff. Also, both LLaMa.cpp and GGML are under active development, and in the future, there could be significant changes, i.e; this quantization method is now also available to use in GGML.

Accelerating LLaMa.cpp for RISC-V

The approach to accelerating LLaMa.cpp for RISC-V mirrors the method discussed for GGML. However, since the quantization being used in LLaMa.cpp is different from the general technique used in GGML, it adds an extra layer of functionality necessary for vectorization to integrate complete RISC-V vector support. This involves porting functions related to all dot-products functions for quantized blocks i.e.; 2-bit, 4-bit, and 8-bit. For an in-depth look at the implementation, please refer to the following pull request:

PR Link: https://github.com/ggerganov/llama.cpp/pull/3453 

Setting up LLaMa.cpp

To get LLaMa.cpp up and running on a RISC-V environment, you'll need to follow these steps:

  1. Clone the Repository: Start by cloning the LLaMa.cpp repository from GitHub. You can find the repository at llama.cpp
  2. Model Weight Selection: We have to choose the appropriate model weights for LLaMa.cpp according to our hardware and this is crucial for performance. These models are typically trained on vast corpora of data (sourced from the internet). Originally the LLaMa LLM developed by Meta was available in 4 versions and can be accessed and downloaded by this link Llama 2 - Meta AI. These versions are:
        a. 7 billion parameters (7B)
        b. 13 billion parameters (13B)
        c. 33 billion parameters (35B)
        d. 65 billion parameters (65)
    Remember, we have to first convert them to GGUF, after that, if we want, we can quantize them to any level. But there is also a shortcut, we can access pre-quantized weights directly from this link, saving us from the need to download very large files and then convert them to the required format, here is the link for 7B pre-quantized weights: 7B Weights Download. Now which weights we have to select depends on how much RAM you have, generally the RAM required is determined by the size of the weight, so if we have a 4 GB weight model, then it will also consume almost 4 GB RAM. Here is the table for reference from LLaMa.cpp repo:
    1. ModelOriginal SizeRAM / Quantize 4-bit Size
      7B13GB3.9GB
      13B24 GB7.8 GB
      35B60 GB19.5 GB
      65B120 GB38.5 GB
  3. GCC and Makefile: If we are running it on actual RISC-V hardware, we just have to make sure that it has a makefile utility and GCC.
  4. Cross Compilation (Optional): If you're setting up a cross-compiling environment or don't have direct access to RISC-V hardware, we have to install the RISC-V toolchain, which we can use to compile and emulate programs for RISC-V on our native hardware. The toolchain includes GCC cross compiler, which is required for compiling, and QEMU, which allows us to emulate a RISC-V environment on our current hardware. We don’t have to build it from scratch and we can get a pre-install build from this link which includes a complete toolchain [toolchain link]. You must select the Linux with glibc version for your architecture and also don’t forget to add it to the path.

Running LLaMa.cpp on RISC-V (Scalar only)

Through RISC-V Hardware:

To execute LLaMa.cpp on a RISC-V environment without a vector processor, follow these steps:

1. Compile the program:

First go inside the llama.cpp folder and do either of these to build the program.

    a. Just run the main program with the following command:

    make main

    b. To build the complete program use 

    make
2. Execute the program

Run the main executable by using the following command and don’t forget to update the model path in it:

    ./main -m ./path/to/model.gguf -p "Anything" -n 100

We tested running LLaMa.cpp on the StarFive Visionfive 2 board using the Cloud-V platform, equipped with 8GB RAM and a quad-core processor, and it doesn’t have a vector processor,

 
 



You can notice despite quad-core CPU, it is running very slow, even much slower than the google pixel 5 (link: https://github.com/ggerganov/llama.cpp#android). The reason for this is the limitations of the board and the lack of optimizations for RISC-V. This was the main motivation for us behind adding the manual vectorization for RISC-V so that in the future, it can run efficiently on RISC-V hardware with vector support.

Running through Emulator:

If you don’t have access to RISC-V hardware, you can use a cross-compiler to compile it and then use QEMU for emulation.

1. Compile the program:

Go inside the llama.cpp folder and run this command (make sure you have a cross-compiler in the $PATH variable).

make main CC="riscv64-unknown-linux-gnu-gcc -march=rv64gc -mabi=lp64d" CXX="riscv64-unknown-linux-gnu-g++ -march=rv64gc -mabi=lp64d"

2. Execute the program

Run the main executable on the emulator by using the following command and also update both the model path and system root path in it (System root can be found in the pre-built RISC-V toolchain directory):

qemu-riscv64 -L /path/to/sysroot/  -cpu rv64 ./main -m ./path/to/model.gguf -p "Anything" -n 100


Running LLaMa.cpp on RISC-V with Vector Extension

Through RISC-V Hardware:

The process for running LLaMa.cpp on RISC-V hardware equipped with a vector processor is quite similar to the scalar version.

1. Compile the Code:

Navigate to the LLaMa.cpp directory, and do either of these.

    a. Just run the main program use:

    make main RISCV=1

    b. To build the complete program: `make RISCV=1`

Execute the program

Run the main executable by using the following command and don’t forget to update the model path in it:

./main -m ./path/to/model.gguf -p "Anything" -n 100

Unfortunately, as of now, we do not have direct access to RISC-V hardware with a vector processor for live testing. However, we can simulate this environment using a cross-compiler and an emulator.

Through Emulator:

If you don’t have access to RISC-V hardware, we can use a cross-compiler to compile it and then QEMU for emulation.

1. Compile the Code:

Go inside the llama.cpp folder and run this command.

make RISCV_CROSS_COMPILE=1 RISCV=1
Execute the program

Use QEMU for emulation and run the main executable with the command below. Make sure to update the paths for both the model (./path/to/model.gguf) and the system root (/path/to/sysroot/):

qemu-riscv64 -L /path/to/sysroot/  -cpu rv64,v=true,vlen=256,elen=64,vext_spec=v1.0 ./main -m ./path/to/model.gguf -p "Anything" -n 50



Conclusion

In this article, we've explored the enhancement of LLaMa.cpp using the RISC-V Vector Extension. Our journey included an examination of the capabilities of vector processors and a dive into the fundamental aspects of LLaMa.cpp and GGML. We also navigated through the practical aspects of running LLaMa.cpp on RISC-V hardware, covering both scalar and vector scenarios. We hope that this will be helpful in wider utilization and research on LLMs within the RISC-V ecosystem.

Accelerating llama.cpp with RISC-V Vector Extension
Cloud-V 13 December, 2023
Share this post
Archive

Benchmarking RISC-V SBC with coremark
By Ali Tariq | January 5, 2023