Recently, I attempted to convert an NVIDIA-based model to a format compatible with Ollama. This post outlines the steps I took, the issues I encountered, and the necessary next steps to complete the conversion and import process.

I’m sharing this information in case it might be helpful to others facing a similar challenge.

The Original Model

The model in question is Nemotron-51B, developed by NVIDIA and based on LLaMA 3.1. It’s available on the Hugging Face model repository in the safetensors format.

How to Import a Model into Ollama

To import a model into Ollama, the first step is to convert the safetensors format into GGUF. For this, I tried using the llama.cpp tool. However, I ran into a problem: this tool is only set up to convert specific models. The NVIDIA model, DeciLM, hasn’t yet been adapted for use with llama.cpp as of September 2024.

To check which models are supported by llama.cpp, you need to inspect the convert_hf_to_gguf.py file in the project repository here.

Below, I’ve shared the scripts I used, as well as the readme.md detailing my attempt to convert and import the model into Ollama.

Scripts

README.md

# Prepare model Nvidia Nemotron-51B for ollama

IT DOESN’T WORK BECAUSE I CAN’T FIND A SCRIPT THAT SETS THE GGUF PARAMETERS TO CONVERT A **DeciLMForCausalLM** MODEL.

TO CHECK WHICH MODELS CAN BE CONVERTED BY LLAMA.CPP, LOOK AT THE convert_hf_to_gguf.py SCRIPT IN THE LLAMA.CPP FOLDER.

- [ollama: importing a model](https://github.com/ollama/ollama/blob/main/docs/import.md)
- [llama.cpp](https://github.com/ggerganov/llama.cpp)
- [Install ollama](https://ollama.com/download)
- [nvidia/Llama-3_1-Nemotron-51B-Instruct](https://huggingface.co/nvidia/Llama-3_1-Nemotron-51B-Instruct)

## Run

### 0. Install dependencies

- transformers python module
- llama.cpp
- ollama
- torch

    ```sh
    pip3 install transformers
    ```

    ```sh
    git clone https://github.com/ggerganov/llama.cpp
    cd llama.cpp
    make
    pip3 install -r requirements.txt
    ```

    ```sh
    curl -fsSL https://ollama.com/install.sh | sh
    ```

To install torch, follow the instructions on this [page](https://pytorch.org/get-started/locally/).

### 1. Download model

    ```sh
    mkdir model
    ```

    ```python
    from transformers import AutoModelForCausalLM, AutoTokenizer
    import torch

    model_name = "nvidia/Llama-3_1-Nemotron-51B-Instruct"
    model_kwargs = {"torch_dtype": torch.bfloat16, "trust_remote_code": True, "device_map": "auto"}

    tokenizer = AutoTokenizer.from_pretrained(model_name)
    tokenizer.pad_token_id = tokenizer.eos_token_id
    model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype="auto", trust_remote_code=True)

    # Guardar el modelo en una carpeta local
    tokenizer.save_pretrained("./model")
    model.save_pretrained("./model")
    ```

### 2. Convert model to GGUF with llama.cpp

Run this command to convert a **Safetensors** model to **GGUF**. This requires all model files from Hugging Face.

    ```bash
    python3 ../llama.cpp/convert_hf_to_gguf.py model 
    ```

### 3. Create ollama quantizing model to q4_K_M

    ```sh
    ollama create --quantize q4_K_M nvidia-nemotron-51b
    ```

### 4. Test the model:

    ```sh
    ollama run nvidia-nemotron-51b
    ```

### 5. Upload model ollama

First create an ollama account by signing up at [ollama.com](https://ollama.com/signup).

Add your ollama public key in [ollama account settings](https://ollama.com/settings/keys). Folllow the instructions on this page.

    ```sh
    ollama cp mymodel madcato/nvidia-nemotron-51b
    ollama push madcato/nvidia-nemotron-51b
    ```

### 6. Test it

    ```sh
    ollama run madcato/nvidia-nemotron-51b
    ```

download.py

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model_name = "nvidia/Llama-3_1-Nemotron-51B-Instruct"
model_kwargs = {"torch_dtype": torch.bfloat16, "trust_remote_code": True, "device_map": "auto"}

tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenizer.pad_token_id = tokenizer.eos_token_id
model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype="auto", trust_remote_code=True)

# Guardar el modelo en una carpeta local
tokenizer.save_pretrained("./model")
model.save_pretrained("./model")

Modelfile

FROM model