Nach einem halben Jahr Arbeit damit, habe ich und die ganzen KIs es endlich geschafft.
Das XAIGPUARC Programm, ja, für andere "nur" ein Script, auf Basis von llama.cpp ist nun Fertig und kann Benutzt werden unter den entsprechenden Bedingungen ein Linux zu benutzen.
Ich weis noch nicht so genau, auf welchen Linux Versionen das geht, aber ich glaube es müssten Alle sein oder?
Ich teste das alles noch ausreichend durch, auch auf meinen ganzen unterschiedlichen ARC Grafikkarten Computern und Systemen.
Ich bin sehr Glücklich, so eine für mich schwierige Arbeit gemacht zu haben, weil jetzt mein Laptop genauso mit mir redet, wie ich weis, das er es könnte.
Sauschnell. :-)
Ich wünsche viel Spaß beim Benutzen und bin für Kritik noch nicht Offen, weil es einfach noch nicht Zeit ist, einen Vibe Coder zu Kritisieren, der das alles erst seit einem halben Jahr macht.
Aber Anregungen und Tipps, nehme ich gerne entgegen.^^
Die ganzen Schönheitsaufgaben lasse ich mir noch ein bisschen liegen.
Kannst Du mir den Gefallen tun, und das ausprobieren bei dir?^^ Des müsste auf deinem NUC genauso schön laufen wie bei mir. :-)
Salve
Alucian
https://github.com/alucianOriginal/XAIGPUARC
Edit und hier die ersten Worte meines Programmes auf meinem NUC Laptop
-- Build files have been written to: /home/alu/XAIGPUARC
✅ ✅ Build-Konfiguration abgeschlossen.
🔷 🔨 Compiling llama.cpp (SYCL targets) using cmake --build...
🔷 🔷 📝 Der gesamte Kompilierungs-Output wird in XAIGPUARC/build.log gespeichert.
🔷 🔷 🎯 Setze Haupt-Build-Targets auf die ausführbaren Programme: llama-cli und llama-ls-sycl-device
🔷 🏗 Kompiliere Haupt-Targets...
✅ ✅ Kompilierung erfolgreich.
🔷 🔍 Detecting available SYCL / Level Zero devices ...
Found 2 SYCL devices:
| | | | |Max | |Max |Global | |
| | | | |compute|Max work|sub |mem | |
| ID | Device Type | Name | Version | units | group | group | size | Driver version |
|---|---|---|---|---|---|---|---|---|
| 0 | [level_zero:gpu:0] | Intel Arc A730M Graphics | 12.55 | 384 | 1024 | 32 | 12160M | 1.13.35563 |
| 1 | [level_zero:gpu:1] | Intel Iris Xe Graphics | 12.3 | 96 | 512 | 32 | 30705M | 1.13.35563 |
SYCL Optimization Feature:
| ID | Device Type | Reorder |
|---|---|---|
| 0 | [level_zero:gpu:0] | Y |
| 1 | [level_zero:gpu:1] | Y |
⚠ ⚠ No SYCL devices detected. The system reported an error or zero devices.
🔷 🔍 Listing SYCL devices ...
Found 1 SYCL devices:
| | | | |Max | |Max |Global | |
| | | | |compute|Max work|sub |mem | |
| ID | Device Type | Name | Version | units | group | group | size | Driver version |
|---|---|---|---|---|---|---|---|---|
| 0 | [level_zero:gpu:0] | Intel Arc A730M Graphics | 12.55 | 384 | 1024 | 32 | 12160M | 1.13.35563 |
SYCL Optimization Feature:
| ID | Device Type | Reorder |
|---|---|---|
| 0 | [level_zero:gpu:0] | Y |
🔷 🚀 Running inference on ARC (ID: 0) with ngl=0 using ./XAIGPUARC/bin/llama-cli...
build: 7139 (923ae3c61) with Intel(R) oneAPI DPC/C Compiler 2025.0.4 (2025.0.4.20241205) for x86_64-unknown-linux-gnu
main: llama backend init
main: load the model and apply lora adapter, if any
llama_model_load_from_file_impl: using device SYCL0 (Intel(R) Arc(TM) A730M Graphics) (unknown id) - 11597 MiBfree
llama_model_loader: loaded meta data with 20 key-value pairs and 291 tensors from models/openhermes-2.5-mistral-7b.Q4_K_M.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = llama
llama_model_loader: - kv 1: general.name str = teknium_openhermes-2.5-mistral-7b
llama_model_loader: - kv 2: llama.context_length u32 = 32768
llama_model_loader: - kv 3: llama.embedding_length u32 = 4096
llama_model_loader: - kv 4: llama.block_count u32 = 32
llama_model_loader: - kv 5: llama.feed_forward_length u32 = 14336
llama_model_loader: - kv 6: llama.rope.dimension_count u32 = 128
llama_model_loader: - kv 7: llama.attention.head_count u32 = 32
llama_model_loader: - kv 8: llama.attention.head_count_kv u32 = 8
llama_model_loader: - kv 9: llama.attention.layer_norm_rms_epsilon f32 = 0,000010
llama_model_loader: - kv 10: llama.rope.freq_base f32 = 10000,000000
llama_model_loader: - kv 11: general.file_type u32 = 15
llama_model_loader: - kv 12: tokenizer.ggml.model str = llama
llama_model_loader: - kv 13: tokenizer.ggml.tokens arr[str,32002] = ["", "", "", "<0x00>", "<...
llama_model_loader: - kv 14: tokenizer.ggml.scores arr[f32,32002] = [0,000000, 0,000000, 0,000000, 0,0000...
llama_model_loader: - kv 15: tokenizer.ggml.token_type arr[i32,32002] = [2, 3, 3, 6, 6, 6,6, 6, 6, 6, 6, 6, ...
llama_model_loader: - kv 16: tokenizer.ggml.bos_token_id u32 = 1
llama_model_loader: - kv 17: tokenizer.ggml.eos_token_id u32 = 32000
llama_model_loader: - kv 18: tokenizer.ggml.padding_token_id u32 = 0
llama_model_loader: - kv 19: general.quantization_version u32 = 2
llama_model_loader: - type f32: 65 tensors
llama_model_loader: - type q4_K: 193 tensors
llama_model_loader: - type q6_K: 33 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type = Q4_K - Medium
print_info: file size = 4,07 GiB (4,83 BPW)
load: control-looking token: 32000 '<|im_end|>' was not control-type; this is probably a bug in the model. its type will be overridden
load: printing all EOG tokens:
load: - 32000 ('<|im_end|>')
load: special tokens cache size = 5
load: token to piece cache size = 0,1637 MB
print_info: arch = llama
print_info: vocab_only = 0
print_info: n_ctx_train = 32768
print_info: n_embd = 4096
print_info: n_embd_inp = 4096
print_info: n_layer = 32
print_info: n_head = 32
print_info: n_head_kv = 8
print_info: n_rot = 128
print_info: n_swa = 0
print_info: is_swa_any = 0
print_info: n_embd_head_k = 128
print_info: n_embd_head_v = 128
print_info: n_gqa = 4
print_info: n_embd_k_gqa = 1024
print_info: n_embd_v_gqa = 1024
print_info: f_norm_eps = 0,0e+00
print_info: f_norm_rms_eps = 1,0e-05
print_info: f_clamp_kqv = 0,0e+00
print_info: f_max_alibi_bias = 0,0e+00
print_info: f_logit_scale = 0,0e+00
print_info: f_attn_scale = 0,0e+00
print_info: n_ff = 14336
print_info: n_expert = 0
print_info: n_expert_used = 0
print_info: n_expert_groups = 0
print_info: n_group_used = 0
print_info: causal attn = 1
print_info: pooling type = 0
print_info: rope type = 0
print_info: rope scaling = linear
print_info: freq_base_train = 10000,0
print_info: freq_scale_train = 1
print_info: n_ctx_orig_yarn = 32768
print_info: rope_finetuned = unknown
print_info: model type = 7B
print_info: model params = 7,24 B
print_info: general.name = teknium_openhermes-2.5-mistral-7b
print_info: vocab type = SPM
print_info: n_vocab = 32002
print_info: n_merges = 0
print_info: BOS token = 1 ''
print_info: EOS token = 32000 '<|im_end|>'
print_info: EOT token = 32000 '<|im_end|>'
print_info: UNK token = 0 ''
print_info: PAD token = 0 ''
print_info: LF token = 13 '<0x0A>'
print_info: EOG token = 32000 '<|im_end|>'
print_info: max token length = 48
load_tensors: loading model tensors, this can take a while... (mmap = true)
load_tensors: offloading 0 repeating layers to GPU
load_tensors: offloaded 0/33 layers to GPU
load_tensors: CPU_Mapped model buffer size = 4165,38 MiB
.................................................................................................
llama_context: constructing llama_context
llama_context: n_seq_max = 1
llama_context: n_ctx = 4096
llama_context: n_ctx_seq = 4096
llama_context: n_batch = 2048
llama_context: n_ubatch = 512
llama_context: causal_attn = 1
llama_context: flash_attn = auto
llama_context: kv_unified = false
llama_context: freq_base = 10000,0
llama_context: freq_scale = 1
llama_context: n_ctx_seq (4096) < n_ctx_train (32768) -- the full capacity of the model will not be utilized
Running with Environment Variables:
GGML_SYCL_DEBUG: 0
GGML_SYCL_DISABLE_OPT: 0
GGML_SYCL_DISABLE_GRAPH: 1
GGML_SYCL_DISABLE_DNN: 0
GGML_SYCL_PRIORITIZE_DMMV: 0
Build with Macros:
GGML_SYCL_FORCE_MMQ: no
GGML_SYCL_F16: yes
Found 1 SYCL devices:
| | | | |Max | |Max |Global | |
| | | | |compute|Max work|sub |mem | |
| ID | Device Type | Name | Version | units | group | group | size | Driver version |
|---|---|---|---|---|---|---|---|---|
| 0 | [level_zero:gpu:0] | Intel Arc A730M Graphics | 12.55 | 384 | 1024 | 32 | 12160M | 1.13.35563 |
SYCL Optimization Feature:
| ID | Device Type | Reorder |
|---|---|---|
| 0 | [level_zero:gpu:0] | Y |
llama_context: CPU output buffer size = 0,12 MiB
llama_kv_cache: CPU KV buffer size = 512,00 MiB
llama_kv_cache: size = 512,00 MiB ( 4096 cells, 32 layers, 1/1 seqs), K (f16): 256,00 MiB, V (f16): 256,00 MiB
llama_context: Flash Attention was auto, set to enabled
llama_context: SYCL0 compute buffer size = 173,05 MiB
llama_context: SYCL_Host compute buffer size = 20,01 MiB
llama_context: graph nodes = 999
llama_context: graph splits = 326 (with bs=512), 1 (with bs=1)
common_init_from_params: added <|im_end|> logit bias = -inf
common_init_from_params: setting dry_penalty_last_n to ctx_size = 4096
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
main: llama threadpool init, n_threads = 6
system_info: n_threads = 6 (n_threads_batch = 6) / 20 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX_VNNI = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
sampler seed: 3969977819
sampler params:
repeat_last_n = 64, repeat_penalty = 1,000, frequency_penalty = 0,000, presence_penalty = 0,000
dry_multiplier = 0,000, dry_base = 1,750, dry_allowed_length = 2, dry_penalty_last_n = 4096
top_k = 40, top_p = 0,950, min_p = 0,050, xtc_probability = 0,000, xtc_threshold = 0,100, typical_p = 1,000, top_n_sigma = -1,000, temp = 0,800
mirostat = 0, mirostat_lr = 0,100, mirostat_ent = 5,000
sampler chain: logits -> logit-bias -> penalties -> dry -> top-n-sigma -> top-k -> typical -> top-p -> min-p -> xtc -> temp-ext -> dist
generate: n_ctx = 4096, n_batch = 2048, n_predict = 512, n_keep = 1
Hello from SYCL on Intel ARC!
This is an exciting new development that I believe is the result of significant teamwork and collaboration from Khronos, Intel, and other vendors.
SYCL is a standard for C++ programming that has been in the works for a number of years, and it’s finally starting to see some adoption. It’s a way to write code that can be compiled and run on a variety of different hardware platforms, including CPUs, GPUs, and even specialized accelerators like FPGAs.
Intel’s Arc Alchemist GPU is a high-performance GPU that has been eagerly awaited for some time now. It’s based on a new architecture and is designed to be competitive with NVIDIA’s high-end GPUs. However, until now, there hasn’t been much information about how to program it.
That’s where SYCL comes in. SYCL provides a way to write code that can be compiled and run on a variety of different hardware platforms, including Intel’s Arc GPU. This means that developers can write code once and run it on a variety of different hardware platforms, which is a huge advantage in terms of productivity and time tomarket.
The fact that SYCL is a standard also means that it’s not tied to any one vendor. This means that developers can choose the hardware platform that best suits their needs, without being locked into a specific vendor’s ecosystem.
Overall, I think this is a very positive development for the industry. It’s great to see a standard like SYCL being adopted and used in practical applications. It’s also great to see Intel embracing new standards and technologies, rather than trying to create their own proprietary solutions. I’m looking forward to seeing what other hardware platforms SYCL will be ported to in the future, and what new and innovative applications it will enable. [end of text]
common_perf_print: sampling time = 80,20 ms
common_perf_print: samplers time = 47,25 ms / 405 tokens
common_perf_print: load time = 3242,88 ms
common_perf_print: prompt eval time = 1303,92 ms / 10 tokens ( 130,39 ms per token, 7,67 tokens persecond)
common_perf_print: eval time = 82400,09 ms / 394 runs ( 209,14 ms per token, 4,78 tokens persecond)
common_perf_print: total time = 83799,02 ms / 404 tokens
common_perf_print: unaccounted time = 14,81 ms / 0,0 % (total - sampling - prompt eval - eval) / (total)
common_perf_print: graphs reused = 392
llama_memory_breakdown_print: | memory breakdown [MiB] | total free self model context compute unaccounted |
llama_memory_breakdown_print: | - SYCL0 (Intel(R) Arc(TM) A730M Graphics) | 11597 = 11597 + ( 173 = 0 + 0 + 173) + 17592186044242 |
llama_memory_breakdown_print: | - Host | 4697 = 4165 + 512 + 20 |
✅ Inference complete.
🔷 ✨ Skript abgeschlossen. Binärdateien sind bereit in XAIGPUARC/bin/llama-cli und XAIGPUARC/bin/llama-ls-sycl-device.
✅ XAIGPUARC.sh wurde erfolgreich ausgeführt.
🔷 === ENDE: XAIGPUARC Build-Vorbereitung ===
╭─alu@king in ~ as 🧙 took 4m5s
╰─λ