Einen ganzen Haufen KIs habe ich ausprobiert mit meinem XAIGPUARC Programm.
Jede KI ist ein bisschen anders, sogar, wenn die gleichen Namen draufstehen.
Ich habe 700 Gigabyte an KIs hier auf den drei Computern vor mir und entsprechend viel zum Gegentesten.
Das Auswahlverfahren ist bisher ganz Gut gelaufen und eine handvoll empfehlenswerter Kandidaten und Verbesserungen dabei herausgekommen.
Die neueste Version von XAIGPUARC ist etwa 10% Schneller als die alten Versionen.
Die Bauvorgänge, KI Auswahl und automatisch richtige Belegung der Layer auf dem GPU VRAM Speicher funtkionieren sehr Gut.
Im Solarmodell werden die größten Cache Dateien angelegt und von der kleinsten Recheneinheit 4,7 Token ereicht im unterem Durchschnitt.
Während der Laptop mit seiner A730m und 65 Watt auf 1400Mhz glatte 7,8 Token und mehr erziehen kann bei dafür 11-13B Modellen in ähnlich hoher Q8 und Q6 Qualität.
Allgemein ist es Wichtig, ein hochqualitatives Modell zu wählen, weil ich den Prozess für halbe Speichergenauigkeit optimiert habe.
Dadurch sind die Berechungen nach Adam Rieße doppelt so Schnell, was sich in den auch dafür angepassten Modellen wiederspiegelt, welche zumeist auch auf eine Q8 Berechnung angewiesen sind um die Vectorberechnungsmöglichkeiten auszunutzen.
Die Promt Token Werte der A730m explodieren hingegen mit über 40-50 Token pro Sekunde und 500 Token pro Minute, auf beiden Laptop artigen Endgeräten.
Ich die Ergebnisse mit der neuen Angel Version von XAIGPUARC ausgeben und noch ein paar Werkzeuge beifügen, welche ich gestern auf heute dazu entwickelt habe, um stabileren Dienst auf unterschiedlichen Geräten zu gewährleisten.
Ich habe eine Speicherung der ganzen Ausgabe eingebaut, damit man seine eigenen KI Beiträge auf der lokalen KI gemacht immer Nachlesen kann.
Wenn die Flash Attention Fertig ist, baue ich einen Chat ein.
Dannach baue ich wohl an anderen Programmen für eine Weile, weil dieses dann in einigermaßen Brauchbarem Zustand für meine Zwecke ist.
Salve
Alucian
65 Watt + 15 CPU/Rest
Demo
1:12 Nach Bau Wartezeit auf 130 Watt Laptop Endgerät mit großem Modell Qualiät hier Prüfbar im Promt Ausgabe Lesen!!! Selbst entscheiden. Optimierungen in Prozess
alu@king
OS Garuda Linux x86_64
├ Kernel Linux 6.17.9-zen1-1-zen
├ Packages 1557 (pacman)[stable]
└ Shell fish 4.2.1
DE KDE Plasma 6.5.3
├ Window Manager KWin (Wayland)
├ Login Manager sddm-autologin 0.21.0 (Wayland)
├ WM Theme Sweet-Dark
├ Color Themes Dr460nized (Sweet) [Qt], Sweet-Dark [GTK2/3/4]
├ System Icons breeze-dark [Qt], breeze-dark [GTK2/3/4]
├ System Fonts Fira Sans Heavy (10pt, Regular) [Qt], Fira Sans Heavy (10pt, ExtraLight) [GTK2/3/4]
└ Terminal konsole 25.8.3
PC Notebook (3.0)
├ CPU 12th Gen Intel(R) Core(TM) i7-12700H (20) @ 4.70 GHz
├ GPU Intel Arc A730M @ 1.55 GHz [Discrete]
├ GPU Intel Iris Xe Graphics @ 1.40 GHz [Integrated]
├ Vulkan 1.4.318 - Intel open-source Mesa driver [Mesa 25.2.7-arch1.1]
└ Display(s) 2560x1440 @ 1.5x in 27", 120 Hz [External]
:: initializing oneAPI environment ...
bash: BASH_VERSION = 5.3.3(1)-release
args: Using "$@" for setvars.sh arguments: --force
:: advisor -- latest
:: ccl -- latest
:: compiler -- latest
:: dal -- latest
:: debugger -- latest
:: dev-utilities -- latest
:: dnnl -- latest
:: dpcpp-ct -- latest
:: dpl -- latest
:: ipp -- latest
:: ippcp -- latest
:: mkl -- latest
:: mpi -- latest
:: pti -- latest
:: tbb -- latest
:: umf -- latest
:: vtune -- latest
:: oneAPI environment initialized ::
╭─alu@king in ~
╰─λ ./XAIGPUARC.sh
🔷 Aktiviere Intel oneAPI Umgebung (MKL, SYCL/C++ Headers)...
🔷 Sourcing setvars.sh, um DPCPP_ROOT und MKL_ROOT zu setzen...
:: initializing oneAPI environment ...
XAIGPUARC.sh: BASH_VERSION = 5.3.3(1)-release
args: Using "$@" for setvars.sh arguments: --force
:: advisor -- latest
:: ccl -- latest
:: compiler -- latest
:: dal -- latest
:: debugger -- latest
:: dev-utilities -- latest
:: dnnl -- latest
:: dpcpp-ct -- latest
:: dpl -- latest
:: ipp -- latest
:: ippcp -- latest
:: mkl -- latest
:: mpi -- latest
:: pti -- latest
:: tbb -- latest
:: umf -- latest
:: vtune -- latest
:: oneAPI environment initialized ::
🔷 ✅ oneAPI environment loaded (DPCPP_ROOT=/opt/intel/oneapi/compiler/2025.0 und MKL_ROOT=/opt/intel/oneapi/mkl/2025.0).
✅ ✅ Gefundene Binaries: ./XAIGPUARC/bin/llama-cli und ./XAIGPUARC/bin/llama-ls-sycl-device
🔷 -> Überspringe die Schritte Setup, Patch, Configure und Compile.
🔷 ⚙ Update des llama.cpp Repositories und Überprüfung der Patches...
🔷 📦 Setting up llama.cpp project...
🔷 -> Aktualisiere und initialisiere Submodule...
Von https://github.com/ggerganov/llama.cpp
- [neues Tag] b7203 -> b7203
- [neues Tag] b7204 -> b7204
- [neues Tag] b7205 -> b7205
Bereits aktuell.
✅ ✅ llama.cpp ready. (Repo und Submodule sind vorhanden).
🔷 🔷 🔷 🩹 Patches für ggml-sycl anwenden (Header & CMake & Kernel-Dispatch-Registrierung)...
🔷 🔷 -> Patch 1/5: dpct/helper.hpp anpassen (Header Fix zu sycl/ext/intel/math.hpp).
🔷 🔷 -> ✅ Patch 1/5 erfolgreich (Standard).
🔷 🔷 -> Patch 2/5: XARCFA Kernel in das Build-System integrieren.
🔷 🔷 -> ✅ XARCFA Kernel von './ggml_flash_attention_sycl.cpp' nach 'llama.cpp/ggml/src/ggml-sycl/custom_kernels/ggml_flash_attention_sycl.cpp' kopiert.
🔷 🔷 -> CMakeLists.txt für Kernel als OBJECT-Library erstellt.
🔷 🔷 -> ✅ Patch 2/5 erfolgreich: custom_kernels zu Haupt-CMake hinzugefügt.
🔷 🔷 -> Patch 3/5: CMakeLists.txt anpassen (Alle Header-Pfade für icpx).
🔷 🔷 -> ✅ Patch 3/5 erfolgreich: Alle Header-Pfade injiziert.
🔷 🔷 -> Patch 4/5: Flash Attention Dispatch in ggml-sycl.cpp injizieren (Robusterer Fix).
🔷 🔷 -> Deklaration erfolgreich eingefügt.
🔷 🔷 -> Versuche, den Dispatch-Case (FA) mittels AWK einzufügen.
🔷 🔷 -> Dispatch-Case erfolgreich eingefügt.
🔷 🔷 -> ✅ Patch 4/5 erfolgreich: Flash Attention Dispatch ist registriert.
🔷 🔷 -> Patch 5/5: Injiziere den custom Flash Attention Kernel als OBJECT-Files in ggml-sycl.
🔷 🔷 -> 5a/5: FA_OBJECT_FILES Variable erfolgreich definiert.
🔷 🔷 -> ⚠ Patch 5/5 (Injection) scheint bereits angewandt zu sein oder Zielzeile nicht gefunden. Überspringe.
✅ ✅ Alle 5 Patches erfolgreich angewandt.
🔷 🔍 Detecting available SYCL / Level Zero devices ...
Found 2 SYCL devices:
| | | | |Max | |Max |Global | |
| | | | |compute|Max work|sub |mem | |
|ID| Device Type| Name|Version|units |group |group|size | Driver version|
|--|-------------------|---------------------------------------|-------|-------|--------|-----|-------|---------------------|
| 0| [level_zero:gpu:0]| Intel Arc A730M Graphics| 12.55| 384| 1024| 32| 12160M| 1.13.36015|
| 1| [level_zero:gpu:1]| Intel Iris Xe Graphics| 12.3| 96| 512| 32| 30705M| 1.13.36015|
SYCL Optimization Feature:
|ID| Device Type|Reorder|
|--|-------------------|-------|
| 0| [level_zero:gpu:0]| Y|
| 1| [level_zero:gpu:1]| Y|
⚠ ⚠ No SYCL devices detected. The system reported an error or zero devices.
🔷 🔍 Listing SYCL devices ...
Found 1 SYCL devices:
| | | | |Max | |Max |Global | |
| | | | |compute|Max work|sub |mem | |
|ID| Device Type| Name|Version|units |group |group|size | Driver version|
|--|-------------------|---------------------------------------|-------|-------|--------|-----|-------|---------------------|
| 0| [level_zero:gpu:0]| Intel Arc A730M Graphics| 12.55| 384| 1024| 32| 12160M| 1.13.36015|
SYCL Optimization Feature:
|ID| Device Type|Reorder|
|--|-------------------|-------|
| 0| [level_zero:gpu:0]| Y|
⚠ Model nicht gefunden unter models/llama-3-12b-Instruct.i1-Q6_Kgguf. Bitte vor Ausführung herunterladen!
🔷 🚀 Running inference on ARC (ID: 0) with ngl=0 using ./XAIGPUARC/bin/llama-cli...
build: 7205 (fa0465954) with Intel(R) oneAPI DPC/C Compiler 2025.0.4 (2025.0.4.20241205) for x86_64-unknown-linux-gnu
main: llama backend init
main: load the model and apply lora adapter, if any
llama_model_load_from_file_impl: using device SYCL0 (Intel(R) Arc(TM) A730M Graphics) (unknown id) - 11597 MiB free
llama_model_loader: loaded meta data with 44 key-value pairs and 435 tensors from models/llama-3-12b-Instruct.i1-Q6_K.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = llama
llama_model_loader: - kv 1: general.type str = model
llama_model_loader: - kv 2: general.name str = Llama 3 8b Instruct
llama_model_loader: - kv 3: general.organization str = Unsloth
llama_model_loader: - kv 4: general.finetune str = Instruct
llama_model_loader: - kv 5: general.basename str = llama-3
llama_model_loader: - kv 6: general.size_label str = 8B
llama_model_loader: - kv 7: general.base_model.count u32 = 1
llama_model_loader: - kv 8: general.base_model.0.name str = Llama 3 8b Instruct
llama_model_loader: - kv 9: general.base_model.0.organization str = Unsloth
llama_model_loader: - kv 10: general.base_model.0.repo_url str = https://huggingface.co/unsloth/llama-...
llama_model_loader: - kv 11: general.tags arr[str,2] = ["mergekit", "merge"]
llama_model_loader: - kv 12: llama.block_count u32 = 48
llama_model_loader: - kv 13: llama.context_length u32 = 8192
llama_model_loader: - kv 14: llama.embedding_length u32 = 4096
llama_model_loader: - kv 15: llama.feed_forward_length u32 = 14336
llama_model_loader: - kv 16: llama.attention.head_count u32 = 32
llama_model_loader: - kv 17: llama.attention.head_count_kv u32 = 8
llama_model_loader: - kv 18: llama.rope.freq_base f32 = 500000,000000
llama_model_loader: - kv 19: llama.attention.layer_norm_rms_epsilon f32 = 0,000010
llama_model_loader: - kv 20: general.file_type u32 = 18
llama_model_loader: - kv 21: llama.vocab_size u32 = 128256
llama_model_loader: - kv 22: llama.rope.dimension_count u32 = 128
llama_model_loader: - kv 23: tokenizer.ggml.model str = gpt2
llama_model_loader: - kv 24: tokenizer.ggml.pre str = llama-bpe
llama_model_loader: - kv 25: tokenizer.ggml.tokens arr[str,128256] = ["!", """, "#", "$", "%", "&", "'", ...
llama_model_loader: - kv 26: tokenizer.ggml.token_type arr[i32,128256] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv 27: tokenizer.ggml.merges arr[str,280147] = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "...
llama_model_loader: - kv 28: tokenizer.ggml.bos_token_id u32 = 128000
llama_model_loader: - kv 29: tokenizer.ggml.eos_token_id u32 = 128009
llama_model_loader: - kv 30: tokenizer.ggml.padding_token_id u32 = 128255
llama_model_loader: - kv 31: tokenizer.chat_template str = {% set loop_messages = messages %}{% ...
llama_model_loader: - kv 32: general.quantization_version u32 = 2
llama_model_loader: - kv 33: general.url str = https://huggingface.co/mradermacher/l...
llama_model_loader: - kv 34: mradermacher.quantize_version str = 2
llama_model_loader: - kv 35: mradermacher.quantized_by str = mradermacher
llama_model_loader: - kv 36: mradermacher.quantized_at str = 2024-08-26T13:14:00+02:00
llama_model_loader: - kv 37: mradermacher.quantized_on str = db1
llama_model_loader: - kv 38: general.source.url str = https://huggingface.co/Darkknight535/...
llama_model_loader: - kv 39: mradermacher.convert_type str = hf
llama_model_loader: - kv 40: quantize.imatrix.file str = llama-3-12b-Instruct-i1-GGUF/imatrix.dat
llama_model_loader: - kv 41: quantize.imatrix.dataset str = imatrix-training-full-3
llama_model_loader: - kv 42: quantize.imatrix.entries_count i32 = 336
llama_model_loader: - kv 43: quantize.imatrix.chunks_count i32 = 314
llama_model_loader: - type f32: 97 tensors
llama_model_loader: - type q6_K: 338 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type = Q6_K
print_info: file size = 8,80 GiB (6,56 BPW)
load: printing all EOG tokens:
load: - 128001 ('<|end_of_text|>')
load: - 128009 ('<|eot_id|>')
load: special tokens cache size = 256
load: token to piece cache size = 0,8000 MB
print_info: arch = llama
print_info: vocab_only = 0
print_info: n_ctx_train = 8192
print_info: n_embd = 4096
print_info: n_embd_inp = 4096
print_info: n_layer = 48
print_info: n_head = 32
print_info: n_head_kv = 8
print_info: n_rot = 128
print_info: n_swa = 0
print_info: is_swa_any = 0
print_info: n_embd_head_k = 128
print_info: n_embd_head_v = 128
print_info: n_gqa = 4
print_info: n_embd_k_gqa = 1024
print_info: n_embd_v_gqa = 1024
print_info: f_norm_eps = 0,0e+00
print_info: f_norm_rms_eps = 1,0e-05
print_info: f_clamp_kqv = 0,0e+00
print_info: f_max_alibi_bias = 0,0e+00
print_info: f_logit_scale = 0,0e+00
print_info: f_attn_scale = 0,0e+00
print_info: n_ff = 14336
print_info: n_expert = 0
print_info: n_expert_used = 0
print_info: n_expert_groups = 0
print_info: n_group_used = 0
print_info: causal attn = 1
print_info: pooling type = 0
print_info: rope type = 0
print_info: rope scaling = linear
print_info: freq_base_train = 500000,0
print_info: freq_scale_train = 1
print_info: n_ctx_orig_yarn = 8192
print_info: rope_finetuned = unknown
print_info: model type = 34B
print_info: model params = 11,52 B
print_info: general.name = Llama 3 8b Instruct
print_info: vocab type = BPE
print_info: n_vocab = 128256
print_info: n_merges = 280147
print_info: BOS token = 128000 '<|begin_of_text|>'
print_info: EOS token = 128009 '<|eot_id|>'
print_info: EOT token = 128009 '<|eot_id|>'
print_info: PAD token = 128255 '<|reserved_special_token_250|>'
print_info: LF token = 198 'Ċ'
print_info: EOG token = 128001 '<|end_of_text|>'
print_info: EOG token = 128009 '<|eot_id|>'
print_info: max token length = 256
load_tensors: loading model tensors, this can take a while... (mmap = true)
load_tensors: offloading 48 repeating layers to GPU
load_tensors: offloading output layer to GPU
load_tensors: offloaded 49/49 layers to GPU
load_tensors: CPU_Mapped model buffer size = 410,98 MiB
load_tensors: SYCL0 model buffer size = 8602,49 MiB
.............................................................................................
llama_context: constructing llama_context
llama_context: n_seq_max = 1
llama_context: n_ctx = 4096
llama_context: n_ctx_seq = 4096
llama_context: n_batch = 2048
llama_context: n_ubatch = 512
llama_context: causal_attn = 1
llama_context: flash_attn = auto
llama_context: kv_unified = false
llama_context: freq_base = 500000,0
llama_context: freq_scale = 1
llama_context: n_ctx_seq (4096) < n_ctx_train (8192) -- the full capacity of the model will not be utilized
Running with Environment Variables:
GGML_SYCL_DEBUG: 0
GGML_SYCL_DISABLE_OPT: 0
GGML_SYCL_DISABLE_GRAPH: 1
GGML_SYCL_DISABLE_DNN: 0
GGML_SYCL_PRIORITIZE_DMMV: 0
Build with Macros:
GGML_SYCL_FORCE_MMQ: no
GGML_SYCL_F16: yes
Found 1 SYCL devices:
| | | | |Max | |Max |Global | |
| | | | |compute|Max work|sub |mem | |
|ID| Device Type| Name|Version|units |group |group|size | Driver version|
|--|-------------------|---------------------------------------|-------|-------|--------|-----|-------|---------------------|
| 0| [level_zero:gpu:0]| Intel Arc A730M Graphics| 12.55| 384| 1024| 32| 12160M| 1.13.36015|
SYCL Optimization Feature:
|ID| Device Type|Reorder|
|--|-------------------|-------|
| 0| [level_zero:gpu:0]| Y|
llama_context: SYCL_Host output buffer size = 0,49 MiB
llama_kv_cache: SYCL0 KV buffer size = 768,00 MiB
llama_kv_cache: size = 768,00 MiB ( 4096 cells, 48 layers, 1/1 seqs), K (f16): 384,00 MiB, V (f16): 384,00 MiB
llama_context: layer 0 is assigned to device SYCL0 but the Flash Attention tensor is assigned to device CPU (usually due to missing support)
llama_context: Flash Attention was auto, set to disabled
llama_context: SYCL0 compute buffer size = 296,01 MiB
llama_context: SYCL_Host compute buffer size = 16,01 MiB
llama_context: graph nodes = 1734
llama_context: graph splits = 2
common_init_from_params: added <|end_of_text|> logit bias = -inf
common_init_from_params: added <|eot_id|> logit bias = -inf
common_init_from_params: setting dry_penalty_last_n to ctx_size = 4096
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
main: llama threadpool init, n_threads = 6
system_info: n_threads = 6 (n_threads_batch = 6) / 20 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX_VNNI = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
sampler seed: 148956143
sampler params:
repeat_last_n = 64, repeat_penalty = 1,000, frequency_penalty = 0,000, presence_penalty = 0,000
dry_multiplier = 0,000, dry_base = 1,750, dry_allowed_length = 2, dry_penalty_last_n = 4096
top_k = 40, top_p = 0,950, min_p = 0,050, xtc_probability = 0,000, xtc_threshold = 0,100, typical_p = 1,000, top_n_sigma = -1,000, temp = 0,800
mirostat = 0, mirostat_lr = 0,100, mirostat_ent = 5,000
sampler chain: logits -> logit-bias -> penalties -> dry -> top-n-sigma -> top-k -> typical -> top-p -> min-p -> xtc -> temp-ext -> dist
generate: n_ctx = 4096, n_batch = 2048, n_predict = 512, n_keep = 1
Baue eine SYCL ARC INTEL GPU FÄHIGE DEMOKRATIE ABSTIMM BLOCKCHAIN!
(SYCL, Arc GPU, Intel, Blockchain)-
In this project, we will create a blockchain-based voting system using SYCL and Arc GPU acceleration. We will demonstrate how to create a smart contract on a blockchain using Solidity and deploy it on the Ethereum blockchain. We will also use SYCL to optimize the smart contract's performance by offloading computations to the Arc GPU.
Step 1: Set up the environment
First, you need to set up your environment to work with SYCL, Arc GPU, and Intel. Make sure you have the following installed:
- Intel Distribution of OpenCL (oneAPI)
- SYCL compiler (oneAPI)
- Arc GPU drivers
- Ethereum blockchain client (e.g., Geth)
Next, create a new directory for your project and navigate to it:
mkdir sycl-arc-voting-system
cd sycl-arc-voting-system
Step 2: Create the smart contract
Create a new file called VotingContract.sol and add the following code to define your smart contract:
pragma solidity ^0.8.0;
contract VotingContract {
mapping(address => uint256) public votes;
mapping(address => bool) public hasVoted;
function vote(address candidate) public {
require(!hasVoted[msg.sender], "You have already voted");
require(candidates[msg.sender] == candidate, "Invalid candidate");
hasVoted[msg.sender] = true;
votes[candidate]++;
}
function getVotes(address candidate) public view returns (uint256) {
return votes[candidate];
}
function hasVoted(address voter) public view returns (bool) {
return hasVoted[voter];
}
}
This smart contract has three functions: vote, getVotes, and hasVoted. The vote function allows users to cast their votes, the getVotes function returns the total number of votes for a given candidate, and the hasVoted function checks if a user has already voted.
Step 3: Deploy the smart contract
Compile and deploy the smart contract on the Ethereum blockchain using the following commands:
solcjs --bin VotingContract.sol
geth attach
This will deploy the smart contract to the Ethereum blockchain and create a new contract instance.
Step 4: Create the SYCL program
Create a new file called voting_system.cpp and add the following
common_perf_print: sampling time = 131,72 ms
common_perf_print: samplers time = 63,82 ms / 539 tokens
common_perf_print: load time = 7497,20 ms
common_perf_print: prompt eval time = 548,93 ms / 27 tokens ( 20,33 ms per token, 49,19 tokens per second)
common_perf_print: eval time = 61548,56 ms / 511 runs ( 120,45 ms per token, 8,30 tokens per second)
common_perf_print: total time = 62238,48 ms / 538 tokens
common_perf_print: unaccounted time = 9,26 ms / 0,0 % (total - sampling - prompt eval - eval) / (total)
common_perf_print: graphs reused = 508
llama_memory_breakdown_print: | memory breakdown [MiB] | total free self model context compute unaccounted |
llama_memory_breakdown_print: | - SYCL0 (Intel(R) Arc(TM) A730M Graphics) | 11597 = 11597 + (9666 = 8602 + 768 + 296) + 17592186034749 |
llama_memory_breakdown_print: | - Host | 426 = 410 + 0 + 16 |
✅ Inference complete.
🔷 ✨ Skript abgeschlossen. Binärdateien sind bereit in XAIGPUARC/bin/llama-cli und XAIGPUARC/bin/llama-ls-sycl-device.
🔷 Der gesamte Skript-Verlauf (Build & Run) wird in die Datei XAIGPUARC/full_script.log umgeleitet.
╭─alu@king in ~ took 1m12s
╰─λ
Nach Bau kein Cherry Picking.