- Use --allow-dirty for crates.io publish to avoid transient Cargo.lock blocking release
- Add git status/diff debug logs for easier triage
- Run Docker publish step even if crates.io publish fails (if: always())
- Added Docker image publishing to GHCR in release workflow
- Fixed issue #146: Docker images now published automatically
- Added proper GHCR authentication and multi-tag publishing
- Both versioned and latest Docker tags now available
- Fixed stop token truncation to respect UTF-8 character boundaries
- Prevents FromUtf8Error when multi-byte characters are split during truncation
- Added regression tests for Unicode handling in token streaming
- Ensures callbacks receive valid UTF-8 strings
The issue was that stop token removal could truncate strings in the middle
of multi-byte Unicode characters (like emojis), causing UTF-8 decoding errors.
Now we find the proper character boundary before the stop token.
* fix: group sharded model files in auto-discovery (#147)
- Add regex-based detection for sharded model patterns (model-XXXXX-of-XXXXX.ext)
- Group multiple sharded files into single model entries with aggregated size
- Display sharded models with count of additional files (e.g., '+3 more files')
- Use directory name as model name for grouped sharded models
- Add comprehensive regression tests for sharded model grouping scenarios
- Clean up unused imports and ensure zero warnings
This resolves the issue where sharded SafeTensors models were listed as
individual entries instead of being grouped together as a single model.
* style: format code with cargo fmt
* fix: remove unused variable prefixes in AMD GPU detection test
- Fixed compilation error where variables were prefixed with underscore but used without underscore
- This was causing test failures during release gate validation
* Fix#142: AMD GPU detection on Windows
- Add configure_gpu_environment() method to set GGML_* environment variables before backend initialization
- Fix GPU layer assignment by ensuring OpenCL/Vulkan/CUDA backends have proper environment setup
- Add comprehensive regression tests for all GPU backend environment configuration
- Fix compiler warnings with documented #[allow(dead_code)] attributes for conditionally-used MoeConfig fields
Root cause: GPU backends require environment variables set before llama.cpp initialization.
AMD GPUs detected by clinfo but layers assigned to CPU due to missing GGML_OPENCL/GGML_VULKAN variables.
Tests: All 6 release gates pass with zero warnings. Regression tests added for GPU backend validation.
* Fix compiler warnings in issue 142 regression test
- Prefix unused variables with underscores to suppress warnings
- Variables are conditionally used based on feature flags
- Remove extra fields (root, parent, permission) from /v1/models response
- Create separate ListModel struct for models endpoint that matches OpenAI spec exactly
- Update created timestamp to use proper system time instead of 0
- Update all tests to use the new ListModel structure
Root cause: Frontend applications like Open WebUI and AnythingLLM expect strict OpenAI API compliance. The extra fields in shimmy's Model struct were causing them to reject the API as incompatible.
Fix ensures /v1/models returns only: id, object, created, owned_by - matching OpenAI specification exactly.
- Add missing COPY benches/ ./benches/ to Dockerfile
- Update Rust version from 1.75-slim to 1.85-slim for lock file compatibility
- Add libclang-dev and cmake build dependencies
- Add Docker build regression test to release gates
Root cause: Dockerfile was missing benches/ directory copy, causing Cargo manifest parsing to fail. Also missing build dependencies for llama.cpp-sys compilation.
CRITICAL FIX - Prevents recurring release blocker:
- Added pre-commit hook to auto-update Cargo.lock when Cargo.toml changes
- Added pre-flight check to dry-run-release.sh to catch uncommitted Cargo.lock
- Committed current Cargo.lock with version 1.7.4
This has killed multiple release attempts. Never again.
Signed-off-by: Michael A. Kuykendall <michaelallenkuykendall@gmail.com>
- Cross.toml only used for ARM64 Linux builds
- Removed aarch64-apple-darwin config causing warnings
- macOS builds use native cargo, not cross
- Fixes ARM64 Linux release builds
Signed-off-by: Michael A. Kuykendall <michaelallenkuykendall@gmail.com>
- Add CROSS_NO_WARNINGS=1 to build environment
- Prevents unused Cross.toml key warnings from failing builds
- Cross.toml has macOS-specific config that warns on Linux ARM64 builds
- Fixes Issue #131 ARM64 release builds
Signed-off-by: Michael A. Kuykendall <michaelallenkuykendall@gmail.com>
- CUDA binaries include static CUDA runtime libraries (~26MB)
- Huggingface binaries are much smaller (~2.6MB)
- Gate 4 now rebuilds huggingface binary after Gate 2 CUDA build
- This ensures we validate the correct binary size for releases
- Fixes false-positive size limit failures in local validation
Signed-off-by: Michael A. Kuykendall <michaelallenkuykendall@gmail.com>
Added .pre-commit-config.yaml with 3 fast hooks:
1. cargo fmt (auto-formatting)
2. cargo clippy (code quality)
3. cargo check (compile validation)
Updated scripts/dry-run-release.sh:
- Added pre-flight format check (catches formatting before gates)
- This catches issues like the one we just hit
Removed duplicate scripts/validate-release.sh:
- Unnecessary duplication of dry-run-release.sh
Workflow separation:
- Pre-commit: Fast checks only (~5-10s)
- PR/Release: Full validation including 115 regression tests (~60-100s)
Regression tests run on:
- Pull requests (GitHub Actions CI)
- Release workflow (dry-run-release.sh Gate 5)
- Manual: cargo test --all-features
Setup: ./scripts/setup-precommit.sh
Signed-off-by: Michael A. Kuykendall <michaelallenkuykendall@gmail.com>
Created scripts/validate-release.sh to catch CI failures locally BEFORE push.
Mirrors CI checks exactly:
1. cargo fmt -- --check (formatting)
2. cargo clippy (code quality)
3. cargo test --all-features (all tests + 115 regression tests)
4. cargo deny check (security audit)
5. cargo build --release (release build)
Run before every push to avoid CI surprises.
This would have caught the formatting issue we just hit.
Signed-off-by: Michael A. Kuykendall <michaelallenkuykendall@gmail.com>
Cargo fmt automatically indented test functions after renaming
mod issue_132_auto_stop_tokens to mod tests. This was not caught
locally because we didn't run 'cargo fmt -- --check' before push.
Root cause: Incomplete local validation workflow missing format check.
Signed-off-by: Michael A. Kuykendall <michaelallenkuykendall@gmail.com>
Problem:
- User @Slach requested ARM64 Linux support for NVIDIA DXG Spark platform
- Release workflow only built x86_64 binaries
- ARM64 users had to build from source manually
Solution:
- Added aarch64-unknown-linux-gnu target to release workflow matrix
- Configured cross-rs for ARM64 cross-compilation on x86_64 runners
- Added shimmy-linux-aarch64 binary to release artifacts
- Used huggingface,llama features (CPU-only) for ARM64 builds
Implementation Details:
- Matrix entry: os=ubuntu-latest, target=aarch64-unknown-linux-gnu, use-cross=true
- Install cross tool conditionally when use-cross flag is set
- Build command checks use-cross flag and uses 'cross' instead of 'cargo'
- Release artifacts now include shimmy-linux-aarch64 alongside existing platforms
Testing:
- Added 11 regression tests in tests/regression/issue_131_arm64_ci_support.rs
- Tests verify: ARM64 target, cross-compilation config, artifact upload, naming
- All 115 regression tests passing
- Build: ✅ Clippy: ✅ Format: ✅
Platforms Now Supported:
- Linux x86_64 (existing)
- Linux ARM64 (NEW - Issue #131)
- Windows x86_64 (existing)
- macOS Intel (existing)
- macOS ARM64/Apple Silicon (existing)
Addresses #131
Signed-off-by: Michael A. Kuykendall <michaelallenkuykendall@gmail.com>
Auto-configure stop tokens for chat templates to prevent template marker leakage.
Fixes Issue #132 reported by @3588.
Signed-off-by: Michael A. Kuykendall <michaelallenkuykendall@gmail.com>
Problem:
- Models using chat templates (ChatML, Llama3, etc.) were outputting raw template tokens
- User @3588 reported gpt-oss-20b outputting <|im_end|> and <|im_start|> in responses
- Templates were applied to format prompts but stop tokens were not configured
- Generated text leaked template markers instead of stopping at them
Solution:
- Added stop_tokens field to GenOptions struct (Vec<String>)
- Added stop_tokens() method to TemplateFamily enum:
* ChatML → ["<|im_end|>", "<|im_start|>"]
* Llama3 → ["<|eot_id|>", "<|end_of_text|>"]
* OpenChat → [] (no special tokens)
- Modified OpenAI compatibility layer to auto-configure stop tokens based on template
- Added support for user-provided stop tokens that merge with template defaults
- Updated llama.rs generation loop to check for stop tokens and truncate output
- Stop token detection uses rfind() to find last occurrence and truncates there
Testing:
- Added 14 regression tests in tests/regression/issue_132_auto_stop_tokens.rs
- Tests cover template stop token configuration, user-provided tokens, truncation logic
- All 104 regression tests passing
- Build: ✅ Clippy: ✅ (no warnings) Format: ✅Closes#132
Signed-off-by: Michael A. Kuykendall <michaelallenkuykendall@gmail.com>
User Report: @D0wn10ad - Downloaded Windows release binary, but gpu-info
showed all GPU features disabled (CUDA/Vulkan/OpenCL all disabled).
Root Cause:
- Release workflow built binaries WITHOUT GPU features
- All platforms used default cargo build (CPU only)
- Users forced to compile from source to get GPU support
Fix:
- Windows builds: Added llama-vulkan for broad GPU compatibility
- macOS builds: Added mlx for Apple Silicon GPU acceleration
- Linux musl: Kept huggingface-only (avoids llama.cpp C++ issues)
- Platform-specific feature detection in release workflow
Testing:
- Created tests/regression/issue_129_precompiled_gpu_support.rs
- Validates release workflow YAML contains GPU features
- Tests platform-specific conditional logic
- All 88 regression tests passing
Impact:
- Future releases will have GPU support built-in for Windows/macOS
- Users can download and use GPU acceleration without compiling
- Closes gap between source code capabilities and distributed binaries
Closes#129
Signed-off-by: Michael A. Kuykendall <michaelallenkuykendall@gmail.com>
- Created tests/regression/issue_130_gpu_layer_offloading.rs
- Validates CPU backend returns 0 layers (no offload)
- Validates GPU backends (CUDA/Vulkan/OpenCL) return 999 layers (full offload)
- Made GpuBackend and gpu_layers() public for testing
- All 84 regression tests passing
This prevents regression of the fix where gpu_layers() was incorrectly
returning 999 for ALL backends including CPU, causing layers to not
properly offload to GPU even when compiled with GPU features.
Signed-off-by: Michael A. Kuykendall <michaelallenkuykendall@gmail.com>
The gpu_layers() method was returning 999 (offload all layers) for ALL
backends including CPU. This caused confusion where Vulkan would be
detected but layers would still be assigned to CPU in llama.cpp.
Root cause: GpuBackend::gpu_layers() didn't check which backend was active.
Changes:
- Match on backend type: CPU returns 0, GPU backends return 999
- Removed incorrect #[allow(dead_code)] from actively used methods
- new_with_backend(), with_moe_config(), get_backend_info() all in use
This fixes:
- Issue #130: Vulkan compiled but layers not offloaded
- Issue #126: MoE models not detecting GPU (same root cause)
- Issue #129: Partial fix - still need release workflow update
Users building with --features llama-vulkan will now actually get
GPU layer offloading. Layers will show as assigned to VULKAN device
instead of CPU in llama.cpp logs.
Testing:
- cargo build --features llama (compiles clean)
- cargo clippy --all-features (no warnings)
- GPU backends properly return 999 layers
- CPU backend properly returns 0 layers
Signed-off-by: Michael A. Kuykendall <michaelallenkuykendall@gmail.com>
MLX support is waiting for new library implementation.
Issue #127 will remain open but feature is on hold.
Issue #128 (BackendAlreadyInitialized) was unrelated to MLX - Windows-only backend singleton issue already fixed.
Signed-off-by: Michael A. Kuykendall <michaelallenkuykendall@gmail.com>
- Implements OnceLock singleton for LlamaBackend initialization
- Ensures backend is initialized only once per process
- Removes owned backend from LlamaLoaded struct
- Adds regression test for backend reinitialization
- Fixes#128
Signed-off-by: Michael A. Kuykendall <michaelallenkuykendall@gmail.com>
- MLX is being replaced with new library
- Changed to workflow_dispatch (manual trigger only)
- Will re-enable after new library integration is complete
Signed-off-by: Michael A. Kuykendall <michaelallenkuykendall@gmail.com>
- Fixed formatting in src/openai_compat.rs
- Fixed formatting in regression test files
- Ensures CI Code Quality checks pass
Signed-off-by: Michael A. Kuykendall <michaelallenkuykendall@gmail.com>
- 82 regression tests organized into tests/regression/ structure
- Simplified CI/CD integration for regression testing
- Auto-discovery script for individual test execution
- Fixed gitignore to exclude console/ and **/target/ directories
Signed-off-by: Michael A. Kuykendall <michaelallenkuykendall@gmail.com>
- Remove redundant regression-tests job (tests already run in main suite)
- Add explicit regression test step to main test job
- Runs cargo test --test regression (82 tests covering Issues #12-#128)
- Runs cargo test --test regression_tests (14 additional tests)
- Regression tests now part of normal CI flow, not separate script
- Cleaner workflow, same coverage, zero tolerance for regressions
Previously had dedicated job that ran scripts/run-regression-tests-auto.sh.
Now regression tests run alongside unit/integration tests for efficiency.
Signed-off-by: Michael A. Kuykendall <michaelallenkuykendall@gmail.com>
- Organized 20+ regression test files into tests/regression/ directory
- Each test file covers specific GitHub issues to prevent regressions
- Created tests/regression.rs to include all organized test modules
- Added automated test runner: scripts/run-regression-tests-auto.sh
- Updated .github/workflows/ci.yml to run regression tests before main suite
- Fixed .gitignore: **/target/ pattern + console/ directory exclusion
- Moved old scattered test files into organized structure (git detected renames)
- 82 regression tests now auto-discovered and run in CI/CD
**Zero Tolerance Policy**: All regression tests MUST pass before PR/release.
Tests cover Issues: #12, #13, #51, #53, #63, #64, #68, #72, #101, #106, #108,
#110, #111, #112, #113, #114, #127, #128 + general packaging/versioning
**Testing:**
- cargo test --test regression --features llama: 82 passed
- CI/CD integration: Runs automatically on every PR
- Release gates: Blocks releases if any regression fails
Signed-off-by: Michael A. Kuykendall <michaelallenkuykendall@gmail.com>
- Add missing optional fields (permission, root, parent) to all Model test initializations
- Fixes compilation errors in test_models_response_structure and test_openai_response_structures
- All 24 openai_compat tests now passing
- Ensures regression test for Open WebUI/AnythingLLM compatibility works correctly
Completes fix for Issue #113
Signed-off-by: Michael A. Kuykendall <michaelallenkuykendall@gmail.com>
- Add optional OpenAI-standard fields to Model struct: permission, root, parent
- Use skip_serializing_if to omit null fields for cleaner JSON responses
- Populate root field with model name for standard OpenAI compatibility
- Update regression tests to verify enhanced Model structure
- Improves compatibility with frontend tools that expect full OpenAI API format
Resolves Issue #113: OpenAI API compatibility for frontends
Signed-off-by: Michael A. Kuykendall <michaelallenkuykendall@gmail.com>