OktoBLAS

High-Performance Linear Algebra for AI Workloads

Powering the OktoSeek Ecosystem — Also available for Python developers

Table of Contents

About OktoBLAS

OktoBLAS is a proprietary high-performance Basic Linear Algebra Subprograms (BLAS) implementation developed by OktoSeek. It was built to power the OktoEngine and the entire OktoSeek ecosystem, providing the computational foundation for OktoScript training pipelines.

As part of our mission to democratize AI, we've made OktoBLAS available as a standalone Python package, allowing developers worldwide to benefit from our research and optimizations — even outside the OktoSeek ecosystem.

Built from the ground up using Rust and hand-optimized CUDA PTX kernels, OktoBLAS has zero dependency on NVIDIA cuBLAS. This allows us to achieve unprecedented performance for the matrix operations that power modern AI.

Key Features:
  • Core of OktoEngine — Powers the entire OktoSeek training ecosystem
  • Zero cuBLAS Dependency — 100% proprietary implementation
  • Tensor Core Optimized — Full WMMA utilization for FP16
  • Fused Operations — Single-kernel attention computation
  • Available for Python — Shared with the community via pip
  • Energy Efficient — ~12% reduction in power consumption

🚀 Installation

For OktoScript users: OktoBLAS is already integrated into OktoEngine — no additional installation needed!

For Python developers: We've made OktoBLAS available on PyPI:

pip install oktoblas --upgrade

Python Package Requirements:

  • Python 3.9 - 3.13
  • NVIDIA GPU with CUDA support
  • CUDA Toolkit 11.0+
  • NumPy ≥ 1.20

📊 Performance Benchmarks

OktoBLAS delivers superior performance across all key operations. These benchmarks were conducted on NVIDIA RTX 4070 Laptop GPU with proper GPU warmup for consistent results.

FP16 GEMM Performance

Matrix Size OktoBLAS PyTorch Result
1024×1024 33.9 TFLOPS 30.0 TFLOPS +13.1% 🔥
2048×2048 40.6 TFLOPS 33.7 TFLOPS +20.6% 🔥🔥
4096×4096 42.1 TFLOPS 40.1 TFLOPS +5.0% ✅

Fused Attention — 3.8x Faster!

Configuration OktoBLAS PyTorch Speedup
B4 S256 D64 1.06 TFLOPS 0.28 TFLOPS 3.8x 🔥
B4 S512 D64 1.20 TFLOPS 0.93 TFLOPS 1.3x ✅
B8 S256 D64 1.17 TFLOPS 0.55 TFLOPS 2.1x ✅
📝 Note: Benchmarks performed with GPU warmed up. Results may vary based on hardware, driver version, and workload characteristics.

Quick Start (Python Package)

For developers using the Python package outside OktoScript, here's how to get started:

Basic Usage

import oktoblas # Initialize OktoBLAS oktoblas.init() # Check device information print(oktoblas.device_info()) # Perform optimized GEMM result = oktoblas.gemm(matrix_a, matrix_b) # Fused attention for transformers attn_output = oktoblas.fused_attention(Q, K, V)

With PyTorch Integration

import torch import oktoblas # Initialize with CUDA oktoblas.init() # Create PyTorch tensors a = torch.randn(2048, 2048, dtype=torch.float16, device='cuda') b = torch.randn(2048, 2048, dtype=torch.float16, device='cuda') # Use OktoBLAS for optimized matrix multiplication result = oktoblas.gemm(a, b) # Result is a standard PyTorch tensor print(result.shape) # torch.Size([2048, 2048])

Python API Reference

oktoblas.init()

Initialize OktoBLAS and detect CUDA devices. Must be called before using other functions.

oktoblas.init() # Initialize with default settings

oktoblas.device_info()

Returns information about the detected GPU and CUDA capabilities.

info = oktoblas.device_info() print(info) # GPU: NVIDIA GeForce RTX 4070 # CUDA: 12.0 # Tensor Cores: Yes

oktoblas.gemm(a, b)

Performs optimized General Matrix Multiplication (C = A × B).

# Supports FP16 and FP32 result = oktoblas.gemm(matrix_a, matrix_b) # With optional transpose result = oktoblas.gemm(matrix_a, matrix_b, trans_a=False, trans_b=True)

oktoblas.fused_attention(Q, K, V)

Computes fused attention in a single kernel launch. Significantly faster than separate operations.

# Q, K, V: [batch, heads, seq_len, head_dim] attn_output = oktoblas.fused_attention(Q, K, V) # With optional mask attn_output = oktoblas.fused_attention(Q, K, V, mask=attention_mask)

🚀 Maximum Performance Guide

Follow these recommendations to get the best performance from OktoBLAS:

✅ Enable cuDNN Benchmark

Set torch.backends.cudnn.benchmark = True for optimized kernel selection.

✅ Use FP16 and Tensor Cores

FP16 operations leverage Tensor Cores for maximum throughput.

✅ Enable AMP

Use automatic mixed precision for optimal balance of speed and accuracy.

OktoScript Integration — Native Support

OktoBLAS was built for OktoScript. It's the computational backbone of OktoEngine, powering every matrix operation in your training pipelines. When you run OktoScript configurations, OktoBLAS is automatically used — no configuration needed.

# OktoBLAS powers all OktoScript training PROJECT "fast-training" MODEL { base: "oktoseek/base-7b" device: "cuda" } TRAIN { epochs: 5 batch_size: 32 # OktoBLAS handles all matrix operations # automatically with optimized kernels }

This is where OktoBLAS shines — deep integration with the OktoSeek ecosystem for maximum performance.

Learn more about OktoScript: OktoScript Documentation →

⚡ Energy Savings & Environmental Impact

Faster operations don't just save time — they save energy. OktoBLAS optimizations result in approximately 12% reduction in energy consumption for typical AI training workloads.

Impact at Scale

For organizations running large-scale training:

  • Lower electricity costs — Significant operational savings
  • Reduced carbon footprint — Supporting sustainable AI development
  • Extended hardware lifespan — Less thermal stress on GPUs
  • More accessible AI — Smaller organizations can train competitive models

Detailed analysis: Enterprise Savings Documentation →

🔬 OktoSeek Research Mission

OktoBLAS is part of OktoSeek's broader research mission: developing mathematical techniques and optimization strategies that make AI training faster and more efficient without compromising quality.

Our Goals:
  • Democratize AI — Make high-performance training accessible to everyone
  • Reduce Energy Consumption — More efficient operations for sustainable AI
  • Accelerate Research — Faster iteration enables more experimentation
  • Push Boundaries — Discover new optimization techniques

By reducing the time and energy required for training, we're making AI more accessible to researchers, startups, and organizations worldwide. Faster training means faster iteration, more experimentation, and ultimately better AI for everyone.

Get Started with OktoBLAS

Using OktoScript? OktoBLAS is already powering your training — no setup needed!

Python developer? Get OktoBLAS from PyPI and experience the performance:

OktoBLAS is developed and maintained by OktoSeek AI as part of the OktoSeek ecosystem.

← Back to Home