OktoBLAS

About OktoBLAS

OktoBLAS is a proprietary high-performance Basic Linear Algebra Subprograms (BLAS) implementation developed by OktoSeek. It was built to power the OktoEngine and the entire OktoSeek ecosystem, providing the computational foundation for OktoScript training pipelines.

As part of our mission to democratize AI, we've made OktoBLAS available as a standalone Python package, allowing developers worldwide to benefit from our research and optimizations — even outside the OktoSeek ecosystem.

Built from the ground up using Rust and hand-optimized CUDA PTX kernels, OktoBLAS has zero dependency on NVIDIA cuBLAS. This allows us to achieve unprecedented performance for the matrix operations that power modern AI.

        Key Features:
        Core of OktoEngine — Powers the entire OktoSeek training ecosystem
Zero cuBLAS Dependency — 100% proprietary implementation
Tensor Core Optimized — Full WMMA utilization for FP16
Fused Operations — Single-kernel attention computation
Available for Python — Shared with the community via pip
Energy Efficient — ~12% reduction in power consumption

      

🚀 Installation

For OktoScript users: OktoBLAS is already integrated into OktoEngine — no additional installation needed!

For Python developers: We've made OktoBLAS available on PyPI:

pip install oktoblas --upgrade

Python Package Requirements:

Python 3.9 - 3.13
NVIDIA GPU with CUDA support
CUDA Toolkit 11.0+
NumPy ≥ 1.20

📊 Performance Benchmarks

OktoBLAS delivers superior performance across all key operations. These benchmarks were conducted on NVIDIA RTX 4070 Laptop GPU with proper GPU warmup for consistent results.

FP16 GEMM Performance

Matrix Size	OktoBLAS	PyTorch	Result
1024×1024	33.9 TFLOPS	30.0 TFLOPS	+13.1% 🔥
2048×2048	40.6 TFLOPS	33.7 TFLOPS	+20.6% 🔥🔥
4096×4096	42.1 TFLOPS	40.1 TFLOPS	+5.0% ✅

Fused Attention — 3.8x Faster!

Configuration	OktoBLAS	PyTorch	Speedup
B4 S256 D64	1.06 TFLOPS	0.28 TFLOPS	3.8x 🔥
B4 S512 D64	1.20 TFLOPS	0.93 TFLOPS	1.3x ✅
B8 S256 D64	1.17 TFLOPS	0.55 TFLOPS	2.1x ✅

        📝 Note: Benchmarks performed with GPU warmed up. Results may vary based on hardware, 
        driver version, and workload characteristics.
      

Quick Start (Python Package)

For developers using the Python package outside OktoScript, here's how to get started:

Basic Usage

import oktoblas

# Initialize OktoBLAS
oktoblas.init()

# Check device information
print(oktoblas.device_info())

# Perform optimized GEMM
result = oktoblas.gemm(matrix_a, matrix_b)

# Fused attention for transformers
attn_output = oktoblas.fused_attention(Q, K, V)
      

With PyTorch Integration

import torch
import oktoblas

# Initialize with CUDA
oktoblas.init()

# Create PyTorch tensors
a = torch.randn(2048, 2048, dtype=torch.float16, device='cuda')
b = torch.randn(2048, 2048, dtype=torch.float16, device='cuda')

# Use OktoBLAS for optimized matrix multiplication
result = oktoblas.gemm(a, b)

# Result is a standard PyTorch tensor
print(result.shape)  # torch.Size([2048, 2048])
      

Python API Reference

`oktoblas.init()`

Initialize OktoBLAS and detect CUDA devices. Must be called before using other functions.

oktoblas.init() # Initialize with default settings

`oktoblas.device_info()`

Returns information about the detected GPU and CUDA capabilities.

info = oktoblas.device_info()
print(info)
# GPU: NVIDIA GeForce RTX 4070
# CUDA: 12.0
# Tensor Cores: Yes
      

`oktoblas.gemm(a, b)`

Performs optimized General Matrix Multiplication (C = A × B).

# Supports FP16 and FP32
result = oktoblas.gemm(matrix_a, matrix_b)

# With optional transpose
result = oktoblas.gemm(matrix_a, matrix_b, trans_a=False, trans_b=True)
      

`oktoblas.fused_attention(Q, K, V)`

Computes fused attention in a single kernel launch. Significantly faster than separate operations.

# Q, K, V: [batch, heads, seq_len, head_dim]
attn_output = oktoblas.fused_attention(Q, K, V)

# With optional mask
attn_output = oktoblas.fused_attention(Q, K, V, mask=attention_mask)
      

🚀 Maximum Performance Guide

Follow these recommendations to get the best performance from OktoBLAS:

✅ Enable cuDNN Benchmark

Set torch.backends.cudnn.benchmark = True for optimized kernel selection.

✅ Use FP16 and Tensor Cores

FP16 operations leverage Tensor Cores for maximum throughput.

✅ Enable AMP

Use automatic mixed precision for optimal balance of speed and accuracy.

OktoScript Integration — Native Support

OktoBLAS was built for OktoScript. It's the computational backbone of OktoEngine, powering every matrix operation in your training pipelines. When you run OktoScript configurations, OktoBLAS is automatically used — no configuration needed.

# OktoBLAS powers all OktoScript training
PROJECT "fast-training"

MODEL {
    base: "oktoseek/base-7b"
    device: "cuda"
}

TRAIN {
    epochs: 5
    batch_size: 32
    # OktoBLAS handles all matrix operations
    # automatically with optimized kernels
}
      

This is where OktoBLAS shines — deep integration with the OktoSeek ecosystem for maximum performance.

Learn more about OktoScript: OktoScript Documentation →

⚡ Energy Savings & Environmental Impact

Faster operations don't just save time — they save energy. OktoBLAS optimizations result in approximately 12% reduction in energy consumption for typical AI training workloads.

Impact at Scale

For organizations running large-scale training:

Lower electricity costs — Significant operational savings
Reduced carbon footprint — Supporting sustainable AI development
Extended hardware lifespan — Less thermal stress on GPUs
More accessible AI — Smaller organizations can train competitive models

Detailed analysis: Enterprise Savings Documentation →

🔬 OktoSeek Research Mission

OktoBLAS is part of OktoSeek's broader research mission: developing mathematical techniques and optimization strategies that make AI training faster and more efficient without compromising quality.

        Our Goals:
        Democratize AI — Make high-performance training accessible to everyone
Reduce Energy Consumption — More efficient operations for sustainable AI
Accelerate Research — Faster iteration enables more experimentation
Push Boundaries — Discover new optimization techniques

      

By reducing the time and energy required for training, we're making AI more accessible to researchers, startups, and organizations worldwide. Faster training means faster iteration, more experimentation, and ultimately better AI for everyone.

Get Started with OktoBLAS

Using OktoScript? OktoBLAS is already powering your training — no setup needed!

Python developer? Get OktoBLAS from PyPI and experience the performance:

📜 OktoScript Documentation — Use OktoBLAS natively
📦 PyPI Package — For Python integration
🐙 GitHub Repository
📊 Full Benchmark Results

OktoBLAS is developed and maintained by OktoSeek AI as part of the OktoSeek ecosystem.

Table of Contents