Athena is a multi-modal perception and reasoning engine designed as a foundation toward functional machine intelligence. It ingests data from any modality — text, audio, video, images — into a unified 512-dimensional feature space, builds a temporal knowledge graph, and answers arbitrary queries through a two-stage GNN + LLM reasoning pipeline.

Core Principles

Universal Embedding

Every input becomes a 512-dimensional L2-normalized vector in a shared space, enabling cross-modal reasoning without modality-specific interfaces.

Structure + Semantics

GNN handles graph-structured reasoning; LLM handles semantic interpretation and natural language output. Two stages, one pipeline.

Iterative Learning

Auto-flags novel features via VAE reconstruction error. Human review folds them back into the training loop — continuous open-world learning.

Modality-Agnostic

New input types follow the same pipeline: raw → encoder → projector (→512d) → graph → novelty → LLM.

On-Device

Entire stack runs on Apple M1 with 16GB RAM using on-demand component loading. No cloud dependency.

Pipeline

Athena captures from a MacBook Air (camera, mic) and streams to a Mac Mini server over WebSocket. The server encodes text via all-MiniLM-L6-v2, images via CLIP ViT-B/32, and audio via Whisper — all into the shared 512d space. Observations are stored in a C++ ObservationGraph with 11 edge types, backed by a SQLite KnowledgeGraph.

A VAE novelty detector flags unfamiliar patterns against an adaptive threshold. A GraphSAGE GNN (2 layers, 11-type edge aggregation) reasons over the graph, conditioned on query attention. A drive-based reward system with REINFORCE learning steers the motivation vector.

Current Status


View on GitHub →