AI Could Go Rogue

Develop hardware-level governance mechanisms to enforce safety and compliance in AI systems, ensuring robust operational constraints. This includes tamper-proof hardware.

Earendil

Company

TamperSec

Company

Ulssean

Company

zeroRISC

Company

What does it take to catch a Chinchilla? Verifying Rules on Large-Scale Neural Network Training via Compute Monitoring

Research and Reviews

Towards a Global Governance of AI and Digital Communications for and by Humanity

Research Org

David A. Dalrymple on the flexHEG AI Security Forum

Whitepapers and Essays

Hardware-Enabled Governance Mechanisms

Whitepapers and Essays

Mechanisms for Flexible Hardware-Enabled Guarantees

Whitepapers and Essays

Study the neural basis of human social instincts to inform AI design, ensuring that AI systems can safely interpret and emulate human social behavior.

Use AI to enhance the interpretability of other AI systems, creating tools that automatically explain and verify AI behavior.

Develop and implement AI architectures with separable, auditable world models; where safety can be specified in terms of the state space of the model; and proposed AI outputs come with proofs that the output does not leave the safe region of the world model’s state space.

Robust strategies for data integrity, anomaly detection, and defensive training protocols to mitigate situations where indirect data poisoning could lead to intentionally misaligned AI systems (not unlike “sleeper agents”).

Digital fortresses that enable sensitive data to be processed in a controlled, privacy-preserving environment.

Observing emergent AI decision-making processes and cognitive patterns with fewer anthropomorphic assumptions.

AI Could Go Rogue

Foundational Capabilities (7)

Hardware Governance

Understanding Neural Design Principles of Social Instincts

Automate AI Interpretability

Guaranteed Safe AI Architectures

Mitigating (Indirect) Data Poisoning

Secure and Privacy-Preserving Local AI Enclaves

Understand AI Psychology without Assuming Human-Like Psychology