- Published on
Learning Latin with E-graphs
E-graphs are a data structure prevalent in SMT solvers, but has recently gained traction as a tool for program optimization using rewrite rules to explore multiple optimization possibilities efficiently. This is mainly due to the popularity of the egg framework in the past couple years. This robust framework has allowed compiler engineers to utilize equality saturation to enable optimizations through supercharged term rewriting. I have been experimenting with egg over a particularly slow winter break, and wanted to apply the underlying concepts to domains other than compilers. I also took to learning Latin through Duolingo over break, and found the importance of word order in this language quite interesting. On face value, it seems quite flexible, but Latin is an inflected language, with a dynamic word order based on the importance of certain parts of speech in a given context. While SOV is the common order, if the object is more relevant to the overall phrase, OSV is a better word order choice. This does not necessarily invalidate the SOV order in this regard, but OSV would certainly be more optimal. This gave birth to the very cursed idea of using e-graphs to "optimize" a given Latin expression. Formally modelling the pragmatics of natural languages is obviously a ridiculously hard problem, but we can use arbitrary cost models to demonstrate the general idea at hand.
Let's start with creating defining a language in egg.
// lib.rs
use egg::{define_language, Id, Symbol};
pub type RecExpr = egg::RecExpr<POS>;
define_language! {
pub enum POS {
"sentence" = List(Box<[Id]>),
"sub" = Subject(Id),
"d_obj" = DirectObj(Id),
"i_obj" = IndirectObj(Id),
"verb" = Verb(Id),
"adj" = Adjective(Id),
"adv" = Adverb(Id),
"emp" = Emphasis(Id),
Word(Symbol),
}
}
Language terminals are the standard parts of speech, with the Symbol is a fallback operator and acts as the word, with the POS operators taking a word as the argument. We also have an emphasis operator that signifies which POS is the most important and thus needs to be prioritized.
// main.rs
use egg::{rewrite as rw, *};
use egg_latin::{RecExpr, POS};
fn main() {
let expr: RecExpr = "(sentence (sub ?s) (d_obj ?do) (verb ?v))"
.parse::<RecExpr>()
.unwrap();
println!("Testing expr: {}", expr);
struct MyCostFn;
impl CostFunction<POS> for MyCostFn {
type Cost = f64;
fn cost<C>(&mut self, enode: &POS, mut costs: C) -> Self::Cost
where
C: FnMut(Id) -> Self::Cost,
{
use POS::*;
let op_cost = match enode {
Emphasis(_) => 1.0,
_ => 100.0,
};
enode.fold(op_cost, |sum, id| sum + costs(id))
}
}
// Some sort of statistical analysis needed to find most important word in sentence based on
// context
let rules: &[Rewrite<POS, ()>] = &[
rw!("sov"; "(sentence (sub ?s) (verb ?v) (d_obj ?do))" => "(sentence (sub ?s) (d_obj ?do) (verb ?v))"),
rw!("osv"; "(sentence (sub ?s) (emp (d_obj ?do)) (verb ?v))" => "(sentence (d_obj ?do) (sub ?s) (verb ?v))"),
rw!("vso"; "(sentence (sub ?s) (d_obj ?do) (emp (verb ?v)))" => "(sentence (verb ?v) (sub ?s) (d_obj ?do))"),
];
let start = "(sentence (sub puella) (d_obj canem) (verb amat))"
.parse()
.unwrap();
let start_2 = "(sentence (sub puella) (emp (d_obj canem)) (verb amat))"
.parse()
.unwrap();
println!("Cost of start: {}", MyCostFn.cost_rec(&start));
println!("Cost of start_2: {}", MyCostFn.cost_rec(&start_2));
let runner = Runner::default().with_expr(&start).run(rules);
let extractor = Extractor::new(&runner.egraph, MyCostFn);
let (best_cost_sentence, best_expr_sentence) = extractor.find_best(runner.roots[0]);
println!("best_cost_sentence: {}", best_cost_sentence);
println!("best_expr_sentence: {}", best_expr_sentence);
let runner_2 = Runner::default().with_expr(&start_2).run(rules);
let extractor_2 = Extractor::new(&runner_2.egraph, MyCostFn);
let (best_cost_sentence_2, best_expr_sentence_2) = extractor_2.find_best(runner_2.roots[0]);
println!("best_cost_sentence_2: {}", best_cost_sentence_2);
println!("best_expr_sentence_2: {}", best_expr_sentence_2);
}