Minimum Message Length (MML)

Minimum message length (MML) [Wal05] inference was devised by Chris Wallace and David Boulton c1968 [WB68] and developed by Chris Wallace and many colleagues. MML is a Bayesian method of inference [Bay63]:

Bayes's Theorem:
pr(H&D) = pr(H).pr(D|H) = pr(D).pr(H|D)
pr(H|D) = pr(H&D) / pr(D) ∝ pr(H).pr(D|H)
Shannon:
msgLen(E) = I(E) = - log2(pr(E)) bits
so
msgLen(H&D) = msgLen(H) +msgLen(D|H) = msgLen(D) +msgLen(H|D)

for hypothesis H, data D, event E. MML is a practical realisation [All05, All18]. of Ockham's Razor [Spa99]. Some have assumed that MML is the same as maximum aposterior inference (MAP) but in general it is not. And unlike the later minimum description length (MDL) principle, MML favours explicit priors and fully parameterized models.

Key points are that every continuous (real, floating point) variable has some limited measurement accuracy and that every continuous parameter has some optimal limited precision to which it should be inferred and stated [WF87]. A consequence is that even continuous data and continuous parameters have non-zero probabilities (and hence finite message lengths), not just probability densities, and therefore Bayes's theorem still applies as is. Interestingly, there are many cases where even a discrete parameter must be estimated to less precision than its discreteness would seem to allow.

Some statistical models that have been MML-ed include:
Discrete,
Binomial, 2-state,
Multinomial, k-state,
Integer, Geometric, Poisson, Universal codes,
Continuous,
Normal (Gaussian),
Linear regression,
von Mises - Fisher (vMF) and von Mises distributions,
Student's t-Distribution,
Structured,
Mixture models (clustering, classification),
(Hidden) Markov models, PFSA,
Classification- (decision-) trees and graphs etc.,
Regression- and Model-trees,
Sequence segmentation,
Megalithic stone circles! [PW82]
Bayesian networks [OAK06],
Supervised learning,
Unsupervised learning,
Trees and Graphs,
amongst others.

MML has theoretical support in the form of Kolmogorov complexity [Kol67].

Strict MML (SMML) is a sort of MML "gold standard". Unfortunately SMML is computationally intractible for all but simple problems but, happily, accurate and feasible approximationsn[WF87] to SMML exist.


References

[All05] L. Allison, 'Models for Machine Learning and Data Mining in Functional Programming', Journal of Functional Programming, 15(1), pp.15-32, doi:10.1017/S0956796804005301, January 2005.
[All18] L. Allison, 'Coding Ockham's Razor', Springer, doi:10.1007/978-3-319-76433-7, 2018. (Also see more including software source code and documentation.)
[Bay63] T. Bayes, 'An Essay Towards Solving a Problem in the Doctrine of Chances', Philosophical Trans. of the Royal Soc. of London, 53, pp.370-418, 1763. (Reprinted in Biometrika, 45(3/4), pp.293-315, Jstor:2333180, 1958.)
[OAK06] R. T. O'Donnell, L. Allison and K. B. Korb, 'Learning Hybrid Bayesian Networks by MML', Springer, LNCS/LNAI, 4304, pp.192-203, doi:10.1007/11941439_23, 2006.
[Kol67] A. N. Kolmogorov, 'Logical Basis for Information Theory and Probability Theory', IEEE Info. Theory, 14(5), pp.662-664, doi:10.1109/TIT.1968.1054210, 1967.
[PW82] J. Patrick and C. S. Wallace, 'Stone Circle Geometries: An Information Theory Approach', in Archaeoastronomy in the Old World, Cambridge University Press (CUP), doi:10.1017/CBO9780511898310.016, 1982.
[Spa99] P. V. Spade, 'The Cambridge Companion to Ockham', Cambridge University Press (CUP), doi:10.1017/CCOL052158244X, 1999.
[Wal05] C. S. Wallace, 'Statistical and Inductive Inference by Minimum Message Length', Springer, Series on Information Science and Statistics, doi:10.1007/0-387-27656-4, 2005.
[WB68] C. S. Wallace and D. M. Boulton, 'An Information Measure for Classification', the Computer Journal, 11(2), pp.185-194, doi:10.1093/comjnl/11.2.185, 1968.
[WF87] C. S. Wallace and P. R. Freeman, 'Estimation and Inference by Compact Coding', J. of the Royal Statistical Society, series B, 49(3), pp.240-265, Jstor:2985992, 1987.

Also see publications.