
582650 InformationTheoretic Modeling (5 cr)
Department of Computer Science, University of Helsinki
Fall 2014
An advanced course in the Algorithms and Machine Learning
subprogramme. Students in other programmes, as well as students
in mathematics and statistics are welcome too.
This course provides an introduction to information theory from the
computer science perspective. Much of the course can be viewed as
consequences of Shannon's central result known as the Noiseless Source
Coding Theorem. The theoretical results will be illustrated by various
practical data compression systems from Huffman coding
to Rissanen's arithmetic coding. In order to demonstrate the wide
applicability of informationtheoretic concepts in
Data Science,
we discuss informationtheoretic principles in statistical
modeling, and especially the Minimum Description Length (MDL) principle.
 Jorma Rissanen, the
inventor of the MDL principle (left)
receiving a best paper award from Claude Shannon, the father of
information theory.


 Lecturer
 Teemu Roos, A332,
teemu.roos at cs.helsinki.fi
 Teaching assistant
 Jussi Määttä,
jussi.maatta at helsinki.fi
Language: All course material and lectures will be in
English.
Lectures: Period I: Sep 4–Oct 17, Wed & Fri 10–12
in C222.
Exercises: Sep 11–Oct 16, Thu 16–18 in C222.
Instructions for exercises (pdf)
(please read)
Final exam: Oct 22, 9AM
The course is followed by a related project in Period II. It is
recommended that you take both the course and the project.
Link to: project.
Prerequisites:
 Elementary calculus (limits and convergence, convexity)
 Basic probability (random variables, conditional and joint
distributions, expectation)
 Good programming skills (no specific language required)
Material:
 There is no course book. All the material will be contained in the
lecture slides (below).
 A recommended source for complementary material is Cover & Thomas
"Elements of Information Theory". There is an online version which you
can access from the University network: online version
 Additional material categorized per theme is available below.
Contents
The schedule below is tentative and subject to change.
 Lecture 1. (Wed, Sep 3)
 Introduction
(attendance strongly recommended)
• introduction to the course •
overview of contents
slides
exercises #1 
example solutions (pdf) 
solution code (zip) 
link to Szpankowski's slides
• Note: You should go through material under "Examples (with numbers)" starting on slide 105 by yourself.
 Lecture 2. (Fri, Sep 5)
 Mathematical Preliminaries
• calculus • probability •
law of large numbers • Jensen's and
Gibbs' inequalities
slides
 Lecture 3. (Wed, Sep 10)
 Source Coding: Theory
• entropy and information • asymptotic equipartition
property (AEP) • noiseless source coding theorem
slides
exercises #2 
example solutions (pdf) 
solution code (zip)
 Lecture 4. (Fri, Sep 12)
 Channel Coding and Error Correction
• error correcting codes • noisy channel coding theorem
• Hamming codes
slides
 Lecture 5. (Wed, Sep 17)
 Source Coding: Practice I
• symbol codes
• Kraft inequality • entropy coding
• Shannon(Fano) codes
slides
exercises #3 
example solutions (pdf) 
solution code (zip)
 Lecture 6. (Fri, Sep 19)
 Source Coding: Practice II
• Huffman code • ShannonFanoElias code
• arithmetic coding
slides
 Lecture 7. (Wed, Sep 24)
 Universal Source Coding I
• universal models • twopart codes  covered slides 114
slides
exercises #4 
example solutions (pdf)
 Lecture 8. (Fri, Sep 26)
 Universal Source Coding II
• mixture codes • normalized maximum likelihood 
covered slides 1546 (of Lecture 7)
 Lecture 9. (Wed, Oct 1)
 Universal Source Coding III
• prediction game • Bayes vs worstcase  covered slides 114
slides
exercises #5
data200.txt 
example solutions (pdf) 
solution code (zip)
 Lecture 10. (Fri, Oct 3)
 MDL Principle: Introduction
• Occam's razor • model selection • oldstyle & modern MDL  covered slides 15end (of Lecture 9)
 Lecture 11. (Wed, Oct 8)
 MDL Principle: Applications I
• encoding continuous data • differential entropy
• Gaussian models  covered slides 122
slides
 Lecture 12. (Fri, Oct 10)
 MDL Principle: Applications II
• multinomial models • histogram density estimation
• clustering • Bayesian networks  slides 23end (of Lecture 11)
additional slides
exercises #6
survey.txt
splitcfg.py 
example solutions (pdf) 
solution code (zip)
 Lecture 13. (Wed, Oct 15)
 Further Topics
• Kolmogorov complexity
slides
 Lecture 14. (Fri, Oct 17)
 Redundant Lecture
• summary of course contents • questions and answers before
final exam • introduction to project
Additional Material (by Lecture)
Here you can find recommended (but not required) additional material,
categorized according to (our) lectures.
 Lecture 1. Introduction


Claude Shannon  Father of the Information Age
 University of California Television (UCTV), Youtube, 30min
 Lecture 3. Source Coding: Theory

 Cover & Thomas, Elements of Information Theory, Chapter 2
 (see course folder, C127)
 Claude E. Shannon, "A mathematical theory of communication",
The Bell Systems Technical Journal, Vol. 27, pp.
379–423, 623–656, JulyOctober, 1948.
 ⇒ a reprint (with corrections) from Bell Labs
 Joaquin Candela, Probability, Information Theory and Bayesian Inference,
on videolectures.net
 • part 1: probability
 • parts 2 & 3: entropy and information
 • part 4: Bayesian inference
 Lecture 4. Noisy Channel Coding

 Information and Entropy
 an MIT OpenCourseWare elementary level course
⇒ material on noise and errors (scroll down the page)
 Lectures 5 & 6. Source Coding: Practice

 Witten, Neal & Cleary, "Arithmetic coding for data compression", Communications of
the ACM, Vol. 30, No 6, pp. 520–540, 1987.
 Cover & Thomas, Elements of Information Theory, Chapter 5
 (see course folder, C127)
 Lelewer & Hirschberg, "Data compression",
ACM Computing Surveys,
Vol. 19, No. 3, pp. 261–296, 1987.
 Ken Huffman, "Huffman algorithm",
Huffman Coding Blog, Dec. 24, 2008.

a mostly informal blog entry written by David Huffman's nephew
• relates Huffman code to mobiles (no, not cell phones)
 David MacKay, Dasher:
InformationEfficient Text Entry, Google Tech Talk,
April 19, 2007.
 dasher is a textentry system based on arithmetic coding
 David Solomon, Data Compression: The Complete Reference,
Springer, 2004, 926 pages
 the title says it all: a comprehensive reference;
more recent editions available
 Lectures 7 & 8 & 9: Universal Source Coding

 Peter Grünwald, The Minimum Description Length
Principle, MIT Press, 2007, 504 pages
 • part 2: universal coding
 Lectures 10 & 11 & 12: MDL Principle

 Peter Grünwald, The Minimum Description Length
Principle, MIT Press, 2007, 504 pages
 • chapter 1: learning, regularity, and compression
⇒ PDF
available
 Peter Grünwald, MDL Tutorial, on videolectures.net
 Jorma Rissanen, Minimum Description Length Principle,
Scholarpedia, 3(8):6727.
 Kontkanen & Myllymäki, "A lineartime algorithm for computing the multinomial stochastic
complexity", Information Processing Letters, Vol. 103,
No. 6, pp. 227–233, 2007.
 a simple yet very useful result • quite complex proof
 T. Silander, T. Roos, P. Kontkanen, and P. Myllymäki,
(2008). Factorized NML criterion for learning Bayesian network
structures, in Proceedings of the 4th European Workshop on
Probabilistic Graphical Models (PGM2008).
 learns Bayesian network structures from data • uses the
multinomial NML as a building block
 Lecture 13: Kolmogorov Complexity (and Further Topics)

 Li & Vitanyi, Kolmogorov Complexity and Its Applications,
Springer, 1997.
 the book about Kolmogorov complexity
 R. Cilibrasi & P.M.B. Vitanyi, "Clustering by compression",
IEEE Transactions on Information Theory, Vol. 51, pp. 1523–1545,
2005.
 uses compression (gzip) to construct a universal similarity
metric
 R. Cilibrasi & P.B.B. Vitanyi, "The Google similarity distance",
IEEE Transactions on Knowledge and Data Engineering, Vol. 19,
pp. 370–383, 2007.
 a followup on the previous • uses Google page counts to construct a semantic similarity metric
 John L. Kelly, Jr., "A new interpretation of information rate",
Bell Systems Technical Journal, Vol. 35, pp. 917–926,
1956.
 we didn't cover this topic in the course
• an argument for using logarithmic loss
functions ⇒
PDF
available
 Video Coding
Fundamentals & H.264, blog
 not covered in the course
•
image coding
• video coding



September 3 First lecture (C222); participation recommended.
September 8, 4.15pm Distinguished lecture by Wojciech Szpankowski.
Link
to (PPT) slides.
September 11 First exercise session (C222).
Note changed room (was D122).
September 12 Note the
new instructions for
exercises (pdf).
October 2 We've set up a Piazza
Q&A area for the course. All students have been sent an email invitation
to join. (If you didn't receive an invitation, send an email to Jussi.)
You can ask any questions related to the lectures, homework problems etc.
Anonymous queries are also allowed.
October 22, 9.00AM Course exam (A111/B123/CK112).
You are allowed to take a single doublesided
A4sized "cheat sheet" with your own handwritten (not
photocopied from someone else) notes to the exam.
A calculator is allowed but not necessary.
November
14 Results
of the course. (You need to have a CS department account to
access.)
November 21 Comments on the course exam and its grading
October 29 C222 First "lecture" of the
project.
November 28, 4.00PM Separate exam in case you missed
or failed the first one, B123. If you took part in the course,
exercise points are valid for those who have completed at least
50% in case they improve your grade. Cheat sheet allowed (see above).
December 17 Results
of the separate exam. (You need to have a CS department account to access.) Exercise
points were taken into account in cases where they resulted in a better grade.

