Theory

Spring 2017

due Thu Jan 26

- Our first goal is to understand the concept of information entropy.
- Reading :
- Skim chapters 1 and 2 in Biggs, then study the entropy section in chapter 2. (We'll come back to the Huffman codes in chapter 2 next week.)
- At least browse Shannon's explanation of the same material in his classic paper which created the field. Entropy is defined in part I, particularly chapter 6.

- Goals :
- These are both math-ish texts, so "reading" them means studying the definitions and working through the formulas with enough examples to understand the ideas and to be able to code examples. In this course we will not be focusing on proofs but instead definitions and applications.
- There are a number of math concepts you'll need to do this material. If you haven't seen them before, now is the time to jump in, google the terms and/or find a math textbook, and get up to speed. If you have questions, ask - I'm open to going over any of this in class as needed. Concepts to get a handle on include :
- probability
- codes, alphabets, messages, bases (i.e. base 10, base 2), logarithms.
- markov chains
- prefix free codes
- information entropy

- Coding :
- As a concrete example we're going to analyze two files of 1's and 0's : stream1.txt and stream2.txt .
- For now, write some code to (a) read the files and store the data as a list of 1's and 0's, and (b) calculate the probabilities and conditional probabilities.

due Thu Feb 2

- Finish reading through chapter 3. Start in on chapter 4. (We'll do LZW and arithmetic soon.)
- Calculate the information entropy of the two files from the first assignment using several probability models, and discuss. What does that say about how much the files can be compressed?
- Using whatever seems appropriate to you for a symbol, construct Huffman codes for stream1.txt and stream2.txt. What compression ratio can be achieved with these codes. Discuss. (You do not need to consider the encoded symbol table itself in your compression, just the average bit length per symbol. Nor do you need to actually encode the file.)
- On page 33, do exercises 3.6, 3.7, and 3.8.
- Do exercise 3.13 on page 40, using Huffman coding, and discuss how close to the theoretical limit this code comes and why.
- Coming soon: with a big ascii text file, a) apply Huffman coding and from that say how much compression it gives, b) calculate the entropy, and c) compare the two and discuss. This is *not* due this week, but you may want to start getting your tools ready ...

due Thu Feb 9

- Moby dick - entropy and compression
- Here is the text of Moby Dick (from Project Gutenberg.)
- Compress this file using a command line tool like gzip or bzip2. By what factor is the compressed file smaller?
- Estimate the entropy using any of the probability models we looked at last week. If consecutive bytes are (a,b,c) then some of the models would be P(a), P(a|b), P(ab), and so on.
- Compare the entropy in appropriate units with the compression from the command tools, and discuss.

- lossless compression - continued
- Using our text and any other sources you can find, read about three more lossless compression algorithms in addition to Huffman : LZW, Arithmetic, and Burrows-Wheeler compression.
- Do exercise 4.3 on pg 49, a small Huffman Code example.
- Do exercise 4.16 on pg 65, an example of an arithmetic code.

due Thu Feb 16

- Do the dot product as discussed on Tue Feb 14.

due Thu Feb 23

- Do at least two of the following three :
- Play around with a toy version of something similar to jpeg compression :
- (I will discuss this one in class Tuesday to set things up.) Using the four \( \vec{b} \) vectors from the last homework, find a 16 "basis images" for 4x4 images, by forming an "outer product" of the b's, (b_01, b_02, b_02, ... b_23, b_33). (The DCT uses images similar these, but 8x8.) If you put the b's into a column and row matrix and multiply them, (4x1)*(1x4), this "outer" product is thir 4x4 matrix product. (The "inner" product is another name for the dot product, which is the (1x4)*(4x1) matrix product.) Transform a sample 4x4 image made up of 16 numbers [[0,0,0,0], [0,100,100,0], [0,100,100,0], [0,0,0,0]] into 16 coefficients in this basis. This is just like Tuesday's assignment but with a bigger basis. Then set the highest frequency (3,3) basis coefficient to zero (that's the lossy compression), and transform back. You should get something like the original but "blurrier".

- Discuss the bandwidth (bits per second) needed for something like human vision and human hearing. Compare with available bandwidths for ethernet and broadband. Look up the typical compression factors for mp3 audio and video, and again compare.
- Read about and summarize the ideas, algorithm, and compression ratios for a commonly used audio and video codecs and file container format, such as H.246 video and MP3 audio (i.e. MPEG-2 Audio Layer III).

due Thu Mar 2

- Read up on noise, as given in the references in my Feb 23 notes.
- Calculate all the interesting quantities including mutual information and channel capacity for at least two models:
- the symmetric binary channel, with X=(0,1), Y=(0,1), P(1|1)=P(0|0)=1-f, P(1|0)=P(0|1)=f where f is a fractional error rate like 0.01 or 0.1. (This is one of the sample problems in MacCay).
- the 3 symbol X=(1,2,3) Y=(a,b,c) example we started in class.

- Do problem 5.17 pg 87 of Biggs: A simplified keypad has four keys arranged in two rows of two). If the intention is to press key x, there is probability α of pressing the other key in the same row and probability α of pressing the other key in the same column (and consequently probability 1 − 2α of pressing x). Write down the channel matrix for this situation and find its capacity (i.e. the maximum entropy, or information per symbol, that can be sent with this keyboard).
- (You can do these calculations by hand, or write some code to do the math. Either way, make sure you understand what's going on.)

due Thu Mar 9

- Read chapters 6 and 7 in Biggs
- Do 6.6, i.e. prove that that hamming distance d(x,y) satisfies d(x,x)=0, d(x,y)=d(y,x), and d(x,z) <= d(x,y) + d(y,z).
- Do 6.15, 6.16, 6.17
- Do 6.12, 6.13, 6.14 on pg 102, on an 1-error correcting code with 5 elements from 6-bit words. (We may have started these in class.)
- Do 7.8, 7.11.

due Tue Apr 4

- Find matrices E and H for a one error correcting code with five data bits (i.e. k=5).
- What is the smallest value of n that might work?

- Consider the data word x = [1 0 1 0 1].
- Find the codeword y that corresponds to the data word x = [1 0 1 0 1] .
- Confirm that H * y = 0.
- Change one bit of that codeword, i.e. insert an error. What is H * y now?
- Change a different bit that codeword, i.e. insert an error somewhere else. What is H * y now?

- Do those same things for the data word x = [1 1 0 0 0].
- Show by brute force that all codewords for this code are at least hamming distance 3 apart.
- Are there any strings of n bits that are more than 1 hamming distance away from all the 2**k codewords without errors? If so, how many of those? Find one.

due Tue Apr 11

- Give examples that use the [8,4] extended Hamming code to
- encode a short message
- decode the coded message
- detect an 1-bit error in that message
- correct that 1-bit error

- Describe a Hamming code with a larger block size than the [7,4] code described in http://en.wikipedia.org/wiki/Hamming_code , and construct its check and generation matrices.
- How many errors does it correct?
- What is its rate? (bits of info per bit of data)
- Is there an extended version of this like the [8,4] code?

- Start thinking about the term project that we talked about Thursday.

due Tue Apr 18

- Using any of the CRC codes in the various articles that we discussed Thursday, work though an example of (a) starting with a data word, (b) calculating the checksum through (bitwise) long division, and (c) confirming that the dataword with checksum gives a "no error" result. You can either invent your own example (preferred) or study one of theirs. And you can do this with code or by hand.
- Make some choices about your term project. Pick and discuss your (a) which algorithms you will implement, and (b) what sizes of data bits, check bits, how many errors to correct, and error rate you will use.

due Fri May 5

- For an end-of-semester project, implement a quasi-real data transmission consisting of (compression, error encoding, error simulation,corrected, uncompressed) pipeline. More specifically,

```
I) start an ascii text file original.txt
II) compress it to compressed.xxx |
| prepare
III) error-correct encode to encoded.yyy |
IV) simulate transmission errors transmitted.yyy
by flipping random bits with
probability e. (For a really
good time, you could put in
burst errors or malicious errors.)
V) correct errors decoded.xxx |
| restore
VI) uncompress final.txt |
```

- All the details would be up to you :
- choice of which compression algorithm
- choice of which error-correction algorithm
- how realistic transmitted file is (i.e. ascii 1's and 0's or bits)

- You should submit the code, an example of what it does (input, output, intermdiate steps), and a paper discussing what's going on including the interesting parameters along the way (information entropy, data rate, etc.)

- Come to class Tues May 2 to show your in-progress work to the class.

- a place for the course grade

http://cs.marlboro.edu/ courses/ spring2017/info/ special/assignments

last modified Tuesday May 9 2017 4:32 pm EDT

last modified Tuesday May 9 2017 4:32 pm EDT