Some Unicode examples

Below you'll see three glyphs, encoded as these four characters: U+212B, U+00C5, U+0041, U+030A.

Doubly-encoded text, or UTF-8 text interpreted as ISO 8859-n, often looks like this:

Incorrectly converted text sometimes uses the replacement character U+FFFD. It's still valid UTF-8!

Here's some text in Tagalog script; chances are you don't have this font. But it's also perfectly good UTF-8 text!

Here's some Laki text that uses U+200C, ZWNJ:

And the same text with the ZWNJ's stripped out:

Know your bytes!

Inspired by the Ken Church's great Unix for Poets tutorial (and The Clash). INNET students should already have the file randomsample.txt; everyone else can get it here (right-click or Ctrl-click the link and save).

List the files in the current directory.

$ ls

Where am I?

$ pwd

Return to home directory

$ cd

Dump the sample file to the screen.

$ cat randomsample.txt

To slow down the output (with any of these commands), add "less" (space bar to page down, 'q' to quit less):

$ cat randomsample.txt | less

Count the number of lines in the file.

$ cat randomsample.txt | wc -l

Show only lines that start with "pt-".

$ cat randomsample.txt | egrep '^pt-'

Show lines that contain Cyrillic 'а'

$ cat randomsample.txt | egrep 'а'

Dump the sample file, but remove the language codes.

$ cat randomsample.txt | sed 's/^[A-Za-z-]*.//'

Save this as a separate file.

$ cat randomsample.txt | sed 's/^[A-Za-z-]*.//' > justtext.txt

Show lines not containing A-Za-z after the language code.

$ cat justtext.txt | egrep -v '[A-Za-z]'

Show lines containing four consecutive vowels a,e,i,o,u.

$ cat justtext.txt | egrep '[aeiou]{4}'

Show lines containing Unicode right apostrophe, but replace it with a saltillo.

$ cat justtext.txt | egrep '’' | sed 's/’/ꞌ/g'

Show all "words" (contiguous sequences of letter characters), one per line. May not work on a Mac, or with non-GNU grep.

$ cat justtext.txt | egrep -o '[[:alpha:]]+'

Save the words in a separate file.

$ cat justtext.txt | egrep -o '[[:alpha:]]+' > words.txt

Count the number of words.

$ cat words.txt | wc -l

Show the 14 "letter" words.

$ cat words.txt | egrep '^.{14}$'

Show all words in sorted order.

$ cat words.txt | sort

Sort them by endings (for investigating inflectional morphology, say).

$ cat words.txt | rev | sort | rev

Create a word frequency list.

$ cat words.txt | sort | uniq -c | sort -r -n

Show all word bigrams.

$ cat words.txt | sed '1d' | paste words.txt -

Word bigram frequency list.

Find palindromes of length at least 4.

$ cat words.txt | rev | paste words.txt - | egrep '^(.{4,}).\1$'

Keep only the Irish sample sentence and put it in a file.

$ cat randomsample.txt | egrep '^ga[^a-z-]' | sed 's/^ga.//' > irish.txt

Examine the bytes in the file.

$ cat irish.txt | xxd

Convert to ISO-8859-1.

$ cat irish.txt | iconv -f UTF-8 -t ISO-8859-1

Convert to ISO-8859-1 and look at the bytes.

$ cat irish.txt | iconv -f UTF-8 -t ISO-8859-1 | xxd

Show all encodings supported by iconv.

$ iconv -l

Convert to UTF-16 and examine the bytes.

$ cat irish.txt | iconv -f UTF-8 -t UTF-16BE | xxd

Incorrectly interpret our nice UTF-8 Irish as ISO-8859-1 and convert to UTF-8.

$ cat irish.txt | iconv -f ISO-8859-1 -t UTF-8

Same thing with Bribri sentence (U+A78C saltillo, U+0331 combining macron below)

$ cat randomsample.txt | egrep '^bzd' | sed 's/^bzd.//' | iconv -f UTF-8 -t UTF-16BE | xxd

Try converting Bribri to ISO-8859-1 (or anything other than a Unicode encoding!)

$ cat randomsample.txt | egrep '^bzd' | iconv -f UTF-8 -t ISO-8859-1