Some Unicode examples
- Below you'll see three glyphs, encoded as these four characters: U+212B, U+00C5, U+0041, U+030A.
ÅÅÅ
- Doubly-encoded text, or UTF-8 text interpreted as ISO 8859-n, often looks like this:
Ins an am chéadna, áfach, bhà sé barúlach
- Incorrectly converted text sometimes uses the replacement character U+FFFD. It's still valid UTF-8!
Ins an am ch�adna, �fach, bh� s� bar�lach
- Here's some text in Tagalog script; chances are you don't have this font. But it's also perfectly good UTF-8 text!
ᜀᜅ᜔ ᜈᜄ᜔ ᜊᜌ᜔ᜊᜌ᜔ ᜈᜒᜆᜓ ᜀᜌ᜔ ᜇᜒ ᜈ᜔ᜌ ᜀᜎ
- Here's some Laki text that uses U+200C, ZWNJ:
بويه دلم بزيوه نهوروز خهندهي له ليوه خوش لهو باخو كيوه خوشي له من و ئيوه جهژني نهوروزمان پيروز ههر خوش بن ولات و هوز نهوروز جهژني گشتمانه جهژني پشتاوپشتمانه لهو روژهدا بوو كاوه رولهي چيني چهوساوه چهكوشي تولهي وهشاند ميشكي زوحاكي پژاند كوتايي هات شهوي تار ههلهات خوري پرشنگدار لهم روژهدا گشت ساليك گشت دليك گشت ماليك يادي ئهو روژه ئهكات كه ديليهتي دوايي هات ...
- And the same text with the ZWNJ's stripped out:
بويه دلم بزيوه نهوروز خهندهي له ليوه خوش لهو باخو كيوه خوشي له من و ئيوه جهژني نهوروزمان پيروز ههر خوش بن ولات و هوز نهوروز جهژني گشتمانه جهژني پشتاوپشتمانه لهو روژهدا بوو كاوه رولهي چيني چهوساوه چهكوشي تولهي وهشاند ميشكي زوحاكي پژاند كوتايي هات شهوي تار ههلهات خوري پرشنگدار لهم روژهدا گشت ساليك گشت دليك گشت ماليك يادي ئهو روژه ئهكات كه ديليهتي دوايي هات ...
Know your bytes!
Inspired by the Ken Church's great
Unix for Poets tutorial (and The Clash). INNET students should already have the file randomsample.txt; everyone else can get it here (right-click or Ctrl-click the link and save).
- List the files in the current directory.
$ ls
- Where am I?
$ pwd
- Return to home directory
$ cd
- Dump the sample file to the screen.
$ cat randomsample.txt
- To slow down the output (with any of these commands), add "less" (space bar to page down, 'q' to quit less):
$ cat randomsample.txt | less
- Count the number of lines in the file.
$ cat randomsample.txt | wc -l
- Show only lines that start with "pt-".
$ cat randomsample.txt | egrep '^pt-'
- Show lines that contain Cyrillic 'а'
$ cat randomsample.txt | egrep 'а'
- Dump the sample file, but remove the language codes.
$ cat randomsample.txt | sed 's/^[A-Za-z-]*.//'
- Save this as a separate file.
$ cat randomsample.txt | sed 's/^[A-Za-z-]*.//' > justtext.txt
- Show lines not containing A-Za-z after the language code.
$ cat justtext.txt | egrep -v '[A-Za-z]'
- Show lines containing four consecutive vowels a,e,i,o,u.
$ cat justtext.txt | egrep '[aeiou]{4}'
- Show lines containing Unicode right apostrophe, but replace it with a saltillo.
$ cat justtext.txt | egrep '’' | sed 's/’/ꞌ/g'
- Show all "words" (contiguous sequences of letter characters), one per line. May not work on a Mac, or with non-GNU grep.
$ cat justtext.txt | egrep -o '[[:alpha:]]+'
- Save the words in a separate file.
$ cat justtext.txt | egrep -o '[[:alpha:]]+' > words.txt
- Count the number of words.
$ cat words.txt | wc -l
- Show the 14 "letter" words.
$ cat words.txt | egrep '^.{14}$'
- Show all words in sorted order.
$ cat words.txt | sort
- Sort them by endings (for investigating inflectional morphology, say).
$ cat words.txt | rev | sort | rev
- Create a word frequency list.
$ cat words.txt | sort | uniq -c | sort -r -n
- Show all word bigrams.
$ cat words.txt | sed '1d' | paste words.txt -
- Word bigram frequency list.
$ cat words.txt | sed '1d' | paste words.txt - | sort | uniq -c | sort -r -n
- Find palindromes of length at least 4.
$ cat words.txt | rev | paste words.txt - | egrep '^(.{4,}).\1$'
- Keep only the Irish sample sentence and put it in a file.
$ cat randomsample.txt | egrep '^ga[^a-z-]' | sed 's/^ga.//' > irish.txt
- Examine the bytes in the file.
$ cat irish.txt | xxd
- Convert to ISO-8859-1.
$ cat irish.txt | iconv -f UTF-8 -t ISO-8859-1
- Convert to ISO-8859-1 and look at the bytes.
$ cat irish.txt | iconv -f UTF-8 -t ISO-8859-1 | xxd
- Show all encodings supported by iconv.
$ iconv -l
- Convert to UTF-16 and examine the bytes.
$ cat irish.txt | iconv -f UTF-8 -t UTF-16BE | xxd
- Incorrectly interpret our nice UTF-8 Irish as ISO-8859-1 and convert to UTF-8.
$ cat irish.txt | iconv -f ISO-8859-1 -t UTF-8
- Same thing with Bribri sentence (U+A78C saltillo, U+0331 combining macron below)
$ cat randomsample.txt | egrep '^bzd' | sed 's/^bzd.//' | iconv -f UTF-8 -t UTF-16BE | xxd
- Try converting Bribri to ISO-8859-1 (or anything other than a Unicode encoding!)
$ cat randomsample.txt | egrep '^bzd' | iconv -f UTF-8 -t ISO-8859-1