Uniba.sk



Material for Lesson 1

Initial questions

The purpose of these questions is to see what you know before this lesson, therefore do not search for answers on the Web:

I1: How big (in bytes) will be a text file containing only the Slovak word “košice”? Give reasons for your answer.

I2: Why some Web pages do not show correctly characters that do not belong to the basic English alphabet?

I3: After copying a text file to a Web server the size of the file has changed What could be the reason(s)?

I4: By how many bytes will shrink the free capacity of the disk when you create a new file containing 100 bytes of information? Why?

Task 1a: In the program Notepad++ create three texts, first of them contains only the Slovak word čaša (copy it from here, include no new line), the second one contains only the word casa and the third one contains no characters at all, it is empty. Using the option Convert to... from the menu Encode save each text in all the 5 encodings, so you will get 15 files, name them logically, so that you know which file is which.

Important note about the terminology:

ANSI in Notepad++ (on a computer having set Slovakia as locale in Windows) means "Windows codepage 1250 (CP1250)

Later in this text I call UCS-2 as "Unicode", even if it is not fully correct

BE and LE means "big endian" and "little endian"

Find out the file size of each file and enter it ito the table below.

|Encoding |čaša |casa |empty |

|ANSI (Windows CP1250) | | | |

|UTF-8 | | | |

|UTF-8 BOM | | | |

|Unicode big endian (UCS-2 BE BOM) | | | |

|Unicode little endian (UCS-2 LE BOM) | | | |

Task 1b: Give your conclusions about the size of text files in different encodings. What is the file size depending on? Why sometimes a file with no content occupies non-zero bytes?

Task 2a: Using the program Far Manager 3 (you can find it in the Start menu) show the content of each file in hexadecimal, byte by byte (press F3, then F4 and if the file is not showing byte by byte but as two-byte entities then press F8) and record their content into the table:

|Encoding |čaša |casa |empty |

|ANSI (CP1250) | | | |

|UTF-8 | | | |

|UTF-8 BOM | | | |

|UCS-2 BE | | | |

|UCS-2 LE | | | |

Using your knowledge (about hexadecimal, about “big endian” and “little endian”, etc.), the encoding tables and knowledge about Unicode, UTF-8 encoding and BOM (study the links below) try to understand (explain to yourself) each value in the above table.



– search for “Latin_Extended-A”





Then try to encode by hand the Slovak word šípka in several encodings:

|Kódovanie / Obsah |šípka |

|ANSI (Windows CP1250) | |

|UCS-2 LE | |

|UCS-2 BE | |

|UTF-8 BOM | |

|UTF-8 | |

Task 2b: Write down your findings about the structure of text files in several encodings. How is encoding according to CP1250, Unicode, UTF-8 working? What does it mean LE and BE in the name of encodings?

Task 5b: What is still unclear to you (about encoding of text files)? What would you like to learn from this area?

Final test

O1: How long will be a text file containing only one Slovak word sôvä?

O2: Why some Web pages do not show correctly characters that do not belong to the basic English alphabet?

................
................

In order to avoid copyright disputes, this page is only a partial summary.

Google Online Preview   Download