SandCastleIcon.png This page has links to websites or programs not trusted by Scratch or hosted by Wikipedia. Remember to stay safe while using the Internet, as we can't guarantee the safety of other websites.
UTF-8 has become the most popular character encoding standard in recent years.

A Character, commonly abbreviated as "char", is a computer symbol, letter, or number.[1] A keyboard is an input device that inputs a character when a key is pressed. In Scratch, characters are used in strings, arguments, and any situation in the Scratch editor or the playable project where text is required.

Computers use encoding sets to represent characters. Since computers only understand binary code, characters are identified by certain binary sequences. There are many variations and standards across the globe that have changed throughout history.[2]

Types of Characters

Letters

Letters are characters from an alphabet. In English, they consist of lowercase and uppercase characters ranging from the letter "A" to "Z". Combining letters can create words, and combining words can create sentences. "Character" is simply a more universal world that encompasses letters as well as other attributes.

Symbols

Computers have a wide range of recognizable symbols. Some are present on standard keyboards while others may need inputted via software rather than a hardware device. An example of a symbol is the common pound sign: "#". The pound sign is also known as a "hash tag" on social media websites and is arguably the most common symbol used. The "&" symbol is also common and represents the word "and" with one character.

Emojis

The emoji section of an Android keyboard.

Emojis are small images and "smileys". They are recognized computer characters and can even be used in project names. Emojis have surged in popularity in the last decade due to their fun nature and easy accessibility on cell phones. To input an emoji into a project title, there are various methods:

  • Perform the input on a cell phone from the project page
  • Copy-and-paste an emoji from another Internet source
  • Use an on-screen keyboard with emoji support on a computer

In Windows 10, the default on-screen keyboard does not support emojis. However, there is a second on-screen keyboard called the "Touch Keyboard" that has emoji support. The Touch Keyboard can be used even without a touch screen; it has support for a traditional computer mouse. To enable it, right-click the task bar and select "Show touch keyboard button". From there, the Touch Keyboard icon will appear in the on the right side of the task bar. On the virtual keyboard, the "smiley" button displays the emoji options.

Numbers

Main article: Numbers

Numbers are also symbols, and they are unique symbols that can have mathematical operations performed on them. Single number characters can be combined to form larger or more precise numbers. The basic numbers range from "0" to "9". Decimal numbers often use the "." character to represent a decimal point. While the "." character alone is a symbol and not a number, it can be used with a number.

Non-Printable Characters

The "enter" key is a real character used twice in this text file when jumping down a line.

Some characters are "invisible", as in computers do not display them on a screen.[3] An example of this is the "escape" key. Other examples include the character for the "enter" key, the "tab" key, and even the value "null". Null is not something that has any visual representation but is important in computer programming. In the language C, the "null" character is used to denote the end of a string.

Restrictions

Some computer programs may only allow certain characters to be used in specific circumstances. For instance, Scratch does not allow letter or symbol characters to be typed into a numeric insert. It is up to the programmer to decide what characters are allowed or not. Many websites only allow letters, numbers, and a few symbols to be used for usernames. Passwords often allow a larger range of characters for enhanced security.

Strings

Main article: String

A string is a chain of characters. A phrase, word, or even random jumble of characters can be a string. The communication of ideas typically cannot be done with single characters, so multiple are used in unison. A string can consist of a single character, however. In Scratch, strings are commonly used in lists, blocks such as Say (), encoding and decoding cloud data, and more.

Retrieving a Character from a String

Main article: Letter () of () (block)

In Scratch, the letter () of [] block is used to retrieve a single character from a string. For instance, if the first letter of "Hello World" is to be obtained, arguments can be entered into the block to form letter (1) of [Hello World].

Encoding

A computer does not recognize characters like a human. A human sees a frowning emoji and interprets it as sadness. A human sees numbers and associates mathematics with them. A computer merely a machine that represents characters with standardized formats known as encoding.[4] Basically, characters are all assigned code values. Usually the code values are organized for the programmer's ease. For instance, the letters' codes will be in order alphabetically. Numbers, likewise, will be in a simple order.

ASCII

The American Standard Code for Information Interchange is an old but still-available encoding standard. Each letter is associated with an ASCII code and also represented by a single byte (8 bits).[5] Originally only 7 bits were used to represent ASCII characters, allowing 128 characters. They were still represented by a single byte, though, since computers work with bytes better than an odd number of bits. Eventually an extended family of ASCII characters came out with the characters 128-255, becoming known as ANSI.

ASCII is a limited character set because of its history. In the past, computers could only handle up to 8 bits, so the ASCII character set was restricted to 127 characters. This predominantly included characters most associated with the English language. The following table provides a snippet of some of the characters in the set:[6]

ASCII Values
Decimal Code Binary Value Character
Non-Printable Values
0 000 0000 NULL
1 000 0001 Start of heading
2 000 0010 Start of text
3 000 0011 End of text
4 000 0100 End of transmission
5 000 0101 Enquiry
6 000 0110 Acknowledgement
7 000 0111 Bell
8 000 1000 Backspace
9 000 1001 Horizontal Tab
10 000 1010 Line Feed
11 000 1011 Vertical Tab
12 000 1100 Form Feed
13 000 1101 Carriage Return
14 000 1110 Shift Out
15 000 1111 Shift In
16 001 0000 Data Link Escape
17 001 0001 Device Control 1 (XON)
18 001 0010 Device Control 2
19 001 0011 Device Control 3 (XOFF)
20 001 0100 Device Control 4
21 001 0101 Negative Acknowledgement
22 001 0110 Synchronous Idle
23 001 0111 End of Transmission Block
24 001 1000 Cancel
25 001 1001 End of Medium
26 001 1010 Substitute
27 001 1011 Escape
28 001 1100 File Separator
29 001 1101 Group Separator
30 001 1110 Record Separator
31 001 1111 Unit Separator
127 111 1111 Delete
Printable Values
32 010 0000 Space
33 010 0001  !
34 010 0010 "
35 010 0011 #
36 010 0100 $
37 010 0101  %
38 010 0110 &
39 010 0111 '
40 010 1000 (
41 010 1001 )
42 010 1010 *
43 010 1011 +
44 010 1100 ,
45 010 1101 -
46 010 1110 .
47 010 1111 /
48 011 0000 0
49 011 0001 1
50 011 0010 2
51 011 0011 3
52 011 0100 4
53 011 0101 5
54 011 0110 6
55 011 0111 7
56 011 1000 8
57 011 1001 9
58 011 1010  :
59 011 1011  ;
60 011 1100 <
61 011 1101 =
62 011 1110 >
63 011 1111  ?
64 100 0000 @
65 100 0001 A
66 100 0010 B
67 100 0011 C
68 100 0100 D
69 100 0101 E
70 100 0110 F
71 100 0111 G
72 100 1000 H
73 100 1001 I
74 100 1010 J
75 100 1011 K
76 100 1100 L
77 100 1101 M
78 100 1110 N
79 100 1111 O
80 101 0000 P
81 101 0001 Q
82 101 0010 R
83 101 0011 S
84 101 0100 T
85 101 0101 U
86 101 0110 V
87 101 0111 W
88 101 1000 X
89 101 1001 Y
90 101 1010 Z
91 101 1011 [
92 101 1100 \
93 101 1101 ]
94 101 1110 ^
95 101 1111 _
96 110 0000 `
97 110 0001 a
98 110 0010 b
99 110 0011 c
100 110 0100 d
101 110 0101 e
102 110 0110 f
103 110 0111 g
104 110 1000 h
105 110 1001 i
106 110 1010 j
107 110 1011 k
108 110 1100 l
109 110 1101 m
110 110 1110 n
111 110 1111 o
112 111 0000 p
113 111 0001 q
114 111 0010 r
115 111 0011 s
116 111 0100 t
117 111 0101 u
118 111 0110 v
119 111 0111 w
120 111 1000 x
121 111 1001 y
122 111 1010 z
123 111 1011 {
124 111 1100 |
125 111 1101 }
126 111 1110 ~

ANSI

ANSI is an extension of the ASCII encoding, doubling the amount of characters. It contains characters ranging from 0-255. It differs from ASCII notably by using 8 bits instead of 7 bits to represent a single character.[7] In the present day, though, this difference is insignificant since ASCII characters are essentially stored as 8-bit values with the first bit always set to "0". Likewise, values that ANSI adds onto the ASCII character set has the first bit set to "1".

Mapping Standards

In different countries, different ANSI characters will represent different values. The encoding system itself (ANSI) uses the same logic, but which codes are associated with which symbols varies.[8] A mapping standard is a method of defining the codes for the desired characters. ISO-8859 and its variants are the most common mapping schemes for Western languages in ANSI.[9]

Inputting Characters off the Keyboard

Keyboards only have a limited amount of characters. If, for instance, one wants to enter the "°" symbol, the "alt" key can be held down while "0176" is punched in on the right-hand number pad of the keyboard.[10] "176" is the respective code for the degree sign in ANSI. This is a functionality built into most keyboards.

UTF-8

UTF-8 is a more modern standard that encompassing over a million characters without necessarily requiring multiple bytes per character. The standard can be used globally, allowing Chinese characters to be used in the same text as Spanish characters.[11] UTF-8 encodes into certain bits information on the length of the sequence of bits to represent the character. For example, if a certain character had a very long code to represent itself, there would be a few bits acting as as a "flag" to alert the computer that it's a longer character. The computer would then take the next byte into account for one single character.

Some characters are represented by less bytes than others in the encoding. Thus, this allows files to be smaller in size than an encoding that treats every character with the same amount of bytes. The USC4 character encoding represents all characters with 4 bytes. While some characters in UTF-8 may be represented by 4 bytes, many are only represented by 1 or 2. The following chart shows the amount of bytes required for a range of characters:[12]

Bytes Per Character
Min Character Code Max Character Code Bytes
0 127 1
128 2,047 2
2,048 65,535 3
65,536 1,114,111 1

Because of this setup, all the original ASCII characters (0-127) are still only 1 byte in UTF-8. The least common characters take on a larger amount of bytes.

Other Variants

UTF-16 and UTF-32 also exist but are less common than UTF-8. UTF-16 uses a minimum of 16 bits or 2 bytes for every character.[13] One would assume this would make files larger than UTF-8, but some characters in UTF-8 character that are represented by 3 bytes may be represented by 2 bytes in UTF-16. A character represented by 16 bits in UTF-8 actually takes up 3 bytes because some of the bits are used to signal that multiple bytes are necessary. In UTF-16, a 16 bit character can be represented by 2 bytes.

Unicode

The Unicode Consortium is a non-profit organization that develops the Unicode standard of computer characters.

Unicode is a standardized character set that can store over a million characters. UTF-8 encodes the Unicode character set. Unicode itself does not specify how to encode its data into binary, it merely is a large database of code values of many characters.[14] Unicode is constantly being updated with new values, as not all have been filled yet. On May 18, 2017, the Emoji 5.0 set of characters was released.[15]

Usage in Scratch

Main article: Encoding and Decoding Cloud Data

Foreknowledge of computer character encoding standards can be beneficial when developing Scratch projects. Particularly, using cloud variables to store more data than a counter or high score value requires a custom-made encoding system. Cloud variables are only capable of storing numbers, so if text is to be stored, it needs to be translated into numeric codes. This is inline with how computers work, as they translate text into sequences of "1"s and "0"s.

Similar to ASCII encoding, a system can be designed where each character is assigned a code, and the cloud variable contains a sequence of codes. When the data is to be read, it must be decoded by looking up the characters associated with their respective code values. Since cloud variables allow the numbers 0-9, less digits can represent the same range of characters as ASCII, which requires 7 digits (bits) in binary per character.

UTF-8 can also be replicated with cloud variables by using some digits to represent how many other digits are part of the same character before moving onto the next one. Consider that the first number in the cloud variable signifies how many following digits make up the code for the next char. If the cloud variable's encoded data is 3564298, then the first character code is "564" followed by "98". The "3" signifies that there are 3 digits in the first character, and the "2" signifies that there are 2 digits in the second character.

A list can then be used where the index corresponds to the character code. This type of system would be beneficial if a large amount of characters is to be recognized by the project. Restrictions can be set by only allowing certain characters, using the more simple ASCII-based encoding system with a fixed number of digits per character. This, however, may cause issues if someone's username with an "illegal" character is attempted to be encoded into the cloud variable. More complex logic could possibly account for such circumstances.

See Also

References

  1. https://computerhope.com/jargon/c/charact.htm
  2. http://www.developerknowhow.com/1091/the-history-of-character-encoding
  3. https://techopedia.com/definition/29785/non-printable-characters
  4. https://techterms.com/definition/characterencoding
  5. http://www.asciitable.com/
  6. http://ascii.cl/
  7. http://www.sttmedia.com/unicode-ansi
  8. https://www.joelonsoftware.com/2003/10/08/the-absolute-minimum-every-software-developer-absolutely-positively-must-know-about-unicode-and-character-sets-no-excuses/
  9. https://www.terena.org/activities/multiling/ml-docs/iso-8859.html
  10. https://support.office.com/en-us/article/Insert-degree-symbol-f1d062b6-577f-4fe2-8a51-c6f7a862a8b7
  11. http://www.codeguru.com/cpp/misc/misc/multi-lingualsupport/article.php/c10451/The-Basics-of-UTF8.htm
  12. http://www.fileformat.info/info/unicode/utf8.htm
  13. http://www.differencebetween.net/technology/difference-between-utf-8-and-utf-16/
  14. http://unicode.org/charts/
  15. http://blog.unicode.org/2017/05/unicode-emoji-50-specification-now-final.html