Sketchup uses UTF-8 character encoding. All the Sketchup API methods return UTF-8 strings. An UTF-8 character can consist of 1–4 bytes.
The first 128 characters in UTF-8 is identical with ASCII. Characters that lie outside the ASCII range will consist of 2-4 bytes. (æøåüûùú all is two bytes in UTF+8)
Ruby 1.8 which is shipped with Sketchup is not aware of character encoding. A String is a series of 8bits - always! A "character" in the eyes of ruby is a byte.
For English users there isn't a problem. All letters and numbers fall within the ASCII range and the UTF-8 strings will only contain 1byte characters.
For Non-English users it's a bit more tricky.
What works and what doesn't
When you deal with UTF-8 characters from Sketchup you must remember the following:
"Test æøå".length => 11 String.length returns the number of bytes, not characters.
Methods that modify string will most likely mangle the UTF-8 strings that contain multi-byte characters.
Consider this string, "Test æøå" - if we investigate the bytes that makes up this UTF-8 string we will see this:
"Test æøå".unpack('C*')
[84, 101, 115, 116, 32, 195, 166, 195, 184, 195, 165]
As you see, "æ" consists of the two bytes 195 and 166.
Now observer what happens when we reverse the string using Ruby's built in method:
"Test æøå".reverse
> ¥Ã¸Ã¦Ã tseT
The string is completely mangled. If we look at the bytes we see the reason for this:
"Test æøå".reverse.unpack('C*')
[165, 195, 184, 195, 166, 195, 32, 116, 115, 101, 84]
Ruby blindly processed each byte individually instead of each character.
Other methods:
"Test æøå".upcase
> TEST æøå
Ruby only seem to process the bytes within the ASCII range - ignoring everything else.
"Test æøå".chop
> Test æøÃ
One byte removed from the multibyte character "å" mangles the whole string.
As you can see, any manipulation of UTF-8 strings is a risky business as you could very easily split a multi-byte character.
How to deal with it
Fortunately, when writing Ruby scripts you don't have to manipulate strings that often. (In a web development environment this would be a true nightmare as you deal with strings all the time.)
Ruby 1.8 does have some Unicode awareness in the forms of .pack and .unpack.
"Test æøå £$€".unpack('U*')
[84, 101, 115, 116, 32, 230, 248, 229, 32, 163, 36, 8364]
Here we got an array with integer values that corresponds to the UTF+8 characters. These integer values are the Unicode Code Point for the characters. Notice the Euro symbol which is a recent symbol has a high Code Point.
With the array of Code Points we can more reliably process the UTF-8 strings. To get the correct length we can use "Test æøå £$€".unpack('U*').length => 12
And we can extract characters from the string without risking splitting multi-byte characters.
"Test æøå £$€".unpack('U*')[5..7].pack('U*')
> æøå
This would have split the "ø" had you used the regular "Test æøå £$€"[5..7] to extract the substring.
What more problematic is swapping upper and lower case letters. When you deal with Unicode characters there's allot more rules. Same thing goes for sorting alphabetically.
I have begun to create an UTF-8 sub-class of the String object that's UTF-8 aware.
- Code: Select all
# (SAMPLE CODE! - DO NOT USE OR DISTRIBUTE!)
# UTF-8 subclass of +String+
class UTF8 < String
# Rewire the String.length to be the UTF-8 string in bytes
alias :bytes :size # :doc:
# Return the number of UTF-8 characters.
def length
return self.unpack('U*').length
end
# (!) Substitute the .each .replace and sub-string functions.
end
This will be populated with more methods that can deal with UTF-8 strings. I have started an Google Code Project page: http://code.google.com/p/sketchup-unicode/ so that other people can contribute. Unicode support is a mammoth task and I have not set as goal to do a complete rewrite of the String class - but slowly add methods as I need them. I hope that if other developers out there write useful methods that will feed it back to this project for the benefit of the plugin community.
Note that contributions to the project should not be dirty hacks - they should produce proper reliable output.
What fails horribly - File methods
There's one major problem with Ruby under Windows. All the File methods fail when the file name or path include a non-ASCII characters. Ruby calls the ASCII version of the Windows API and therefore fails miserably. It will return errors saying the file could not be found.
On OSX Ruby calls the correct system API and there's no problems at all.
I have found a way to do file operations under Windows with UTF-8 file names - by calling the Unicode aware Windows API directly. The problem is that the File classes in Ruby is many and extensive. As with the UTF-8 sub-class of the String object I'm writing sub-class FileW and FileTestW of the default File and FileTest. On OSX they are pure sub-classes of File and FileTest. But under Windows I override methods to call the Unicode aware Windows API. Methods not overridden will behave as their parent class.
So far I have some a few proof of concept methods such as .exists?, .zero?, .size? and similar which provides info about the files. But I'm still not sure how to tackle read and write to files as I'm not entirely sure the relationship with the IO class. I'm not sure if I can open and UTF-8 aware file stream and pass on the the IO class or if I have to sub-class the IO class as well.
The UnicodeEx project
I have started the UnicodeEx project as an effort to assist in the dealings of UTF-8 string, on obth platforms, and File methods on the Windows platform.
It's still early on - many thing can and will change. So please do not rely on this 100% yet. I make it available now in the hope that more people can assist to this project that's quickly growing in scope.
Repository: http://code.google.com/p/sketchup-unico ... e/checkout (latest source - most up to date)
Download .zip archive: http://code.google.com/p/sketchup-unico ... loads/list (latest build)
Documentation: http://workshop.thomthom.net/su-unicode/