[Code] UnicodeEx - (0.2.0a) Sketchup + Character Encoding

[Code] UnicodeEx - (0.2.0a) Sketchup + Character Encoding

Postby thomthom » Fri Jul 17, 2009 8:48 pm

Overview
Sketchup uses UTF-8 character encoding. All the Sketchup API methods return UTF-8 strings. An UTF-8 character can consist of 1–4 bytes.
The first 128 characters in UTF-8 is identical with ASCII. Characters that lie outside the ASCII range will consist of 2-4 bytes. (æøåüûùú all is two bytes in UTF+8)

Ruby 1.8 which is shipped with Sketchup is not aware of character encoding. A String is a series of 8bits - always! A "character" in the eyes of ruby is a byte.

For English users there isn't a problem. All letters and numbers fall within the ASCII range and the UTF-8 strings will only contain 1byte characters.

For Non-English users it's a bit more tricky.


What works and what doesn't
When you deal with UTF-8 characters from Sketchup you must remember the following:

"Test æøå".length => 11 String.length returns the number of bytes, not characters.

Methods that modify string will most likely mangle the UTF-8 strings that contain multi-byte characters.
Consider this string, "Test æøå" - if we investigate the bytes that makes up this UTF-8 string we will see this:
"Test æøå".unpack('C*')
[84, 101, 115, 116, 32, 195, 166, 195, 184, 195, 165]

As you see, "æ" consists of the two bytes 195 and 166.
Now observer what happens when we reverse the string using Ruby's built in method:

"Test æøå".reverse
> ¥Ã¸Ã¦Ã tseT

The string is completely mangled. If we look at the bytes we see the reason for this:
"Test æøå".reverse.unpack('C*')
[165, 195, 184, 195, 166, 195, 32, 116, 115, 101, 84]
Ruby blindly processed each byte individually instead of each character.

Other methods:
"Test æøå".upcase
> TEST æøå

Ruby only seem to process the bytes within the ASCII range - ignoring everything else.

"Test æøå".chop
> Test æøÃ

One byte removed from the multibyte character "å" mangles the whole string.

As you can see, any manipulation of UTF-8 strings is a risky business as you could very easily split a multi-byte character.


How to deal with it
Fortunately, when writing Ruby scripts you don't have to manipulate strings that often. (In a web development environment this would be a true nightmare as you deal with strings all the time.)

Ruby 1.8 does have some Unicode awareness in the forms of .pack and .unpack.
"Test æøå £$€".unpack('U*')
[84, 101, 115, 116, 32, 230, 248, 229, 32, 163, 36, 8364]

Here we got an array with integer values that corresponds to the UTF+8 characters. These integer values are the Unicode Code Point for the characters. Notice the Euro symbol which is a recent symbol has a high Code Point.

With the array of Code Points we can more reliably process the UTF-8 strings. To get the correct length we can use "Test æøå £$€".unpack('U*').length => 12

And we can extract characters from the string without risking splitting multi-byte characters.
"Test æøå £$€".unpack('U*')[5..7].pack('U*')
> æøå

This would have split the "ø" had you used the regular "Test æøå £$€"[5..7] to extract the substring.

What more problematic is swapping upper and lower case letters. When you deal with Unicode characters there's allot more rules. Same thing goes for sorting alphabetically.

I have begun to create an UTF-8 sub-class of the String object that's UTF-8 aware.
Code: Select all
# (SAMPLE CODE! - DO NOT USE OR DISTRIBUTE!)
# UTF-8 subclass of +String+
class UTF8 < String

   # Rewire the String.length to be the UTF-8 string in bytes
   alias :bytes :size # :doc:
   
   # Return the number of UTF-8 characters.
   def length
      return self.unpack('U*').length
   end
   
   # (!) Substitute the .each .replace and sub-string functions.

end


This will be populated with more methods that can deal with UTF-8 strings. I have started an Google Code Project page: http://code.google.com/p/sketchup-unicode/ so that other people can contribute. Unicode support is a mammoth task and I have not set as goal to do a complete rewrite of the String class - but slowly add methods as I need them. I hope that if other developers out there write useful methods that will feed it back to this project for the benefit of the plugin community.
Note that contributions to the project should not be dirty hacks - they should produce proper reliable output.


What fails horribly - File methods
There's one major problem with Ruby under Windows. All the File methods fail when the file name or path include a non-ASCII characters. Ruby calls the ASCII version of the Windows API and therefore fails miserably. It will return errors saying the file could not be found.

On OSX Ruby calls the correct system API and there's no problems at all.

I have found a way to do file operations under Windows with UTF-8 file names - by calling the Unicode aware Windows API directly. The problem is that the File classes in Ruby is many and extensive. As with the UTF-8 sub-class of the String object I'm writing sub-class FileW and FileTestW of the default File and FileTest. On OSX they are pure sub-classes of File and FileTest. But under Windows I override methods to call the Unicode aware Windows API. Methods not overridden will behave as their parent class.

So far I have some a few proof of concept methods such as .exists?, .zero?, .size? and similar which provides info about the files. But I'm still not sure how to tackle read and write to files as I'm not entirely sure the relationship with the IO class. I'm not sure if I can open and UTF-8 aware file stream and pass on the the IO class or if I have to sub-class the IO class as well.


The UnicodeEx project
I have started the UnicodeEx project as an effort to assist in the dealings of UTF-8 string, on obth platforms, and File methods on the Windows platform.

It's still early on - many thing can and will change. So please do not rely on this 100% yet. I make it available now in the hope that more people can assist to this project that's quickly growing in scope.

Repository: http://code.google.com/p/sketchup-unico ... e/checkout (latest source - most up to date)
Download .zip archive: http://code.google.com/p/sketchup-unico ... loads/list (latest build)
Documentation: http://workshop.thomthom.net/su-unicode/
0
Last edited by thomthom on Fri Jul 24, 2009 11:27 am, edited 2 times in total.
Thomas Thomassen — SketchUp Monkey & Coding addict
List of my plugins and link to the CookieWare fund
User avatar
thomthom 
PluginStore Author
PluginStore Author
 

Re: [Code] UnicodeEx - (0.1.0a) Sketchup + Character Encoding

Postby TIG » Fri Jul 17, 2009 8:59 pm

Thanks for taking this over and on to the next phase... my brain was melting !
0
TIG
User avatar
TIG 
Global Moderator
 

Re: [Code] UnicodeEx - (0.1.0a) Sketchup + Character Encoding

Postby thomthom » Fri Jul 17, 2009 9:07 pm

Unicode is a beast of a topic. The deeper I get into this the more I feel I'm in over my head. I'd rather focus on CityGen or DoubleCut, but for CityGen I need Unicode support. :knockout:

TIG wrote:my brain was melting !

Same here! :? But I found that an ice cold pint does wonders. * :D


* To the head. Not the code...
0
Thomas Thomassen — SketchUp Monkey & Coding addict
List of my plugins and link to the CookieWare fund
User avatar
thomthom 
PluginStore Author
PluginStore Author
 

Re: [Code] UnicodeEx - (0.1.0a) Sketchup + Character Encoding

Postby daiku » Sat Jul 18, 2009 1:55 pm

Very valuable research, Thom. I've got one user who is currently suffering with these problems, and I now feel armed to release a patch for him. Thanks much. CB.
0
User avatar
daiku 
PluginStore Author
PluginStore Author
 

Re: [Code] UnicodeEx - (0.2.0a) Sketchup + Character Encoding

Postby thomthom » Fri Jul 24, 2009 11:28 am

What kind of problems where there?
0
Thomas Thomassen — SketchUp Monkey & Coding addict
List of my plugins and link to the CookieWare fund
User avatar
thomthom 
PluginStore Author
PluginStore Author
 

Re: [Code] UnicodeEx - (0.2.0a) Sketchup + Character Encoding

Postby tomasz » Fri Nov 27, 2009 2:29 pm

Thanks a lot Thomas for the library. SU2TK fails to export to folders and to files containing unicode characters.. I will try the lib and maybe will be able to help you improve it.
It is so irritating. I was furious when I couldn't figure out why Ruby didn't want to write to a C:\Chałupa\chałupa.xml when the folder existed. Now I know why.

Will definitely come back with comments.

Tomasz
0

tomasz 
SU2TH & SU2KT Developer
 

Re: [Code] UnicodeEx - (0.2.0a) Sketchup + Character Encoding

Postby thomthom » Fri Nov 27, 2009 2:34 pm

Yea. I've not had the time to make methods to read and write files. Basically any IO functions are missing from my lib as they seem to require quite a bit of work to get fully working. It's been put on my backburner.
0
Thomas Thomassen — SketchUp Monkey & Coding addict
List of my plugins and link to the CookieWare fund
User avatar
thomthom 
PluginStore Author
PluginStore Author
 

Re: [Code] UnicodeEx - (0.2.0a) Sketchup + Character Encoding

Postby Dan Rathbun » Fri Nov 27, 2009 10:48 pm

thomthom wrote:Overview
Sketchup uses UTF-8 character encoding. All the Sketchup API methods return UTF-8 strings.

Is that why the $KCODE (alias $-K) global is set by default to Unicode by Sketchup? (at Least in SU ver 7.1)

From the book Programming Ruby (for ver1.6.x) p137:
Command-Line Options

-K kcode

Specifies the code set to be used. This option is useful mainly when Ruby is used for Japanese-language processing. kcode may be one of:
  • e, E for EUC;
  • s, S for SJIS;
  • u, U for UTF-8
  • a, A, n, N for ASCII
I take the n, N to stand for 'None', meaning ASCII is the default.

same book Programming Ruby, (chap 'The Ruby Language')
Source Layout

Ruby programs are written in 7-bit ASCII.
[Ruby also has extensive support for Kanji, using the EUC, SJIS, or UTF-8 coding system. If a code set other than 7-bit ASCII is used, the KCODE option must be set appropriately, as shown on page 137.]

[later, in same chapter...]

Execution Environment Variables

$-K String [...alias $KCODE ]
    Sets the multibyte coding system for strings and regular expressions. Equivalent to the -K command-line option. See page 137.
[holds one of: EUC, SJIS, UTF-8, or ASCII ]

I looked at my rbconfig.rb file in the full install (ver 1.9.1) and by default on Win32 machines it's supposed to be blank (or 'None') by default. from:
"#{ENV['RUBYLIB']}/#{RUBY_PLATFORM}/rbconfig.rb", line 75:
CONFIG["DEFAULT_KCODE"] = ""
'Purty' sure it was the same in Ruby ver 1.8.6

SO... Ruby does have encoding support, but it's located in the 'enc' folder, under the platform folder.

If the 'bubble Ruby' with Sketchup doesn't have access to these files, how can changing $KCODE (or any value it might hold,) have any effect?

I have come to the conclusion, that SU Ruby [at least on Win32 PC,] really needs to have a 'Library' folder (under the program folder, not the Plugins folder,) that has many of the full Ruby library files and subfolders.

Thom, if you still think you need to do some Unicode translation, there MAY be a ready made method within the JSON package, chekout %RUBYLIB%/json/pure/generator.rb
He has built-in a UTF-8 to UTF-16 big endian converter. You'd need to modify it, as it outputs JSON strings.
0
    I'm not here much anymore. But a PM will fire email notifications.
    User avatar
    Dan Rathbun 
    PluginStore Author
    PluginStore Author
     

    Re: [Code] UnicodeEx - (0.2.0a) Sketchup + Character Encoding

    Postby thomthom » Fri Nov 27, 2009 10:54 pm

    Isn't $KCODE related to Ruby 1.9?
    (And are you sure it's set by SU - and not some ruby plugin?)

    AFIK the only Unicode support that 1.8 has is in the RegEx functions. Or have I missed something?

    Dan Rathbun wrote:Thom, if you still think you need to do some Unicode translation, there MAY be a ready made method within the JSON package, chekout %RUBYLIB%/json/pure/generator.rb
    He has built-in a UTF-8 to UTF-16 big endian converter. You'd need to modify it, as it outputs JSON strings.

    Dealing with strings is one thing - I have not found that so critical as long as you are careful.

    The main problem is IO functions under Windows. It's a no-go. Some help here to be able to read, write files would be great. Either to create simple read, write methods. Or IO streams.
    0
    Thomas Thomassen — SketchUp Monkey & Coding addict
    List of my plugins and link to the CookieWare fund
    User avatar
    thomthom 
    PluginStore Author
    PluginStore Author
     

    Re: [Code] UnicodeEx - (0.2.0a) Sketchup + Character Encoding

    Postby Dan Rathbun » Fri Nov 27, 2009 11:10 pm

    thomthom wrote:Isn't $KCODE related to Ruby 1.9?
    No, it cannot be, because the book was written for Ruby ver 1.6.x

    thomthom wrote:(And are you sure it's set by SU - and not some ruby plugin?)
    I turned all my plugins OFF, while doing development. And it must be SU that sets it.
    I was having problems making Win32API calls, seemed the ANSI version was not being used, so I called it specifically, but didn't work at first. Then it did.. still don't know why. Anyhow I thought the default $KCODE setting was causing Windows to use the Unicode versions by default.
    I tried changing it to ASCII and it didn't seem to make any difference as far as the Win32API call. Every time it failed, and I made a change, I totally rebooted SU. And then it suddenly worked, not because of anything I did. Anyhow.. weird glich maybe.. memory gremlins!

    What do you suggest? always use the Unicode Win32API calls?

    thomthom wrote:AFIK the only Unicode support that 1.8 has is in the RegEx functions. Or have I missed something?
    Well... the book says "$-K Sets the multibyte coding system for strings and regular expressions."
    0
      I'm not here much anymore. But a PM will fire email notifications.
      User avatar
      Dan Rathbun 
      PluginStore Author
      PluginStore Author
       

      Re: [Code] UnicodeEx - (0.2.0a) Sketchup + Character Encoding

      Postby thomthom » Fri Nov 27, 2009 11:25 pm

      Dan Rathbun wrote:What do you suggest? always use the Unicode Win32API calls?

      When I call Win APIs I had to call directly to the W version of the APIs.

      Example, for the Kernel32function FindFirstFile I must call FindFirstFileW directly, because trying to call FindFirstFile will use FindFirstFileA. At least in SU7.0. I have not tried this after 7.1.
      0
      Thomas Thomassen — SketchUp Monkey & Coding addict
      List of my plugins and link to the CookieWare fund
      User avatar
      thomthom 
      PluginStore Author
      PluginStore Author
       

      Re: [Code] UnicodeEx - (0.2.0a) Sketchup + Character Encoding

      Postby thomthom » Fri Nov 27, 2009 11:30 pm

      Dan Rathbun wrote:Well... the book says "$-K Sets the multibyte coding system for strings and regular expressions."

      hmm... when does Ruby 1.8 ever treat strings as multibyte? From all my testing I found it to always treat strings as sets of single bytes. Though, please enlighten me if I'm incorrect - as that would be very interesting.


      For treating strings I've been using pack('U*) and unpack('U*) - then using the source code for the original String methods to recreate them in Unicode.
      0
      Thomas Thomassen — SketchUp Monkey & Coding addict
      List of my plugins and link to the CookieWare fund
      User avatar
      thomthom 
      PluginStore Author
      PluginStore Author
       

      Re: [Code] UnicodeEx - (0.2.0a) Sketchup + Character Encoding

      Postby Dan Rathbun » Fri Nov 27, 2009 11:56 pm

      thomthom wrote:Example, for the Kernel32function FindFirstFile I must call FindFirstFileW directly, because trying to call FindFirstFile will use FindFirstFileA. At least in SU7.0. I have not tried this after 7.1.

      This is not something that is caused by SU or Ruby... this is a Windows 'thang'. (Unless Ruby is somehow screwin' it up...)

      Sounds like you have an ANSI Windows version. Windows is 'supposed' to map the FindFirstFile call to either the ANSI version of the function (FindFirstFileA) or to the Wide version (FindFirstFileW) based on if the UNICODE flag is set at compile time.

      The MSDN website mentions 'extra' files are needed for Unicode support on Windows.

      I thought (maybe I'm wrong,) that most foreign sold Windows versions were specially compiled as Unicode versions.

      But like I said, I was having similar problems, seemed like it was the Wide versions that were being called for me, instead of the ANSI versions. This is strange...
      0
        I'm not here much anymore. But a PM will fire email notifications.
        User avatar
        Dan Rathbun 
        PluginStore Author
        PluginStore Author
         

        Re: [Code] UnicodeEx - (0.2.0a) Sketchup + Character Encoding

        Postby Dan Rathbun » Sat Nov 28, 2009 12:01 am

        thomthom wrote:hmm... when does Ruby 1.8 ever treat strings as multibyte? From all my testing I found it to always treat strings as sets of single bytes. Though, please enlighten me if I'm incorrect -

        Looks like your right, in that respect (referencing your testing.)

        Maybe there's a hidden single/multi-byte flag or switch [for strings] setting we don't know about...
        0
          I'm not here much anymore. But a PM will fire email notifications.
          User avatar
          Dan Rathbun 
          PluginStore Author
          PluginStore Author
           

          Re: [Code] UnicodeEx - (0.2.0a) Sketchup + Character Encoding

          Postby thomthom » Sat Nov 28, 2009 12:03 am

          Dan Rathbun wrote:This is not something that is caused by SU or Ruby... this is a Windows 'thang'. (Unless Ruby is somehow screwin' it up...)

          Sounds like you have an ANSI Windows version. Windows is 'supposed' to map the FindFirstFile call to either the ANSI version of the function (FindFirstFileA) or to the Wide version (FindFirstFileW) based on if the UNICODE flag is set at compile time.

          No. That is set per application. If I had an ANSI version of Windows I'd have some big problems with my other applications.
          0
          Thomas Thomassen — SketchUp Monkey & Coding addict
          List of my plugins and link to the CookieWare fund
          User avatar
          thomthom 
          PluginStore Author
          PluginStore Author
           

          Re: [Code] UnicodeEx - (0.2.0a) Sketchup + Character Encoding

          Postby thomthom » Sat Nov 28, 2009 12:04 am

          Dan Rathbun wrote:Maybe there's a hidden single/multi-byte flag or switch [for strings] setting we don't know about...

          From all I read on this - multibyte support in String wasn't added until 1.9.
          0
          Thomas Thomassen — SketchUp Monkey & Coding addict
          List of my plugins and link to the CookieWare fund
          User avatar
          thomthom 
          PluginStore Author
          PluginStore Author
           

          Re: [Code] UnicodeEx - (0.2.0a) Sketchup + Character Encoding

          Postby thomthom » Sat Nov 28, 2009 12:13 am

          0
          Thomas Thomassen — SketchUp Monkey & Coding addict
          List of my plugins and link to the CookieWare fund
          User avatar
          thomthom 
          PluginStore Author
          PluginStore Author
           

          Re: [Code] UnicodeEx - (0.2.0a) Sketchup + Character Encoding

          Postby thomthom » Sat Nov 28, 2009 12:13 am

          thomthom wrote:
          Dan Rathbun wrote:This is not something that is caused by SU or Ruby... this is a Windows 'thang'. (Unless Ruby is somehow screwin' it up...)

          Sounds like you have an ANSI Windows version. Windows is 'supposed' to map the FindFirstFile call to either the ANSI version of the function (FindFirstFileA) or to the Wide version (FindFirstFileW) based on if the UNICODE flag is set at compile time.

          No. That is set per application. If I had an ANSI version of Windows I'd have some big problems with my other applications.

          To elaborate:
          http://en.wikibooks.org/wiki/Windows_Pr ... ementation

          Applications need to define UNICODE
          Code: Select all
          #define UNICODE
          before including the windows headers - where the compiler then decides to map the API call to the A or W version.
          0
          Thomas Thomassen — SketchUp Monkey & Coding addict
          List of my plugins and link to the CookieWare fund
          User avatar
          thomthom 
          PluginStore Author
          PluginStore Author
           

          Re: [Code] UnicodeEx - (0.2.0a) Sketchup + Character Encoding

          Postby thomthom » Sat Nov 28, 2009 12:19 am

          thomthom wrote:
          thomthom wrote:
          Dan Rathbun wrote:This is not something that is caused by SU or Ruby... this is a Windows 'thang'. (Unless Ruby is somehow screwin' it up...)

          Sounds like you have an ANSI Windows version. Windows is 'supposed' to map the FindFirstFile call to either the ANSI version of the function (FindFirstFileA) or to the Wide version (FindFirstFileW) based on if the UNICODE flag is set at compile time.

          No. That is set per application. If I had an ANSI version of Windows I'd have some big problems with my other applications.

          To elaborate:
          http://en.wikibooks.org/wiki/Windows_Pr ... ementation

          Applications need to define UNICODE
          Code: Select all
          #define UNICODE
          before including the windows headers - where the compiler then decides to map the API call to the A or W version.

          Thinking of it. It might not be Sketchup that doesn't define the UNICODE. That would be odd considering SU deals with UTF-8 internally and can open SU models with Unicode characters.
          But I'm guessing it's the Ruby binaries that isn't compiled with that flag.
          0
          Thomas Thomassen — SketchUp Monkey & Coding addict
          List of my plugins and link to the CookieWare fund
          User avatar
          thomthom 
          PluginStore Author
          PluginStore Author
           

          Re: [Code] UnicodeEx - (0.2.0a) Sketchup + Character Encoding

          Postby tomasz » Mon Nov 30, 2009 9:22 am

          What I am trying to achieve is make Ruby create this file:
          file = File.new('C:\Półka\Test.xml',"w")
          instead of stopping a script execution with an error:
          Error: #<Errno::ENOENT: No such file or directory - C:\Półka\Test.xml>
          For the time being IO operations on a file are not a problem for me. Is there way to convert 'C:\Półka\Test.xml' string into something that will be recognized by Windows?

          Thanks
          Tomasz
          0

          tomasz 
          SU2TH & SU2KT Developer
           

          Re: [Code] UnicodeEx - (0.2.0a) Sketchup + Character Encoding

          Postby thomthom » Mon Nov 30, 2009 10:35 am

          But that is an IO error. You're trying to create a new file with Unicode characters in the path.

          You won't get around it by converting the string with the Unicode path to a different encoding - because the file is located under the folder named "Półka" and that's where you need to tell windows to look. Which means you need to give a Unicode string - which the ruby IO methods doesn't handle.
          What you need is to call the Unicode APIs that creates a file.
          0
          Thomas Thomassen — SketchUp Monkey & Coding addict
          List of my plugins and link to the CookieWare fund
          User avatar
          thomthom 
          PluginStore Author
          PluginStore Author
           

          Re: [Code] UnicodeEx - (0.2.0a) Sketchup + Character Encoding

          Postby Dan Rathbun » Mon Nov 30, 2009 4:26 pm

          thomthom wrote:... Which means you need to give a Unicode string - which the ruby IO methods doesn't handle.
          What you need is to call the Unicode APIs that creates a file.

          OK I agree with that.

          It's the File and Dir classes that STILL seem to have problems on Windows, even for Ruby ver 1.9.1
          see this bug report
          (I'd think the easiest solution would be to add a new parameter to many of the File and Dir class methods, ie "ANSI|UNICODE" for the mswin32 edition, that would give ruby coders a 'high-level' ruby way of forcing which API call to use, [ie: Ansi or Wide] without having to resort to direct API calls.)

          By the way several people have created unicode libraries (extensions) for string and character.. also iconv is mentioned.
          An old (2005) unicode library, this may be obsolete
          A list of extensions or gems at rubyforge for unicode and unidecode
          0
            I'm not here much anymore. But a PM will fire email notifications.
            User avatar
            Dan Rathbun 
            PluginStore Author
            PluginStore Author
             

            Re: [Code] UnicodeEx - (0.2.0a) Sketchup + Character Encoding

            Postby tomasz » Tue Dec 01, 2009 9:14 am

            thomthom wrote:But that is an IO error. You're trying to create a new file with Unicode characters in the path.

            Can a file be created through WIN32ole.so and returned as a Ruby variable and could all writing to the file go through that extension?
            0

            tomasz 
            SU2TH & SU2KT Developer
             

            Re: [Code] UnicodeEx - (0.2.0a) Sketchup + Character Encoding

            Postby thomthom » Tue Dec 01, 2009 9:21 am

            I have no experiences with .so files. :/
            0
            Thomas Thomassen — SketchUp Monkey & Coding addict
            List of my plugins and link to the CookieWare fund
            User avatar
            thomthom 
            PluginStore Author
            PluginStore Author
             

            SketchUcation One-Liner Adverts

            by Ad Machine » 5 minutes ago



            Ad Machine 
            Robot
             



             

            Return to Developers' Forum

            Who is online

            Users browsing this forum: AmirBilu and 13 guests