eryk sun

My feedback

  1. 1,509 votes
    Sign in
    Check!
    (thinking…)
    Reset
    or sign in with
    • facebook
    • google
      Password icon
      Signed in as (Sign out)

      We’ll send you updates on this idea

      eryk sun supported this idea  · 
    • 710 votes
      Sign in
      Check!
      (thinking…)
      Reset
      or sign in with
      • facebook
      • google
        Password icon
        Signed in as (Sign out)

        We’ll send you updates on this idea

        As Chip said, UTF-16 is rather baked into the stuff we do. The console host does use UTF-16 somewhere in there. :P We just have the matter of dealing with code pages throughout the history of computing existence that causes us heartburn every time we think about how to fix this. :( It’s definitely something that we would like to look into on our backlog.

        —Michael

        eryk sun supported this idea  · 
        eryk sun commented  · 

        Codepage 65001 (UTF-8) is still broken in the console. On the plus side, as of Windows 10, WriteFile to a console handle finally works correctly. It no longer confuses buffered writers by returning the number of UTF-16 codes written instead of the number of bytes written. But there's still no support for writing a UTF-8 encoded character split across 2 writes, which can happen when using a buffered writer such as a C FILE stream.

        What's worse, the ReadConsoleA implementation in conhost.exe makes an incorrect assumption when calling WideCharToMultiByte in a Western locale. It assumes the current codepage is ANSI, in which a UTF-16 code maps to a single byte. So it tries to encode N UTF-16 codes as N bytes. This fails if even one non-ASCII character is entered (since that's at least 2 bytes when encoded as UTF-8), and it returns back to the client that it successfully read 0 bytes. With CPython, for example, this is interpreted as EOF, so the REPL quietly quits and input() raises EOFError. Maybe the console could instead assume the worst case that each UTF-16 code maps to 4 UTF-8 bytes.

        On the subject of internationalization, the console shouldn't require a DBCS codepage to mix fullwidth (2 cells) and halfwidth (1 cell) glyphs. There should be a locale-neutral implementation based on Unicode character properties. It also should be able to render characters that require multiple UTF-16 codes, such as decomposed characters and surrogate pairs for astral characters such as emojis.

      • 1,310 votes
        Sign in
        Check!
        (thinking…)
        Reset
        or sign in with
        • facebook
        • google
          Password icon
          Signed in as (Sign out)

          We’ll send you updates on this idea

          eryk sun supported this idea  · 

        Feedback and Knowledge Base