eryk sun

My feedback

  1. 1,662 votes
    Sign in
    (thinking…)
    Password icon
    Signed in as (Sign out)

    We’ll send you updates on this idea

    eryk sun supported this idea  · 
  2. 774 votes
    Sign in
    (thinking…)
    Password icon
    Signed in as (Sign out)

    We’ll send you updates on this idea

    As Chip said, UTF-16 is rather baked into the stuff we do. The console host does use UTF-16 somewhere in there. :P We just have the matter of dealing with code pages throughout the history of computing existence that causes us heartburn every time we think about how to fix this. :( It’s definitely something that we would like to look into on our backlog.

    —Michael

    eryk sun supported this idea  · 
    eryk sun commented  · 

    Codepage 65001 (UTF-8) is still broken in the console. On the plus side, as of Windows 10, WriteFile to a console handle finally works correctly. It no longer confuses buffered writers by returning the number of UTF-16 codes written instead of the number of bytes written. But there's still no support for writing a UTF-8 encoded character split across 2 writes, which can happen when using a buffered writer such as a C FILE stream.

    What's worse, the ReadConsoleA implementation in conhost.exe makes an incorrect assumption when calling WideCharToMultiByte in a Western locale. It assumes the current codepage is ANSI, in which a UTF-16 code maps to a single byte. So it tries to encode N UTF-16 codes as N bytes. This fails if even one non-ASCII character is entered (since that's at least 2 bytes when encoded as UTF-8), and it returns back to the client that it successfully read 0 bytes. With CPython, for example, this is interpreted as EOF, so the REPL quietly quits and input() raises EOFError. Maybe the console could instead assume the worst case that each UTF-16 code maps to 4 UTF-8 bytes.

    On the subject of internationalization, the console shouldn't require a DBCS codepage to mix fullwidth (2 cells) and halfwidth (1 cell) glyphs. There should be a locale-neutral implementation based on Unicode character properties. It also should be able to render characters that require multiple UTF-16 codes, such as decomposed characters and surrogate pairs for astral characters such as emojis.

  3. 1,446 votes
    Sign in
    (thinking…)
    Password icon
    Signed in as (Sign out)

    We’ll send you updates on this idea

    eryk sun supported this idea  · 

Feedback and Knowledge Base