eryk sun

My feedback

  1. 1,667 votes
    Sign in
    (thinking…)
    Sign in with: facebook google
    Signed in as (Sign out)

    We’ll send you updates on this idea

    eryk sun supported this idea  · 
  2. 779 votes
    Sign in
    (thinking…)
    Sign in with: facebook google
    Signed in as (Sign out)

    We’ll send you updates on this idea

    As Chip said, UTF-16 is rather baked into the stuff we do. The console host does use UTF-16 somewhere in there. :P We just have the matter of dealing with code pages throughout the history of computing existence that causes us heartburn every time we think about how to fix this. :( It’s definitely something that we would like to look into on our backlog.

    —Michael

    eryk sun supported this idea  · 
    eryk sun commented  · 

    Codepage 65001 (UTF-8) is still broken in the console. On the plus side, as of Windows 10, WriteFile to a console handle finally works correctly. It no longer confuses buffered writers by returning the number of UTF-16 codes written instead of the number of bytes written. But there's still no support for writing a UTF-8 encoded character split across 2 writes, which can happen when using a buffered writer such as a C FILE stream.

    What's worse, the ReadConsoleA implementation in conhost.exe makes an incorrect assumption when calling WideCharToMultiByte in a Western locale. It assumes the current codepage is ANSI, in which a UTF-16 code maps to a single byte. So it tries to encode N UTF-16 codes as N bytes. This fails if even one non-ASCII character is entered (since that's at least 2 bytes when encoded as UTF-8), and it returns back to the client that it successfully read 0 bytes. With CPython, for example, this is interpreted as EOF, so the REPL quietly quits and input() raises EOFError. Maybe the console could instead assume the worst case that each UTF-16 code maps to 4 UTF-8 bytes.

    On the subject of internationalization, the console shouldn't require a DBCS codepage to mix fullwidth (2 cells) and halfwidth (1 cell) glyphs. There should be a locale-neutral implementation based on Unicode character properties. It also should be able to render characters that require multiple UTF-16 codes, such as decomposed characters and surrogate pairs for astral characters such as emojis.

  3. 1,452 votes
    Sign in
    (thinking…)
    Sign in with: facebook google
    Signed in as (Sign out)

    We’ll send you updates on this idea

    eryk sun supported this idea  · 

Feedback and Knowledge Base