$ | >

UTF-8

I really really want utf-8, I do not use any legacy console apps from 80-s, so please please please make it happens :)

733 votes
Sign in
Check!
(thinking…)
Reset
or sign in with
  • facebook
  • google
    Password icon
    Signed in as (Sign out)

    We’ll send you updates on this idea

    Alexandr Marchenko shared this idea  ·   ·  Flag idea as inappropriate…  ·  Admin →
    on the backlog  ·  Microsoft Console Team responded  · 

    As Chip said, UTF-16 is rather baked into the stuff we do. The console host does use UTF-16 somewhere in there. :P We just have the matter of dealing with code pages throughout the history of computing existence that causes us heartburn every time we think about how to fix this. :( It’s definitely something that we would like to look into on our backlog.

    —Michael

    21 comments

    Sign in
    Check!
    (thinking…)
    Reset
    or sign in with
    • facebook
    • google
      Password icon
      Signed in as (Sign out)
      Submitting...
      • Jay Jech commented  ·   ·  Flag as inappropriate

        Fix this, Microsoft! This cannot remain a backlog item any longer; it's now 2018! UTF-8 has been around since the early 90s, and it's been the dominant character encoding worldwide for almost a decade. Look at all the comments you've received since 2014 (some of which are great suggestions). This is hindering your customers' productivity across the globe!!!

      • Roy Tinker commented  ·   ·  Flag as inappropriate

        Perhaps you guys could just create a new console host with UTF-8 support, without getting rid of the old one --- sort of like Edge is to IE11.

      • Bo commented  ·   ·  Flag as inappropriate

        Agree. After installing Chinese language package, I cannot change (permanently) font of bash console anymore....

        Please help!!!

      • Peter Laursen commented  ·   ·  Flag as inappropriate

        I can use chcp 65001

        .. BUT I cannot find a font that works properly in cmd/bash. "Lucida console" is often recommended but does not work with Chinese, Arabic and more. I tried a few TT fonts including Courier New. But the console does not work well withaccented etc. TT fonts in t

        See this MySQL result set in SQLyog (a MySQL GUI client) in WIndows

        id english native
        ------ -------- --------------------------
        1 danish dansk
        2 russisk ру́сский
        3 turkish Türkçe
        4 mandarin 官話
        7 arabic العَرَبِيَّة

        (the font I use in this program is Courier new and it works here as you see)

        .. but lots of display issues in "cmd" with any font (no matter if If I run the 'mysql' client for Windows or the Lnux build in 'bash' it will not display correctly with any font even if I use "chcp 65001").

        Bash is UTF8 natively and the windows console should work with that. Neither "cmd" nor "Powershell" does. This exposes the primitiveness of "cmd" (and "Powershell" too in this respect)

        It may be enough to improve the "Lucida Console" font with more languages' support. I believe it is a font problem in the console(s) only. The byte-sequences are probably correct but what does it help if it cannot be displayed?

      • Sebastian Godelet commented  ·   ·  Flag as inappropriate

        The existing font selection code for the console host is a pain to configure, it doesn't work out of the box. Also instead of having to type chcp 65001 all the time why not make it the default. It is not like anybody can rely on any default value at the moment anyway.

      • Ja commented  ·   ·  Flag as inappropriate

        Come on MS, it's 2016 and still no utf-8? I can't input any CJK character using the built-in input method. This is really a show stopper of using bash on Windows to develop.

      • xilun commented  ·   ·  Flag as inappropriate

        Now that UTF-8 seems kind of supported (at least when using WSL bash in insider builds), that the new console has new features, and that there is the concept of the legacy console, please consider switching the default new console "MBCS" to UTF-8 instead of OEM. The existence of OEM is completely hellish when using redirections and/or outputing strings from various sources (exception .what(), FormatMessage, filenames, user input) and even more so in library code (when you can't change the CP and/or the locale).

        It's doubtful OEM still serve any purpose on 64 bits editions, because there is no DOS VM there. I could not find figures for the general cases (only those published by Steam, but the gamer market should obviously be biased in favor of 64-bits) but I would not be surprised if Windows 10 is at least 90% 64-bits on new computers (64-bits was already at least 50% in the Windows 7 era, according to MS own figures). Then on the 32-bits one, the number of people using localized DOS programs should be quite small. And like I said, you can always keep the historical OEM the default on the legacy console and/or use OEM just for DOS programs, so a good compatibility should be largely possible for all scenarios, including rare ones.

      • eryk sun commented  ·   ·  Flag as inappropriate

        Codepage 65001 (UTF-8) is still broken in the console. On the plus side, as of Windows 10, WriteFile to a console handle finally works correctly. It no longer confuses buffered writers by returning the number of UTF-16 codes written instead of the number of bytes written. But there's still no support for writing a UTF-8 encoded character split across 2 writes, which can happen when using a buffered writer such as a C FILE stream.

        What's worse, the ReadConsoleA implementation in conhost.exe makes an incorrect assumption when calling WideCharToMultiByte in a Western locale. It assumes the current codepage is ANSI, in which a UTF-16 code maps to a single byte. So it tries to encode N UTF-16 codes as N bytes. This fails if even one non-ASCII character is entered (since that's at least 2 bytes when encoded as UTF-8), and it returns back to the client that it successfully read 0 bytes. With CPython, for example, this is interpreted as EOF, so the REPL quietly quits and input() raises EOFError. Maybe the console could instead assume the worst case that each UTF-16 code maps to 4 UTF-8 bytes.

        On the subject of internationalization, the console shouldn't require a DBCS codepage to mix fullwidth (2 cells) and halfwidth (1 cell) glyphs. There should be a locale-neutral implementation based on Unicode character properties. It also should be able to render characters that require multiple UTF-16 codes, such as decomposed characters and surrogate pairs for astral characters such as emojis.

      • mobluse commented  ·   ·  Flag as inappropriate

        I cannot enter @£${[]}\ and many other characters using a Swedish keyboard.

      • 欧文 commented  ·   ·  Flag as inappropriate

        Agreed. It's 2016! UTF-8 really needs to be the default for conhost. If you need to choose another codepage, then choose it as an option, but UTF-8 is the sensible default.

      • Shunichi Arai commented  ·   ·  Flag as inappropriate

        I'm a Japanese Ruby developer using Windows environment. Currently I have a huge pain that the cmd.exe supports only SJIS encoding, also its support for Japanese input method is quite horrible. I wish a good multilingual support with Bash in Windows.

      • Adetula commented  ·   ·  Flag as inappropriate

        UTF8 support in Windows is the major issue. Adding support by simply converting UTF8 to UTF16 just as it's done with ASCII will solve a lot problems working with Windows

      • Chip Locke commented  ·   ·  Flag as inappropriate

        What I want is for PowerShell to let me set a default text encoding when piping. And yes. That I want to be UTF8.

        This is a bigger "thing" in Windows because UTF16 is really heavily baked into (e.g.) .NET. Still, a lot of other OSS tools assume UTF8 or in some cases don't even understand UTF16 at all no matter what you do.

      ← Previous 1

      Feedback and Knowledge Base