テキストファイルのエンコーディングを自動検出する方法は?

PyPI で利用可能な chardet Python モジュールを試してください:

pip install chardet

次に chardetect myfile.txt を実行します .

Chardet は Mozilla が使用する検出コードに基づいているため、入力テキストが統計分析に十分な長さであれば、妥当な結果が得られるはずです。プロジェクトのドキュメントを読んでください。

コメントで述べたように、かなり遅いですが、@Xavier が https://superuser.com/a/609056 で見つけたように、一部のディストリビューションでは元の C++ バージョンも出荷されています。 Java版もどこかにあります。

この単純なコマンドを使用します:

encoding=$(file -bi myfile.txt)

または、実際の文字セットだけが必要な場合 (utf-8 など) ):

encoding=$(file -b --mime-encoding myfile.txt)

Debian ベースの Linux では、uchardet パッケージ (Debian / Ubuntu) がコマンドラインツールを提供します。パッケージの説明の下を参照してください:

 universal charset detection library - cli utility
 .
 uchardet is a C language binding of the original C++ implementation
 of the universal charset detection library by Mozilla.
 .
 uchardet is a encoding detector library, which takes a sequence of
 bytes in an unknown character encoding without any additional
 information, and attempts to determine the encoding of the text.
 .
 The original code of universalchardet is available at
 http://lxr.mozilla.org/seamonkey/source/extensions/universalchardet
 .
 Techniques used by universalchardet are described at
 http://www.mozilla.org/projects/intl/UniversalCharsetDetection.html