The reason you saw so many complex solutions to this problem is because, by definition, it is not solvable. The process of encoding a string of text is not deterministic. You can create different combinations of text and encoding that lead to the same stream of bytes. Therefore, strictly logically, it is impossible to determine the encoding, character set and text from the byte stream.
In fact, it is possible to achieve results that are quite โcloseโ using heuristic methods, since there is a finite set of encodings that you will find in the wild, and with a sufficiently large sample, the program can most likely determine the encoding. Whether the results are good enough depends on the application.
I want to comment on a question about user data. All data published from the web page has a known encoding (POST comes with the encoding that the developer defined for this page). If the user inserts text into the form field, the browser interprets the text based on the encoding of the source data (as the operating system knows) and the encoding of the page and, if necessary, transcodes them. It is too late to detect the encoding on the server - because the browser may have changed the byte stream based on the intended encoding.
For example, if I type the letter ร on my German keyboard and place it on the UTF-8 encoded page, 2 bytes (xC3 x84) will be sent to the server. This is a valid EBCDIC string that represents the letters C and d. It is also a valid ANSI string, which represents 2 ร and "characters. However, no matter what I try to do, I donโt need to insert an ANSI encoded string into the browser form and expect it to be interpreted as UTF-8 - since the operating system knows that I paste ANSI (I copied the text from Textpad, where I created a text file with ANSI encoding) and transcodes it to UTF-8, the result is a byte stream xC3 x83 xE2 x80 x9E.
I want to say that if the user manages to remove the garbage, perhaps because he was already garbage at the time he was inserted into the browser form, because the client did not have proper character set support, encoding, no difference. Since character encoding is not deterministic, you cannot expect that there is a trivial method to detect this situation.
Unfortunately, the problem remains for the downloaded files. The only reliable solution that I see is to show the user a section of the file and ask if it is interpreted correctly, and loop through a lot of different encodings until this happens.
Or we could develop a heuristic method that takes into account the appearance of certain characters in different languages. Let's say I uploaded my text file containing two xC3 x84 bytes. There is no other information - just two bytes in the file. This method could detect that the letter ร is quite common in the German text, but the letters ร and "together are unusual in any language and thus determine that the encoding of my file is indeed UTF-8. This is a crude level of difficulty that must be dealt with such a heuristic method, and the more statistical and linguistic facts it can use, the more reliable its results will be.