Defining file encoding in Ruby

I came up with a method for determining the encoding (or at least the guess) for the file I am transferring:

def encoding_type(file_path)
 File.read(file_path).encoding.name
end

The problem is that I have a file of 15 GB in size, so this means that the entire file is read into memory.

Is there a way to accomplish what I'm doing in this method without having to read the entire file in memory?

+4
source share
2 answers

, , , . Encoding.default_internal, , Encoding.default_external. , UTF-8. Encoding.default_internal , .

, , , . 100% , , ( ).

, ( ).

, , , 10 : https://github.com/oleander/rchardet . ruby ​​system() linux, , - Linux file.

, , , , . , chardet , , , , X , .

 require 'chardet19'

 first1000bytes = File.read(file, 1000)
 cd = CharDet.detect(first1000bytes)
 cd.encoding
 cd.confidence

, ruby ​​ :

 str.valid_encoding?

, , :

 orig_encoding = str.encoding

 str.force_encoding("ISO-8859-1").valid_encoding?
 str.force_encoding("UTF-8").valid_encoding?

 str.force_enocding(orig_encoding) # put it back to what it was

, , , .

, valid_encoding?, . String.scrub ruby ​​2.1 pure -ruby backport String.scrub .

, , , .

0

os gem , .

OSX Linux file -i mime :

file -i myfile

myfile: text/plain; charset=iso-8859-1

, Mac OSX -I , ...

require 'os'    
def detect_charset(file_path)
  charset = if OS.mac?
    `file -I #{file_path}`.strip.split('charset=').last
  elsif OS.linux?
    `file -i #{file_path}`.strip.split('charset=').last
  else
    nil
  end
rescue => e 
  Rails.logger.warn "Unable to determine charset of #{file_path}"
  Rails.logger.warn "Error: #{e.message}"
end
0

All Articles