How can I get rubWords ShellWords.shellescape to work with multibyte characters?

I am trying to call exec with an argument that contains multibyte characters that come from an environment variable in Windows but have not yet found a solution that works. Here is what I have been able to debug so far.

For simplicity, suppose I have a directory called "Seán" that I am trying to use as an argument to exec. If I just call

exec 'script', "Se\u00E1n".encode("IBM437") 

The script, which is exec'ed, cannot find the file because the argument is configured so that the shock character is lost. If I do the following, this works, but this is bad practice, since the argument must be escaped before it goes into the shell.

 exec "script #{"Se\u00E1n".encode("IBM437")}" 

So, I thought I was just using shellescape to protect the use of exec.

 require 'shellwords' exec "script #{"Se\u00E1n".encode("IBM437").shellescape}" 

But the problem is that he eludes a special character so that it looks like this: "Se \ án". I figured out where this happens, and this comes from this regex .

 str.gsub!(/([^A-Za-z0-9_\-.,:\/@\n])/, "\\\\\\1") 

At first glance, it seems that the characters are not in the well-known good character set of the shell. Unfortunately, this set does not contain special characters, so I run into problems.

I am looking for a regular expression that will perform shell escaping that will not spoil special characters so that I can avoid these arguments before passing them to exec.

+6
source share
2 answers

The regular expression /([^A-Za-z0-9_\-.,:\/@\n])/ processes only ASCII letters and numbers, and not all Unicode letters. [^...] is a negative character class that matches all characters other than those specified in the class. Thus, all , , Ą are deleted with this expression, since they do not match [A-Za-z] .

You need to add abbreviated classes to exclude all Unicode letters and numbers. To make it even safer, we can add a diacritical class to also preserve diacritics:

 str.gsub(/([^\p{L}\p{M}\p{N}_.,:\/@\n-])/, "\\\\\\1") 

Here \p{L} matches all Unicode base letters, \p{M} matches all diacritics, and \p{N} matches any Unicode numbers.

Note that a hyphen does not need to be escaped when placed at the beginning / end of a character class (or after a valid range or abbreviated character class).

+1
source

TL DR

Escaped characters

Metacharacters


code

 String.class_eval do def escapeshell() # Escape shell special characters self.gsub!(/[#-&(-*;<>?\[-^`{-~\u00FF]/, '\\\\\0') # Escape unbalanced quotes (single and double quotes) self.gsub!(/(["'])(?:([^"']*(?:(?!\1)["'][^"']*)*)\1)?/) do if $2.nil? '\\' + $1 else # and escape quotes inside (eg "x'x" or 'y"y') qt = $1 qt + $2.gsub(/["']/, '\\\\\0') + qt end end self end end # Test it str = "(dir *.txt & dir \"\\some dir\\Sè\u00E1ñ*.rb\") | sort /R >Filé.txt 2>&1" puts 'String:' puts str puts "\nEscaped:" puts str.escapeshell 

Output

 String: (dir *.txt & dir "\some dir\Sèáñ*.rb") | sort /R >Filé.txt 2>&1 Escaped: \(dir \*.txt \& dir "\\some dir\\Sèáñ\*.rb"\) \| sort /R \>Filé.txt 2\>\&1 

ideone demo


Description

metacharacters

Given the shell metacharacters to be escaped:

 # & % ; ` | * ? ~ < > ^ ( ) [ ] { } $ \ \u00FF 

We can include each character in a character class :

 [#&%;`|*?~<>^()\[\]{}$\\\u00FF] 

This is exactly the same as:

 /[#-&(-*;<>?\[-^`{-~\u00FF]/ 

Then we use gsub!() To add a backslash before any character in the class:

 str.gsub!(/[#-&(-*;<>?\[-^`{-~\u00FF]/, '\\\\\0') 


Quotes

Only unbalanced quotes should be avoided. This is important to preserve the arguments of the command. Using the following expression, we match balanced quotes:

 /(["'])[^"']*(?:(?!\1)["'][^"']*)*)\1/ 

As well as unbalanced, adding the last part is optional

 /(["'])(?:[^"']*(?:(?!\1)["'][^"']*)*)\1)?/ 

But we also need to avoid quotes inside the other pair. These are single quotes inside double quotes and vice versa. So we find another gsub() to replace in the text matched inside quotes ( $2 ):

 str.gsub!(/(["'])(?:([^"']*(?:(?!\1)["'][^"']*)*)\1)?/) do if $2.nil? '\\' + $1 else qt = $1 qt + $2.gsub(/["']/, '\\\\\0') + qt end end 
+1
source

All Articles