How to read empty translations in .po using grep (or another LSB tool)?

I can search for empty translations in vim with a command like this:

/""\n\n 

But my task is to find the number of untranslated strings. Any ideas on how to do this using standard tools that all linux windows should have (without separate packages).

Here is an example .po file containing 2 translated and 2 translation lines (long and short versions).

 msgid "translated string" msgstr "some translation" msgid "non-translated string" msgstr "" msgid "" "Some long translated string which starts from new line " "and can last for few lines" msgstr "" "Translation of some long string which starts from new line " "and lasts for few lines" msgid "" "Some long NON-translated string which starts from new line " "and can last for few lines" msgstr "" 
+6
source share
5 answers

Here is one way: awk :

 awk '$NF == "msgstr \"\"" { c++ } END { print c }' FS="\n" RS= file 

Results:

 2 

Explanation:

Put awk in paragraph mode. Then check the last line in each block. If the last line exactly matches the pattern, count it. Then, at the end of the script, print the counter. If you later decide that you want to count the number of lines translated, just change == to != . NTN.


From the comments below, to handle empty lines containing spaces:

You will need to use a regular expression, for example: RS="\n{2,}|\n([ \t]*\n)+|\n$" (maybe this can be simplified). However, it should be noted that the ability of RS be a regular expression is an extension of GNU awk . Another awk will not be able to handle multi-character record delimiters. Fortunately, the aforementioned file format looks pretty tough, so handling strings containing spaces is not required.

If you encounter delimiters, including spaces, a quick fix is ​​calling sed :

 < file sed 's/^ *$//' | awk ... 
+7
source

I suggest using the available gettext tools instead of directly analyzing .po files:

 $ msggrep -v -T -e "." test.po msgid "non-translated string" msgstr "" msgid "" "Some long NON-translated string which starts from new line and can last for " "few lines" msgstr "" 

msggrep flags:

  • -v inverted match
  • -T apply the following pattern to msgstr
  • -e search pattern

i.e. show any msgstr that does not match /./ and therefore is empty.

Since msggrep does not have -c , the counter in the single-line layer is:

  msggrep -v -T -e "." test.po | grep -c ^msgstr 

( msggrep been part of the gettext package since the release of v0.11, January 2002. LSB Core aka ISO / IEC 23360-1: 2006 (E) provides only the gettext and msgfmt binaries but I haven’t seen the system without it yet, therefore it, hope will satisfy your requirements.)

+4
source

Since awk (nice) solution has already been set, there are 4 more ways:

All commands have been tested with your sample and a good .po file.

Using sed

 sed -ne '/msgstr ""/{N;s/\n$//p}' <poFile | wc -l 2 

Explanation: every time I find msgstr "" , I am concatenating the next line, than if I could suppress the new line as the last character of my line s/\n$// , I print them p . For the final count of the number of rows.

Bash only

Without using any binaries other than bash:

 total=0 while read line;do if [ "$line" == 'msgstr ""' ] ;then read line [ -z "$line" ] && ((total++)) fi done <poFile echo $total 2 

Explanation: every time I found msgstr "" , I read the next line, and not empty, I increment the counter.

Another bash way
 mapfile -t line <poFile count=0 for ((i=${#line[@]};i--;));do [ -z "${line[i]}" ] && [ "${line[i-1]}" == 'msgstr ""' ] && ((count++)) done echo $count 2 

Explanation: read the entire .po file in a single array than the search array for an empty field where the previous field contains msgstr "" , increment counter than print.

Perl (in command line mode)

 perl -ne '$t++if/^$/&&$l=~/msgstr\s""\s*$/;$l=$_;END{printf"%d\n",$t}' <poFile 2 

Explained: Each time I found an empty line and the previous line (stored in the $l variable) contained msgstr "" , then I increment the counter.

Dash (not bash!)

 count=0 while read line ; do [ "$line" = "" ] && [ "$prev" = 'msgstr ""' ] && true $((count=count+1)) prev="$line" done <poFile echo $count 2 

Based on perl example this work for both and

+2
source

Try:

 grep -c '^""$' 

it counts lines where the only content is two.

EDIT:

Following your comment, I see that the above does not meet your needs. To perform a multi-line match, you can use GNU grep as follows:

 grep -Pzo '^msgstr ""\n\n' en.po | grep -c msgstr 

This has been tested and found to work with GNU grep 2.14. However, I don't know if GNU grep is enough for you.

Explanation of 1st grep:

-P activate the Perl regular expression extension.

-z replace the newline at the end of the line with zero, which allows grep to keep track of newlines.

-o print 'only-matching', required since -z used; otherwise we would print the whole file.

Explanation of the second grep:

-c count the number of matching lines, in this case msgstr. This should be in a separate grep expression since -c returns 1 if used with -z .

+1
source
 grep -n ^msg your.po | grep -v '""' | uniq -D -f1 

This searches for lines starting with msg , ignores those that are only empty lines ( "" ), and then uses uniq to find duplicate lines (ignoring the msgid / msgstr field).

Example output from a CUPS file:

 $ grep -n ^msg /usr/share/locale/es/cups_es.po | grep -v '""' | uniq -D -f1 3742:msgid "ParamCustominCutInterval" 3743:msgstr "ParamCustominCutInterval" 3745:msgid "ParamCustominTearInterval" 3746:msgstr "ParamCustominTearInterval" 3858:msgid "Quarto" 3859:msgstr "Quarto" 3967:msgid "Stylus Color Series" 3968:msgstr "Stylus Color Series" 3970:msgid "Stylus Photo Series" 3971:msgstr "Stylus Photo Series" 3973:msgid "Super A" 3974:msgstr "Super A" 
-1
source

All Articles