Creating CSV information extracted from file names in a given format

Question

Creating CSV information extracted from file names in a given format

I have a little script that lists the paths to all the files in a directory and in all subdirectories and analyzes each path in a list with a regular expression in Perl.

#!/bin/sh find * -type f | while read j; do echo $j | perl -n -e '/\/(\d{2})\/(\d{2})\/(\d+).*-([a-zA-Z]+)(?:_(\d{1}))?/ && print "\"0\";\"$1$2$3\";\"$4\";\"$5\";$fl\""' >> bss.csv echo | readlink -f -n "$j" >>bss.csv echo \">>bss.csv done

Output:

 "0";"13957";"4121113";"2";"/home/root/dir1/bss/164146/13/95/7___/000240216___Abc-4121113_2.jpg"

I use readlink from GNU coreutils: -n suppresses a new line at the end, -f performs canonicalization recursively after symbolic links on the path.

The problem is that when the input line did not pass the regular expression, I only have the line with the file path.

How to add a condition to check if the regular expression passed - show path, else - no. I broke my brain with various combinations, but did not find a suitable job.

0

shell perl

Sergii rechmp Jul 08 '14 at 12:59

source share

2 answers

If I understand you, you want to write the following parts of the file name:

 /home/root/dir1/bss/164146/13/95/7___/000240216___Abc-4121113_2.jpg ~~ ~~ ~ ~~~ ~~~~~~~ ~ 1 2 3 4 5 6

But your perl regex does not. Let me break it down for a better understanding.

 /\/(\d{2})\/(\d{2})\/(\d+).*-([a-zA-Z]+)(?:_(\d{1}))?/

Sliced in pieces, it will be ...

\/(\d{2}) - slash, then two digits (with recorded digits)
\/(\d{2}) is another slash and two digits
\/(\d) is another slash and any number of digits
.*- - any character run to the final hyphen in the input line
([a-zA-Z]+) - one or more alpha characters
(?:_(\d{1}))? - a meaningless (I think) construction corresponding to an optional separate digit that will not be captured (because it is inside a (?:...) )

If you go through your file name, you will see that there is nothing to process the second last line of digits.

I would do this using simpler tools. Sed, for example:

 [ ghoti@pc ~]$ s="/home/root/dir1/bss/164146/13/95/7___/000240216___Abc-4121113_2.jpg" [ ghoti@pc ~]$ echo "$s" | sed -rne 's/.*/"&"/;h;s:.*/([0-9]{2})/([0-9]{2})/([0-9]+)[^[a-zA-Z]]*[^-]+-([0-9]+)(_([0-9]+))?.*:"0";"\1\2\3";"\4";"\6":;G;s/\n/;/;p' "0";"13957";"4121113";"2";"/home/root/dir1/bss/164146/13/95/7___/000240216___Abc-4121113_2.jpg" [ ghoti@pc ~]$

I will break the sed script for readability:

s/.*/"&"/; - Put quotation marks around the file name.
h; - Save the file name in the "Sed" waiting space for future use ...
s: - Run a big replacement ...
- .*/([0-9]{2})/([0-9]{2})/([0-9]+)[^[a-zA-Z]]*[^-]+-([0-9]+)(_([0-9]+))?.* - This is the pattern that we want to match for substitution. Like what you did in Perl, obviously, but using ERE instead of PCRE.
- :"0";"\1\2\3";"\4";"\6":; - Sample replacement, when replacing \n with parentheses, RE elements. Note that \5 skipped in the replacement string, as this subexpression is used only for matching.
G; - add a space to the template space
s/\n/;/; - and delete the new line between them.
p - Print the result.

Note that this solution, as it is, assumes that all input lines match the pattern you are looking for. If this is not the case, then you can get unpredictable output and put some pattern matching in the script.

+1

ghoti Jul 08 '14 at 15:09

source share

Palec · Accepted Answer · 2014-07-08T14:27:47+0000

Solution Description

In Perl, use if (/…/) {…} else {…} instead of /…/ && … That way, you can print if the match is successful and the other code otherwise.

If this is not a problem, and you want to get rid of the readlink quotation of the output and closing, you can call readlink from Perl using backlinks.

Result code

I turned everything into a single Perl program, used File::Find instead of the find , suggested $fl at the end of print in Perl is a relic (ignored it), and used Cwd::realpath() to search for the canonical file path instead of GNU readlink -f coreutils. If you still want to use readlink -f , feel free to change Cwd::realpath($_) to `readlink -f '$_'` (including backlinks!), But then it will not work for file names containing one quotation mark.

You should call this script as ./script-name starting-directory > bss.csv . If you put it in the directory you are studying, the output will also contain it along with bss.csv .

 #!/usr/bin/perl # Usage: ./$0 [<starting-directory>...] use strict; use warnings; use File::Find; use Cwd; no warnings 'File::Find'; sub handleFile() { return if not -f; if ($File::Find::name =~ /\/(\d{2})\/(\d{2})\/(\d+).*-([a-zA-Z]+)(?:_(\d{1}))?/) { local $, = ';', $\ = "\n"; print map "\"$_\"", 0, $1.$2.$3, $4, $5, Cwd::realpath($_); } else { print STDERR "File $File::Find::name did not match\n"; } } find(\&handleFile, @ARGV ? @ARGV : '.');

For reference, I also enclose a polished version of the original program. It calls readlink from Perl, as I suggested above, and really uses the -n option to Perl, avoiding the while read .

 #!/bin/sh find . -type f | perl -n -e 'm{/(\d{2})/(\d{2})/(\d+).*-([a-zA-Z]+)(?:_(\d{1}))?} && print qq{"0";"$1$2$3";"$4";"$5";"`readlink -f -n '\''$_'\''`"}' > bss.csv

Other source code notes

echo | before readlink does nothing and should be deleted. Readlink does not read its stdin.
Where does $fl come from at the end of print in Perl? I guess this is a relic.
Using common quotation marks, such as qq{} and deliberate use of delimiters (for example, when matching regular expressions and other operators like quotation marks), can save you from quoting hell. I have already used this tip above: /…/ → m{…} and "…" → qq{…} . Thanks Slade ! See the perlop manpage for more information.

Creating CSV information extracted from file names in a given format

Solution Description

Result code

Other source code notes

More articles: