BASH Delete a file type that is not specified in the html file

Quite new to BASH and looking for some tips, as I'm afraid to even start with this.

I have a webpage that lists image downloads, for example

<img src="01.jpg" alt="" width="1920" height="1080" />
<img src="02.jpg" alt="" width="1920" height="1080" />
<img src="03.jpg" alt="" width="1920" height="1080" />

I would like to run BASH to read this web page, its local one, to pick up the file names, i.e. 01.jpg, 02.jpg and 03.jpg, and then delete all the other .jpg files in the directory that do not match. So, for example, if there was 04.jpg in the folder, then this file will be deleted, since it is not on the web page.

Sorry, I didn’t post any encoding, I just didn’t fall on it.

Thank you in advance

+4
source share
4 answers

Python BeautifulSoup ( HTML Python):

python -c '
import sys, glob, bs4;
print("\n".join(
    set(glob.glob("*.jpg")) -
    set(e["src"] for e in bs4.BeautifulSoup(sys.stdin.read()).find_all("img"))
))' < file.htm | xargs rm`

: jpg , <img src="..">,

+3

:

find . -maxdepth 1 -name "*.jpg" -type f -exec bash -c \
    'f="{}"; f=${f#./}; if ! grep -wq "img src=\"$f\"" file.html; then rm "$f"; echo "Removed $f"; fi' \;
0

. - jpg , jpg, html .

. , . script , :

#!/bin/bash

[ -z $1 ] && {
    printf "error: insufficient input. usage:  %s path/to/file.html\n" ${0##*/}
    exit 1
}

[ -r "$1" ] || {
    printf "error: invalid filename '%s'. usage:  %s path/to/file.html\n" "$1" ${0##*/}
    exit 1
}

fname=${1##*/}  ## split filename/path
fpath=${1%/*}

[ "$fname" = "$fpath" ] && fpath="./"

jpgarray=( ${fpath}/*.jpg )                 ## read jpg files in directory

for i in ${jpgarray[@]}; do
    tmp=${i##*/}
    if grep "$tmp" "$1" >/dev/null; then
        printf "    file: %s exists in %s -- don't delete\n" "$i" "$1"
    else
        printf "    file: %s does NOT exist in %s -- deleting\n" "$i" "$1"
        # rm "${fpath}/${fname}"
    fi
done

exit 0

jpg

$ ls -1 dat/*.jpg
dat/01.jpg
dat/02.jpg
dat/03.jpg
dat/04.jpg
dat/05.jpg
dat/06.jpg

$ cat dat/jpgnames.html
<img src="01.jpg" alt="" width="1920" height="1080" />
<img src="02.jpg" alt="" width="1920" height="1080" />
<img src="03.jpg" alt="" width="1920" height="1080" />

/

$ bash findjpg.sh dat/jpgnames.html
    file: dat/01.jpg exists in dat/jpgnames.html -- don't delete
    file: dat/02.jpg exists in dat/jpgnames.html -- don't delete
    file: dat/03.jpg exists in dat/jpgnames.html -- don't delete
    file: dat/04.jpg does NOT exist in dat/jpgnames.html -- deleting
    file: dat/05.jpg does NOT exist in dat/jpgnames.html -- deleting
    file: dat/06.jpg does NOT exist in dat/jpgnames.html -- deleting
0

This script only works if you have only one web page to check, there are more efficient scripts in terms of syntax, but I think this is easier to understand for beginners:

#!/bin/bash
## loop through all the files in the image folder
for FILENAME in /path/to/image/folder/*; do

    # for each file, check (case insensitive) if it exists in your web page
    if grep -qi $(basename "$FILENAME") /path/to/webpage.html
    then
        # image file found in webpage
        echo "$FILENAME found, not deleting"
    else
        # image file not found in webpage
        echo "$FILENAME found, moving to trash"
        mv "$FILENAME" /path/to/trash/folder
    fi
done

It also moves files to the trash, in case you need to restore them.

-1
source

All Articles