How can I run a recursive search and replace operation for multiple files in parallel?

Question

How can I run a recursive search and replace operation for multiple files in parallel?

I am trying to replace text data in a git repository using git branching functionality.

I wrote a simple script to search for various terms and replace them. He walked very slowly. I had several lines of BASH code execution to customize my search results and replace operation. I know that my code was not very efficient. I decided to go ahead and try only my first line, which should be half-effective. It still takes forever to go through the code base.

Can I use BASH or another simple approach to search my files and perform search and replace operations in parallel to speed things up?

If not, are there any other suggestions on how best to deal with this?

Here's the git command:

git filter-branch --tree-filter "sh /home/kurtis/.bin/redact.sh || true" \ -- --all

Here, the code of my command essentially executes:

 find . -not -name "*.sql" -not -name "*.tsv" -not -name "*.class" \ -type f -exec sed -i 's/01dPassw0rd\!/HIDDENPASSWORD/g' {} \;

+4

git bash parallel-processing find sed

Kurtis Jan 30 '13 at 10:02

source share

4 answers

With GNU Parallel, you can parallelize each processor:

 find . -not -name "*.sql" -not -name "*.tsv" -not -name "*.class" \ -type f -print0 | parallel -q -0 sed -i 's/01dPassw0rd\!/HIDDENPASSWORD/g'

Read more: https://www.youtube.com/playlist?list=PL284C9FF2488BC6D1

+2

Ole tange Feb 03 '13 at 15:18

source share

I found this problem interesting, so I played around with it a bit and I share this partially working script. My original approach was a bit wrong, but it can be fast (er).

I tried to improve performance by searching for modified files in each commit, where the modification contains the line you want to replace with git log -Sstring . But I forgot that if I change only those, the modification will appear in the next commit, so I had to run the script several times, but it does not check all the files only the modifications, so it may be faster to run this multiple times then your version, but I don’t sure how long the filter branch takes if it does nothing.

Maybe you can use parts of it, maybe get the file names first with git log -S... And you can improve it by using xargs to sed instead of the for loop, but in development, I like this form better. I don’t know how to open my parents correctly, so I did it this way and had to handle the initial fix case separately.

In any case, I am here to study, so if you find a good way to deal with this problem, please share :)

 #!/bin/bash commit=$1 pattern=$2 replace=$3 function replaceall() { for f in `git log -S$pattern --pretty="format:" --name-only $1 | egrep -v '.sql$|.class$|.tsv$'`; do echo "FILE $f" sed -i "s/$pattern/$replace/g" $f done } parents=`git log --pretty=%P -n 1 $commit` if test -z "$parents"; then echo "ROOT" replaceall $commit else for p in $parents; do echo "PARENT $p" replaceall $p..$commit done fi

Usage: git filter-branch -f --tree-filter '/path/to/script.sh $commit 01dPassw0rd\! HIDDENPASSWORD' -- --all git filter-branch -f --tree-filter '/path/to/script.sh $commit 01dPassw0rd\! HIDDENPASSWORD' -- --all

I think the script should not be in the git working directory, because the tree filter adds everything that it found when overwriting, but I'm not sure about that.

+1

tewe Jan 31 '13 at 19:22

source share

You want a BFG Repo-Cleaner , a faster and simpler alternative to git-filter-branch , which works in the JVM and is explicitly designed to remove private data from Git repos. It is multi-threaded and optimized specifically for the task you are describing. This is usually 10-50 times faster than git-filter-branch - the larger your repo, the faster it is.

Download the Java jar, create a private.txt file that lists the passwords, etc. that you want to delete (one entry per line), and then run the following command:

 $ java -jar bfg.jar --replace-text private.txt my-repo.git

All files with a threshold size (by default 1 MB) in your repo history will be scanned, and any corresponding line (which is not included in your last fix) will be replaced with the line "*** REMOVED ***". Then you can use git gc to delete dead data:

 $ git gc --prune=now --aggressive

+1

Roberto tyley Feb 02 '13 at 23:17

source share

Josef Kufner · Accepted Answer · 2013-01-30T22:24:45+0000

The git filter branch cannot process transactions in parallel because it needs to know the hash (id) of the parent commit to calculate the current hash.

But you can speed up the processing of each commit:

Your code does sed for each file. It is very slow. Use this instead:

 find . -not -name "*.sql" -not -name "*.tsv" -not -name "*.class" \ -type f -print0 \ | xargs -0 sed -i 's/01dPassw0rd\!/HIDDENPASSWORD/g'

This version works just like yours, but sed executes with as many files (arguments) as possible. Find "-print0" and xargs "-0" means "split file names with a zero byte." Therefore, there is no problem when the file name contains spaces, newlines, binary garbage, etc.

How can I run a recursive search and replace operation for multiple files in parallel?

More articles: