wget improved

Posted on December 30, 2004 by Steve

Jeffrey Veen's much-furled introduction to wget is great as far as it goes, but it fails to respect the rule that 99% of what's on the internet is crap.

That "virtual radio station of hand-filtered new music" is going to need some culling before it will be a listenable collection. I wrote a shell script to scrape mp3 files into a timestamped folder for later review. It uses the "no clobber" switch to prevent multiple downloading of the same file (put a zero-byte file named empty.txt with the script; it overwrites the mp3s in the shells folder). A file named "go.txt" will enable the script, it is renamed to "stop.txt" to prevent the cronned job from stuffing the hard drive. Put your list of favored sources in mp3blogs.txt.

If you want to vacuum up a huge pile of mp3, change the level switch to 2 or more and wget will range far and wide.
todaysdate=`date +%y%m%d`
logfile="/home/steve/mp3/wget/logs/log_$todaysdate.txt"
workingdir="/home/steve/mp3/wget/"
#note: create a shells directory in the working directory
storagedir="/home/steve/mp3/wget/"

if [ -f $workingdir\go.txt ]
then
#the file was found, rename it so this won't run forever
mv $workingdir\go.txt $workingdir\stop.txt
else
echo "go.txt not found, exiting..." >> $logfile
exit
fi

freespace=`df -k |grep home |awk '{print $4}'`

#inisist on half a gig free before running
if test $freespace -gt 500000
then
echo "Freespace is $freespace; beginning wget ..." >> $logfile

#### THE WGET
wget -nc -w2 -r -l1 -H -t1 -nv -nd -P $workingdir/shells -np -A.mp3 -erobots=off
-i $workingdir\mp3blogs.txt -o $logfile
# -nc "no clobber" -- don't redownload a file that exists
# -w5 wait 5 secs between requests
# -r recursive
# -l1 one level deep
# -H enable spanning across hosts when doing recursive retrieving
# -t1 don't retry
# -nv not so verbose
# -nd don't create directories reflecting source dirs
# -P shells -- save files to shells dir
# -np don't ascend into parent dirs when recursing
# -N turns on timestamps
# -A accept list: only .mp3 extensions
# -i input file
# -o log out

mkdir $workingdir$todaysdate

# set field separator to newline character!
oldIFS="$IFS"; IFS="
";

#get list of files freshly downloaded, copy them to date dir
for files in `ls -clt $workingdir\shells/*mp3 | awk '{if ($5 > 1) print substr($0, index($0, $9))}'`
do
echo "doing copy--- cp \"$files\" $workingdir$todaysdate" >> $logfile
cp "$files" $workingdir$todaysdate
echo "doing overwrite--- cp $workingdir./empty.txt \"$files\"" >> $logfile
cp $workingdir./empty.txt "$files"
done

else
echo "Freespace is low: $freespace; cancelling wget" >> $logfile
fi

IFS="$oldIFS"