13 Oct 2011

Using perl to extract files from large directory structure

When I work on Windows I use Activesite Active Perl to do some stuff. Recently I had to copy about 60 files from large directory structure to a new place (about 800 files in nasty directory structure). It was rather boring stuff to do, so I used perl to finish job because it would be pretty easy task to do using hashes (I had names of files given in text file). Unfortunately I had some problems on completion of job, because there was a problem with encoding of file path name on Windows.

Below I provide simple script demonstration how to handle national characters in path using perl on Windows.

use strict;
use warnings;
use utf8;
use Encoding;
# Win32::Codepage::Simple is available in ActiveState Perl
use Win32::Codepage::Simple qw(get_codepage get_acp);
use File::Find;
use File::Copy;
use File::Path;
use File::Basename;
binmode STDOUT, ":utf8";
binmode STDERR, ":utf8";
my $dir  = "./copy_from_here/";
my $dest = "./copy_here/";
my %filter = ( 'grzegrzółka.doc' => 1, 'słoń i łoś.avi' => 1);
# Polish Windows 7 :)
my $encoding = get_codepage();
if ($encoding eq '1250') {
   $encoding = 'cp1250';
       { wanted =>
               sub {
               return if not -f $File::Find::name;
               my $mfname = $encoding ? 
                              decode($encoding, $File::Find::name) 
                            : $File::Find::name;
               return if not exists $filter{$mfname};
               my $nfile = basename($File::Find::name);
               if ( not -e $dest.'/'.$nfile ) {
                       copy($fInName, $dest.'/'.$nfile) or die "File cannot be copied.";
               } else {
                       print "File exists: ".$nfile."\n";
       follow => 1, no_chdir => 1, depth => 1
       , $dir

More information about national characters in path on Windows


  1. Does it fix Hebrew issues as well (Hebrew directories), cause I yet to find a solution for this problem.

  2. This version probably does not fix Hebrew file name issues, but you can change encoding from 'cp1250' maybe to 'iso8859-8' or 'windows-1255' or other, and try to rerun script.

    You can also use perl tool named convmv which is used for convert file names from one encoding to another but this is rather risky.