13 Oct 2011

Using perl to extract files from large directory structure

When I work on Windows I use Activesite Active Perl to do some stuff. Recently I had to copy about 60 files from large directory structure to a new place (about 800 files in nasty directory structure). It was rather boring stuff to do, so I used perl to finish job because it would be pretty easy task to do using hashes (I had names of files given in text file). Unfortunately I had some problems on completion of job, because there was a problem with encoding of file path name on Windows.

Below I provide simple script demonstration how to handle national characters in path using perl on Windows.

#!/usr/bin/perl
 
use strict;
use warnings;
 
use utf8;
use Encoding;
 
# Win32::Codepage::Simple is available in ActiveState Perl
use Win32::Codepage::Simple qw(get_codepage get_acp);
use File::Find;
use File::Copy;
use File::Path;
use File::Basename;
 
binmode STDOUT, ":utf8";
binmode STDERR, ":utf8";
 
my $dir  = "./copy_from_here/";
my $dest = "./copy_here/";
my %filter = ( 'grzegrzółka.doc' => 1, 'słoń i łoś.avi' => 1);
 
# Polish Windows 7 :)
my $encoding = get_codepage();
if ($encoding eq '1250') {
   $encoding = 'cp1250';
}
 
find(
       { wanted =>
               sub {
 
               return if not -f $File::Find::name;
 
               my $mfname = $encoding ? 
                              decode($encoding, $File::Find::name) 
                            : $File::Find::name;
               return if not exists $filter{$mfname};
 
               my $nfile = basename($File::Find::name);
               if ( not -e $dest.'/'.$nfile ) {
                       copy($fInName, $dest.'/'.$nfile) or die "File cannot be copied.";
               } else {
                       print "File exists: ".$nfile."\n";
               }
 
               },
       follow => 1, no_chdir => 1, depth => 1
       }
       , $dir
);

More information about national characters in path on Windows

2 comments:

  1. Does it fix Hebrew issues as well (Hebrew directories), cause I yet to find a solution for this problem.

    ReplyDelete
  2. This version probably does not fix Hebrew file name issues, but you can change encoding from 'cp1250' maybe to 'iso8859-8' or 'windows-1255' or other, and try to rerun script.

    You can also use perl tool named convmv which is used for convert file names from one encoding to another but this is rather risky.

    ReplyDelete