19 Jan 2011

Group and count files by type in Perl (using file magic numbers)

Perl File::Find - passing parameters to "wanted" callback - basics

Using object oriented libraries with File::Find module is not so obvious like it should be. But there is a way to do it in simple way - use closures in Perl. Now you probably think, "stop the bla bla talk, and show me code!", so here it is:

my $prefix = "Found ";
find(
   {
      wanted => sub {
         print $prefix.$File::Find::name."\n";
      },
      follow   => 1,
      no_chdir => 1,
   },
   "." # dir to search
); # find

How to determine file type ignoring extension?

Answer is simple - using magic numbers.

What is magic number

The term magic number is a constant numerical or text value used to identify a file format or protocol.

How connect file type with magic number in Perl?

There is variety of modules implementing this, for sample application presented below I have used File::Type module. Installation of this module on Ubuntu is extreme easy: apt-get install libfile-type-perl.

Tool for finding files of certain type using magic numbers

#!/usr/bin/perl

# Author:  Tomasz Gaweda
# Date:    2011.01.19
# Purpose: Find files using magic number
#
# Usage:  ./findGroup.pl scanDir
#         ./findGroup.pl scanDir listFilesType
#

use strict;
use warnings;

use File::Find;
use File::Type;
use Data::Dumper;

use File::Type;

# Globals
my %magic2File;
my $indent = "   ";
my $dir    = '.';

# CMD line parsing
die $0 . " dir to scan\n" if scalar @ARGV < 1;
die $ARGV[0] . " is not directory\n" if not -d $ARGV[0];

$dir = $ARGV[0];
my $ft = File::Type->new();

# Search
find(
   {
      wanted => sub {
         return if not -f $File::Find::name;    # only files
         my $type = $ft->checktype_filename($File::Find::name);

         #print $File::Find::name." (".$type.")\n";
         push @{ $magic2File{$type} }, $File::Find::name;
      },
      follow   => 1,
      no_chdir => 1,
   },
   $dir
); # find

# Sort by file type
my @aTypes = sort { $a cmp $b } keys %magic2File;

if ( scalar @ARGV <= 1 ) {
   print "Directory $dir contains:\n";
   for my $type (@aTypes) {
      print $indent. scalar( @{ $magic2File{$type} } ) . " $type\n";
   } # for
}
else {
   my $dumpType = $ARGV[1];
   my @matchMagic = grep /$dumpType/, @aTypes;
   for my $type (@matchMagic) {
      print scalar( @{ $magic2File{$type} } )
        . " files of type $type (matching: $dumpType) in $dir"
        . "\n$indent"
        . join( "\n" . $indent, sort { $a cmp $b } @{ $magic2File{$type} } )
        . "\n";
   } #for
}

Here you can download source code of this program: findGroup.pl

How to use the tool?

$ perl findGroup.pl . 
Directory . contains:
   17 application/octet-stream
   2 application/x-perl
   3 application/zip
   346 image/jpeg
$ perl findGroup.pl . perl
2 files of type application/x-perl (matching: perl) in .
   ./findGroup.pl
   ./test.pl

Additional references:

No comments:

Post a Comment