Re: [SLUG] perl [pig] duplicate removal with a twist

From: baris nema (baris_nema@cigflorida.com)
Date: Mon Jul 24 2006 - 17:44:31 EDT


The input files I'm processing is currently on the order of 6megs
(~100,000 lines), so
I'm thinking an array is out, the output file is smaller, usually.

Would it be possible to search the output file without having to close
it (maybee using
a different file handle)?

I'm currently doing the processing using bash, but due to the number of
loops want to
port it over to perl, I'm quite new at perl, so any code examples would
be helpfull.

currently it's (perlified pseudo code):

read line in from input file;
$criticalpart = result of several operations on line;
$existstat = `grep -c "$criticalpart" "$outputfile"`; #trying to
replace this line with perl code
if ( $existstat == 0 );
   {
   $processedcriticalpart = [some more operations on $criticalpart];
   print $outputfile "$processedcriticalpart";
   }

Levi Bard wrote:
> On 7/24/06, baris nema <baris_nema@cigflorida.com> wrote:
>> I'm writing a perl program that takes input from one file, processes it
>> (line by line), and then
>> puts in into another text file. I'm trying to figure out how I can
>> search the second text file to see
>> if a particular part that line (at the beginning of the line) has
>> already been put into the file, and if
>> it has, don't put it in.
>>
>> What I'm having trouble on is how to search the output file in perl for
>> that text before writing to it.
>> -any ideas?
>
> Unless these files have the potential to be huge, I'd read the input
> into a list (or array, whatever), and do the duplicate check with the
> list upon input. If the order of the input doesn't matter, you could
> do an insertion sort while reading, and make the duplicate check much
> quicker.
>
> If neither of those are viable, you're going to be
> opening/searching/closing the second file for each line of input.
>

-----------------------------------------------------------------------
This list is provided as an unmoderated internet service by Networked
Knowledge Systems (NKS). Views and opinions expressed in messages
posted are those of the author and do not necessarily reflect the
official policy or position of NKS or any of its employees.



This archive was generated by hypermail 2.1.3 : Fri Aug 01 2014 - 15:02:39 EDT