Revisions to How to select on CSV files by R sqldf/data.table/dplyr?

Added some of the blocks of code underlying the utility routines.

Source Link

edited May 21, 2017 at 20:01

412
3
9

If you wanted to do it directly with a special-purpose routine that would be fairly easy. Text::CSV_XS easily pulls CSV file rows into hashes and then you can do what you like with them.

First though, if your files are huge you should use the DB_File module to specify that your hash should be stored on disk as a database. Otherwise you can fill up memory and grind to a halt.

use DB_File;

my %theHash;
unlink '/tmp/translation.db';
sleep 2;
tie ( %theHash, 'DB_File', '/tmp/translation.db' )
    or die "Can't open /tmp/translation.db\n";

Then create CSV objects

map{ $_ = Text::CSV_XS->new( { allow_whitespace => 1,
        eol =>"\015\012",
        always_quote => 1, binary => 1 })}
   ( $data_csv, $log_csv, $output_csv );

Note that I'm using DOS EOL characters.

Then pull in the input header rows to set up the column names

@cols = @{$data_csv->getline( $data_fh )};
$data_csv->column_names( @cols );
@cols = @{$log_csv->getline( $log_fh )};
$log_csv->column_names( @cols );

where you've opened the files on the file handles $data_fh and $log_fh.

Decide what your output columns will be and write out a column header row

@output_cols = ( 'name', 'event_value' );
$output_csv->combine( @output_cols );
$latest_row = $output_csv->string();
print $output_fh, $latest_row;

Then make up a data_id to name hash.

while ( $log_csv_row = $log_csv->getline_hr( $log_fh ) ){
    $theHash{ $log_csv_row->{data_id} } = $log_csv_row->{name};
}

Then, as in your example, cycle through data.csv to get all of the '1's.

$outputHash{name} = $theHash{1};

while ( $data_csv_row = $data_csv->getline_hr( $data_fh ) ){
    
    next unless $data_csv_row->{data_id} == 1;

    $outputHash{data_id} = $data_csv_row->{data_id};
    $output_csv->combine( map { $outputHash{$_} } @output_cols );
    $latest_row = $output_csv->string();
    print $output_fh "$latest_row";
}

This example code is the basis for all of the utility routines listed above where the hardcoded '1' is replaced with assorted arguments or arrays of arguments that are put in hashes.

If you wanted to do it directly with a special-purpose routine that would be fairly easy. Text::CSV_XS easily pulls CSV file rows into hashes and then you can do what you like with them.

First though, if your files are huge you should use the DB_File module to specify that your hash should be stored on disk as a database. Otherwise you can fill up memory and grind to a halt.

use DB_File;

my %theHash;
unlink '/tmp/translation.db';
sleep 2;
tie ( %theHash, 'DB_File', '/tmp/translation.db' )
    or die "Can't open /tmp/translation.db\n";

Then create CSV objects

map{ $_ = Text::CSV_XS->new( { allow_whitespace => 1,
        eol =>"\015\012",
        always_quote => 1, binary => 1 })}
   ( $data_csv, $log_csv, $output_csv );

Note that I'm using DOS EOL characters.

Then pull in the input header rows to set up the column names

@cols = @{$data_csv->getline( $data_fh )};
$data_csv->column_names( @cols );
@cols = @{$log_csv->getline( $log_fh )};
$log_csv->column_names( @cols );

where you've opened the files on the file handles $data_fh and $log_fh.

Decide what your output columns will be and write out a column header row

@output_cols = ( 'name', 'event_value' );
$output_csv->combine( @output_cols );
$latest_row = $output_csv->string();
print $output_fh, $latest_row;

Then make up a data_id to name hash.

while ( $log_csv_row = $log_csv->getline_hr( $log_fh ) ){
    $theHash{ $log_csv_row->{data_id} } = $log_csv_row->{name};
}

Then, as in your example, cycle through data.csv to get all of the '1's.

$outputHash{name} = $theHash{1};

while ( $data_csv_row = $data_csv->getline_hr( $data_fh ) ){
    
    next unless $data_csv_row->{data_id} == 1;

    $outputHash{data_id} = $data_csv_row->{data_id};
    $output_csv->combine( map { $outputHash{$_} } @output_cols );
    $latest_row = $output_csv->string();
    print $output_fh "$latest_row";
}

This example code is the basis for all of the utility routines listed above where the hardcoded '1' is replaced with assorted arguments or arrays of arguments that are put in hashes.

format in the body

Source Link

edit approved May 17, 2017 at 14:27

Léo Léopold Hertz 준영

7.1k
30
103
201

I make extensive use of the Perl language module Text::CSV_XS for heavy duty, ad-hoc manipulation of CSV files. Using this module I've built four small basic Perl programs to use as building blocks for whatever I want to do.

Filter - Filter inputFile filterFile field
Reject - Reject inputFile filterFile field
Stripper - Stripper inputFile field [field2 field3…]
Swap - Swap inputFile swapFile matchField outfield

The filterFile has a regEx pattern on each line. Anything matching one of these patterns is matched for the purposes of acceptance or rejection. The assorted "field"s are column header names.

So in your example I just put "1" in filterFile and go:

perl Filter.pm data.csv filter.txt data_id >One.csv
perl Stripper.pm One.csv data_id event_value >Two.csv
perl Swap.pm Two.csv log.csv data_id name >Three.csv

If

perl Filter.pm data.csv filter.txt data_id >One.csv
perl Stripper.pm One.csv data_id event_value >Two.csv
perl Swap.pm Two.csv log.csv data_id name >Three.csv

If we also wanted Leopold's events filter.txt would have two lines with eponymous contents: 1 2

1
2

I have assorted mutant versions of all four building block routines that do things like take input from STDIN or or post output to a specific URL.

I make extensive use of the Perl language module Text::CSV_XS for heavy duty, ad-hoc manipulation of CSV files. Using this module I've built four small basic Perl programs to use as building blocks for whatever I want to do.

Filter - Filter inputFile filterFile field
Reject - Reject inputFile filterFile field
Stripper - Stripper inputFile field [field2 field3…]
Swap - Swap inputFile swapFile matchField outfield

The filterFile has a regEx pattern on each line. Anything matching one of these patterns is matched for the purposes of acceptance or rejection. The assorted "field"s are column header names.

So in your example I just put "1" in filterFile and go:

perl Filter.pm data.csv filter.txt data_id >One.csv
perl Stripper.pm One.csv data_id event_value >Two.csv
perl Swap.pm Two.csv log.csv data_id name >Three.csv

If we also wanted Leopold's events filter.txt would have two lines with eponymous contents: 1 2

I have assorted mutant versions of all four building block routines that do things like take input from STDIN or or post output to a specific URL.

I make extensive use of the Perl language module Text::CSV_XS for heavy duty, ad-hoc manipulation of CSV files. Using this module I've built four small basic Perl programs to use as building blocks for whatever I want to do.

Filter - Filter inputFile filterFile field
Reject - Reject inputFile filterFile field
Stripper - Stripper inputFile field [field2 field3…]
Swap - Swap inputFile swapFile matchField outfield

The filterFile has a regEx pattern on each line. Anything matching one of these patterns is matched for the purposes of acceptance or rejection. The assorted "field"s are column header names.

So in your example I just put "1" in filterFile and go:

perl Filter.pm data.csv filter.txt data_id >One.csv
perl Stripper.pm One.csv data_id event_value >Two.csv
perl Swap.pm Two.csv log.csv data_id name >Three.csv

If we also wanted Leopold's events filter.txt would have two lines with eponymous contents:

1
2

I have assorted mutant versions of all four building block routines that do things like take input from STDIN or or post output to a specific URL.

added 31 characters in body

Source Link

edited May 15, 2017 at 4:55

Nadreck

412
3
9

I make extensive use of the Perl language module Text::CSV_XS for heavy duty, ad-hoc manipulation of CSV files. Using this module I've built four small basic Perl programs to use as building blocks for whatever I want to do.

Filter - Filter inputFile filterFile field
Reject - Reject inputFile filterFile field
Stripper - Stripper inputFile field [field2 field3…]
Swap - Swap inputFile swapFile matchField outfield

The filterFile has a regEx pattern on each line. Anything matching one of these patterns is matched for the purposes of acceptance or rejection. The assorted "field"s are column header names.

So in your example I just put "1" in filterFile and go:

perl Filter.pm data.csv filter.txt data_id >One.csv

perl Stripper.pm One.csv data_id event_value >Two.csv perl Filter.pm data.csv filter.txt data_id >One.csv
perl Stripper.pm One.csv data_id event_value >Two.csv
perl Swap.pm Two.csv log.csv data_id name >Three.csv

perl Swap.pm Two.csv log.csv data_id name >Three.csv

IfIf we also wanted Leopold's events filter.txt would have two lines with eponymous contents: 1 2

I have assorted mutant versions of all four building block routines that do things like take input from STDIN or or post output to a specific URL.

I make extensive use of the Perl language module Text::CSV_XS for heavy duty, ad-hoc manipulation of CSV files. Using this module I've built four small basic Perl programs to use as building blocks for whatever I want to do.

Filter - Filter inputFile filterFile field
Reject - Reject inputFile filterFile field
Stripper - Stripper inputFile field [field2 field3…]
Swap - Swap inputFile swapFile matchField outfield

The filterFile has a regEx pattern on each line. Anything matching one of these patterns is matched for the purposes of acceptance or rejection. The assorted "field"s are column header names.

So in your example I just put "1" in filterFile and go:

perl Filter.pm data.csv filter.txt data_id >One.csv perl Stripper.pm One.csv data_id event_value >Two.csv perl Swap.pm Two.csv log.csv data_id name >Three.csv

If we also wanted Leopold's events filter.txt would have two lines with eponymous contents: 1 2

I have assorted mutant versions of all four building block routines that do things like take input from STDIN or or post output to a specific URL.

I make extensive use of the Perl language module Text::CSV_XS for heavy duty, ad-hoc manipulation of CSV files. Using this module I've built four small basic Perl programs to use as building blocks for whatever I want to do.

Filter - Filter inputFile filterFile field
Reject - Reject inputFile filterFile field
Stripper - Stripper inputFile field [field2 field3…]
Swap - Swap inputFile swapFile matchField outfield

The filterFile has a regEx pattern on each line. Anything matching one of these patterns is matched for the purposes of acceptance or rejection. The assorted "field"s are column header names.

So in your example I just put "1" in filterFile and go:

perl Filter.pm data.csv filter.txt data_id >One.csv
perl Stripper.pm One.csv data_id event_value >Two.csv
perl Swap.pm Two.csv log.csv data_id name >Three.csv

If we also wanted Leopold's events filter.txt would have two lines with eponymous contents: 1 2

I have assorted mutant versions of all four building block routines that do things like take input from STDIN or or post output to a specific URL.

Source Link

answered May 14, 2017 at 21:44

Nadreck

412
3
9

Loading

Stack Exchange Network

Return to Answer