0

I know the thread How can I inner join two csv files in R which has a merge option, which I do not want. I have two data CSV files. I am thinking how to query like them like SQL with R. Two CSV files where primary key is data_id. data.csv where OK to have IDs not found in log.csv (etc 4)

data_id, event_value
1, 777
1, 666
2, 111
4, 123 
3, 324
1, 245

log.csv where no duplicates in ID column but duplicates can be in name

data_id, name
1, leo
2, leopold
3, lorem

Pseudocode by partial PostgreSQL syntax

  1. Let data_id=1
  2. Show name and event_value from data.csv and log.csv, respectively

Pseudocode like partial PostgreSQL select

SELECT name, event_value 
    FROM data, log
    WHERE data_id=1;

Expected output

leo, 777
leo, 666 
leo, 245

R approach

file1 <- read.table("file1.csv", col.names=c("data_id", "event_value"))
file2 <- read.table("file2.csv", col.names=c("data_id", "name"))

# TODO here something like the SQL query 
# http://stackoverflow.com/a/1307824/54964

Possible approaches where I think sqldf can be sufficient here

  1. sqldf
  2. data.table
  3. dplyr

PostgreSQL Schema pseudocode to show what I am trying to do with CSV files

CREATE TABLE data (
        data_id SERIAL PRIMARY KEY NOT NULL,
        event_value INTEGER NOT NULL
);
CREATE TABLE log (
        data_id SERIAL PRIMARY KEY NOT NULL,
        name INTEGER NOT NULL
);

R: 3.3.3
OS: Debian 8.7
Related: PostgreSQL approach in the relevant thread How to SELECT with two CSV files/… on PostgreSQL?

4
  • Cross-site duplicate: stackoverflow.com/questions/7187372/… Commented May 13, 2017 at 13:15
  • 1
    The SQL here seems tangential and requires people to be familiar with both platforms to understand what you're asking. Your expected result looks like you want to merge and find keys with duplicates, or maybe merge and display just the results for a particular key, but the SQL doesn't seem to support either interpretation. Commented May 13, 2017 at 13:41
  • I still have no idea what you actually want; but my suspicion is that it's something really basic, like stackoverflow.com/questions/18463562/… Commented May 13, 2017 at 14:25
  • @tripleee Absolutely no. Please, see Possible approaches section where I think sqldf is most closest here, which I am thinking too. Commented May 13, 2017 at 14:35

3 Answers 3

1

R has a number of packages that come with the SQL level of convenience. The most convenient packages are

  • dplyr (modern, usually 10-100 times speed against the base functions) with SQL-inspired commands such as group-by and different joins

  • SparkR (if you need Spark support, not apparently here but this come with nice SQL-convenient commands) with SQL-inspired commands such as group-by and different joins

  • data-table provides more efficient functions than the base functions such as replace read.csv with fread.

  • The SQLDF is very unreliable and inefficient, yes you will get fatal error with the precedence error like above failing Rstudio.

where the packages won't teach you learn and debug SQL: in order to learn proper SQL, SQLDF is cerntainly not the most convenient tool for this. SQL Server Management Studio, Windows, may be easier to use for example with Azure Basic SQL DB, 5EUR/month or free tier here or some other DB -- or set up db yourself

  • source postgres database with src_postgres() command, more about database commands in R here

The below provides you demos, showing the proper SQL and fixing the bugs occurring in your Code cases. I also show the SQL convenience commands. It is good to learn SQL properly first so you know what to look for in the R packages.

DEMOS


Code 4 is failing because of wrong precedence. The where-clause and the USING must be after the join.

> file1 <- read.csv("test1.csv", header=TRUE, sep=",")
> file2 <- read.csv("test2.csv", header=TRUE, sep=",")

> sqldf("SELECT event_value, name
+       FROM file1
+       LEFT JOIN 
+       (SELECT data_id, name
+       FROM file2
+       WHERE data_id = 1)
+       USING(data_id)
+       WHERE data_id = 1")
  event_value name
1         777  leo
2         666  leo
3         245  leo

Other ways contain

Proper LEFT-JOIN approach

> df3 <- sqldf("SELECT event_value, name
+              FROM file1 a
+              LEFT JOIN file2 b ON a.data_id=b.data_id")
> 
> df3
  event_value    name
1         777     leo
2         666     leo
3         111 leopold
4         123    <NA>
5         324   lorem
6         245     leo
> df3 <- sqldf("SELECT a.event_value, b.name
+    FROM file1 a
+              LEFT JOIN file2 b ON a.data_id=b.data_id
+              WHERE a.data_id = 1")
> df3
  event_value name
1         777  leo
2         666  leo
3         245  leo

Merging the tables with where condition

> df4 <- sqldf("SELECT a.event_value, b.name
+    FROM file1 a, file2 b
+              WHERE a.data_id = 1
+              AND a.data_id=b.data_id")
> 
> df4
  event_value name
1         777  leo
2         666  leo
3         245  leo

SQL subqueries approach

> df5 <- sqldf("SELECT a.event_value, b.name
+    FROM 
+              (SELECT data_id, event_value FROM file1) a,
+              (SELECT data_id, name FROM file2) b
+              WHERE a.data_id = 1
+              AND a.data_id=b.data_id")
> df5
  a.event_value b.name
1           777    leo
2           666    leo
3           245    leo

WHAT ARE SQL-CONVENIENT WAYS TO USE R?


dplyr

  • SQL-style joins

    enter image description here

  • build_sql convenience function to execute SQL-style commands

    enter image description here

  • Case-when function is inspired by SQL-CASE-WHEN

  • Coalesce function is inpsired by SQL-COALESCE
  • translate_sql helps to convert R functions to SQL (more cases here)

    # Many functions have slightly different names
    translate_sql(x == 1 && (y < 2 || z > 3))
    #> <SQL> "x" = 1.0 AND ("y" < 2.0 OR "z" > 3.0)
    translate_sql(x ^ 2 < 10)
    #> <SQL> POWER("x", 2.0) < 10.0
    translate_sql(x %% 2 == 10)
    #> <SQL> "x" % 2.0 = 10.0
    
  • SQLLite and dplyr: install the sqlite package and try the NYC dataset with dplyr, more here

SparkR package

  • comes with SQL-style joins (inner joins, left-joins, etc) and group-by. More here.

    enter image description here

4
  • Assume you have 20 columns and 6, respectively in the tables. Which method would you use to join such tables and then do the request? Commented May 14, 2017 at 14:52
  • 1
    @LéoLéopoldHertz준영 because I have SQL Server (running on Windows server) available, I would use it and MMSM (Micsoft SQL Management Studio, free). If not, I would try try to find replacement for it. For very small data amount 6-20 cols (still little), I would try R and SQLDF with SELECT TOP 100 Col1, col2,... FROM TABLE1 directive with TOP100 (so not hanging up even with big amount of data) and/or dplyr if too slow. If I had no computer what-so-ever available (only chromebook or tablet or phone), I would do everything from Azure-SQL GUI: you can run SQL commands through the webinterface. Commented May 14, 2017 at 15:59
  • 1
    @LéoLéopoldHertz준영 Ubuntu has alternatives and DbVis looks promising, working in OSX/Linux/W. The free edition looks fine, driver to PostegreSQL but no scheduling. This looks similar to MMSM and you need to get the db set up first somewhere like your comp or Azure SQL. Commented May 14, 2017 at 16:09
  • @LéoLéopoldHertz준영 Moved the DbVis point here when you have no Windows os available. There may be other alternatives to MMSM. Commented May 14, 2017 at 21:35
0

sqldf approach.

One approach which shows a caveat with join approach - you cannot use WHERE data_id on both tables if you join by data_id. Code 1

file1 <- read.table("data.csv", col.names=c("data_id", "event_value"))
file2 <- read.table("log.csv", col.names=c("data_id", "name"))

library("sqldf")
df3 <- sqldf("SELECT event_value, name
   FROM file1
   LEFT JOIN file2 USING(data_id)")

df3

Output wrong because data_id = 1 should be active too

Loading required package: gsubfn
Loading required package: proto
Loading required package: RSQLite
Loading required package: tcltk
Warning message:
Quoted identifiers should have class SQL, use DBI::SQL() if the caller performs the quoting. 
  event_value    name
1 event_value    name
2         777     leo
3         666     leo
4         111 leopold
5         123    <NA>
6         324   lorem
7         245     leo

Code 2

Code

df3 <- sqldf("SELECT event_value, name
   FROM file1
   LEFT JOIN file2 USING(data_id)
   WHERE data_id = 1")

Output blank because join applied already

[1] event_value name       
<0 rows> (or 0-length row.names)

Code 3

Do WHERE earlier

df3 <- sqldf("SELECT event_value, name
   FROM file1
   WHERE data_id = 1
   LEFT JOIN file2 USING(data_id)")

Output error because two tables are of different size so WHERE should be applied on both tables

Error in rsqlite_send_query(conn@ptr, statement) : 
  near "LEFT": syntax error
Calls: sqldf ... initialize -> initialize -> rsqlite_send_query -> .Call
In addition: Warning message:
Quoted identifiers should have class SQL, use DBI::SQL() if the caller performs the quoting. 
Execution halted

Code 4

Using two SELECTs with JOIN

df3 <- sqldf("SELECT event_value, name
   FROM file1
   WHERE data_id = 1
   LEFT JOIN 
       (SELECT data_id, name
       FROM file2
       WHERE data_id = 1)
   USING(data_id)")

Output error

Error in rsqlite_send_query(conn@ptr, statement) : 
  near "LEFT": syntax error
Calls: sqldf ... initialize -> initialize -> rsqlite_send_query -> .Call
In addition: Warning message:
Quoted identifiers should have class SQL, use DBI::SQL() if the caller performs the quoting. 
Execution halted

Maybe, a syntax error with the second SELECT and its attachment to JOIN.

0

I make extensive use of the Perl language module Text::CSV_XS for heavy duty, ad-hoc manipulation of CSV files. Using this module I've built four small basic Perl programs to use as building blocks for whatever I want to do.

  1. Filter - Filter inputFile filterFile field
  2. Reject - Reject inputFile filterFile field
  3. Stripper - Stripper inputFile field [field2 field3…]
  4. Swap - Swap inputFile swapFile matchField outfield

The filterFile has a regEx pattern on each line. Anything matching one of these patterns is matched for the purposes of acceptance or rejection. The assorted "field"s are column header names.

So in your example I just put "1" in filterFile and go:

perl Filter.pm data.csv filter.txt data_id >One.csv
perl Stripper.pm One.csv data_id event_value >Two.csv
perl Swap.pm Two.csv log.csv data_id name >Three.csv

If we also wanted Leopold's events filter.txt would have two lines with eponymous contents:

1
2 

I have assorted mutant versions of all four building block routines that do things like take input from STDIN or or post output to a specific URL.

If you wanted to do it directly with a special-purpose routine that would be fairly easy. Text::CSV_XS easily pulls CSV file rows into hashes and then you can do what you like with them.

First though, if your files are huge you should use the DB_File module to specify that your hash should be stored on disk as a database. Otherwise you can fill up memory and grind to a halt.

use DB_File;

my %theHash;
unlink '/tmp/translation.db';
sleep 2;
tie ( %theHash, 'DB_File', '/tmp/translation.db' )
    or die "Can't open /tmp/translation.db\n";

Then create CSV objects

map{ $_ = Text::CSV_XS->new( { allow_whitespace => 1,
        eol =>"\015\012",
        always_quote => 1, binary => 1 })}
   ( $data_csv, $log_csv, $output_csv );

Note that I'm using DOS EOL characters.

Then pull in the input header rows to set up the column names

@cols = @{$data_csv->getline( $data_fh )};
$data_csv->column_names( @cols );
@cols = @{$log_csv->getline( $log_fh )};
$log_csv->column_names( @cols );

where you've opened the files on the file handles $data_fh and $log_fh.

Decide what your output columns will be and write out a column header row

@output_cols = ( 'name', 'event_value' );
$output_csv->combine( @output_cols );
$latest_row = $output_csv->string();
print $output_fh, $latest_row;

Then make up a data_id to name hash.

while ( $log_csv_row = $log_csv->getline_hr( $log_fh ) ){
    $theHash{ $log_csv_row->{data_id} } = $log_csv_row->{name};
}

Then, as in your example, cycle through data.csv to get all of the '1's.

$outputHash{name} = $theHash{1};

while ( $data_csv_row = $data_csv->getline_hr( $data_fh ) ){

    next unless $data_csv_row->{data_id} == 1;

    $outputHash{data_id} = $data_csv_row->{data_id};
    $output_csv->combine( map { $outputHash{$_} } @output_cols );
    $latest_row = $output_csv->string();
    print $output_fh "$latest_row";
}

This example code is the basis for all of the utility routines listed above where the hardcoded '1' is replaced with assorted arguments or arrays of arguments that are put in hashes.

3
  • Can you give some examples with the pseudocode provided in the body? I feel the answer now a stub. Commented May 20, 2017 at 8:05
  • Not sure what you mean here. Do you mean examples of the utility routines I mention above in data pipelines or do you mean code using the 'TEXT::CSV_XS' module such as these utility routines? The answer was just intended to be a stub since you said you just wanted possible approaches but I can expand on it if you like. Commented May 20, 2017 at 16:43
  • Yes, it would be nice to get it more expanded with the examples about the utility routines and/or TEXT::CSV_XS module, for the pseudocode in the body. . Commented May 20, 2017 at 19:32

You must log in to answer this question.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.