0

I'm very confused by this and I'm sure it's something simple, hopefully someone can point me in the right direction.

I am working on a text mining project with the TM package and when I run the code in the console it works perfectly however when I call the function itself, the final output is empty.

Here's some sample code:

func <- function(filename, count=100, full=FALSE){

  packages <- c("ggplot2", "tm")
  if (length(setdiff(packages, rownames(installed.packages()))) > 0) {
    install.packages(setdiff(packages, rownames(installed.packages())))  
  }
  library(tm)
  library(ggplot2)

  ## get data
  data <- read.csv(filename) 

  ##Create corpus and remove formatting from text
  Tickets <- Corpus(DataframeSource(data))
  Tickets = tm_map(Tickets, removePunctuation)
  Tickets = tm_map(Tickets, tolower)

  ##Create stopwords vector to remove complete list from data
  stopwords <- read.csv("stopwords.csv", header=FALSE)
  stopwords <- as.character(stopwords[,1])
  stopwords <- c(stopwords("english"), stopwords)

  ## create full analasis of whole data, if selected by user
  if(full==TRUE){
    Tickets = tm_map(Tickets, PlainTextDocument) ##convert back to a text document we can analyse
    Tickets.TDM <- TermDocumentMatrix(Tickets) ## create matrix for analysis
    TDM.frame <- data.frame(as.matrix(Tickets.TDM))
    write.csv(TDM.frame, "Full_queue_analysis.csv")
  }

  ## Remove Stopwords and irrelevant data then convert to TDM for analysis
  Tickets = tm_map(Tickets, removeWords, stopwords)
  Tickets = tm_map(Tickets, removeNumbers)
  Tickets = tm_map(Tickets, stripWhitespace)
  Tickets = tm_map(Tickets, PlainTextDocument)
  Tickets.TDM <- TermDocumentMatrix(Tickets)

  ## matrix to frame for additional calculations
  TDM.frame <- data.frame(as.matrix(Tickets.TDM))

  ##count each word once word per entry and only display those which count more than user specified amount
  Counts.df <- data.frame(rowSums(TDM.frame > 0))
  colnames(Counts.df) <- "count"
  Counts.df <- subset(Counts.df, count > count)

  ## create csv file for final counts
  write.csv(Counts.df, "Queue_analysis.csv")

  ##Print basic analysis based on user option
  cat("Terms which appear more than",count,"times:")
  findFreqTerms(Tickets.TDM, count)

Things seem to go wrong when initialising the Counts.df vector, I can run this perfectly through the console and it populates with the correct data however when run in the funciton it is completely empty, though it does exist.

There's no errors and the function ends as expected but when opening the csv file, it's empty with just the "count" header.

Thanks for any advice!

Edit - added function itself, sorry!

2
  • You need to show the actual function and how it is called. Commented Mar 30, 2015 at 19:39
  • Silly me, you're totally right :D I've completed it now, thanks! Commented Mar 30, 2015 at 19:43

1 Answer 1

1

What do you think that count > count is doing? How can something be larger than itself? Given the function code shown, 100 > 100 would be what is used in the subset argument in the subset() call. This evaluates to FALSE (for obvious reasons)

> 100 > 100
[1] FALSE

hence you are dropping all rows.

E.g.:

> subset(mtcars, 100 > 100)
 [1] mpg  cyl  disp hp   drat wt   qsec vs   am   gear carb
<0 rows> (or 0-length row.names)

You have a logic error in your code but I can't make out what you intended so you'll need to solve that yourself. What vector of counts were you want to retain elements from if they were greater than count (which you set by default to be 100 in the function definition)? Did you mean

Count.df <- subset(Counts.df, Count.df[,1] > count)

Ahh, penny drops. This is just a problem relating to evaluation of the count in subset(). Here is a simple example of what is wrong:

> foo <- function(foo, cyl = 3) {
+ subset(mtcars, cyl > cyl)
+ }
> foo()
 [1] mpg  cyl  disp hp   drat wt   qsec vs   am   gear carb
<0 rows> (or 0-length row.names)

Now cyl in the subset call gets evaluated as the value of cyl defined in the function definition, cyl = 3, not the vector cyl in the mtcars data frame. Hence you get the result of subset(mtcars, FALSE).

So how about you call the column in Count.df Count instead of count, and then have that line of code be:

Count.df <- subset(Counts.df, Count > count)

Which should work.

Sign up to request clarification or add additional context in comments.

1 Comment

Ah... I see what you mean, my vector naming needs improvement.. the first count is the name of the column I want it to subset.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.