Friday, February 27, 2015

Automating PDF Data Export to R

The Challenge

I am emailed a large daily text PDF report that has data I’d like to automatically clean, rearrange in a new report, and email out. My issues were the following:
  1. Transfer PDF email into Google Drive.
  2. Export data from PDF into text file.
  3. Clean data and build report in R.
  4. Email report to higher ups.
There is the TM package in R that supposedly does this, but I couldn’t get it to work, so I came up with the following work around.

Step 1

The first step is to get the PDF file from my email account into a folder on my computer. To achieve this, I used the following recipe from ifttt.com.
This automatically transfers any email attachment I receive into a folder based on the sender in Google Drive.

Step 2

The second step was to export the PDF data into R. To do this I needed to do the following:
  • download xpdf command scripts from here.
  • Unzip file into Program Files folder
  • Copy pdftotext.exe, located in bin32 folder into the Google Drive folder that is storing the PDF reports.
  • Use Windows Task Scheduler to run a command script daily executes the pdftotext script on the daily PDF report.
    • Open Task Scheduler and Create New Task
    • Under Action Tab: Use path to cmd.exe for your Program/Script
    • Use /c “pdftotext -layout yourFile.pdf” in the Add Arguments field
    • Use path to where your PDF is stored in the Start in field
  • Set your Triggers for whenever you want the PDF to text transformation to occur.
Now you should have a text file in your Google Drive folder.

Step 3

Now that we have our PDF data in text format we can simply used readLines(“filename.txt”) in R. From here we can clean and build the wanted report using the rmarkdown package.

Step 4

Once we have the report built in R, we need to email it out to the higher ups. For this we need to use the mailR package. You may need to reduce your Gmail security to allow R to access your account.
##  library(mailR)
##  subject_line <- "GIVE ME A RAISE!!!"
##  send.mail(from = "myEmail@work.com",
##      to = c("bossman1@work.com",
##          "bossman2@work.com",
##          "bossman3@work.com"),
##    subject = subject_line,
##      body = paste("Daily Report For", yesterday, sep = " "),
##      smtp = list(host.name = "smtp.gmail.com", port = 465,
##      user.name = "myEmail@work.com",
##      passwd = "********", ssl = TRUE),
##      attach.files = "myFile.pdf",
##      authenticate = TRUE, send = TRUE)
Next we save the above script as myEmail.R and schedule the script to run daily in Windows Task Scheduler.
  • Open Task Scheduler and Create a new task
  • Under Action tab: use path to Rscript.exe in Program/Script field
  • Use myEmail.R in the Add Attributes field
  • Use path to myEmail.R in the Start in field

Complete