M1 S&E - Big data analytics

M1 in Economics and Economics and Statistics, Toulouse School of Economics

Applied multivariate data analysis - Big data analytics (2014-…, 15 hours)

Material

A few hints

  • Worksheet 1, exercise 1, question 1:

For the last worksheet, if you are using the package rmr2 with Windows, try the following patch, just after loading the package and setting the backend to "local":

library("rmr2")
rmr.options(backend="local")

dfs.tempfile = function(pattern = "file", 
                        tmpdir = {if (rmr.options("backend") == "hadoop") 
                        rmr.options("hdfs.tempdir") else tempdir()}) {
  if (.Platform$OS.type == "windows") {
    fname  = tempfile(pattern, tmpdir)
  } else {
    fname  = rmr.normalize.path(tempfile(pattern, tmpdir))
    subfname = strsplit(fname, ":")
    if (length(subfname[[1]]) > 1) fname = subfname[[1]][2]
  }
  namefun = function() {fname}
  reg.finalizer(environment(namefun), function(e) {
    fname = eval(expression(fname), envir = e)
    if (!in.a.task() && dfs.exists(fname)) dfs.rmr(fname)
  }, onexit = TRUE)
  namefun
}

assignInNamespace("dfs.tempfile", dfs.tempfile, ns="rmr2")

task.id = function() default(Sys.getenv("mapreduce_task_id"),
                             Sys.getenv('mapred_task_id'), "")
in.a.task = function() task.id() != ""
default = function(x, value, bad.value = is.null) {
  test = if (is.function(bad.value)) bad.value(x) else identical(bad.value, x)
  if(test) value else x
}

Practical information

At this link, you can find the project group to which you are associated and check in which classes you have been reported as missing.

  • Project group number 3 in Group 1 and project group number 2 in Group 2 are scheduled to send me the answers to exercise 2 questions 4 to 6 (worksheet 1) for Tuesday, February 9th at the latest. Solution must include R code and comments of the results in a few slides (PDF format). It must be sent to me by email and I'll acknowledge its receipt.
  • Project group number 2 in Group 1 and project group number 3 in Group 2 are scheduled to send me the answers to exercise 2 questions 1, 3 and 4 (worksheet 2) for Tuesday, February 16th at the latest. Solution must include R code and comments of the results in a few slides (PDF format). It must be sent to me by email and I'll acknowledge its receipt.
  • Project group number 5 in Group 1 and project group number 4 in Group 2 are scheduled to send me the answers to exercice 2 questions 1 to 4 (worksheet 3) for Tuesday, March 15th at the latest. Solution must include R code and comments of the results in a few slides (PDF format). It must be sent to me by email and I'll acknowledge its receipt.
  • Project group number 1 in Group 1 and project group number 1 in Group 2 are scheduled to send me the answers to exercice 2 questions 1 to 3 (worksheet 4) for Tuesday, March 22nd at the latest. Solution must include R code and comments of the results in a few slides (PDF format). It must be sent to me by email and I'll acknowledge its receipt.
  • Project group number 4 in Group 1 is scheduled to send me the answers to exercice 2 questions 1, 2 and 4 (worksheet 4) for Tuesday, March 22nd at the latest. Solution must include R code and comments of the results in a few slides (PDF format). It must be sent to me by email and I'll acknowledge its receipt.

Final exam

The final exam will be held on March 18th, 5-8 pm. The exam will be related to the practical part of this work. All documents will be allowed as well as use to documents on the internet. However, any kind of communication with another person during the exam is strictly forbidden. Any attempt to cheat will be severely penalized.

About R

For this course, I will intensively use R programming language. If you do not feel confortable with R, I advise you to check out this material:
  • these slides, especially lessons 1 to 4 and lesson 6;
  • install the R package swirl (instructions for the installation of R, RStudio and swirl are provided below). The package can be loaded into R using:
    library(swirl)
    and a swirl teaching session is started using
    swirl()
    Choose the course "1: R Programming" and follows the instructions: you have 12 short interactive courses that will help you to be more familiar with R;
  • (not mandatory) if you want to go further, you can enroll to this on-line course.

For this course, the following (CRAN) packages must be installed (if you want to use your own computer): boot, mlbench, rpart, ipred, class, randomForest, e1071, foreach, doMC (unix-like OS users only) or doParallel (Windows users only).
In addition, the package rmr2 developped by Revolution Analytics must also be installed. This is done by following the tutorial provided at this link (in English).

How to install R?

R can be downloaded for free on the official repository website. I also advise you to install RStudio which is a simple graphical user interface for R which is very handy in many situations. Finally, packages can be installed either the menu "Tools/Install packages" in RStudio or directly the command line:
install.packages("swirl")
in an R console.