Ruby_Talk_Material/predict_flightdelays.Rmd at master · shabscan/Ruby_Talk_Material · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
---
title: "Predicting flight delays - Ruby Meetup"
author: "Brigitte"
date: "April 27, 2016"
output:
  html_document:
    keep_md: yes
    theme: cosmo
---

```{r}
set.seed(100)
setwd("~/GitHub/Ruby_Talk_Material")
library(caret)
trainData <- read.csv('train.csv',sep=',', header=TRUE)
testData <- read.csv('test.csv',sep=',', header=TRUE)

trainData$ARR_DEL15 <- as.factor(trainData$ARR_DEL15)
testData$ARR_DEL15 <- as.factor(testData$ARR_DEL15)
trainData$DAY_OF_WEEK <- as.factor(trainData$DAY_OF_WEEK)
testData$DAY_OF_WEEK <- as.factor(testData$DAY_OF_WEEK)
trainData$X <- NULL
testData$X <- NULL

```


Now we train the model. Use a rather simple algorith first to do the classification. Then if performence not that good, go to ensemble algorithms which are usually better. Even better would be to select more important variables from the data, include additional predictor variables, or do feature-engineering.

Choose Logistic regression to start with. Basically a regression that predicts a binary value.
```{r}
library(caret)
logisticRegModel <- train(ARR_DEL15 ~ ., data=trainData, method = 'glm', family = 'binomial') #the dot here stands for 'all available variables, i.e. all columns', glm is generalized linear regression, we want logistic regression, i.e. set family to binomial

```

Now we can use the model and the test data to check how well we predict flight arrival delays.

```{r}
logRegPrediction <- predict(logisticRegModel, testData)
logRegConfMat <- confusionMatrix(logRegPrediction, testData[,"ARR_DEL15"])
logRegConfMat

```

Specificity is really low. Improve model.
See what s available with names(getModelInfo()) and then try boosted tree model gbm:
see http://topepo.github.io/caret/training.html

```{r}
fitControl <- trainControl(method = 'repeatedcv', number = 10, repeats = 10)
gbmFit1 <- train(ARR_DEL15 ~ ., data=trainData, method = 'gbm',trControl = fitControl,verbose = FALSE)

gbmPrediction <- predict(gbmFit1, testData)
gbmConfMat <- confusionMatrix(gbmPrediction, testData[,"ARR_DEL15"])
gbmConfMat

#gbmFit1
#plot(gbmFit1)
#plot(gbmFit1, metric = "Kappa")
```