forked from mbbrigitte/Ruby_Talk_Material
-
Notifications
You must be signed in to change notification settings - Fork 0
Expand file tree
/
Copy pathpredict_flightdelays.Rmd
More file actions
66 lines (47 loc) · 2.06 KB
/
predict_flightdelays.Rmd
File metadata and controls
66 lines (47 loc) · 2.06 KB
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
---
title: "Predicting flight delays - Ruby Meetup"
author: "Brigitte"
date: "April 27, 2016"
output:
html_document:
keep_md: yes
theme: cosmo
---
```{r}
set.seed(100)
setwd("~/GitHub/Ruby_Talk_Material")
library(caret)
trainData <- read.csv('train.csv',sep=',', header=TRUE)
testData <- read.csv('test.csv',sep=',', header=TRUE)
trainData$ARR_DEL15 <- as.factor(trainData$ARR_DEL15)
testData$ARR_DEL15 <- as.factor(testData$ARR_DEL15)
trainData$DAY_OF_WEEK <- as.factor(trainData$DAY_OF_WEEK)
testData$DAY_OF_WEEK <- as.factor(testData$DAY_OF_WEEK)
trainData$X <- NULL
testData$X <- NULL
```
Now we train the model. Use a rather simple algorith first to do the classification. Then if performence not that good, go to ensemble algorithms which are usually better. Even better would be to select more important variables from the data, include additional predictor variables, or do feature-engineering.
Choose Logistic regression to start with. Basically a regression that predicts a binary value.
```{r}
library(caret)
logisticRegModel <- train(ARR_DEL15 ~ ., data=trainData, method = 'glm', family = 'binomial') #the dot here stands for 'all available variables, i.e. all columns', glm is generalized linear regression, we want logistic regression, i.e. set family to binomial
```
Now we can use the model and the test data to check how well we predict flight arrival delays.
```{r}
logRegPrediction <- predict(logisticRegModel, testData)
logRegConfMat <- confusionMatrix(logRegPrediction, testData[,"ARR_DEL15"])
logRegConfMat
```
Specificity is really low. Improve model.
See what s available with names(getModelInfo()) and then try boosted tree model gbm:
see http://topepo.github.io/caret/training.html
```{r}
fitControl <- trainControl(method = 'repeatedcv', number = 10, repeats = 10)
gbmFit1 <- train(ARR_DEL15 ~ ., data=trainData, method = 'gbm',trControl = fitControl,verbose = FALSE)
gbmPrediction <- predict(gbmFit1, testData)
gbmConfMat <- confusionMatrix(gbmPrediction, testData[,"ARR_DEL15"])
gbmConfMat
#gbmFit1
#plot(gbmFit1)
#plot(gbmFit1, metric = "Kappa")
```