Decision trees with R codes
Decision tree can be used for both classification and regression problems.
Terminologies Related To Decision Tree
•Root Node
Splitting
Decision Node
Leaf/Terminal Node
Pruning
Branch/ Sub Tree
Parent/Child Node
•Algorithms used in decision tree –
Ginni Index – •Higher the value of Gini higher the homogeneity. •CART (Classification and Regression Tree) uses Gini method to create binary splits
Chi-Square – •Higher the value of Chi-Square higher the statistical significance of differences between sub-node and Parent node.
Information Gain
Reduction In Variance – Reduction in variance is an algorithm used for continuous target variables (regression problems).
If we can use logistic regression for classification problems and linear regression for regression problems, why is there a need to use trees?
Algorithm to be used depends on the type of problem we are solving.
•If the relationship between dependent & independent variable is well approximated by a linear model, linear regression will outperform tree based model.
•If there is a high non-linearity & complex relationship between dependent & independent variables, a tree model will outperform a classical regression method.
•If you need to build a model which is easy to explain to people, a decision tree model will always do better than a linear model.
CART decision tree algorithm
library(rpart)
install.packages('rattle')
install.packages('rpart.plot')
install.packages('RColorBrewer')
library(rattle)
library(rpart.plot)
fit <- rpart(Survived ~ Pclass + Sex + Age + SibSp + Embarked, data=train, method="class")
# fit is our decision tree model
# let us visualize the tree
plot(fit)
text(fit)
fancyRpartPlot(fit)
# Let us make the prediction using decision tree model
Prediction <- predict(fit, test, type = "class")
submit <- data.frame(PassengerId = test$PassengerId, Survived = Prediction)
write.csv(submit, file = "myfirstdtree.csv", row.names = FALSE)