Basics of statistics for Data science: Descriptive statistics with R codes

## Load the data

data("iris")

# View the data

View(iris)

# There are 5 variables in the dataframe namely Sepal.length, sepal.width, 
# petallength, petal.width and species of the flower

head(iris)
# The above command will give output of the first 6 rows in the dataframe.

# Let us look at the strucuture of the data
str(iris)
# Sepal.length, sepal.width, petal.length, petal.width are numeric variable 
# Species is a factor variable

# Let us calculate the descriptive statistics for the 

# Measures of central tendency - describes the most typical response to a question
# Mean – Centre point of the data / Vulnerable to outliers
# The command for calculating mean is mean()

# The below command will calculate the mean of first column
mean(iris[,1])

# The other alternative is 
mean(iris$Sepal.Length)   # The $ sign is select particular column or variable from a dataframe.

#	Median – midpoint of distribution values / arrange ascending ordre
# The command for calculating median is medain()
median(iris$Sepal.Length)
#	Mode - the value that appears most often.

(iris$Sepal.Length)

# Measures of Dispersion - describes the shape and spread of the data set
#	Frequency distribution reveals the number (percent) of occurrences of each number or set of numbers
# This is specially relevant for categorical/factor variable 
# Species is categorical variable in the dataframe

table(iris$Species)
#	Range identifies the maximum and minimum values in a set of numbers
range(iris$Sepal.Length)

#	Standard deviation indicates the degree of variation in a way that can be translated into a bell-shaped curve distribution
sd(iris$Sepal.Length)
# there is no function for calculating variance in the data
# We can calculate the variance by using the formula Variance = Std. dev / mean

var <- sd(iris$Sepal.Length) / mean(iris$Sepal.Length)
var

# If you want to get full set of descriptives, then use summary()  command
summary(iris$Sepal.Length)
# This gives the min, 1st Quartile, median, mean, 3rd Quartile and maximum value

# Another way to get full set of descriptives is to use to command describe()
# describe() is part of package psych therefore install this package

install.packages("psych")
library(psych)

describe(iris$Sepal.Length)

# This gives the output in terms of mean, sd, median, trimmed,  mad, min, max, range, skew, kurtosis, se

# Descriptives accross groups

describeBy(iris$Sepal.Length, iris$Species)
#This will give descriptives accross groups of species

1 Comment

Leave a Reply

Your email address will not be published. Required fields are marked *