Basic Statistical Analysis

Author

Hassan Ghayas

Published

April 27, 2026

📦 Load Libraries

library(dplyr)

Attaching package: 'dplyr'
The following objects are masked from 'package:stats':

    filter, lag
The following objects are masked from 'package:base':

    intersect, setdiff, setequal, union
library(ggplot2)
library(palmerpenguins)
Warning: package 'palmerpenguins' was built under R version 4.4.3

Load data

For basic statistical analysis, we will use palmerpenguins r package which has penguin dataset

penguin_data <- penguins
head(penguin_data)
# A tibble: 6 × 8
  species island    bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
  <fct>   <fct>              <dbl>         <dbl>             <int>       <int>
1 Adelie  Torgersen           39.1          18.7               181        3750
2 Adelie  Torgersen           39.5          17.4               186        3800
3 Adelie  Torgersen           40.3          18                 195        3250
4 Adelie  Torgersen           NA            NA                  NA          NA
5 Adelie  Torgersen           36.7          19.3               193        3450
6 Adelie  Torgersen           39.3          20.6               190        3650
# ℹ 2 more variables: sex <fct>, year <int>

Summary Statistics

Descriptive statistics are used to summarize and describe the main features of a dataset. Common descriptive statistics include:

  • Mean

  • Median

  • Standard deviation

  • Minimum and maximum values

mean(penguin_data$body_mass_g, na.rm = TRUE)
[1] 4201.754
median(penguin_data$body_mass_g, na.rm = TRUE)
[1] 4050
sd(penguin_data$body_mass_g, na.rm = TRUE)
[1] 801.9545
summary(penguin_data$body_mass_g)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
   2700    3550    4050    4202    4750    6300       2 
penguin_data %>%
  group_by(species) %>%
  summarise(mean_mass = mean(body_mass_g, na.rm = TRUE))
# A tibble: 3 × 2
  species   mean_mass
  <fct>         <dbl>
1 Adelie        3701.
2 Chinstrap     3733.
3 Gentoo        5076.

t-test

A t-test is used to compare the means between two groups to determine whether they are statistically different.

Required:

  • One numeric variable

  • One grouping variable with two categories

t.test(body_mass_g ~ sex, data = penguin_data)

    Welch Two Sample t-test

data:  body_mass_g by sex
t = -8.5545, df = 323.9, p-value = 4.794e-16
alternative hypothesis: true difference in means between group female and group male is not equal to 0
95 percent confidence interval:
 -840.5783 -526.2453
sample estimates:
mean in group female   mean in group male 
            3862.273             4545.685 

ANOVA

ANOVA is used to compare the means among three or more groups. For example comparing body mass among species

Required:

  • One numeric variable
  • One categorical grouping variable with multiple groups
anova_result <- aov(body_mass_g ~ species, data = penguin_data)
summary(anova_result)
             Df    Sum Sq  Mean Sq F value Pr(>F)    
species       2 146864214 73432107   343.6 <2e-16 ***
Residuals   339  72443483   213698                   
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
2 observations deleted due to missingness

Correlation

Correlation measures the strength and direction of relationship between two numeric variables. Example: Does bill length increase with flipper length?

cor(penguin_data$bill_length_mm,
    penguin_data$flipper_length_mm,
    use = "complete.obs" ) #removes missing values
[1] 0.6561813

Linear Regression

Linear regression models the relationship between:

  • one predictor variable (X)

  • one outcome variable (Y)

It helps predict values and identify trends.

model <- lm(flipper_length_mm ~ bill_length_mm, data = penguin_data)
summary(model)

Call:
lm(formula = flipper_length_mm ~ bill_length_mm, data = penguin_data)

Residuals:
    Min      1Q  Median      3Q     Max 
-43.708  -7.896   0.664   8.650  21.179 

Coefficients:
               Estimate Std. Error t value Pr(>|t|)    
(Intercept)    126.6844     4.6651   27.16   <2e-16 ***
bill_length_mm   1.6901     0.1054   16.03   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 10.63 on 340 degrees of freedom
  (2 observations deleted due to missingness)
Multiple R-squared:  0.4306,    Adjusted R-squared:  0.4289 
F-statistic: 257.1 on 1 and 340 DF,  p-value: < 2.2e-16

plot with regression line

ggplot(penguin_data,
       aes(x = bill_length_mm,
           y = flipper_length_mm)) +
  geom_point() +
  geom_smooth(method = "lm")
`geom_smooth()` using formula = 'y ~ x'
Warning: Removed 2 rows containing non-finite outside the scale range
(`stat_smooth()`).
Warning: Removed 2 rows containing missing values or values outside the scale range
(`geom_point()`).