R/inspect_imb.R
inspect_imb.Rd
For a single dataframe, summarise the most common level in each categorical column. If two dataframes are supplied, compare the most common levels of categorical features appearing in both dataframes. For grouped dataframes, summarise the levels of categorical columns in the dataframe split by group.
inspect_imb(df1, df2 = NULL, include_na = FALSE)
A dataframe.
An optional second data frame for comparing columnwise imbalance.
Defaults to NULL
.
Logical flag, whether to include missing values as a unique level. Default
is FALSE
- to ignore NA
values.
A tibble summarising and comparing the imbalance for each categorical column in one or a pair of dataframes.
For a single dataframe, the tibble returned contains the columns:
col_name
, a character vector containing column names of df1
.
value
, a character vector containing the most common categorical level
in each column of df1
.
pcnt
, the relative frequency of each column's most common categorical level
expressed as a percentage.
cnt
, the number of occurrences of the most common categorical level in each
column of df1
.
For a pair of dataframes, the tibble returned contains the columns:
col_name
, a character vector containing names of the unique columns in df1
and df2
.
value
, a character vector containing the most common categorical level
in each column of df1
.
pcnt_1
, pcnt_2
, the percentage occurrence of value
in
the column col_name
for each of df1
and df2
, respectively.
cnt_1
, cnt_2
, the number of occurrences of of value
in
the column col_name
for each of df1
and df2
, respectively.
p_value
, p-value associated with the null hypothesis that the true rate of
occurrence is the same for both dataframes. Small values indicate stronger evidence of a difference
in the rate of occurrence.
For a grouped dataframe, the tibble returned is as for a single dataframe, but where
the first k
columns are the grouping columns. There will be as many rows in the result
as there are unique combinations of the grouping variables.
# Load dplyr for starwars data & pipe
library(dplyr)
# Single dataframe summary
inspect_imb(starwars)
#> # A tibble: 8 × 4
#> col_name value pcnt cnt
#> <chr> <chr> <dbl> <int>
#> 1 gender masculine 75.9 66
#> 2 sex male 69.0 60
#> 3 hair_color none 42.5 37
#> 4 species Human 40.2 35
#> 5 eye_color brown 24.1 21
#> 6 skin_color fair 19.5 17
#> 7 homeworld Naboo 12.6 11
#> 8 name Ackbar 1.15 1
# Paired dataframe comparison
inspect_imb(starwars, starwars[1:20, ])
#> # A tibble: 8 × 7
#> col_name value pcnt_1 cnt_1 pcnt_2 cnt_2 p_value
#> <chr> <chr> <dbl> <int> <dbl> <int> <dbl>
#> 1 gender masculine 75.9 66 90 18 0.277
#> 2 sex male 69.0 60 70 14 1.00
#> 3 hair_color none 42.5 37 NA NA NA
#> 4 species Human 40.2 35 65 13 0.0786
#> 5 eye_color brown 24.1 21 NA NA NA
#> 6 skin_color fair 19.5 17 35 7 0.231
#> 7 homeworld Naboo 12.6 11 NA NA NA
#> 8 name Ackbar 1.15 1 NA NA NA
# Grouped dataframe summary
starwars %>% group_by(gender) %>% inspect_imb()
#> # A tibble: 21 × 5
#> # Groups: gender [3]
#> gender col_name value pcnt cnt
#> <chr> <chr> <chr> <dbl> <int>
#> 1 feminine sex female 94.1 16
#> 2 feminine species Human 52.9 9
#> 3 feminine hair_color brown 35.3 6
#> 4 feminine skin_color light 35.3 6
#> 5 feminine eye_color blue 35.3 6
#> 6 feminine homeworld Naboo 17.6 3
#> 7 feminine name Adi Gallia 5.88 1
#> 8 masculine sex male 90.9 60
#> 9 masculine hair_color none 47.0 31
#> 10 masculine species Human 39.4 26
#> # … with 11 more rows
#> # ℹ Use `print(n = ...)` to see more rows