For a single dataframe, summarise the levels of each categorical column. If two dataframes are supplied, compare the levels of categorical features that appear in both dataframes. For grouped dataframes, summarise the levels of categorical features separately for each group.
inspect_cat(df1, df2 = NULL, include_int = FALSE)
A dataframe.
An optional second data frame for comparing categorical levels.
Defaults to NULL
.
Logical flag - whether to treat integer columns as categories. Default is FALSE
.
A tibble summarising or comparing the categorical features in one or a pair of dataframes.
For a single dataframe, the tibble returned contains the columns:
col_name
, character vector containing column names of df1
.
cnt
integer column containing count of unique levels found in each column,
including NA
.
common
, a character column containing the name of the most common level.
common_pcnt
, the percentage of each column occupied by the most common level shown in
common
.
levels
, a named list containing relative frequency tibbles for each feature.
For a pair of dataframes, the tibble returned contains the columns:
col_name
, character vector containing names of columns appearing in both
df1
and df2
.
jsd
, a numeric column containing the Jensen-Shannon divergence. This measures the
difference in relative frequencies of levels in a pair of categorical features. Values near
to 0 indicate agreement of the distributions, while 1 indicates disagreement.
pval
, the p-value corresponding to a NHT that the true frequencies of the categories are equal.
A small p indicates evidence that the the two sets of relative frequencies are actually different. The test
is based on a modified Chi-squared statistic.
lvls_1
, lvls_2
, the relative frequency of levels in each of df1
and df2
.
For a grouped dataframe, the tibble returned is as for a single dataframe, but where
the first k
columns are the grouping columns. There will be as many rows in the result
as there are unique combinations of the grouping variables.
# Load dplyr for starwars data & pipe
library(dplyr)
#>
#> Attaching package: ‘dplyr’
#> The following objects are masked from ‘package:stats’:
#>
#> filter, lag
#> The following objects are masked from ‘package:base’:
#>
#> intersect, setdiff, setequal, union
# Single dataframe summary
inspect_cat(starwars)
#> # A tibble: 8 × 5
#> col_name cnt common common_pcnt levels
#> <chr> <int> <chr> <dbl> <named list>
#> 1 eye_color 15 brown 24.1 <tibble [15 × 3]>
#> 2 gender 3 masculine 75.9 <tibble [3 × 3]>
#> 3 hair_color 13 none 42.5 <tibble [13 × 3]>
#> 4 homeworld 49 Naboo 12.6 <tibble [49 × 3]>
#> 5 name 87 Ackbar 1.15 <tibble [87 × 3]>
#> 6 sex 5 male 69.0 <tibble [5 × 3]>
#> 7 skin_color 31 fair 19.5 <tibble [31 × 3]>
#> 8 species 38 Human 40.2 <tibble [38 × 3]>
# Paired dataframe comparison
inspect_cat(starwars, starwars[1:20, ])
#> # A tibble: 8 × 5
#> col_name jsd pval lvls_1 lvls_2
#> <chr> <dbl> <dbl> <named list> <named list>
#> 1 eye_color 0.0936 7.08e- 1 <tibble [15 × 3]> <tibble [8 × 3]>
#> 2 gender 0.0387 3.38e- 1 <tibble [3 × 3]> <tibble [2 × 3]>
#> 3 hair_color 0.261 9.04e- 4 <tibble [13 × 3]> <tibble [10 × 3]>
#> 4 homeworld 0.394 2.21e- 2 <tibble [49 × 3]> <tibble [11 × 3]>
#> 5 name 0.573 9.35e-11 <tibble [87 × 3]> <tibble [20 × 3]>
#> 6 sex 0.0526 5.19e- 1 <tibble [5 × 3]> <tibble [4 × 3]>
#> 7 skin_color 0.288 1.58e- 1 <tibble [31 × 3]> <tibble [10 × 3]>
#> 8 species 0.300 7.86e- 2 <tibble [38 × 3]> <tibble [6 × 3]>
# Grouped dataframe summary
starwars %>% group_by(gender) %>% inspect_cat()
#> # A tibble: 21 × 6
#> # Groups: gender [3]
#> gender col_name cnt common common_pcnt levels
#> <chr> <chr> <int> <chr> <dbl> <named list>
#> 1 feminine eye_color 6 blue 35.3 <tibble [6 × 3]>
#> 2 feminine hair_color 6 brown 35.3 <tibble [6 × 3]>
#> 3 feminine homeworld 11 Naboo 17.6 <tibble [11 × 3]>
#> 4 feminine name 17 Adi Gallia 5.88 <tibble [17 × 3]>
#> 5 feminine sex 2 female 94.1 <tibble [2 × 3]>
#> 6 feminine skin_color 9 light 35.3 <tibble [9 × 3]>
#> 7 feminine species 8 Human 52.9 <tibble [8 × 3]>
#> 8 masculine eye_color 13 brown 22.7 <tibble [13 × 3]>
#> 9 masculine hair_color 10 none 47.0 <tibble [10 × 3]>
#> 10 masculine homeworld 44 Tatooine 12.1 <tibble [44 × 3]>
#> # … with 11 more rows
#> # ℹ Use `print(n = ...)` to see more rows