For a single dataframe, summarise the numeric columns. If two dataframes are supplied, compare numeric columns appearing in both dataframes. For grouped dataframes, summarise numeric columns separately for each group.
inspect_num(df1, df2 = NULL, breaks = 20, include_int = TRUE)
A dataframe.
An optional second dataframe for comparing categorical levels.
Defaults to NULL
.
Integer number of breaks used for histogram bins, passed to
graphics::hist()
. Defaults to 20.
Logical flag, whether to include integer columns in numeric summaries.
Defaults to TRUE
.
hist(..., breaks)
. See ?hist
for more details.
A tibble
containing statistical summaries of the numeric
columns of df1
, or comparing the histograms of df1
and df2
.
For a single dataframe, the tibble returned contains the columns:
col_name
, a character vector containing the column names in df1
min
, q1
, median
, mean
, q3
, max
and
sd
, the minimum, lower quartile, median, mean, upper quartile, maximum and
standard deviation for each numeric column.
pcnt_na
, the percentage of each numeric feature that is missing
hist
, a named list of tibbles containing the relative frequency of values
falling in bins determined by breaks
.
For a pair of dataframes, the tibble returned contains the columns:
col_name
, a character vector containing the column names in df1
and df2
hist_1
, hist_2
, a list column for histograms of each of df1
and df2
.
Where a column appears in both dataframe, the bins used for df1
are reused to
calculate histograms for df2
.
jsd, a numeric column containing the Jensen-Shannon divergence. This measures the difference in distribution of a pair of binned numeric features. Values near to 0 indicate agreement of the distributions, while 1 indicates disagreement.
pval
, the p-value corresponding to a NHT that the true frequencies of histogram bins are equal.
A small p indicates evidence that the the two sets of relative frequencies are actually different. The test
is based on a modified Chi-squared statistic.
For a grouped dataframe, the tibble returned is as for a single dataframe, but where
the first k
columns are the grouping columns. There will be as many rows in the result
as there are unique combinations of the grouping variables.
# Load dplyr for starwars data & pipe
library(dplyr)
# Single dataframe summary
inspect_num(starwars)
#> # A tibble: 3 × 10
#> col_name min q1 median mean q3 max sd pcnt_na hist
#> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <named list>
#> 1 height 66 167 180 174. 191 264 34.8 6.90 <tibble>
#> 2 mass 15 55.6 79 97.3 84.5 1358 169. 32.2 <tibble>
#> 3 birth_year 8 35 52 87.6 72 896 155. 50.6 <tibble>
# Paired dataframe comparison
inspect_num(starwars, starwars[1:20, ])
#> # A tibble: 3 × 5
#> col_name hist_1 hist_2 jsd pval
#> <chr> <named list> <named list> <dbl> <dbl>
#> 1 height <tibble [21 × 2]> <tibble [21 × 2]> 0.181 0.310
#> 2 mass <tibble [28 × 2]> <tibble [28 × 2]> 0.0236 0.701
#> 3 birth_year <tibble [18 × 2]> <tibble [18 × 2]> 0.0343 0.252
# Grouped dataframe summary
starwars %>% group_by(gender) %>% inspect_num()
#> # A tibble: 9 × 11
#> gender col_n…¹ min q1 median mean q3 max sd pcnt_na hist
#> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <named >
#> 1 masculine height 66 171. 183 177. 193 264 37.6 6.06 <tibble>
#> 2 masculine mass 15 75 80 106. 88 1358 185. 25.8 <tibble>
#> 3 masculine birth_… 8 31.9 52.5 97.8 82 896 173. 48.5 <tibble>
#> 4 feminine height 96 162. 166. 165. 172 213 23.6 5.88 <tibble>
#> 5 feminine mass 45 50 55 54.7 56.2 75 8.59 47.1 <tibble>
#> 6 feminine birth_… 19 44.5 47.5 47.2 50.5 72 15.0 52.9 <tibble>
#> 7 NA height 178 180. 183 181. 183 183 2.89 25 <tibble>
#> 8 NA mass 48 48 48 48 48 48 NA 75 <tibble>
#> 9 NA birth_… 62 62 62 62 62 62 NA 75 <tibble>
#> # … with abbreviated variable name ¹col_name