Summary and comparison of numeric columns

For a single dataframe, summarise the numeric columns. If two dataframes are supplied, compare numeric columns appearing in both dataframes. For grouped dataframes, summarise numeric columns separately for each group.

inspect_num(df1, df2 = NULL, breaks = 20, include_int = TRUE)

Arguments

df1: A dataframe.
df2: An optional second dataframe for comparing categorical levels. Defaults to NULL.
breaks: Integer number of breaks used for histogram bins, passed to graphics::hist(). Defaults to 20.
include_int: Logical flag, whether to include integer columns in numeric summaries. Defaults to TRUE. hist(..., breaks). See ?hist for more details.

Value

A tibble containing statistical summaries of the numeric columns of df1, or comparing the histograms of df1 and df2.

Details

For a single dataframe, the tibble returned contains the columns:

col_name, a character vector containing the column names in df1
min, q1, median, mean, q3, max and sd, the minimum, lower quartile, median, mean, upper quartile, maximum and standard deviation for each numeric column.
pcnt_na, the percentage of each numeric feature that is missing
hist, a named list of tibbles containing the relative frequency of values falling in bins determined by breaks.

For a pair of dataframes, the tibble returned contains the columns:

col_name, a character vector containing the column names in df1 and df2
hist_1, hist_2, a list column for histograms of each of df1 and df2. Where a column appears in both dataframe, the bins used for df1 are reused to calculate histograms for df2.
jsd, a numeric column containing the Jensen-Shannon divergence. This measures the difference in distribution of a pair of binned numeric features. Values near to 0 indicate agreement of the distributions, while 1 indicates disagreement.
pval, the p-value corresponding to a NHT that the true frequencies of histogram bins are equal. A small p indicates evidence that the the two sets of relative frequencies are actually different. The test is based on a modified Chi-squared statistic.

For a grouped dataframe, the tibble returned is as for a single dataframe, but where the first k columns are the grouping columns. There will be as many rows in the result as there are unique combinations of the grouping variables.

Author

Alastair Rushworth

Examples

# Load dplyr for starwars data & pipe
library(dplyr)

# Single dataframe summary
inspect_num(starwars)
#> # A tibble: 3 × 10
#>   col_name     min    q1 median  mean    q3   max    sd pcnt_na hist        
#>   <chr>      <dbl> <dbl>  <dbl> <dbl> <dbl> <dbl> <dbl>   <dbl> <named list>
#> 1 height        66 167      180 174.  191     264  34.8    6.90 <tibble>    
#> 2 mass          15  55.6     79  97.3  84.5  1358 169.    32.2  <tibble>    
#> 3 birth_year     8  35       52  87.6  72     896 155.    50.6  <tibble>    

# Paired dataframe comparison
inspect_num(starwars, starwars[1:20, ])
#> # A tibble: 3 × 5
#>   col_name   hist_1            hist_2               jsd  pval
#>   <chr>      <named list>      <named list>       <dbl> <dbl>
#> 1 height     <tibble [21 × 2]> <tibble [21 × 2]> 0.181  0.310
#> 2 mass       <tibble [28 × 2]> <tibble [28 × 2]> 0.0236 0.701
#> 3 birth_year <tibble [18 × 2]> <tibble [18 × 2]> 0.0343 0.252

# Grouped dataframe summary
starwars %>% group_by(gender) %>% inspect_num()
#> # A tibble: 9 × 11
#>   gender    col_n…¹   min    q1 median  mean    q3   max     sd pcnt_na hist    
#>   <chr>     <chr>   <dbl> <dbl>  <dbl> <dbl> <dbl> <dbl>  <dbl>   <dbl> <named >
#> 1 masculine height     66 171.   183   177.  193     264  37.6     6.06 <tibble>
#> 2 masculine mass       15  75     80   106.   88    1358 185.     25.8  <tibble>
#> 3 masculine birth_…     8  31.9   52.5  97.8  82     896 173.     48.5  <tibble>
#> 4 feminine  height     96 162.   166.  165.  172     213  23.6     5.88 <tibble>
#> 5 feminine  mass       45  50     55    54.7  56.2    75   8.59   47.1  <tibble>
#> 6 feminine  birth_…    19  44.5   47.5  47.2  50.5    72  15.0    52.9  <tibble>
#> 7 NA        height    178 180.   183   181.  183     183   2.89   25    <tibble>
#> 8 NA        mass       48  48     48    48    48      48  NA      75    <tibble>
#> 9 NA        birth_…    62  62     62    62    62      62  NA      75    <tibble>
#> # … with abbreviated variable name ¹col_name

Arguments

Value

Details

See also

Author

Examples