For a single dataframe, summarise the numeric columns. If two dataframes are supplied, compare numeric columns appearing in both dataframes. For grouped dataframes, summarise numeric columns separately for each group.

inspect_num(df1, df2 = NULL, breaks = 20, include_int = TRUE)

Arguments

df1

A dataframe.

df2

An optional second dataframe for comparing categorical levels. Defaults to NULL.

breaks

Integer number of breaks used for histogram bins, passed to graphics::hist(). Defaults to 20.

include_int

Logical flag, whether to include integer columns in numeric summaries. Defaults to TRUE. hist(..., breaks). See ?hist for more details.

Value

A tibble containing statistical summaries of the numeric columns of df1, or comparing the histograms of df1 and df2.

Details

For a single dataframe, the tibble returned contains the columns:

  • col_name, a character vector containing the column names in df1

  • min, q1, median, mean, q3, max and sd, the minimum, lower quartile, median, mean, upper quartile, maximum and standard deviation for each numeric column.

  • pcnt_na, the percentage of each numeric feature that is missing

  • hist, a named list of tibbles containing the relative frequency of values falling in bins determined by breaks.

For a pair of dataframes, the tibble returned contains the columns:

  • col_name, a character vector containing the column names in df1 and df2

  • hist_1, hist_2, a list column for histograms of each of df1 and df2. Where a column appears in both dataframe, the bins used for df1 are reused to calculate histograms for df2.

  • jsd, a numeric column containing the Jensen-Shannon divergence. This measures the difference in distribution of a pair of binned numeric features. Values near to 0 indicate agreement of the distributions, while 1 indicates disagreement.

  • pval, the p-value corresponding to a NHT that the true frequencies of histogram bins are equal. A small p indicates evidence that the the two sets of relative frequencies are actually different. The test is based on a modified Chi-squared statistic.

For a grouped dataframe, the tibble returned is as for a single dataframe, but where the first k columns are the grouping columns. There will be as many rows in the result as there are unique combinations of the grouping variables.

See also

Author

Alastair Rushworth

Examples

# Load dplyr for starwars data & pipe
library(dplyr)

# Single dataframe summary
inspect_num(starwars)
#> # A tibble: 3 × 10
#>   col_name     min    q1 median  mean    q3   max    sd pcnt_na hist        
#>   <chr>      <dbl> <dbl>  <dbl> <dbl> <dbl> <dbl> <dbl>   <dbl> <named list>
#> 1 height        66 167      180 174.  191     264  34.8    6.90 <tibble>    
#> 2 mass          15  55.6     79  97.3  84.5  1358 169.    32.2  <tibble>    
#> 3 birth_year     8  35       52  87.6  72     896 155.    50.6  <tibble>    

# Paired dataframe comparison
inspect_num(starwars, starwars[1:20, ])
#> # A tibble: 3 × 5
#>   col_name   hist_1            hist_2               jsd  pval
#>   <chr>      <named list>      <named list>       <dbl> <dbl>
#> 1 height     <tibble [21 × 2]> <tibble [21 × 2]> 0.181  0.310
#> 2 mass       <tibble [28 × 2]> <tibble [28 × 2]> 0.0236 0.701
#> 3 birth_year <tibble [18 × 2]> <tibble [18 × 2]> 0.0343 0.252

# Grouped dataframe summary
starwars %>% group_by(gender) %>% inspect_num()
#> # A tibble: 9 × 11
#>   gender    col_n…¹   min    q1 median  mean    q3   max     sd pcnt_na hist    
#>   <chr>     <chr>   <dbl> <dbl>  <dbl> <dbl> <dbl> <dbl>  <dbl>   <dbl> <named >
#> 1 masculine height     66 171.   183   177.  193     264  37.6     6.06 <tibble>
#> 2 masculine mass       15  75     80   106.   88    1358 185.     25.8  <tibble>
#> 3 masculine birth_…     8  31.9   52.5  97.8  82     896 173.     48.5  <tibble>
#> 4 feminine  height     96 162.   166.  165.  172     213  23.6     5.88 <tibble>
#> 5 feminine  mass       45  50     55    54.7  56.2    75   8.59   47.1  <tibble>
#> 6 feminine  birth_…    19  44.5   47.5  47.2  50.5    72  15.0    52.9  <tibble>
#> 7 NA        height    178 180.   183   181.  183     183   2.89   25    <tibble>
#> 8 NA        mass       48  48     48    48    48      48  NA      75    <tibble>
#> 9 NA        birth_…    62  62     62    62    62      62  NA      75    <tibble>
#> # … with abbreviated variable name ¹​col_name