Getting started with DataSum

Why DataSum?

DataSum is built for the first serious look at a dataset. Before modeling, teaching, or publication, analysts need to know what is missing, what is unusual, which variables are skewed, whether normality checks are meaningful, and which columns need closer inspection.

library(DataSum)

Summarize one variable

summarize_vector(c(1, 2, 2, NA, 10), name = "score")
#>   variable    type n n_complete n_missing missing_pct n_unique mode mode_count
#> 1    score numeric 5          4         1          20        3    2          2
#>   mode_ties mean median       sd variance minimum  q25 q75 maximum range  iqr
#> 1     FALSE 3.75      2 4.193249 17.58333       1 1.75   4      10     9 2.25
#>      mad  skewness excess_kurtosis outlier_count outlier_pct normality_test
#> 1 0.7413 0.7209456        -1.70475             1          25   Shapiro-Wilk
#>   normality_statistic normality_p_value normality_alpha
#> 1           0.7252874        0.02203226            0.05
#>           normality_decision warning
#> 1 Evidence against normality    <NA>

Summarize a data frame

summarize_data(iris)
#>       variable    type   n n_complete n_missing missing_pct n_unique
#> 1 Sepal.Length numeric 150        150         0           0       35
#> 2  Sepal.Width numeric 150        150         0           0       23
#> 3 Petal.Length numeric 150        150         0           0       43
#> 4  Petal.Width numeric 150        150         0           0       22
#> 5      Species  factor 150        150         0           0        3
#>                            mode mode_count mode_ties     mean median        sd
#> 1                             5         10     FALSE 5.843333   5.80 0.8280661
#> 2                             3         26     FALSE 3.057333   3.00 0.4358663
#> 3                      1.4, 1.5         13      TRUE 3.758000   4.35 1.7652982
#> 4                           0.2         29     FALSE 1.199333   1.30 0.7622377
#> 5 setosa, versicolor, virginica         50      TRUE       NA     NA        NA
#>    variance minimum q25 q75 maximum range iqr     mad   skewness
#> 1 0.6856935     4.3 5.1 6.4     7.9   3.6 1.3 1.03782  0.3086407
#> 2 0.1899794     2.0 2.8 3.3     4.4   2.4 0.5 0.44478  0.3126147
#> 3 3.1162779     1.0 1.6 5.1     6.9   5.9 3.5 1.85325 -0.2694109
#> 4 0.5810063     0.1 0.3 1.8     2.5   2.4 1.5 1.03782 -0.1009166
#> 5        NA      NA  NA  NA      NA    NA  NA      NA         NA
#>   excess_kurtosis outlier_count outlier_pct normality_test normality_statistic
#> 1      -0.6058125             0    0.000000   Shapiro-Wilk           0.9760903
#> 2       0.1387047             4    2.666667   Shapiro-Wilk           0.9849179
#> 3      -1.4168574             0    0.000000   Shapiro-Wilk           0.8762681
#> 4      -1.3581792             0    0.000000   Shapiro-Wilk           0.9018349
#> 5              NA             0          NA           <NA>                  NA
#>   normality_p_value normality_alpha            normality_decision
#> 1      1.018116e-02            0.05    Evidence against normality
#> 2      1.011543e-01            0.05 No evidence against normality
#> 3      7.412263e-10            0.05    Evidence against normality
#> 4      1.680465e-08            0.05    Evidence against normality
#> 5                NA            0.05                    Not tested
#>                                                warning
#> 1                                                 <NA>
#> 2                                                 <NA>
#> 3                                                 <NA>
#> 4                                                 <NA>
#> 5 Normality requires at least 3 finite numeric values.

Grouped summaries are useful for teaching and comparative research workflows.

summarize_data(iris, by = "Species")
#> Warning in data.frame(..., check.names = FALSE): row names were found from a
#> short variable and have been discarded
#> Warning in data.frame(..., check.names = FALSE): row names were found from a
#> short variable and have been discarded
#> Warning in data.frame(..., check.names = FALSE): row names were found from a
#> short variable and have been discarded
#>       Species     variable    type  n n_complete n_missing missing_pct n_unique
#> 1      setosa Sepal.Length numeric 50         50         0           0       15
#> 2      setosa  Sepal.Width numeric 50         50         0           0       16
#> 3      setosa Petal.Length numeric 50         50         0           0        9
#> 4      setosa  Petal.Width numeric 50         50         0           0        6
#> 5  versicolor Sepal.Length numeric 50         50         0           0       21
#> 6  versicolor  Sepal.Width numeric 50         50         0           0       14
#> 7  versicolor Petal.Length numeric 50         50         0           0       19
#> 8  versicolor  Petal.Width numeric 50         50         0           0        9
#> 9   virginica Sepal.Length numeric 50         50         0           0       21
#> 10  virginica  Sepal.Width numeric 50         50         0           0       13
#> 11  virginica Petal.Length numeric 50         50         0           0       20
#> 12  virginica  Petal.Width numeric 50         50         0           0       12
#>             mode mode_count mode_ties  mean median        sd   variance minimum
#> 1         5, 5.1          8      TRUE 5.006   5.00 0.3524897 0.12424898     4.3
#> 2            3.4          9     FALSE 3.428   3.40 0.3790644 0.14368980     2.3
#> 3       1.4, 1.5         13      TRUE 1.462   1.50 0.1736640 0.03015918     1.0
#> 4            0.2         29     FALSE 0.246   0.20 0.1053856 0.01110612     0.1
#> 5  5.5, 5.6, 5.7          5      TRUE 5.936   5.90 0.5161711 0.26643265     4.9
#> 6              3          8     FALSE 2.770   2.80 0.3137983 0.09846939     2.0
#> 7            4.5          7     FALSE 4.260   4.35 0.4699110 0.22081633     3.0
#> 8            1.3         13     FALSE 1.326   1.30 0.1977527 0.03910612     1.0
#> 9            6.3          6     FALSE 6.588   6.50 0.6358796 0.40434286     4.9
#> 10             3         12     FALSE 2.974   3.00 0.3224966 0.10400408     2.2
#> 11           5.1          7     FALSE 5.552   5.55 0.5518947 0.30458776     4.5
#> 12           1.8         11     FALSE 2.026   2.00 0.2746501 0.07543265     1.4
#>      q25   q75 maximum range   iqr     mad    skewness excess_kurtosis
#> 1  4.800 5.200     5.8   1.5 0.400 0.29652  0.11297784      -0.4508724
#> 2  3.200 3.675     4.4   2.1 0.475 0.37065  0.03872946       0.5959507
#> 3  1.400 1.575     1.9   0.9 0.175 0.14826  0.10009538       0.6539303
#> 4  0.200 0.300     0.6   0.5 0.100 0.00000  1.17963278       1.2587179
#> 5  5.600 6.300     7.0   2.1 0.700 0.51891  0.09913926      -0.6939138
#> 6  2.525 3.000     3.4   1.4 0.475 0.29652 -0.34136443      -0.5493203
#> 7  4.000 4.600     5.1   2.1 0.600 0.51891 -0.57060243      -0.1902555
#> 8  1.200 1.500     1.8   0.8 0.300 0.22239 -0.02933377      -0.5873144
#> 9  6.225 6.900     7.9   3.0 0.675 0.59304  0.11102862      -0.2032597
#> 10 2.800 3.175     3.8   1.6 0.375 0.29652  0.34428489       0.3803832
#> 11 5.100 5.875     6.9   2.4 0.775 0.66717  0.51691747      -0.3651161
#> 12 1.800 2.300     2.5   1.1 0.500 0.29652 -0.12181190      -0.7539586
#>    outlier_count outlier_pct normality_test normality_statistic
#> 1              0           0   Shapiro-Wilk           0.9776985
#> 2              2           4   Shapiro-Wilk           0.9717195
#> 3              4           8   Shapiro-Wilk           0.9549768
#> 4              2           4   Shapiro-Wilk           0.7997645
#> 5              0           0   Shapiro-Wilk           0.9778357
#> 6              0           0   Shapiro-Wilk           0.9741333
#> 7              1           2   Shapiro-Wilk           0.9660044
#> 8              0           0   Shapiro-Wilk           0.9476263
#> 9              1           2   Shapiro-Wilk           0.9711794
#> 10             3           6   Shapiro-Wilk           0.9673905
#> 11             0           0   Shapiro-Wilk           0.9621864
#> 12             0           0   Shapiro-Wilk           0.9597715
#>    normality_p_value normality_alpha            normality_decision warning
#> 1       4.595132e-01            0.05 No evidence against normality    <NA>
#> 2       2.715264e-01            0.05 No evidence against normality    <NA>
#> 3       5.481147e-02            0.05 No evidence against normality    <NA>
#> 4       8.658573e-07            0.05    Evidence against normality    <NA>
#> 5       4.647370e-01            0.05 No evidence against normality    <NA>
#> 6       3.379951e-01            0.05 No evidence against normality    <NA>
#> 7       1.584778e-01            0.05 No evidence against normality    <NA>
#> 8       2.727780e-02            0.05    Evidence against normality    <NA>
#> 9       2.583147e-01            0.05 No evidence against normality    <NA>
#> 10      1.808960e-01            0.05 No evidence against normality    <NA>
#> 11      1.097754e-01            0.05 No evidence against normality    <NA>
#> 12      8.695419e-02            0.05 No evidence against normality    <NA>

Profile a dataset

profile <- profile_data(iris)
profile$dataset
#>   rows columns complete_rows duplicated_rows total_missing missing_pct
#> 1  150       5           150               1             0           0
#>          type_profile
#> 1 factor=1, numeric=4
profile$warnings
#>       variable        level
#> 1    <dataset>   duplicates
#> 2 Sepal.Length    normality
#> 3 Petal.Length    normality
#> 4  Petal.Width    normality
#> 5      Species data-quality
#>                                                           message
#> 1                                   Duplicate rows were detected.
#> 2 Normality test suggests evidence against a normal distribution.
#> 3 Normality test suggests evidence against a normal distribution.
#> 4 Normality test suggests evidence against a normal distribution.
#> 5            Normality requires at least 3 finite numeric values.

Create a report scaffold

report_path <- datasum_report(iris, format = "qmd", render = FALSE)
file.exists(report_path)
#> [1] TRUE

The generated Quarto source contains the dataset overview, variable diagnostics, warnings, formula definitions, and interpretation notes. Rendering HTML, PDF, or DOCX output is available when the optional quarto package and Quarto CLI are installed.