Title: | Quantitative Text Kit |
---|---|
Description: | Support package for the textbook "An Introduction to Quantitative Text Analysis for Linguists: Reproducible Research Using R" (Francom, 2024) <doi:10.4324/9781003393764>. Includes functions to acquire, clean, and analyze text data as well as functions to document and share the results of text analysis. The package is designed to be used in conjunction with the book, but can also be used as a standalone package for text analysis. |
Authors: | Jerid Francom [aut, cre, cph] |
Maintainer: | Jerid Francom <[email protected]> |
License: | GPL (>= 3) |
Version: | 1.0.0 |
Built: | 2024-11-04 02:48:19 UTC |
Source: | https://github.com/qtalr/qtkit |
This function adds a package to a BibTeX file. It uses the
knitr::write_bib
function to write the package name to the file.
add_pkg_to_bib(pkg_name, bib_file = "packages.bib")
add_pkg_to_bib(pkg_name, bib_file = "packages.bib")
pkg_name |
The name of the package to add to the BibTeX file. |
bib_file |
The name of the BibTeX file to write to. |
my_bib_file <- tempfile(fileext = ".bib") add_pkg_to_bib("dplyr", my_bib_file) readLines(my_bib_file) |> cat(sep = "\n")
my_bib_file <- tempfile(fileext = ".bib") add_pkg_to_bib("dplyr", my_bib_file) readLines(my_bib_file) |> cat(sep = "\n")
This function calculates various association metrics (PMI, Dice's Coefficient, G-score) for bigrams in a given corpus.
calc_assoc_metrics( data, doc_index, token_index, type, association = "all", verbose = FALSE )
calc_assoc_metrics( data, doc_index, token_index, type, association = "all", verbose = FALSE )
data |
A data frame containing the corpus. |
doc_index |
Column in 'data' which represents the document index. |
token_index |
Column in 'data' which represents the token index. |
type |
Column in 'data' which represents the tokens or terms. |
association |
A character vector specifying which metrics to calculate. Can be any combination of 'pmi', 'dice_coeff', 'g_score', or 'all'. Default is 'all'. |
verbose |
A logical value indicating whether to keep the intermediate probability columns. Default is FALSE. |
A data frame with one row per bigram and columns for each calculated metric.
data_path <- system.file("extdata", "bigrams_data.rds", package = "qtkit") data <- readRDS(data_path) calc_assoc_metrics(data, doc_index, token_index, type)
data_path <- system.file("extdata", "bigrams_data.rds", package = "qtkit") data <- readRDS(data_path) calc_assoc_metrics(data, doc_index, token_index, type)
This function calculates type metrics for tokenized text data.
calc_type_metrics(data, type, document, frequency = NULL, dispersion = NULL)
calc_type_metrics(data, type, document, frequency = NULL, dispersion = NULL)
data |
A data frame containing the tokenized text data |
type |
The variable in |
document |
The variable in |
frequency |
A character vector indicating which
frequency metrics to use. If NULL (default),
only the |
dispersion |
A character vector indicating which
dispersion metrics to use. If NULL (default),
only the |
A data frame with columns:
type
: The unique types from the input data.
n
: The frequency of each type across all document.
Optionally (based on the frequency
and dispersion
arguments):
rf
: The relative frequency of each type across all document.
orf
: The observed relative frequency (per 100) of each
type across all document.
df
: The document frequency of each type.
idf
: The inverse document frequency of each type.
dp
: Gries' Deviation of Proportions of each type.
Gries, Stefan Th. (2023). Statistical Methods in Corpus Linguistics. In Readings in Corpus Linguistics: A Teaching and Research Guide for Scholars in Nigeria and Beyond, pp. 78-114.
data_path <- system.file("extdata", "types_data.rds", package = "qtkit") data <- readRDS(data_path) calc_type_metrics( data = data, type = type, document = document, frequency = c("rf", "orf"), dispersion = c("df", "idf") )
data_path <- system.file("extdata", "types_data.rds", package = "qtkit") data <- readRDS(data_path) calc_type_metrics( data = data, type = type, document = document, frequency = c("rf", "orf"), dispersion = c("df", "idf") )
This function takes a data frame and creates a data dictionary. The data dictionary includes the variable name, a human-readable name, the variable type, and a description. If a model is specified, the function uses OpenAI's API to generate the information based on the characteristics of the data frame.
create_data_dictionary( data, file_path, model = NULL, sample_n = 5, grouping = NULL, force = FALSE )
create_data_dictionary( data, file_path, model = NULL, sample_n = 5, grouping = NULL, force = FALSE )
data |
A data frame to create a data dictionary for. |
file_path |
The file path to save the data dictionary to. |
model |
The ID of the OpenAI chat completion models to use for
generating descriptions (see |
sample_n |
The number of rows to sample from the data frame to use as input for the model. Default NULL. |
grouping |
A character vector of column names to group by when sampling rows from the data frame for the model. Default NULL. |
force |
If TRUE, overwrite the file at |
A data frame containing the variable name, human-readable name, variable type, and description for each variable in the input data frame.
Data frame with attributes about the data origin, written to a CSV file and optionally returned.
create_data_origin(file_path, return = FALSE, force = FALSE)
create_data_origin(file_path, return = FALSE, force = FALSE)
file_path |
File path where the data origin file should be saved. |
return |
Logical value indicating whether the data origin should be returned. |
force |
Logical value indicating whether to overwrite the file if it already exists. |
A data frame containing the data origin information.
tmp_file <- tempfile(fileext = ".csv") create_data_origin(tmp_file) read.csv(tmp_file)
tmp_file <- tempfile(fileext = ".csv") create_data_origin(tmp_file) read.csv(tmp_file)
This function identifies outliers in a numeric variable of a data.frame using the interquartile range (IQR) method.
find_outliers(data, variable_name)
find_outliers(data, variable_name)
data |
A data.frame object. |
variable_name |
A symbol representing a numeric variable in |
A data.frame containing the outliers in variable_name
.
If no outliers are found, the function returns NULL
. The
function also prints diagnostic information about the
variable and the number of outliers found.
data(mtcars) find_outliers(mtcars, mpg) find_outliers(mtcars, wt)
data(mtcars) find_outliers(mtcars, mpg) find_outliers(mtcars, wt)
Possible file types include .zip, .gz, .tar, and .tgz
get_archive_data(url, target_dir, force = FALSE, confirmed = FALSE)
get_archive_data(url, target_dir, force = FALSE, confirmed = FALSE)
url |
A character vector representing the full url to the compressed file |
target_dir |
The directory where the archive file should be downloaded |
force |
An optional argument which forcefully overwrites existing data |
confirmed |
If |
NULL, the archive file is unarchived in the target directory
## Not run: data_dir <- file.path(tempdir(), "data") url <- "https://raw.githubusercontent.com/qtalr/qtkit/main/inst/extdata/test_data.zip" get_archive_data( url = url, target_dir = data_dir, confirmed = TRUE) ## End(Not run)
## Not run: data_dir <- file.path(tempdir(), "data") url <- "https://raw.githubusercontent.com/qtalr/qtkit/main/inst/extdata/test_data.zip" get_archive_data( url = url, target_dir = data_dir, confirmed = TRUE) ## End(Not run)
Retrieves works from Project Gutenberg based on specified criteria and saves the data to a CSV file. This function is a wrapper for the gutenbergr package.
get_gutenberg_data( target_dir, lcc_subject, birth_year = NULL, death_year = NULL, n_works = 100, force = FALSE, confirmed = FALSE )
get_gutenberg_data( target_dir, lcc_subject, birth_year = NULL, death_year = NULL, n_works = 100, force = FALSE, confirmed = FALSE )
target_dir |
The directory where the CSV file will be saved. |
lcc_subject |
A character vector specifying the Library of Congress Classification (LCC) subjects to filter the works. |
birth_year |
An optional integer specifying the minimum birth year of authors to include. |
death_year |
An optional integer specifying the maximum death year of authors to include. |
n_works |
An integer specifying the number of works to retrieve. Default is 100. |
force |
A logical value indicating whether to overwrite existing data if it already exists. |
confirmed |
If |
This function retrieves Gutenberg works based on the specified LCC subjects and optional author birth and death years. It checks if the data already exists in the target directory and provides an option to overwrite it. The function also creates the target directory if it doesn't exist. If the number of works is greater than 1000 and the 'confirmed' parameter is not set to TRUE, it prompts the user for confirmation. The retrieved works are filtered based on public domain rights in the USA and availability of text. The resulting works are downloaded and saved as a CSV file in the target directory.
For more information on Library of Congress Classification (LCC) subjects, refer to the Library of Congress Classification Guide.
A message indicating whether the data was acquired or already existed on disk, writes the data files to disk in the specified target directory.
## Not run: data_dir <- file.path(tempdir(), "data") get_gutenberg_data( target_dir = data_dir, lcc_subject = "JC" n_works = 5, confirmed = TRUE) ## End(Not run)
## Not run: data_dir <- file.path(tempdir(), "data") get_gutenberg_data( target_dir = data_dir, lcc_subject = "JC" n_works = 5, confirmed = TRUE) ## End(Not run)
This function is a wrapper around ggsave
from the
ggplot2
package that allows you to write a ggplot object as part of
a knitr document as an output for later use. It is designed to be used
in a code block. The file name, if not specified, will be the label of
the code block.
write_gg( gg_obj = NULL, file = NULL, target_dir = NULL, device = "pdf", theme = NULL, ... )
write_gg( gg_obj = NULL, file = NULL, target_dir = NULL, device = "pdf", theme = NULL, ... )
gg_obj |
The ggplot to be written. If not specified, the last ggplot created will be written. |
file |
The name of the file to be written. If not specified, the label of the code block will be used. |
target_dir |
The directory where the file will be written. If not specified, the current working directory will be used. |
device |
The device to be used for saving the ggplot. Options include "pdf" (default), "png", "jpeg", "tiff", and "svg". |
theme |
The ggplot2 theme to be applied to the ggplot. Default is the theme specified in the ggplot2 options. |
... |
Additional arguments to be passed to the |
The path of the written file.
## Not run: library(ggplot2) plot_dir <- file.path(tempdir(), "plot") # Write a ggplot object as a PDF file p <- ggplot(mtcars, aes(x = wt, y = mpg)) + geom_point() write_gg( gg_obj = p, file = "plot_file", target_dir = plot_dir, device = "pdf") unlink(plot_dir) ## End(Not run)
## Not run: library(ggplot2) plot_dir <- file.path(tempdir(), "plot") # Write a ggplot object as a PDF file p <- ggplot(mtcars, aes(x = wt, y = mpg)) + geom_point() write_gg( gg_obj = p, file = "plot_file", target_dir = plot_dir, device = "pdf") unlink(plot_dir) ## End(Not run)
This function is a wrapper around save_kable
from the
kableExtra
package that allows you to write a kable object as part of
a knitr document as an output for later use. It is designed to be used
in a code block. The file name, if not specified, will be the label of
the code block.
write_kbl( kbl_obj, file = NULL, target_dir = NULL, device = "pdf", bs_theme = "bootstrap", ... )
write_kbl( kbl_obj, file = NULL, target_dir = NULL, device = "pdf", bs_theme = "bootstrap", ... )
kbl_obj |
The knitr_kable object to be written. |
file |
The name of the file to be written. If not specified, the name will be based on the current knitr code block label. |
target_dir |
The directory where the file will be written. If not specified, the current working directory will be used. |
device |
The device to be used for saving the file. Options include "pdf" (default), "html", "latex", "png", and "jpeg". Note that a Chromium-based browser (e.g., Google Chrome, Chromium, Microsoft Edge or Brave) is required on your system for all options except "latex'. If a suitable browser is not available, the function will stop and return an error message. |
bs_theme |
The Bootstrap theme to be applied to the kable object (only applicable for HTML output). Default is "bootstrap". |
... |
Additional arguments to be passed to the |
The path of the written file.
## Not run: library(knitr) table_dir <- file.path(tempdir(), "table") mtcars_kbl <- kable( x = mtcars[1:5, ], format = "html") # Write a kable object as a PDF file write_kbl( kbl_obj = mtcars_kbl, file = "kable_pdf", target_dir = table_dir, device = "pdf") # Write a kable as an HTML file with a custom Bootstrap theme write_kbl( kbl_obj = mtcars_kbl, file = "kable_html", target_dir = table_dir, device = "html", bs_theme = "flatly") unlink(table_dir) ## End(Not run)
## Not run: library(knitr) table_dir <- file.path(tempdir(), "table") mtcars_kbl <- kable( x = mtcars[1:5, ], format = "html") # Write a kable object as a PDF file write_kbl( kbl_obj = mtcars_kbl, file = "kable_pdf", target_dir = table_dir, device = "pdf") # Write a kable as an HTML file with a custom Bootstrap theme write_kbl( kbl_obj = mtcars_kbl, file = "kable_html", target_dir = table_dir, device = "html", bs_theme = "flatly") unlink(table_dir) ## End(Not run)
This function is a wrapper around dput
that allows you
to write an R object as part of a knitr document as an output for
later use. It is designed to be used in a code block. The file name, if
not specified, will be the label of the code block. Use the standard
dget
function to read the file back into an R session.
write_obj(obj, file = NULL, target_dir = NULL, ...)
write_obj(obj, file = NULL, target_dir = NULL, ...)
obj |
The R object to be written. |
file |
The name of the file to be written. If not specified, the label of the code block will be used. |
target_dir |
The directory where the file will be written. If not specified, the current working directory will be used. |
... |
Additional arguments to be passed to |
The path of the written file.
## Not run: obj_dir <- file.path(tempdir(), "obj") # Write a data frame as a file write_obj( obj = mtcars, file = "mtcars_data", target_dir = obj_dir) # Read the file back into an R session my_mtcars <- dget(file.path(obj_dir, "mtcars_data")) unlink(obj_dir) ## End(Not run)
## Not run: obj_dir <- file.path(tempdir(), "obj") # Write a data frame as a file write_obj( obj = mtcars, file = "mtcars_data", target_dir = obj_dir) # Read the file back into an R session my_mtcars <- dget(file.path(obj_dir, "mtcars_data")) unlink(obj_dir) ## End(Not run)