! | Reverses the meaning of the following condition. It is a logical negation and has a similar meaning as 'Not'. |
all | Return TRUE if the condition is satisfied for all the rows or a group of rows. |
any | Return TRUE if the condition is satisfied for any of the rows or a group of rows. |
duplicated | Returns a logical value (TRUE or FALSE) indicating if it is a duplicated value or rows. |
ifelse | Returns different values based on the conditional result. |
is.character | Returns true if a given object is character type. |
is.double | Returns true if a given object is double type. |
is.factor | Returns true if a given object is factor type. |
is.list | Returns true if a given object is list type. |
is.logical | Returns true if a given object is logical type. |
is.matrix | Returns true if a given object is matrix type. |
is.na | Returns TRUE when it is NA. |
is.null | Returns TRUE when it is null. |
is.numeric | Returns true if a given object is numeric type. |
isTRUE | Returns true if and only if a given value is a length-one logical vector whose only element is TRUE. |
which | Returns the TRUE indices of a logical object or based on the condition. |
xor | Performs an exclusive-or check across two values and return TRUE when only one of them is TRUE. |
is.Date | Returns true if a given object is Date type. |
is.difftime | Returns true if a given object is difftime type. |
is.duration | Returns true if a given object is duration type. |
is.POSIXct | Returns true if a given object is POSIXct type. |
str_detect | Return TRUE or FALSE based on whether Text data contains a given text or not |
if_else | Returns values based on whether a given condition is satisfied or not. |
case_when | Returns values based on multiple conditions. This is similar to CASE WHEN conditions in SQL. You can use two sided formula to define the condition and the mapping values, and use TRUE as ELSE condition. See examples below. |
coalesce | Replace the missing values with a given value or the values that are at the same position in other columns. |
na_if | Replaces the matching values to NA. |
near | Compares two numeric vectors (columns). This is a safe way of comparing if two vectors of floating point numbers are (pairwise) equal. This is safer than using ==, because it has a built in tolerance. |
recode_factor | Recodes (or replaces) the original values with given values and create a factor column. You can replace numeric values either by the name with backticks or based on their position, and character values by their name. |
parse_logical | Convert data to Logical (Boolean) data type. |
is.hms | Returns true if a given object is hms type. |
str_logical | Converts the data type of a column to logical type, transforming values into TRUE or FALSE.If the "true_value" argument is not specified, it converts "true", "Yes", "yes", "1" to TRUE, and "false", "No", "no", "0" to FALSE.If you want to make a specific value TRUE and all other values FALSE, specify the original value in true_value.Even if the target column for logical conversion contains a mix of uppercase and lowercase letters, the conversion ignores case and focuses only on the string content. |
impute_na | Impute NA by mean / median / mode values, manually entered values, or predicted values. It will build a linear regression model for the prediction. |
get_mode | Returns the most frequent value (mode) in a column. |
summarize_group | Summarize (or aggregate) column values with specified aggregate functions. If grouping columns are provided then summarize the values for each group. |
summarize_row | Summarize (or aggregate) column values across columns for each row with the specified aggregate function. |
mutate_group | Create new columns or update existing columns with specified functions such as Window Calculations. If grouping columns are provided then it peforms calculation for each group. |
confint_ratio | Returns a confidence interval range (half-width of confidence interval) of TRUE ratio from given logical data. |
calc_confint_ratio | Returns a confidence interval range (half-width of confidence interval) of ratio from a size and a TRUE ratio of sample data. |
abbreviate | Abbreviate strings to at least minlength characters, such that they remain unique (if they were). First all spaces at the beginning of the string are stripped. Then (if necessary) any other spaces are stripped. Next, lower case vowels are removed (starting at the right) followed by lower case consonants. Finally if the abbreviation is still longer than minlength upper case letters are stripped. |
c | This is a generic function which combines its arguments. |
iconv | Convert a given text from one encoding to another. |
nchar | Returns # of letters or size of a given value. |
stri_compare | Compares two strings lexicographically and determines their order. The result is returned as a numeric value: -1 (comes before), 0 (equal), or 1 (comes after). |
stri_datetime_format | Converts a date-time column to a string based on the specified format. |
stri_escape_unicode | Converts non-ASCII characters (such as Japanese multibyte characters or special symbols) in a string to Unicode escape sequences (\uXXXX format). ASCII characters (alphanumeric characters and basic symbols) are preserved as is. |
stri_numbytes | Returns the number of bytes required to store text data. In UTF-8 encoding, one character can be represented by 1-4 bytes, so the number of characters and the number of bytes may differ. |
stri_rand_strings | Generates random strings of specified length. Various character types can be used for generation, including ASCII characters and Unicode characters. |
regex | Modifier functions control the meaning of the pattern argument to stringr functions |
str_c | Concatenates multiple text from multiple columns. |
str_conv | Convert text to a specified encoding. e.g. UTF-8 |
str_count | Return number of letters or words in text data. You can also set a pattern to count only the matched pattern or text. |
str_detect | Return TRUE or FALSE based on whether Text data contains a given text or not |
str_dup | Repeat text values. |
str_extract | Extract only letters that match with a given letters or patterns |
str_extract_all | Extract all characters that match with a given letters or patterns. Can be combined with list_to_text function to convert extracted strings into a concatenated text. |
str_length | Return number of letters in text data. |
str_pad | Add white spaces to text values to make the total length to be exact a specified number. |
str_replace | Replace letters that matches with a given letters or expressions |
str_replace_all | Replace letters that matches with a given letters or expressions |
str_replace_na | Replace NA values to a specific Text |
str_split | Split a given text into multiple text by a given separator. It will return a list so you want to unnest the newly created column with unnest() function. |
str_sub | Extract letters from Text values based on the position |
str_to_lower | Convert text to lowercase |
str_to_title | Convert text to Title-case |
str_to_upper | Convert text to UPPERCASE |
str_trim | Trim (or remove) white spaces in Text values |
word | Extract First / Last Nth Word in Text data |
extract_numeric | Extract only the number from a given text. This is useful for strings that are supposed to be numbers with extra formatting (e.g. $1,200.34, -12%, X1, etc.). |
url_parameters | Returns decoded query parameters from url. |
coalesce | Replace the missing values with a given value or the values that are at the same position in other columns. |
na_if | Replaces the matching values to NA. |
recode | Recodes (or replaces) the original values with given values. You can replace numeric values either by the name with backticks or based on their position, and character values by their name. |
parse_number | Parse characters, extract numeric values, and convert to number data type including |
str_squish | Remove leading and trailing white spaces and convert multiple consecutive white spaces into a single space. |
str_trunc | Truncate text to make the maximum length to be the specified number. |
str_remove | Remove letters that match with given letters or expressions. |
str_remove_all | Remove letters that match with given letters or expressions |
str_remove_word | Remove a word from sentence. |
str_replace_word | Replace word from a sentence with given text. |
str_remove_inside | Replace a text inside of characters. |
str_replace_inside | Replace a text inside of characters. |
str_remove_url | Remove an URL inside of characters. |
str_replace_url | Replace an URL inside of characters with provided text. |
str_extract_url | Replace an URL inside of characters with provided text. |
str_remove_emoji | Replace an URL inside of characters with provided text. |
anonymize | Anonymize values by hashing algorithms. |
str_clean | Cleans up text by removing escape characters (e.g. \n, \t), extra white spaces, extra period, and leading and trailing spaces. |
str_count_all | Count patterns from texts |
str_normalize | Normalize text by replacing full-width alphabets, numbers, special characters with regular alphabets, numbers, special characters, replacing half-width Kana characters with full-width Kana characters, and applying other normalization rules. It follows the rule of Unicode Normalization Forms. This is a wrapper function around stringi::stri_trans_nfkc function. |
str_extract_inside | Extract text inside symbols with given begin and end symbol. If multiple strings are extracted, they will be returned as a list type, so please wrap them with the list_to_text function." |
str_logical | Converts the data type of a column to logical type, transforming values into TRUE or FALSE.If the "true_value" argument is not specified, it converts "true", "Yes", "yes", "1" to TRUE, and "false", "No", "no", "0" to FALSE.If you want to make a specific value TRUE and all other values FALSE, specify the original value in true_value.Even if the target column for logical conversion contains a mix of uppercase and lowercase letters, the conversion ignores case and focuses only on the string content. |
get_stopwords | Returns stop words like "a", "the", "and", etc. |
word_to_sentiment | Returns sentiment types of positive or negative based on word(s). |
get_sentiment | Returns the sentiment score for a text sentence. The positive score indicates the positive sentence and the negative score indicates the opposite. 0 means 'neutral.' |
stem_word | Stem word so that almost the same words which have a little different spells can be recognized as the same words. |
is_stopword | Returns TRUE if a word is included in a list of the stop words defined by one of the dictionary. |
is_empty | Returns TRUE if a text is empty string or NA. |
is_alphabet | Returns TRUE if a text contains only alphabets. |
ip_to_country | Returns country names from IP addresses. |
url_domain | Returns domain (ex. "exploratory.io") from url. |
url_fragment | Returns fragment from url. |
url_path | Returns path from url. |
url_port | Returns port from url. |
url_scheme | Returns scheme (ex. "http", "https") from url. |
url_suffix | Returns suffix (ex. "com", "org") from url. |
url_subdomain | Returns subdomain (ex. "www", "blog") from url. |
url_tld | Returns top-level domain (ex. "com", "co") from url. |
url_param | Returns decoded query parameter from url. |
countrycode | Converts country names or country codes to other formats of country codes or names (country names, continent names, region names, etc.). This is useful for standardizing country codes from different data sources when conducting international data analysis or creating reports. |
statecode | This function takes a column that has US State information and returns either US State names, abbreviations, numeric codes, division, or region, based on 'output_type' argument setting. The US State information can be either US State names, US State abbreviations (FIPS / ANSI codes), or US State numeric code (FIPS) in any combination. |
countycode | Generate US county code (FIPS - Federal Information Processing Standard) based on US State and County names. |
impute_na | Impute NA by mean / median / mode values, manually entered values, or predicted values. It will build a linear regression model for the prediction. |
get_mode | Returns the most frequent value (mode) in a column. |
summarize_group | Summarize (or aggregate) column values with specified aggregate functions. If grouping columns are provided then summarize the values for each group. |
summarize_row | Summarize (or aggregate) column values across columns for each row with the specified aggregate function. |
mutate_group | Create new columns or update existing columns with specified functions such as Window Calculations. If grouping columns are provided then it peforms calculation for each group. |
str_remove_before | Removes the text before a specified delimiter or regex match. |
str_remove_after | Removes the text after a specified delimiter or regex match. |
str_remove_range | Removes characters from a specified range in text data. |
str_replace_before | Replaces the text before a specified delimiter or regex match with a specified string. |
str_replace_after | Replaces the text after a specified delimiter or regex match with a specified string. |
str_replace_range | Replaces characters in a specified range of text data with another string. |
str_extract_before | Extracts the text before a specified delimiter or regex match. |
str_extract_after | Extracts the text after a specified delimiter or regex match. |
str_conv_zenhan | Converts between full-width (zenkaku) and half-width (hankaku) characters in a string. |
str_conv_hirakana | Converts between hiragana and katakana characters in a string. |
separate_japanese_address | Splits a Japanese address string into prefecture, city, and street components. |
convert_jyear | Converts Japanese era years (Reiwa, Heisei, Showa, etc.) to Western calendar years. Supports both kanji notation (令和三年) and alphabetic notation (R3). |
convert_prefecture | Converts prefecture names between kanji format (e.g., 東京都) and romanized format (e.g., Tokyo-to). Useful for normalizing and standardizing address data. |
convert_prefecture_from_kana | Converts prefecture names written in hiragana or katakana to kanji notation. Useful for normalizing address data and name matching processes. |
harmonize_prefecture_name | Standardizes prefecture names to a unified format. Absorbs differences such as the presence/absence of "都", "府", "県" suffixes, as well as abbreviated and formal name variations, to provide consistent notation. |
is_zipcode | Determines whether a string is in the correct format for a Japanese postal code. Supports both 7-digit numeric format and hyphenated format. |
str_jconv | Converts between different Japanese character types (hiragana, katakana, full-width, half-width, etc.). By specifying the conversion function and target format, various character type conversions are supported. |
str_jnormalize | Normalizes strings to a standardized format. Converts full-width alphanumeric characters to half-width, unifies space characters, removes duplicate spaces, and more. |
zipcode_spacer | Formats postal code strings. Can convert between numeric-only format and hyphenated format. Useful for data standardization and display format unification. |
abs | Returns the absolute value. |
acos | Returns the inverse cosine of a value, in radians |
acosh | Returns the inverse hyperbolic cosine of a number. |
asin | Returns the inverse sine of a value, in radians. |
asinh | Returns the inverse hyperbolic sine of a number. |
atan | Returns the inverse tangent of a value, in radians. |
atan2 | Returns the angle between the x-axis and a line segment from the origin (0,0) to specified coordinate pair (`x`,`y`), in radians. |
atanh | Returns the inverse hyperbolic tangent of a number. |
ceiling | Rounds numeric values up to the specified decimal place. |
cos | Returns the cosine of an angle provided in radians. |
cosh | Returns the hyperbolic cosine of any real number. |
cospi | Returns the cos(pi * |
cut | Divides the range of values into intervals and return the values in x according to which interval they fall. |
exp | Returns Euler's number, e (~2.718) raised to a power. |
floor | Rounds a number down to the nearest integer. floor(3.475) is 3. |
log | Returns logarithms, by default natural logarithms. |
log10 | Computes common (i.e., base 10) logarithms |
log1p | Computes log(1+x) accurately also for |x| << 1. |
log2 | computes binary (i.e., base 2) logarithms |
pmax | Returns the parallel maxima of the input values. |
pmin | Returns the parallel minima of the input values. |
round | Rounds a number to a certain number of decimal places. |
sign | Returns either 1, 0, or -1 depending on whether the input value is positive, zero, or negative, respectively. |
signif | Rounds the values in its first argument to the specified number of significant digits. |
sin | Returns the sine of an angle provided in radians. |
sinh | Returns the hyperbolic sine of any real number. |
sinpi | Returns the sin(pi * |
sqrt | Returns the square root of the values. |
tan | Returns the tangent of an angle provided in radians. |
tanh | Returns the hyperbolic tangent of any real number. |
tanpi | Returns the tan(pi * |
trunc | Truncates the numbers by omitting digits. |
coalesce | Replace the missing values with a given value or the values that are at the same position in other columns. |
near | Compares two numeric vectors (columns). This is a safe way of comparing if two vectors of floating point numbers are (pairwise) equal. This is safer than using ==, because it has a built in tolerance. |
recode | Recodes (or replaces) the original values with given values. You can replace numeric values either by the name with backticks or based on their position, and character values by their name. |
normalize | Centers and/or scales the numeric values of a column. |
impute_na | Impute NA by mean / median / mode values, manually entered values, or predicted values. It will build a linear regression model for the prediction. |
detect_outlier | Detects outliers in a numeric column and returns labels indicating if values are outliers ('Upper' or 'Lower') or not ('Normal').This function is useful for labeling outliers or for easily removing outliers when used in combination with a filter. |
cluster | Build K-means clustering model and label each row with cluster id. |
get_mode | Returns the most frequent value (mode) in a column. |
summarize_group | Summarize (or aggregate) column values with specified aggregate functions. If grouping columns are provided then summarize the values for each group. |
summarize_row | Summarize (or aggregate) column values across columns for each row with the specified aggregate function. |
mutate_group | Create new columns or update existing columns with specified functions such as Window Calculations. If grouping columns are provided then it peforms calculation for each group. |
confint_mean | Returns a confidence interval range (half-width of confidence interval) of mean from given numeric data. |
calc_confint_mean | Returns a confidence interval range (half-width of confidence interval) of mean from a size and a standard deviation of sample data. |
cumsum_decayed | Caluculates cumulative sum of decaying effects. It is same as cumsum when r (the second argument) is 1. |
ts_lag | Returns the value of the previous period. |
ts_diff | Returns differences in values in a time series data since the specified lag time ago. |
ts_diff_ratio | Returns the differences of the values in ratio compared to the base values from the lag time ago in a time series. |
likert_sigma | Returns Likert's sigma values given a raw data column of survey answers with selected levels such as 1="Strongly disagree", 2="Disagree", 3="Neutral", 4="Agree", 5="Strongly agree". |
logistic | Logistic function. |
all | Return TRUE if the condition is satisfied for all the rows or a group of rows. |
any | Return TRUE if the condition is satisfied for any of the rows or a group of rows. |
length | Returns the length of the values. |
max | Returns the maximum value in a numeric column. |
mean | Returns the numerical average (mean) value. |
min | Returns the minimum value in a numeric column. |
prod | Returns the product value by multiplying a series of numbers in a given data together. |
sum | Returns the sum of all the values. |
unique | Returns number of unique values. |
n | Returns the count of the rows for each group. |
n_distinct | Returns the count of unique values. |
nth | returns the nth value of all the values. |
IQR | computes interquartile range of the x values |
mad | returns the median absolute deviation of the values. |
median | Returns the numerical median value. |
quantile | sample quantiles corresponding to the given probabilities. The smallest observation corresponds to a probability of 0 and the largest to a probability of 1. |
sd | returns the standard deviation of the values. |
var | returns the variance of the values. |
get_mode | Returns the most frequent value (mode) in a column. |
summarize_group | Summarize (or aggregate) column values with specified aggregate functions. If grouping columns are provided then summarize the values for each group. |
sum_if | Returns the sum of all the values that satisfy given conditions. |
sum_if_ratio | Returns the ratio of the sum of all the values that satisfy given conditions to the sum of all the values. |
count_if | Summarize the data by counting number of rows that satisfy given conditions. |
count_if_ratio | Return the ratio of the sum of the data by counting the number of rows that satisfy given conditions to the sum of the total count. |
count_unique_if | Counts the number of the unique values of a column that come from rows that satisfy given conditions. |
count_unique_if_ratio | Return the ratio of the number of the unique values of a column that come from rows that satisfy given conditions to number of the unique values. |
mean_if | Returns the numerical average (mean) value that satisfy given conditions. |
average_if | Returns the numerical average (mean) value that satisfy given conditions. This is an alias of mean_if. |
median_if | Returns the numerical median value that satisfy given conditions. |
max_if | Returns the maximum value from a numeric or date column that meets the specified condition. |
min_if | Returns the minimum value from a numeric or date column that meets the specified condition. |
count_rows | Returns the count of the rows for each group. |
count_unique | Returns the count of unique values. |
as.character | Convert a given data to character data type. |
as.Date | Convert a given data to Date data type. Date data type doesn't include Time. The default formats follow the rules of the ISO 8601 international standard which expresses a day as "2001-02-03". If the given data is number as 'internal system number (# of days since some origin)' then you want to provide which date to be used as 'origin'. |
as.double | Convert data to double precision number data type. It is identical to numeric. |
as.factor | Encode a column as a factor. |
as.integer | Convert data to integer number data type. |
as.logical | Converts the target column to logical type (TRUE/FALSE). For numeric values, 0 is converted to FALSE, and all non-zero values are converted to TRUE.For character strings, "TRUE", "True", "true", "T", "t" are converted to TRUE, and "FALSE", "False", "false", "F", "f" are converted to FALSE. Any other strings result in NA.The difference with the str_logical function is that str_logical can also convert Yes/No values to TRUE/FALSE, and allows you to directly specify which values should be converted to TRUE. |
as.numeric | Convert data to |
as.POSIXct | Convert data to Date/Time data type. |
factor | Encode a column as a factor. |
parse_date_time | Convert data to Date/Time data type. |
stri_datetime_format | Converts a date-time column to a string based on the specified format. |
parse_number | Parse characters, extract numeric values, and convert to number data type including |
parse_double | Convert data to double precision number data type. |
parse_euro_double | Convert data to double precision number data type. |
parse_integer | Convert data to integer number data type. |
parse_time | Converts character strings containing time values into time objects (hms format). When combined with as.character(), it can effectively transform time data from various formats into standardized hms objects. |
parse_character | Convert a given data to character data type. |
parse_factor | Parse data and convert to factor data type. |
parse_logical | Convert data to Logical (Boolean) data type. |
as_date | Convert a given POSIXct to Date. |
as_datetime | Convert a given data to date. |
as.hms | Converts numeric values representing seconds into hms (hours:minutes:seconds) time format. |
excel_numeric_to_date | Converts Excel numeric date values (e.g., 42370) to date format (e.g., 2016-01-01) |
excel_numeric_to_datetime | Convert number of date/time from excel to POSIXct. |
unixtime_to_datetime | Convert unix time numeric values to POSIXct. |
chartr | Translates each character in x that is specified in old to the corresponding character specified in new. |
list | Construct a list |
rep | Repeats given numbers. |
rev | Reverses the entries in a given data. |
desc | Change the sorting order to a descending order. e.g. 9 to 1 instead of 1 to 9, z to a instead of a to z. |
cummax | Returns the cumulative maxima. |
cummin | Returns the cumulative minima. |
cumprod | Returns the cumulative products. |
cumsum | Returns the cumulative sums. |
cumall | Returns TRUE if all the values up to the current position are TRUE. |
cumany | Returns TRUE if any of the values up to the current position is TRUE. |
cume_dist | Cumulative distribution. Proportion of all values less than or equal to the current rank. Missing values are left as is. |
cummean | Returns the cumulative mean (average). |
dense_rank | Ranks with no gaps. Missing values are left as is. |
first | returns the first value of all the values. |
lag | Provides access to a row at a given physical offset prior to that position. |
last | returns the last value of all the values. |
lead | Provides access to a row at a given physical offset that follows the current row. |
min_rank | Ranks. Ties get min rank. Missing values are left as is. |
ntile | Divides data into a specified number (n) of "equal frequency" groups (with approximately the same number of rows in each group) and returns an integer value (from 1 to n) indicating which group each row belongs to.Note that this "equal frequency division" distributes rows evenly across groups, which differs from equal width (same value range) division. |
percent_rank | Ranks rescaled to [0, 1]. Missing values are left as is. |
row_number | Returns row numbers. Equivalent to Rank. |
roll_max | Returns the rolling (or moving) max value. |
roll_mean | Calculate a moving average for a column. By specifying a window size, it is possible to return the moving average over the specified interval. |
roll_median | Returns the rolling (or moving) median value. |
roll_min | Returns the rolling (or moving) min value. |
roll_prod | Returns the rolling (or moving) prod value. |
roll_sd | Returns the rolling (or moving) standard deviation value. |
roll_sum | Returns the rolling (or moving) sum value. |
roll_var | Returns the rolling (or moving) variance value. |
mutate_group | Create new columns or update existing columns with specified functions such as Window Calculations. If grouping columns are provided then it peforms calculation for each group. |
difftime | Calculates the difference between two given dates/times. |
months | Convert a number to month periods so you can use it for further calculations. |
ceiling_date | Rounds date/time up to the nearest integer value of the specified time unit. |
date_decimal | Converts numeric data calculated as a fraction of the year to date data type. |
day | Extract Day |
days | Convert a number to day periods so you can use it for further calculations. |
decimal_date | Converts date data to numeric data type by calculating the date to a fraction of the year. |
dmy | Convert Character or Number to Date / Period when data contains only Date, but not Time. |
dmy_h | Convert Character or Number to Date / Period when data contains Date and Time. |
dmy_hm | Convert Character or Number to Date / Period when data contains Date and Time. |
dmy_hms | Convert Character or Number to Date / Period when data contains Date and Time. |
duration | Create a duration |
dym | Convert Character or Number to Date / Period when data contains only Date, but not Time. |
floor_date | Rounds date/time down to the nearest integer value of the specified time unit. |
force_tz | Re-registers the data with a given timezone. |
here | Get current time in your local timezone |
hm | Convert text to Period data type if a given column holds values that look like Hours and Minutes |
hms | Create hms data. It's difftime dealt as seconds. |
hour | Extract hour |
hours | Convert a number to hour periods so you can use it for further calculations. |
interval | Return intervals between two dates. You can use this result to get the duration by either dividing by a given units such as days, weeks, etc, OR converting it to numeric data type with "as.numeric()". |
isoweek | Extract week numbers of the year. Weeks start from Monday. |
mday | Extract Day of Month |
mdy | Convert Character or Number to Date / Period when data contains only Date, but not Time. |
mdy_h | Convert Character or Number to Date / Period when data contains Date and Time. |
mdy_hm | Convert Character or Number to Date / Period when data contains Date and Time. |
mdy_hms | Convert Character or Number to Date / Period when data contains Date and Time. |
milliseconds | Convert a number to milliseconds periods so you can use it for further calculations. |
minute | Extract minute |
minutes | Convert a number to minute periods so you can use it for further calculations. |
month | Extract Month |
ms | Convert text to Period data type if a given column holds values that look like Minutes and Seconds |
myd | Convert Character or Number to Date / Period when data contains only Date, but not Time. |
now | Returns current date and time. A similar function 'today()' returns only date. |
parse_date_time | Convert data to Date/Time data type. |
qday | Extract Day of Quarter |
quarter | Extract Quarter |
rollback | Calculate the last day of the previous month or to the first day of the month based on a given date. |
round_date | Rounds date/time to the nearest integer value of the specified time unit. |
second | Extract second |
seconds | Convert a number to second periods so you can use it for further calculations. |
time_length | Calculate the exact time length between two dates. |
today | Returns current date without time. A similar function 'now()' returns date including time. |
wday | Extract day of the week as a decimal number (01-07, Sunday is 1) or as Text (e.g. Sunday). The Text can be a full text or an abbreviated text. Sunday vs. Sun |
week | Extract week numbers of the year. The 1st day of the 1st week always starts from January 1st regardless of the day of the week. |
weeks | Convert a number to week periods so you can use it for further calculations. |
with_tz | Returns Date / Time in a given time zone |
yday | Extract Day of Year |
ydm | Convert Character or Number to Date / Period when data contains only Date, but not Time. |
ydm_h | Convert Character or Number to Date / Period when data contains Date and Time. |
ydm_hm | Convert Character or Number to Date / Period when data contains Date and Time. |
ydm_hms | Convert Character or Number to Date / Period when data contains Date and Time. |
year | Extract Year |
years | Convert a number to year periods so you can use it for further calculations. |
ymd | Convert Character or Number to Date / Period when data contains only Date, but not Time. |
ymd_h | Convert Character or Number to Date / Period when data contains Date and Time. |
ymd_hm | Convert Character or Number to Date / Period when data contains Date and Time. |
ymd_hms | Convert Character or Number to Date / Period when data contains Date and Time. |
coalesce | Replace the missing values with a given value or the values that are at the same position in other columns. |
recode | Recodes (or replaces) the original values with given values. You can replace numeric values either by the name with backticks or based on their position, and character values by their name. |
parse_time | Converts character strings containing time values into time objects (hms format). When combined with as.character(), it can effectively transform time data from various formats into standardized hms objects. |
ym | Convert Character or Number to Date / Period when data contains only Date, but not Time. |
my | Convert Character or Number to Date / Period when data contains only Date, but not Time. |
yq | Convert text to Date data type if a given column holds values that look like Year and Quoter. |
epiweek | Extract week numbers of the year. Weeks start from Sunday. |
as_date | Convert a given POSIXct to Date. |
as.hms | Converts numeric values representing seconds into hms (hours:minutes:seconds) time format. |
impute_na | Impute NA by mean / median / mode values, manually entered values, or predicted values. It will build a linear regression model for the prediction. |
get_mode | Returns the most frequent value (mode) in a column. |
summarize_group | Summarize (or aggregate) column values with specified aggregate functions. If grouping columns are provided then summarize the values for each group. |
summarize_row | Summarize (or aggregate) column values across columns for each row with the specified aggregate function. |
mutate_group | Create new columns or update existing columns with specified functions such as Window Calculations. If grouping columns are provided then it peforms calculation for each group. |
weekend | Returns either Weekday or Weekend based on the provided date column value. |
is_japanese_holiday | Alias to is_jholiday. Returns TRUE if the provided date is a Japanese Holiday. |
get_week_of_month | Extract Week of Month. For example, if the given date is in the 1st week of the month, it returns `1`. |
years_between | Calculate period between two dates in years. |
months_between | Calculate period between two dates in months. |
weeks_between | Calculate period between two dates in weeks. |
days_between | Calculate period between two dates in days. |
hours_between | Calculate period between two dates in hours. |
minutes_between | Calculate period between two dates in minutes. |
seconds_between | Calculate period between two dates in seconds. |
last_date | Returns the last date of the specified period (e.g., month) that the original date belongs to. |
ts_lag | Returns the value of the previous period. |
ts_diff | Returns differences in values in a time series data since the specified lag time ago. |
ts_diff_ratio | Returns the differences of the values in ratio compared to the base values from the lag time ago in a time series. |
is_jholiday | Returns TRUE if the provided date is a Japanese Holiday. |
find_date_by_wday | Retrieves the date for a specific weekday in a given year and month. For example, it can find "the 3rd Tuesday of May 2023." Useful for holiday settings and recurring schedule management. |
intersect | |
setdiff | |
union | |
anti_join | |
arrange | Sort rows by given column(s). |
bind_cols | Bind multiple data frames by column. |
bind_rows | Bind multiple data frames by row. |
distinct | Select distinct/unique rows. Only the first row will be preserved. Set .keep_all argument to TRUE to keep all the original columns. |
filter | Select rows with conditions. |
full_join | |
group_by | Converts the data frame into a grouped data frame where the following operations will be performed based on the groups. Grouping should be done before you want to aggregate values. |
inner_join | |
left_join | |
mutate | Mutate creates new column(s) with given expressions and preserves existing |
rename | Rename existing column names. |
right_join | |
sample_frac | Sample n fraction of rows from the data frame. |
sample_n | Sample n rows from the data frame. |
select | Select column(s) by column names. |
semi_join | |
slice | Select rows by positions. This is not supported for relational databases, in which case you should use filter() with row_number() function. |
summarize | Summarize a column values to a single value by using aggregate functions. Make sure you have grouped the data frame already using group_by() already. |
top_n | Select the top or bottom N data points within each group based on a specified column.This function allows you to filter the dataset to keep only the top or bottom N data points for each group according to a given measure. |
transmute | Transmute adds new columns and drops existing columns. |
ungroup | Ungroup existing grouping |
predict | Returns data augmented with fitted values and residuals. |
complete | Complete a data frame by adding missing combinations of data. |
expand | Expand a data frame by adding all the combination of given columns. |
fill | Fills missing values in using the previous entry. This is convenient especially when values are not presented in some cells to avoid duplicated entries. |
gather | Takes multiple columns and collapses into key-value pairs, duplicating all other columns as needed. |
nest | Nest a set of columns together as a list column. |
separate | Separates a column with delimited values into multiple columns. |
spread | Spread a key-value pair across multiple columns. Missing values will be replaced with NA. |
unite | Unite multiple columns together into one column with given uniting characters. It will concatenate values by “\_" and remove the original column by default. Multiple columns specified in the list can be combined together at once. |
unnest | Unnest a list column or a list of data frames by making each element of the list to be presented in its own row. |
select_if | Select column(s) with a predicate (conditional) function. |
mutate_all | Apply functions to all the columns. |
mutate_at | Apply functions to specified columns. |
mutate_if | Apply functions to only the columns that match with a given condition. |
summarize_all | Apply functions to all the columns. |
summarize_at | Apply functions to specified columns. |
summarize_if | Apply functions to only the columns that match with a given condition. |
rename_with | Batch converts multiple column names by applying a function. |
separate_rows | Separates a column with delimited values into multiple rows. |
drop_na | Drop rows that have NA value. |
type_convert | Heuristically guess the data type for each column by reading the first 1000 rows, parse the data, and set appropriate data types for all the columns of the data frame. |
model_info | Returns a summary information of a given model in a tidy (normalized data frame) format. |
remove_empty | Removes empty rows or columns. Useful for removing rows or columns that are empty (NA, NULL, blank) from a data frame. |
clean_names | Make column names clean by using underscores (_), lowercase letters, and numbers for easier data handling and standardization. |
get_dupes | Extracts rows that are duplicated in the specified column(s). If no columns are specified, it returns rows that are duplicated across all columns. |
tabyl | Create frequency table for a column in a data frame. Useful for quickly understanding data distribution with percentage calculations and NA handling. |
do_svd.kv | Calculates coordinations by reducing dimensionality using SVD (Singular Value Decomposition). |
do_svd | Calculates coordinations by reducing dimensionality using SVD (Singular Value Decomposition). |
build_lm | Builds a linear regression model (lm) and store it in a data frame. |
build_lr | Builds logistic regression model and store it in a data frame. |
build_glm | Builds generalized linear models (glm) and store it in a data frame. |
build_multinom | Builds multinomial logistic regression model and store it in a data frame. |
prediction | Returns a data frame with regression information about a model. |
prediction_binary | Returns a data frame with binary classification information about a model. |
prediction_coxph | Returns a data frame with predicted values of Cox Proportional Hazard Model. |
model_coef | Returns a data frame with 'Parameter Estimates (Coefficients)' information about a model including the below.* term - Term in lm that is estimated.* estimate - Estimated coefficients. The larger the number, the more positive effect the variable has.* std_error - Standard Error. How much prediction error there is in the variable.* t_ratio - estimate / std_error. This value in t distribution is p_value.* p_value - The probability that the variable is **not** effective.* conf_low - The lower bound of confidence interval.* conf_high - The upper bound of confidence interval. |
model_stats | Returns a data frame with 'Summary of Fit' information about a model including the below.* r_square - R Square.* r_square_adj - Adjusted R Square.* root_mean_square_error - Root mean square error.* f_ratio - F ratio.* p_value - P value.* df - Degree of freedom.* null_deviance - Null deviance.* df_for_null_model - Degree of freedom for null model.* log_likelihood - Log likelifood.* deviance - Deviance.* AIC - Akaike's information criterion.* BIC - Bayesian information criterion.* deviance - Deviance.* residual_df - Residual degree of freedom. |
model_anova | Returns a data frame with anova test information about a model including the below.* df - Degree of freedom.* sum_of_squares* mean_square* f_ratio* p_value* deviance* residual_df* residual_deviance |
evaluate_regression | Returns a data frame with evaluation score of regression including the below.* r_squared* explained_variance* mean_square_error* misclassification_error* root_mean_square_error* mean_absolute_error* mean_absolute_percentage_error |
evaluate_binary | Returns a data frame with evaluation score of binary classification including the below.* AUC* f_score* accuracy* misclassification_rate* precision* recall* specificity* true_positive - Number of positive predictions that actually are positive.* false_positive - Number of positive predictions that actually are negative.* true_negative - Number of negative predictions that actually are negative.* false_negative - Number of negative predictions that actually are positive.* test_size - The number of tested data.* threshold - threshold value for prediction. |
evaluate_multi | Returns a data frame with evaluation score of multi classification including the below.* micro_f_score* macro_f_score* accuracy* misclassification_rate |
do_roc | Returns coordinates of roc curve. |
build_kmeans.cols | Builds a clustering model (k-means) from variable columns and returns the summary of the model or the augmented data depending on the parameter value. |
build_kmeans.kv | Builds a clustering model (k-means) from key-value columns and store the model into a generated data frame or augment the original data with the clustered ID. |
build_kmeans | Builds a clustering model (k-means). Stores the model into a generated data frame or the augmented data depending on the parameter value. |
do_t.test | Execute t-test, which checks differences of means of variables. |
do_var.test | Execute F-test, which checks the differences of variances between groups. |
do_chisq.test | Execute chi-squared contingency table tests and goodness-of-fit tests. |
do_apriori | Find rules of what tend to occur at the same time from transaction data. |
do_anomaly_detection | Detect anomaly in time series data frame. |
do_prophet | Add forecast data to time series data frame. |
do_market_impact | Estimate impact of an event (advertisement, etc.) on a market, by using other markets to form a synthetic control. |
row_as_header | Use a row as column names. |
pivot | Pivot columns into rows and columns. Values are count of pairs of rows and columns or aggregation of another column. |
do_cor.cols | Calculates correlations for all the pairs of the variables (columns). |
do_cor.kv | Calculates correlations for all the pairs of subject columns. |
do_cor | Calculates correlations for all the pairs of the variables or subjects. |
do_dist.kv | Calculate the distances between each of the pairs. |
do_dist.cols | Calculate distances of each of the pairs. |
do_dist | Calculate distances of each of the pairs of the variables or subjects. |
do_cmdscale | Execute multidimensional scaling (MDS). Calculate approximated coordinations from distances of entity pairs. |
build_model | Create data frame with models from input data frame by model function and arguments. |
one_hot | One-hot encodes a categorical column, producing separate columns for each categorical values, each of which has values of 1 or 0 that tells whether a row has the value the column represents. |
sample_rows | Sample n rows from the data frame. This is same as sample_n except for it handles the case where the number of rows in the data is fewer than the specified n without throwing error. |
exp_balance | Balances imbalanced data using SMOTE (Synthetic Minority Oversampling Technique). Creates synthetic data for the minority class and appropriately adjusts the majority class data to help build unbiased machine learning models.The target column must be limited to 0/1 numeric values, binary categorical data, or logical type columns. |
levels | Returns the levels of a given factor. |
coalesce | Replace the missing values with a given value or the values that are at the same position in other columns. |
recode | Recodes (or replaces) the original values with given values. You can replace numeric values either by the name with backticks or based on their position, and character values by their name. |
as_factor | Convert a given column to a factor. Compared to base R's as.factor, this function creates levels in the order in which they appear, which will be the same on every platform (base R sorts in the current locale which can vary from place to place). |
fct_anon | Anonymize factor levels. Replaces factor levels with arbitrary numeric identifiers. Neither the values nor the order of the levels are preserved. |
fct_expand | Add additional levels to a factor |
fct_drop | Drop unused levels. |
fct_explicit_na | This gives missing value an explicit factor level, ensuring that they appear in the charts. |
fct_lump | Moves least/most common factor levels into "Other" category. |
fct_other | Creates "Other" category by moving specified values to "Other", or by keeping only specified values from "Other". |
fct_inorder | Reorder factors levels by first appearance. |
fct_infreq | Reorder factors levels by frequency (the most frequently appears to the least.) |
fct_relevel | Sets the level of a given factor column by moving any number of levels to any location. |
fct_reorder | Sets the order of values based on sorted values from another column and converts the data type to factor. |
fct_rev | Reverse order of factor levels. |
impute_na | Impute NA by mean / median / mode values, manually entered values, or predicted values. It will build a linear regression model for the prediction. |
cluster | Build K-means clustering model and label each row with cluster id. |
get_mode | Returns the most frequent value (mode) in a column. |
summarize_group | Summarize (or aggregate) column values with specified aggregate functions. If grouping columns are provided then summarize the values for each group. |
summarize_row | Summarize (or aggregate) column values across columns for each row with the specified aggregate function. |
mutate_group | Create new columns or update existing columns with specified functions such as Window Calculations. If grouping columns are provided then it peforms calculation for each group. |
exp_cut_by_step | Divides numeric data into equal intervals (e.g., in steps of 500) and returns the corresponding group name. Can be combined with format_cut_output function to customize the display format of the intervals. |
exp_cut | Divides numeric data into equal width groups and returns the corresponding group name. Can be combined with format_cut_output function to customize the display format of the intervals. |
range | Returns a min and max values of a given data. |
between | Return TRUE or FALSE based on a given value is within a given range (between). |
cor | Computes the correlation of two column values. The covariance divided by the product of the standard deviations of the two column values. |
cov | Computes the covariance of two column values. |
str_detect | Return TRUE or FALSE based on whether Text data contains a given text or not |
list_extract | Extract a value of a list data type column based on a position or name when the data inside the list is data frame. |
list_to_text | Concatenates texts from all the elements of a list data type column. |
list_concat | Concatenates values from multiple columns into a list. |
list_n | Returns number of elements inside a list data type column for each row. |
full_seq | Generate a sequence of numbers based on the values in a given column. |
nesting | Fix multiple columns so that only the combination of actually existing values in those columns will be used for complete() or expand() operation. |
param_remove | Removes parameters from a given URL. |
url_decode | Decodes a given encoded URL. It assume character encoding is UTF-8. |
url_encode | Encodes a given URL. It assume character encoding is UTF-8. |
url_parameters | Returns decoded query parameters from url. |
url_domain | Returns domain (ex. "exploratory.io") from url. |
url_fragment | Returns fragment from url. |
url_path | Returns path from url. |
url_port | Returns port from url. |
url_scheme | Returns scheme (ex. "http", "https") from url. |
url_suffix | Returns suffix (ex. "com", "org") from url. |
url_subdomain | Returns subdomain (ex. "www", "blog") from url. |
url_tld | Returns top-level domain (ex. "com", "co") from url. |
url_param | Returns decoded query parameter from url. |
- | Removes column(s). It can be used along with any column(s) selection operation like select(), gather(), etc. |
starts_with | Returns the column names that starts with a given text. |
ends_with | Returns the column names that ends with a given text. |
contains | Returns the column names that contain a given text. |
matches | Returns the column names that matches with a given text. |
num_range | Returns the column names that starts with a given text with numbers that are within a given range. It's useful especially when you have column names like X1, X2, X3, X4, etc. |
any_of | Returns the column names that are any of the given names. |
everything | Returns all the column names. It's useful when you want to have particular column(s) first before everything else. |
where | Returns the columns for which the specified column selection function (e.g. is.numeric) returns TRUE. |
across | This function is used for selecting columns for other functions such as summarize_row. |
vars | Returns a list of user selected columns. |
as_date | Convert a given POSIXct to Date. |
weekend | Returns either Weekday or Weekend based on the provided date column value. |
is_japanese_holiday | Alias to is_jholiday. Returns TRUE if the provided date is a Japanese Holiday. |
is_jholiday | Returns TRUE if the provided date is a Japanese Holiday. |
xgboost_reg | Create extreme gradient boosting model for regression. |
xgboost_binary | Create extreme gradient boosting model for binary classification. |
xgboost_multi | Create extreme gradient boosting model for binary classification. |
randomForestReg | Create random forest model for regression. |
randomForestBinary | Create random forest model for binary classification. |
randomForestMulti | Create random forest model for multi class classification. |
do_survfit | Calculates Survival Curve from survival time and survival status. |
prediction_survfit | Simulates Survival Curve for specified cohort based on a survival model. |
build_coxph | Builds Cox Proportional Hazard Model for survival analysis and store it in a data frame. |
prediction_coxph | Returns a data frame with predicted values of Cox Proportional Hazard Model. |
ip_to_country | Returns country names from IP addresses. |
countrycode | Converts country names or country codes to other formats of country codes or names (country names, continent names, region names, etc.). This is useful for standardizing country codes from different data sources when conducting international data analysis or creating reports. |
is_prefecture | Determines whether a string is a valid prefecture name. Returns FALSE if the string contains additional text such as city names or if it's an invalid prefecture name. |
do_tokenize | Returns one token (e.g. word) per row after tokenizing a text. |
pair_count | Count pairs of words (tokens) that cooccur within a group |
do_tfidf | Calculates TF-IDF for each term against a group. TF-IDF is a weighting mechanism that calculates the importance of each word to each document by increasing the importance based on the term frequency while decreasing the importance based on the document frequency. |
do_ngram | Create columns of n-grams connected in sentences. |
do_cosine_sim.kv | Calculates the similarity between each pair of the documents using the cosine similarity algorithm. Cosine similarity measures the cosine of the angle between two vectors in the multi-dimensional space. |
summarize_group | Summarize (or aggregate) column values with specified aggregate functions. If grouping columns are provided then summarize the values for each group. |
summarize_group | Summarize (or aggregate) column values with specified aggregate functions. If grouping columns are provided then summarize the values for each group. |
mutate_group | Create new columns or update existing columns with specified functions such as Window Calculations. If grouping columns are provided then it peforms calculation for each group. |
exp_balance | Balances imbalanced data using SMOTE (Synthetic Minority Oversampling Technique). Creates synthetic data for the minority class and appropriately adjusts the majority class data to help build unbiased machine learning models.The target column must be limited to 0/1 numeric values, binary categorical data, or logical type columns. |
exp_balance | Balances imbalanced data using SMOTE (Synthetic Minority Oversampling Technique). Creates synthetic data for the minority class and appropriately adjusts the majority class data to help build unbiased machine learning models.The target column must be limited to 0/1 numeric values, binary categorical data, or logical type columns. |
exp_cut_by_step | Divides numeric data into equal intervals (e.g., in steps of 500) and returns the corresponding group name. Can be combined with format_cut_output function to customize the display format of the intervals. |
exp_cut | Divides numeric data into equal width groups and returns the corresponding group name. Can be combined with format_cut_output function to customize the display format of the intervals. |
convert_jdate | Converts Japanese era date strings (e.g., 令和3年1月1日, R3/1/1) to Western calendar dates (e.g., 2021-01-01).For Japanese dates in the Gregorian calendar, such as "yyyy年mm 月" or "yyyy年mm月dd日", use the lubridate functions (ym or ymd) according to the date order (year, month, day) instead of using the convert_jdate function. The lubridate package functions can also properly process Japanese dates such as "year", "month", and "day". |
Summary
Reverses the meaning of the following condition. It is a logical negation and has a similar meaning as 'Not'.
Syntax
!
Return Value
Example
filter(!is.na(Population))
Keep the row if Population value is NOT NA
filter(!str_detect(City, "Francisco"))
Keep the row if City value does NOT include text 'Francisco'.
filter(!Sales > 500)
Keep the row if Sales value is NOT greater than 500.
Summary
Abbreviate strings to at least minlength characters, such that they remain unique (if they were). First all spaces at the beginning of the string are stripped. Then (if necessary) any other spaces are stripped. Next, lower case vowels are removed (starting at the right) followed by lower case consonants. Finally if the abbreviation is still longer than minlength upper case letters are stripped.
Syntax
abbreviate(<column>
, minlength = <number>
, dot = <logical>
)
Arguments
Return Value
Character
Example
abbreviate("Exploratory")
Returns "Expl"
abbreviate("Exploratory", 3)
Returns "Exp"
Summary
Returns the absolute value.
Syntax
abs(<column_num>
)
Return Value
Numeric
Example
mutate(abs = abs(ARR_DELAY))
Summary
Returns the inverse cosine of a value, in radians
Syntax
acos(<column_num>
)
Return Value
Numeric
Example
mutate(a = acos(X1))
Create a new column for calculating the acos.
Summary
Returns the inverse hyperbolic cosine of a number.
Syntax
acosh(<column_num>
)
Return Value
Numeric
Example
mutate(a = acosh(X1))
Create a new column for calculating the acosh.
Summary
Return TRUE if the condition is satisfied for all the rows or a group of rows.
Syntax
all(<condition>
, na.rm = <logical>
)
Arguments
Return Value
Logical
Example
summarize(is_matched_all = all(userid == username))
Return TRUE if all the userid values and username values are same, otherwise FALSE.
summarize(is_matched_all = all(revenue > 500))
Return TRUE if all the revenue values are greater than 500, otherwise FALSE.
Summary
Return TRUE if the condition is satisfied for any of the rows or a group of rows.
Syntax
any(<condition>
, na.rm = <logical>
)
Arguments
Return Value
Logical
Example
summarize(is_matched_all = any(userid == username))
Return TRUE if any of the userid values and username values are same, otherwise FALSE.
summarize(is_matched_all = all(revenue > 500))
Return TRUE if any of the revenue values are greater than 500, otherwise FALSE.
Summary
Convert a given data to character data type.
Syntax
as.character(<column>
)
Return Value
Character
Example
as.character(123.11)
Returns "123.11" as character.
Summary
Convert a given data to Date data type. Date data type doesn't include Time. The default formats follow the rules of the ISO 8601 international standard which expresses a day as "2001-02-03". If the given data is number as 'internal system number (# of days since some origin)' then you want to provide which date to be used as 'origin'.
Syntax
as.Date(<column>
, format = <date_format>
, tz = <timezone>
, origin = <text>
)
Arguments
Return Value
Date
Example
as.Date("2015-10-10")
Returns "2015-10-10"
as.Date("2015-10-10", tz = "America/Los_Angeles")
Returns "2015-10-10"
as.Date("10-10-10", format = "%d-%m-%y")
Returns "2010-10-10"
as.Date(35981, origin = "1899-12-30") # This is recommended for Windows Excel data.
Returns "1998-07-05"
as.Date(34519, origin = "1904-01-01") # This is recommended for Mac Excel data.
Returns "1998-07-05"
Summary
Convert data to double precision number data type. It is identical to numeric.
Syntax
as.double(<column>
)
Return Value
Double
Example
as.double("12345")
Returns 12345
as.double("12345.10")
Returns 12345.1
Summary
Encode a column as a factor.
Syntax
as.factor(<column>
)
Arguments
Return Value
Factor
Example
columnA - "iMac", "iPod", "iPhone", "iPod", "iPhone"
as.factor(columnA)
Returns iMac, iPod, iPhone, iPod, iPhone (It has Levels information of : iMac iPhone iPod)
Summary
Convert data to integer number data type.
Syntax
as.integer(<column>
)
Return Value
Integer
Example
as.integer("12345")
Returns 12345
as.integer("12345.10")
Returns 12345
Summary
Converts the target column to logical type (TRUE/FALSE). For numeric values, 0 is converted to FALSE, and all non-zero values are converted to TRUE.
For character strings, "TRUE", "True", "true", "T", "t" are converted to TRUE, and "FALSE", "False", "false", "F", "f" are converted to FALSE. Any other strings result in NA.
The difference with the str_logical function is that str_logical can also convert Yes/No values to TRUE/FALSE, and allows you to directly specify which values should be converted to TRUE.
Syntax
as.logical(<column_to_convert>
)
Arguments * column_to_convert - Specifies the column you want to convert to logical type.
Return Value
Logical
Example
as.logical(numeric_column)
Converts a numeric column to logical type. 0 is converted to FALSE, and all non-zero values are converted to TRUE.
numeric_column | converted |
---|---|
0 | FALSE |
1 | TRUE |
1 | TRUE |
0 | FALSE |
as.logical(text_column)
Converts a text column to logical type. Only "TRUE"/"FALSE" and their variations are converted, all others become NA.
text_column | converted |
---|---|
TRUE | TRUE |
False | FALSE |
t | TRUE |
Yes | NA |
No | NA |
Summary
Convert data to
Syntax
as.numeric(<column>
, units = <difftime_units>
)
Arguments
Return Value
Numeric
Example
as.numeric("12345")
Returns 12345
as.numeric("12345.10")
Returns 12345.1
as.numeric(as.Date("2015-01-30") - as.Date("2015-01-15"), units = "days")
Returns 15
as.numeric(as.Date("2015-01-30") - as.Date("2015-01-15"), units = "weeks")
Returns 2.142857
Summary
Convert data to Date/Time data type.
Syntax
as.POSIXct(<column>
, format = <date_time_format>
, tz = <timezone>
, origin = <text>
)
Arguments
Return Value
POSIXct
Example
as.POSIXct("2015-10-10 13:10:05")
Returns "2015-10-10 13:10:05 PDT"
as.POSIXct("2015-10-10 13:10:05", tz = "America/Los_Angeles")
Returns "2015-10-10 13:10:05 PDT"
as.POSIXct("2015-10-10 13:10:05", tz = "Asia/Tokyo")
Returns "2015-10-10 13:10:05 JST"
as.POSIXct("05-10-15T13:10:05", format = "%d-%m-%yT%H:%M")
Returns "2015-10-05 13:10:00 PDT"
Summary
Returns the inverse sine of a value, in radians.
Syntax
asin(<column_num>
)
Return Value
Numeric
Example
mutate(a = asin(X1))
Create a new column for calculating the asin.
Summary
Returns the inverse hyperbolic sine of a number.
Syntax
asinh(<column_num>
)
Return Value
Numeric
Example
mutate(a = asinh(X1))
Create a new column for calculating the asinh.
Summary
Returns the inverse tangent of a value, in radians.
Syntax
atan(<column_num>
)
Return Value
Numeric
Example
mutate(a = atan(X1))
Create a new column for calculating the atan.
Summary
Returns the angle between the x-axis and a line segment from the origin (0,0) to specified coordinate pair (x
,y
), in radians.
Syntax
atan2(<column_num>
)
Return Value
Numeric
Example
mutate(a = atan2(X1, Y1))
Create a new column for calculating the atan2.
Summary
Returns the inverse hyperbolic tangent of a number.
Syntax
atanh(<column_num>
)
Return Value
Numeric
Example
mutate(a = atanh(X1))
Create a new column for calculating the atanh.
Summary
This is a generic function which combines its arguments.
Syntax
c(<value1>
, <value2>
, ... , recursive = <logical>
)
Arguments
Example
c(1:10)
Returns 1 2 3 4 5 6 7 8 9 10
c(1:10, 1:3)
Returns 1 2 3 4 5 6 7 8 9 10 1 2 3
c("a", "b", "c")
Returns "a" "b" "c"
Summary
Rounds numeric values up to the specified decimal place.
Syntax
ceiling(<column_num>
, digits = <number>
)
Arguments
Return Value
Numeric
Example
ceiling(X1, digits=2)
Rounds X1 up to the second decimal place.
X1 | NEW |
---|---|
-0.2344 | -0.23 |
0.91098 | 0.92 |
0.311785 | 0.32 |
Summary
Translates each character in x that is specified in old to the corresponding character specified in new.
Syntax
chartr(<old_text>
, <new_text>
, <column_text>
)
Arguments
Return Value
Character
Example
x = "abcdef 123456"
chartr("abc", "xyz", x)
Returns "xyzdef 123456"
chartr("a-f", "u-z", x)
Returns "uvwxyz 123456"
chartr("a-f", "xxxxxx", x)
Returns "xxxxxx 123456"
chartr("123", "000", x)
Returns "abcdef 000456"
Summary
Returns the cosine of an angle provided in radians.
Syntax
cos(<column_num>
)
Return Value
Numeric
Example
mutate(a = cos(X1))
Create a new column for calculating the cos.
Summary
Returns the hyperbolic cosine of any real number.
Syntax
cosh(<column_num>
)
Return Value
Numeric
Example
mutate(a = tanh(X1))
Create a new column for calculating the cosh.
Summary
Returns the cos(pi *
Syntax
cospi(<column_num>
)
Return Value
Numeric
Example
mutate(a = cospi(X1))
Create a new column for calculating the cospi.
Summary
Returns the cumulative maxima.
Syntax
cummax(<column_num>
)
Return Value
Numeric
Example
// X = c(1, 2, 3, 2, 1)
cummax(X)
Returns 1, 2, 3, 3, 3
Summary
Returns the cumulative minima.
Syntax
cummin(<column_num>
)
Return Value
Numeric
Example
// X = c(1, 2, 3, 2, 1)
cummin(X)
Returns 1, 1, 1, 1, 1
Summary
Returns the cumulative products.
Syntax
cumprod(<column_num>
)
Return Value
Numeric
Example
// X = c(1, 2, 3, 2, 1)
cumprod(X)
Returns 1, 2, 6, 12, 12
Summary
Returns the cumulative sums.
Syntax
cumsum(<column_num>
)
Return Value
Numeric
Example
// X = c(1, 2, 3, 2, 1)
cumsum(X)
Returns 1, 3, 6, 8, 9
Summary
Divides the range of values into intervals and return the values in x according to which interval they fall.
Syntax
cut(<column_num>
, breaks = <number>
, labels = <text>
,
include.lowest = <logical>
, right = <logical>
, dig.lab = <number>
),
ordered_result = <logical>
)
Arguments
Example
//x = c(1,2,3,4,5,6,7,8,9,10)
cut(x, breaks = 5)
Returns (0.991,2.8] (0.991,2.8] (2.8,4.6] (2.8,4.6] (4.6,6.4] (4.6,6.4] (6.4,8.2] (6.4,8.2] (8.2,10] (8.2,10]
cut(x, breaks = 2)
Returns (0.991,5.5] (0.991,5.5] (0.991,5.5] (0.991,5.5] (0.991,5.5] (5.5,10] (5.5,10] (5.5,10] (5.5,10] (5.5,10]
cut(x, breaks = 2, labels = c(1,2))
Returns 1 1 1 1 1 2 2 2 2 2
cut(x, breaks = 2, labels = c("A", "B"))
Returns A A A A A B B B B B
mutate(a = cut(FL_DATE, breaks = 5, start.on.monday = FALSE))
Returns one of the five buckets based on the date value
Summary
Calculates the difference between two given dates/times.
Syntax
difftime(<column_date>
, <column_date>
, tz = <timezone>
, units = <difftime_units>
)
Arguments
Return Value
Difftime
Example
difftime("2015-12-10", "2015-12-30"))
Returns Time difference of -20 days.
difftime("2015-12-10", "2015-12-30", unit = "hours")
Returns Time difference of -480 hours.
as.numeric(difftime("2015-12-10", "2015-12-30"))
Returns -20.
as.numeric(difftime("2015-12-10", "2015-12-30"), units = "hours")
Returns -480.
Summary
Returns a logical value (TRUE or FALSE) indicating if it is a duplicated value or rows.
Syntax
duplicated(<column(s)>
, incomparables = <logical>
)
Arguments
Return Value
Logical
Example
// x <- c("a", "b", "c", "a", "b")
duplicated(x)
Returns FALSE FALSE FALSE TRUE TRUE
duplicated(x, incomparables = "a")
Returns FALSE FALSE FALSE FALSE TRUE
mutate(a = duplicated(ARR_TIME))
Returns TRUE or FALSE for each row
filter(duplicated(ARR_TIME, ARR_DELAY))
Keep only the duplicated rows
summarise(a = sum(!duplicated(ARR_TIME)))
Returns # of NOT duplicated values for each group. This is same as the below.
summarise(a = sum(length(unique(ARR_TIME))))
Summary
Returns Euler's number, e (~2.718) raised to a power.
Syntax
exp(<column_num>
)
Return Value
Numeric
Example
mutate(revenue_exp = exp(revenue))
Returns the exponential value of revenue
Summary
Encode a column as a factor.
Syntax
factor(<column>
, levels = <text>
, labels = <text>
, exclude = <text>
, ordered = <logical>
, nmax = <number>
)
Arguments
Return Value
Factor
Example
mutate(color = factor(color, levels = c("Gold","Silver","Bronze"), ordered=TRUE))
Before:
color | counts |
---|---|
Bronze | 8 |
Silver | 12 |
Gold | 15 |
After:
color | counts |
---|---|
Gold | 15 |
Silver | 12 |
Bronze | 8 |
Summary
Rounds a number down to the nearest integer. floor(3.475) is 3.
Syntax
floor(<column_num>
, digits = <number>
)
Arguments
Return Value
Numeric
Example
floor(3.475, digits=2)
Returns 3.47
Summary
Convert a given text from one encoding to another.
Syntax
iconv(<column>
, from = <encoding>
, to = <encoding>
, sub = <text>
, mark = <logical>
, toRaw = <logical>
)
Arguments
Return Value
Character
Example
iconv(x, "ISO_8859-1", "UTF-8")
Converts a given text from "ISO_8859-1" to "UTF-8".
Summary
Returns different values based on the conditional result.
Syntax
ifelse(<condition>
, <return_value_when_TRUE>
, <return_value_when_FALSE>
)
Arguments
Return Value
Example
//x <- 1:5
ifelse(x > 3, "Bigger", "Smaller")
Returns "Smaller" "Smaller" "Smaller" "Bigger" "Bigger"
Overview
Extracts only the rows that exist in both dataframes. Equivalent to the "intersection" operation in set theory.
Syntax
intersect(<dataframe>
, force_data_type = <TRUE|FALSE>
)
Arguments * dataframe - Specifies the dataframes to find common rows between. * force_data_type (optional) - When set to TRUE, it first converts all column data types to character type, and after processing, converts them back to inferred values.
Examples
Source Data1:
| ID | product_name | price |
|----|--------------|-------|
| 1 | apple | 100 |
| 2 | orange | 80 |
| 3 | banana | 120 |
Source Data2: | ID | product_name | price | |----|--------------|-------| | 2 | orange | 80 | | 3 | banana | 120 | | 4 | kiwi | 150 |
intersect(Source Data1, Source Data2)
Extracts only the rows that exist in both Data1 and Data2. | ID | product_name | price | |----|--------------|-------| | 2 | orange | 80 | | 3 | banana | 120 |
Summary
Returns true if a given object is character type.
Syntax
is.character
Return Value
Logical
Example
mutate_if(is.character, str_to_lower)
Convert all the letters to lowercase for all the character type columns.
Summary
Returns true if a given object is double type.
Syntax
is.double
Return Value
Logical
Example
mutate_if(is.double, round)
Round all the double type columns.
mutate_if(is.double, funs(. * 0.2))
Multiply all the double columns by 0.2 overriding existing columns.
mutate_if(is.double, funs(calc = . * 0.2))
Multiply all the double columns by 0.2 as new columns.
summarize_if(is.double, mean)
Calculate the mean (average) for all the double columns.
Summary
Returns true if a given object is factor type.
Syntax
is.factor
Return Value
Logical
Example
mutate_if(is.factor, as.character)
Convert all the factor type columns to character type.
Summary
Returns true if a given object is list type.
Syntax
is.list
Return Value
Logical
Example
select_if(is.list)
Select only list type columns.
Summary
Returns true if a given object is logical type.
Syntax
is.logical
Return Value
Logical
Example
mutate_if(is.logical, as.numeric)
Convert the logical values (TRUE, FALSE) to numeric values (1,0).
summarize_if(is.logical, sum)
Count the number of TRUE values for all the logical type columns.
Summary
Returns true if a given object is matrix type.
Syntax
is.matrix
Return Value
Logical
Example
mutate_if(is.matrix, scale)
Scale and center all the columns of matrices in all the matrix type columns.
Summary
Returns TRUE when it is NA.
Syntax
is.na(<column>
)
Return Value
Logical
Example
filter(is.na(ARR_DELAY))
Keep only rows where ARR_DELAY is NA
filter(!is.na(ARR_DELAY))
Keep only rows where ARR_DELAY is NOT NA
Summary
Returns TRUE when it is null.
Syntax
is.null(<column>
)
Return Value
Logical
Example
filter(is.null(ARR_DELAY))
Keep only rows where ARR_DELAY is null
filter(!is.null(ARR_DELAY))
Keep only rows where ARR_DELAY is NOT null
Summary
Returns true if a given object is numeric type.
Syntax
is.numeric
Return Value
Logical
Example
mutate_if(is.numeric, funs(. * 0.2))
Multiply all the numeric columns by 0.2 overriding existing columns.
mutate_if(is.numeric, funs(calc = . * 0.2))
Multiply all the numeric columns by 0.2 as new columns.
summarize_if(is.numeric, mean)
Calculate the mean (average) for all the numeric columns.
Summary
Returns true if and only if a given value is a length-one logical vector whose only element is TRUE.
Syntax
isTRUE(
Return Value
Logical
Example
isTRUE(TRUE)
Returns TRUE
isTRUE(FALSE)
Returns FALSE
summarise(a = isTRUE(mean(ARR_DELAY) > 120))
Returns TRUE or FALSE based on the average of ARR_DELAY for each group.
Summary
Returns the length of the values.
Syntax
length(<column>
)
Return Value
Numeric
Example
summarize(text_length = length(TAIL_NUM))
Returns # of values for each group
summarize(text_length = length(unique(TAIL_NUM)))
Returns # of unique values of 'TAIL_NUM' for each group
Summary
Returns the levels of a given factor.
Syntax
levels(<column>
)
Example
// x <- as.factor(c("a", "b", "c", "d", "e", "a", "b", "a"))
levels(x)
Returns "a" "b" "c" "d" "e"
summarise(n = length(levels(as.factor(flight$CARRIER))))
Returns 16
Summary
Construct a list
Syntax
list(<column>
)
Return Value
List
Example
list(sales = 0, profit = 10)
Create a named list of [sales = 0, profit = 10]
complete(year, product, fill = list(sales = 0))
Set 0 for 'sales' column for missing values.
Summary
Returns logarithms, by default natural logarithms.
Syntax
log(<column_num>
, base = <number>
)
Arguments
Return Value
Numeric
Example
mutate(Sales_log = log(Sales))
Summary
Computes common (i.e., base 10) logarithms
Syntax
log10(<column_num>
, base = <number>
)
Arguments
Return Value
Numeric
Example
mutate(Sales_log = log10(Sales))
Summary
Computes log(1+x) accurately also for |x| << 1.
Syntax
log1p(<column_num>
, base = <number>
)
Arguments
Return Value
Numeric
Example
mutate(Sales_log = log1p(Sales))
Summary
computes binary (i.e., base 2) logarithms
Syntax
log2(<column_num>
, base = <number>
)
Arguments
Return Value
Numeric
Example
mutate(Sales_log = log2(Sales))
Summary
Returns the maximum value in a numeric column.
Syntax
max(<column>
, na.rm = <logical>
)
Arguments
Example
summarize(revenue_max = max(revenue))
Create a new column for calculating the max value of of revenue for each group.
Summary
Returns the numerical average (mean) value.
Syntax
mean(<column_num_logic_date>
, na.rm = <logical>
, trim =
Arguments
Return Value
Numeric
Example
summarize(revenue_mean = mean(revenue))
Create a new column for calculating the average value of revenue for each group.
Summary
Returns the minimum value in a numeric column.
Syntax
min(<column>
, na.rm = <logical>
)
Arguments
Example
summarize(revenue_min = min(revenue))
Create a new column for calculating the minimum of revenue for each group.
Summary
Convert a number to month periods so you can use it for further calculations.
Syntax
months(<column_num>
)
Return Value
Period
Example
months(1)
returns "1m 0d 0H 0M 0S"
ymd_hms("2015-10-01 06:15:30") + months(1)
returns "2015-11-01 06:15:30 UTC"
Summary
Returns # of letters or size of a given value.
Syntax
nchar(<column>
, type = "chars"|"bytes"|"width"
, allowNA = <logical>
)
Arguments
Return Value
Numeric
Example
nchar("Exploratory")
Returns 11
nchar("西田")
Returns 2
nchar("西田", type = "bytes")
Returns 6
Summary
Returns the parallel maxima of the input values.
Syntax
pmax(<column_num_logic_date>
, na.rm = <logical>
)
Arguments
Return Value
Numeric
Example
x <- c(1, 2, 3, 4, 5)
y <- c(3, 3, 3, 3, 3)
pmax(x, y)
Returns 3 3 3 4 5
mutate(a = pmax(ARR_DELAY, DEP_DELAY, na.rm = TRUE))
Returns a bigger value between ARR_DELAY and DEP_DELAY values.
Summary
Returns the parallel minima of the input values.
Syntax
pmin(<column_num_logic_date>
, na.rm = <logical>
)
Arguments
Return Value
Numeric
Example
x <- c(1, 2, 3, 4, 5)
y <- c(3, 3, 3, 3, 3)
pmin(x, y)
Returns 1 2 3 3 3
mutate(a = pmin(ARR_DELAY, DEP_DELAY, na.rm = TRUE))
Returns a smaller value between ARR_DELAY and DEP_DELAY values.
Summary
Returns the product value by multiplying a series of numbers in a given data together.
Syntax
prod(<<column_num_logic>
, na.rm = <logical>
)
Arguments
Return Value
Numeric
Example
// x <- 1:5
prod(x)
Returns 120
Summary
Returns a min and max values of a given data.
Syntax
range(<column>
, na.rm = <logical>
)
Arguments
Return Value
Example
// x <- c(1, 2, 3, 4, 5)
Returns 1 5
filter(ARR_DELAY %in% range(DEP_DELAY, na.rm = TRUE))
Keep only rows with ARR_DELAY is in a range of DEP_DELAY. This is same as the below.
filter(between(ARR_DELAY, range(DEP_DELAY, na.rm = TRUE)[1], range(DEP_DELAY, na.rm = TRUE)[2]))
Summary
Repeats given numbers.
Syntax
rep(<column>
, times = <number>
, length_out = <number>
, each = <number>
)
Arguments
Return Value
Numeric
Example
// x <- c(1, 2, 3, 4, 5)
rep(x, 2)
Returns 1 2 3 4 5 1 2 3 4 5
rep(x, times = 2)
Returns 1 2 3 4 5 1 2 3 4 5
rep(x, length.out = 8)
Returns 1 2 3 4 5 1 2 3
rep(x, times = 2, each = 2)
Returns 1 1 2 2 3 3 4 4 5 5 1 1 2 2 3 3 4 4 5 5
rep(x, length.out = 8, each = 2)
Returns 1 1 2 2 3 3 4 4
Summary
Reverses the entries in a given data.
Syntax
rev(<column>
)
Return Value
Example
// x <- c(1, 2, 3, 4, 5)
rev(x)
Returns 5 4 3 2 1
Summary
Rounds a number to a certain number of decimal places.
Syntax
round(<column_num>
, digits = <number>
)
Arguments
Return Value
Numeric
Example
round(3.475, digits=2)
Returns 3.48
Summary
Returns a sequence of numbers.
Syntax
seq(from = <number>
, to = <number>
, by = <number>
, length.out = <number>
, along.with = <column>
)
Arguments
Return Value
Numeric
Example
seq(1, 5)
Returns 1 2 3 4 5
seq(1, 5, by = 2)
Returns 1 3 5
seq(1:5)
Returns 1 2 3 4 5
seq(0, 1, length.out = 11)
Returns 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
seq(ymd("2017-04-01"), by = "month", length.out = 6)
Returns "2017-04-01" "2017-05-01" "2017-06-01" "2017-07-01" "2017-08-01" "2017-09-01" - It will generate date values for the next 6 months.
Overview
Extracts rows that don't exist in common between two dataframes. Equivalent to the "set difference" operation in set theory.
Syntax
setdiff(<dataframe>
, force_data_type = <TRUE|FALSE>
)
Arguments * dataframe - Specifies the dataframes to find non-common rows between. * force_data_type (optional) - When set to TRUE, it first converts all column data types to character type, and after processing, converts them back to inferred values.
Examples
Data1:
| ID | product_name | price |
|----|--------------|-------|
| 1 | apple | 100 |
| 2 | orange | 80 |
| 3 | banana | 120 |
Data2: | ID | product_name | price | |----|--------------|-------| | 2 | orange | 80 | | 3 | banana | 120 | | 4 | kiwi | 150 |
setdiff(Data1, Data2)
Extracts only the rows that exist in Data1 but not in Data2. | ID | product_name | price | |----|--------------|-------| | 1 | apple | 100 |
Summary
Returns either 1, 0, or -1 depending on whether the input value is positive, zero, or negative, respectively.
Syntax
sign(<column_num>
)
Return Value
Numeric
Example
// x <- c(1, 2, 10, 0, -1, -2, -10)
sign(x)
Returns 1 1 1 0 -1 -1 -1
Summary
Rounds the values in its first argument to the specified number of significant digits.
Syntax
signif(<column_num>
, digits = <number>
)
Arguments
Return Value
Numeric
Example
signif(3.475, digits=2)
Returns 3.5
Summary
Returns the sine of an angle provided in radians.
Syntax
sin(<column_num>
)
Return Value
Numeric
Example
mutate(a = sinh(X1))
Create a new column for calculating the sinh.
Summary
Returns the hyperbolic sine of any real number.
Syntax
sinh(<column_num>
)
Return Value
Numeric
Example
mutate(a = sinh(X1))
Create a new column for calculating the sinh.
Summary
Returns the sin(pi *
Syntax
sinpi(<column_num>
)
Return Value
Numeric
Example
mutate(a = sinpi(X1))
Create a new column for calculating the sinpi.
Summary
Returns the square root of the values.
Syntax
sqrt(<column_num>
)
Return Value
Numeric
Example
mutate(revenue_sq = sqrt(revenue))
Returns the square root of revenue values
Summary
Returns the sum of all the values.
Syntax
sum(<column_num_logic>
, na.rm = <logical>
)
Arguments
Return Value
Numeric
Example
summarize(total = sum(revenue))
Create a new column for calculating the sum of revenue for each group.
Summary
Returns the tangent of an angle provided in radians.
Syntax
tan(<column_num>
)
Return Value
Numeric
Example
mutate(a = sinh(X1))
Create a new column for calculating the sinh.
Summary
Returns the hyperbolic tangent of any real number.
Syntax
tanh(<column_num>
)
Return Value
Numeric
Example
mutate(a = tanh(X1))
Create a new column for calculating the tanh.
Summary
Returns the tan(pi *
Syntax
tanpi(<column_num>
)
Return Value
Numeric
Example
mutate(a = tanpi(X1))
Create a new column for calculating the tanpi.
Summary
Truncates the numbers by omitting digits.
Syntax
trunc(<column_num>
)
Return Value
Numeric
Example
trunc(3.475)
Returns 3
Overview
Combines multiple dataframes, keeping rows that exist in either or both dataframes. Equivalent to the "union" operation in set theory.
Syntax
union(<dataframe>
, force_data_type = <TRUE|FALSE>
)
Arguments * dataframe - Specifies the dataframes to combine by rows. Multiple dataframes can be specified. You can also specify multiple dataframes in list format. * force_data_type (optional) - When set to TRUE, it first converts all column data types to character type, and after combining, converts them back to inferred values.
Examples
Data1:
| ID | product_name | price |
|----|--------------|-------|
| 1 | apple | 100 |
| 2 | orange | 80 |
| 3 | banana | 120 |
Data2: | ID | product_name | price | |----|--------------|-------| | 2 | orange | 80 | | 3 | banana | 120 | | 4 | kiwi | 150 |
union(Data1, Data2)
Combines rows that exist in either Data1 or Data2, or both. | ID | product_name | price | |----|--------------|-------| | 1 | apple | 100 | | 2 | orange | 80 | | 3 | banana | 120 | | 4 | kiwi | 150 |
Summary
Returns number of unique values.
Syntax
unique(<column>
, incomparables = <logical>
, fromLast = <logical>
, nmax = <number>
)
Arguments
Return Value
Numeric
Example
summarize(text_length = length(unique(TAIL_NUM)))
Returns # of unique values of 'TAIL_NUM' for each group
Summary
Returns the TRUE indices of a logical object or based on the condition.
Syntax
which(<condition>
, arr.ind = <logical>
, useNames = <logical>
)
Arguments
Return Value
Numeric
Example
x <- c("a", "b", "c", "d", "e")
which(x == "c")
Returns 3
select(which(sapply(., is.numeric)))
Returns only numeric columns.
Summary
Performs an exclusive-or check across two values and return TRUE when only one of them is TRUE.
Syntax
xor(<value1>
, <value2>
)
Return Value
Logical
Example
xor(TRUE, TRUE)
Returns FALSE
xor(FALSE, TRUE)
Returns TRUE
xor(FALSE, FALSE)
Returns FALSE
Overview
Returns only those rows from the current dataframe that do not have matching values in the target dataframe. Only columns from the current dataframe are retained.
Syntax
anti_join(<target_dataframe>
, by = join_by(<join_condition>
))
Arguments
Complex join conditions with join_by
Examples
Source data: | customer_id | customer_name | order_amount | |-------------|---------------|--------------| | 101 | Smith | 5000 | | 102 | Johnson | 3000 | | 103 | Brown | 7000 |
Customer attributes data: | customer_id | region | age | |-------------|-------------|-----| | 101 | New York | 35 | | 102 | California | 42 | | 104 | Texas | 28 |
anti_join(customer_attributes, by = join_by(customer_id == customer_id))
From the source dataframe, returns only rows where "customer_id" does not have a match in the target dataframe. Customer_id=103 is included in the result because it doesn't exist in the customer attributes data.
customer_id | customer_name | order_amount |
---|---|---|
103 | Brown | 7000 |
Source data (employees): | employee_id | job_code | name | |-------------|----------|----------| | 1 | A01 | Thompson | | 2 | A02 | Wilson | | 3 | B01 | Davis | | 4 | C01 | Brown |
Job data: | job_code | job_title | |----------|--------------| | A01 | Sales | | A02 | Development | | B01 | Finance |
anti_join(job_data, by = join_by(job_code == job_code))
From the source dataframe, returns only rows where "job_code" does not have a match in the target dataframe. Job code "C01" is included in the result because it doesn't exist in the job data.
employee_id | job_code | name |
---|---|---|
4 | C01 | Brown |
Source data (product data): | product_id | product_name | price | |------------|--------------|-------| | P001 | Laptop | 80000 | | P002 | Mouse | 3000 | | P003 | Keyboard | 5000 |
Sale period data: | product_id | sale_start_date | sale_end_date | discount_rate | |------------|-----------------|---------------|---------------| | P001 | 2023-06-01 | 2023-06-15 | 10 | | P001 | 2023-07-01 | 2023-07-15 | 15 | | P002 | 2023-06-10 | 2023-06-20 | 20 | | P004 | 2023-06-01 | 2023-06-30 | 25 |
anti_join(sale_period_data, by = join_by(product_id == product_id, between(as.Date("2023-06-05"), sale_start_date, sale_end_date)))
From the source dataframe, returns only rows where "product_id" does not match in the target dataframe or "2023-06-05" does not fall within the sale period. Product_id=P002 (sale starts after 2023-06-05) and product_id=P003 (no sale period) are included in the result.
product_id | product_name | price |
---|---|---|
P002 | Mouse | 3000 |
P003 | Keyboard | 5000 |
Source data (order data): | order_id | order_date | amount | |----------|------------|--------| | 1001 | 2023-01-15 | 3000 | | 1002 | 2023-02-20 | 5000 | | 1003 | 2023-03-10 | 8000 | | 1004 | 2023-04-05 | 4000 |
Exchange rate data: | rate_date | usd_jpy_rate | |------------|--------------| | 2023-01-01 | 130.5 | | 2023-02-01 | 132.0 | | 2023-03-01 | 133.5 |
anti_join(exchange_rate_data, by = join_by(closest(order_date >= rate_date)))
From the source dataframe, returns only rows that do not meet the condition with rows in the target dataframe. Extracts orders that don't have a closest rate date that is on or before the order date. Order with id=1004 is included because there's no corresponding rate date.
order_id | order_date | amount |
---|---|---|
1004 | 2023-04-05 | 4000 |
Source data (meeting schedule): | meeting_id | start_time | end_time | room | |------------|----------------------|----------------------|----------| | M001 | 2023-06-01 10:00 | 2023-06-01 11:30 | Room A | | M002 | 2023-06-01 13:00 | 2023-06-01 14:00 | Room B | | M003 | 2023-06-02 09:30 | 2023-06-02 12:00 | Room A | | M004 | 2023-06-03 10:00 | 2023-06-03 11:00 | Room C |
Reservation data: | reservation_id | start_time | end_time | reserved_by | |----------------|----------------------|----------------------|-------------| | R001 | 2023-06-01 09:00 | 2023-06-01 10:30 | Smith | | R002 | 2023-06-01 11:00 | 2023-06-01 12:00 | Johnson | | R003 | 2023-06-01 13:30 | 2023-06-01 15:00 | Brown | | R004 | 2023-06-02 10:00 | 2023-06-02 11:00 | Wilson |
anti_join(reservation_data, by = join_by(overlaps(start_time, end_time, start_time, end_time)))
From the source dataframe, returns only rows that do not meet the condition with rows in the target dataframe. Extracts meetings that don't have overlapping times with reservations. Meeting with id=M004 is included because it doesn't overlap with any reservation.
meeting_id | start_time | end_time | room |
---|---|---|---|
M004 | 2023-06-03 10:00 | 2023-06-03 11:00 | Room C |
Source data (store hours): | store_id | opening_time | closing_time | |----------|--------------|--------------| | S001 | 09:00 | 21:00 | | S002 | 10:00 | 22:00 | | S003 | 08:30 | 20:00 | | S004 | 07:00 | 15:00 |
Event data: | event_id | start_time | end_time | description | |----------|------------|----------|---------------| | E001 | 08:00 | 10:00 | Morning Mtg | | E002 | 11:00 | 13:00 | Seminar | | E003 | 15:00 | 19:00 | Exhibition |
anti_join(event_data, by = join_by(within(opening_time, closing_time, start_time, end_time)))
From the source dataframe, returns only rows that do not meet the condition with rows in the target dataframe. Extracts stores where no event completely fits within the store's opening hours. Store S004 is included because the Morning Mtg event (08:00-10:00) fits within its opening hours (07:00-15:00).
store_id | opening_time | closing_time |
---|---|---|
S004 | 07:00 | 15:00 |
Summary
Sort rows by given column(s).
Syntax
arrange(<column(s)>
)
Used Together
desc
Example
arrange(POPULATION)
Sort by POPULATION values in a ascending order.
arrange(desc(POPULATION))
Sort by POPULATION values in a descending order.
arrange(POPULATION, PRODUCTION)
Sort by POPULATION, PRODUCTION columns in a ascending order.
arrange(-POPULATION, -PRODUCTION)
Sort by POPULATION, PRODUCTION columns in both descending order. This is an equivalent to the following
arrange(desc(POPULATION), desc(PRODUCTION))
Summary
Return TRUE or FALSE based on a given value is within a given range (between).
Syntax
between(<column_num_date>
, <left>
, <right>
)
Arguments
Return Value
Logical
Example
filter(between(revenue, 1000, 6000))
Keep data whose revenue is between 1000 and 6000.
Summary
Bind multiple data frames by column.
Syntax
bind_cols(<data_set(s)>
)
Example
bind_cols(ECONOMY)
Add columns from a data frame 'ECONOMY' to the existing data frame.
Summary
Bind multiple data frames by row.
Syntax
bind_rows(<dataframe1>
, <dataframe2>
, ..., .id = <id_column_name>
, current_df_name = <current_dataframe_name>
, force_data_type = <TRUE|FALSE>
, use_col_index_as_col_name = <TRUE|FALSE>
, encoding = <encoding>
)
Arguments
Example
Existing data:
DATA2013
date | sales | product_id |
---|---|---|
11/05/2013 | 5000 | 1 |
12/20/2013 | 2000 | 2 |
Additional data to bind:
DATA2014
date | sales | product_id |
---|---|---|
01/05/2014 | 1000 | 1 |
02/20/2014 | 2000 | 2 |
DATA2015
date | sales | product_id |
---|---|---|
01/05/2015 | 3000 | 1 |
02/20/2015 | 3000 | 2 |
bind_rows(DATA2014)
Add a data frame 'DATA2014' to the existing data.
date | sales | product_id |
---|---|---|
11/05/2013 | 5000 | 1 |
12/20/2013 | 2000 | 2 |
01/05/2014 | 1000 | 1 |
02/20/2014 | 2000 | 2 |
bind_rows(DATA2014, DATA2015)
Add rows from a data frame 'DATA2014' and 'DATA2015' to the existing data.
date | sales | product_id |
---|---|---|
11/05/2013 | 5000 | 1 |
12/20/2013 | 2000 | 2 |
01/05/2014 | 1000 | 1 |
02/20/2014 | 2000 | 2 |
01/05/2015 | 3000 | 1 |
02/20/2015 | 3000 | 2 |
bind_rows(DATA2014, DATA2015, force_data_type = TRUE)
Add rows from a data frame 'DATA2014' and 'DATA2015' to the existing data and force all columns data types as character, then reevaluate columns types after the merge.
For example, if product_id column only has character values for DATA2015, the product_id column data type is Character. And if you try to bind this to other data frames whose the product_id column data type is Integer, it gives an error becuase of this data type mismatch.
date | sales | product_id |
---|---|---|
01/05/2015 | 3000 | a |
02/20/2015 | 3000 | b |
In this case you can set force_data_type
argument as TRUE. This will create a data frame like below:
date | sales | product_id |
---|---|---|
11/05/2013 | 5000 | 1 |
12/20/2013 | 2000 | 2 |
01/05/2014 | 1000 | 1 |
02/20/2014 | 2000 | 2 |
01/05/2015 | 3000 | a |
02/20/2015 | 3000 | b |
bind_rows(DATA2014, DATA2015, current_df_name = "DATA2013", id_column_name = "df_name")
Add rows from a data frame 'DATA2014' and 'DATA2015' to the DATA2013 and it will create a column called 'df_name' and use "DATA2013", "DATA2014", and "DATA2015" to mark each row from two data frames
df_name | date | sales | product_id |
---|---|---|---|
DATA2013 | 11/05/2013 | 5000 | 1 |
DATA2013 | 12/20/2013 | 2000 | 2 |
DATA2014 | 01/05/2014 | 1000 | 1 |
DATA2014 | 02/20/2014 | 2000 | 2 |
DATA2015 | 01/05/2015 | 3000 | 1 |
DATA2015 | 02/20/2015 | 3000 | 2 |
Dataframe1 (with product-related column names) | product_name | price | stock | |--------------|-------|-------| | Laptop | 80000 | 10 | | Mouse | 3000 | 50 | | Keyboard | 5000 | 30 |
Dataframe2 (with similar but different column names) | product_name | price | quantity | |--------------|-------|----------| | Monitor | 25000 | 15 | | USB drive | 2000 | 100 | | Headphones | 8000 | 20 |
Normal column name-based binding
bind_rows(Dataframe1, Dataframe2)
Because the column names are different, they are not recognized as common columns, and all columns are combined as separate columns.
product_name | price | stock | product_name | price | quantity |
---|---|---|---|---|---|
Laptop | 80000 | 10 | NA | NA | NA |
Mouse | 3000 | 50 | NA | NA | NA |
Keyboard | 5000 | 30 | NA | NA | NA |
NA | NA | NA | Monitor | 25000 | 15 |
NA | NA | NA | USB drive | 2000 | 100 |
NA | NA | NA | Headphones | 8000 | 20 |
Column position-based binding
bind_rows(Dataframe1, Dataframe2, use_col_index_as_col_name = TRUE)
Combines based on column position, ignoring column names. The column names from the first dataframe (Dataframe1) are used in the combined result.
product_name | price | stock |
---|---|---|
Laptop | 80000 | 10 |
Mouse | 3000 | 50 |
Keyboard | 5000 | 30 |
Monitor | 25000 | 15 |
USB drive | 2000 | 100 |
Headphones | 8000 | 20 |
Summary
Summarize the data by either calling n() or sum() for each group.
Syntax
count(<column(s)>
, wt = <column_num_or_base_aggregate_expression>
, sort = <logical>
)
Arguments
Example
count(CARRIER)
Groups by 'CARRIER' and returns the number of rows for each 'CARRIER'.
count(CARRIER, sort = TRUE)
Groups by 'CARRIER' and returns the number of rows for each 'CARRIER' from the highest number to the lowest.
count(CARRIER, wt = ARR_DELAY)
Groups by 'CARRIER' and returns the sum total of 'ARR_DELAY'.
count(CARRIER, TAIL_NUM, wt = ARR_DELAY)
Groups by 'CARRIER' and 'TAIL_NUM', and returns the sum total of 'ARR_DELAY'.
count(CARRIER, wt = ARR_DELAY, sort = TRUE)
Groups by 'CARRIER' and returns the sum total of 'ARR_DELAY' and sorted by the total number.
count(CARRIER, wt = mean(ARR_DELAY, na.rm = TRUE))
Groups by 'CARRIER' and returns the average of 'ARR_DELAY' after removing NA values.
Summary
Returns TRUE if all the values up to the current position are TRUE.
Syntax
cumall(<condition>
)
Arguments
Return Value
Logical
Example
// X = TRUE, TRUE, FALSE, TRUE, FALSE
cumall(X)
Returns TRUE, TRUE, FALSE, FALSE, FALSE
Summary
Returns TRUE if any of the values up to the current position is TRUE.
Syntax
cumany(<condition>
)
Arguments
Return Value
Logical
Example
// X = TRUE, TRUE, FALSE, TRUE, FALSE
cumany(X)
Returns TRUE, TRUE, TRUE, TRUE, TRUE
Summary
Cumulative distribution. Proportion of all values less than or equal to the current rank. Missing values are left as is.
Syntax
cume_dist(<column_num>
)
Return Value
Numeric
Example
mutate(
Create a new column to have the cumulative distribution number.
filter(cume_dist(
Select rows whose cumulative distribution number are less .25.
Summary
Returns the cumulative mean (average).
Syntax
cummean(<column_num>
)
Arguments
Return Value
Numeric
Example
// X = 1, 2, 3, 2, 1
cummean(X)
Returns 1, 3, 6, 8, 9
Summary
Ranks with no gaps. Missing values are left as is.
Syntax
dense_rank(<column_num>
)
Return Value
Numeric
Example
mutate(population_rank = dense_rank(population))
Create a new column to have the rank number based on 'population' values.
filter(dense_rank(population) < 10)
Select rows whose rank numbers are less 10.
Summary
Change the sorting order to a descending order. e.g. 9 to 1 instead of 1 to 9, z to a instead of a to z.
Syntax
desc(<column>
)
Example
arrange(desc(POPULATION))
Sort by POPULATION values in a descending order.
arrange(desc(POPULATION), desc(PRODUCTION))
Sort by POPULATION, PRODUCTION columns in both descending order.
Summary
Select distinct/unique rows. Only the first row will be preserved. Set .keep_all argument to TRUE to keep all the original columns.
Syntax
distinct(<column(s)>
, .keep_all = <logical>
)
Arguments
Example
distinct()
Keep only unique/distinct rows based on all the columns.
distinct(CARRIER, TAIL_NUM, FL_NUM)
Keep only unique/distinct rows based on CARRIER, TAIL_NUM, FL_NUM columns.
distinct(CARRIER, .keep_all = TRUE)
Keep only unique/distinct rows based on CARRIER and return all the columns.
Summary
Select rows with conditions.
Syntax
filter(<condition>
)
Used Together
is.na, between,
Example
filter(revenue < 6000)
Keep data whose revenue is less than 6000.
filter(revenue < 6000 & region == “West")
Keep data whose revenue is less than 6000 AND region is "West".
filter(revenue < 6000 | region == “West")
Keep data whose revenue is less than 6000 OR region is "West".
filter(between(revenue, 1000, 6000))
Keep data whose revenue is between 1000 and 6000.
filter(is.na(revenue))
Keep data whose revenue is NA.
filter(!is.na(revenue))
Keep data whose revenue is NOT NA.
Summary
returns the first value of all the values.
Syntax
first(<column>
, order_by = <column(s)>
, default = <text>
)
Arguments
Example
summarize(total = first(name))
Create a new column for returning the first value for each group.
Overview
Performs a full outer join. Returns all rows from both the current dataframe (left side) and the target dataframe (right side), and all columns from both dataframes. When a row in one dataframe has no matching row in the other dataframe, columns from the non-matching side will contain NA values. If there are multiple matches, all combinations will be returned.
Syntax
full_join(<target_dataframe>
, by = join_by(<join_condition>
), suffix = c(<source_dataframe_suffix>
, <target_dataframe_suffix>
), na_matches = <NA_handling_method>
, ignore_case = <TRUE|FALSE>
)
Arguments
Complex join conditions with join_by
Examples
Source data: | customer_id | customer_name | order_amount | |-------------|---------------|--------------| | 101 | Smith | 5000 | | 102 | Johnson | 3000 | | 103 | Brown | 7000 |
Customer attributes data: | customer_id | region | age | |-------------|-------------|-----| | 101 | New York | 35 | | 102 | California | 42 | | 104 | Texas | 28 |
full_join(customer_attributes, by = join_by(customer_id == customer_id))
Joins all rows from both dataframes. For customer_id=104 which has no match in the source dataframe, customer_name and order_amount columns will contain NA values. For customer_id=103 which has no match in the target dataframe, region and age columns will contain NA values.
customer_id | customer_name | order_amount | region | age |
---|---|---|---|---|
101 | Smith | 5000 | New York | 35 |
102 | Johnson | 3000 | California | 42 |
103 | Brown | 7000 | NA | NA |
104 | NA | NA | Texas | 28 |
Source data (employees): | employee_id | job_code | name | |-------------|----------|----------| | 1 | A01 | Thompson | | 2 | A02 | Wilson | | 3 | NA | Davis |
Job data: | job_code | job_title | |----------|--------------| | A01 | Sales | | A02 | Development | | A03 | Finance | | NA | Unassigned |
full_join(job_data, by = join_by(job_code == job_code), na_matches = "na")
Joins all rows from both dataframes. With na_matches="na", NA values are considered matching. For job_code=A03 which has no match in the source dataframe, employee_id and name columns will contain NA values.
employee_id | job_code | name | job_title |
---|---|---|---|
1 | A01 | Thompson | Sales |
2 | A02 | Wilson | Development |
3 | NA | Davis | Unassigned |
NA | A03 | NA | Finance |
Source data (user data): | user_id | name | |---------|-------| | 1 | john | | 2 | MARY | | 3 | David |
Profile data: | user_id | name | age | |---------|-------|-----| | 1 | John | 25 | | 2 | mary | 30 | | 4 | Steve | 35 |
full_join(profile_data, by = join_by(name == name), ignore_case = TRUE)
Joins all rows from both dataframes, ignoring case differences. For name=Steve (user_id=4) which has no match in the source dataframe, the source user_id column will contain NA value. For name=David (user_id=3) which has no match in the target dataframe, the target user_id and age columns will contain NA values.
user_id.x | name | user_id.y | age |
---|---|---|---|
1 | john | 1 | 25 |
2 | MARY | 2 | 30 |
3 | David | NA | NA |
NA | Steve | 4 | 35 |
Source data (product data): | product_id | product_name | price | |------------|--------------|-------| | P001 | Laptop | 80000 | | P002 | Mouse | 3000 | | P003 | Keyboard | 5000 |
Sale period data: | product_id | sale_start_date | sale_end_date | discount_rate | |------------|-----------------|---------------|---------------| | P001 | 2023-06-01 | 2023-06-15 | 10 | | P001 | 2023-07-01 | 2023-07-15 | 15 | | P002 | 2023-06-10 | 2023-06-20 | 20 | | P004 | 2023-06-01 | 2023-06-30 | 25 |
full_join(sale_period_data, by = join_by(product_id == product_id, between(as.Date("2023-06-05"), sale_start_date, sale_end_date)))
Joins all rows from both dataframes. For combinations where "product_id" matches and "2023-06-05" falls within the sale period, all data is included. For other combinations, columns from the side that doesn't satisfy the condition will contain NA values.
product_id | product_name | price | sale_start_date | sale_end_date | discount_rate |
---|---|---|---|---|---|
P001 | Laptop | 80000 | 2023-06-01 | 2023-06-15 | 10 |
P001 | Laptop | 80000 | 2023-07-01 | 2023-07-15 | 15 |
P002 | Mouse | 3000 | 2023-06-10 | 2023-06-20 | 20 |
P003 | Keyboard | 5000 | NA | NA | NA |
P004 | NA | NA | 2023-06-01 | 2023-06-30 | 25 |
Source data (order data): | order_id | order_date | amount | |----------|------------|--------| | 1001 | 2023-01-15 | 3000 | | 1002 | 2023-02-20 | 5000 | | 1003 | 2023-03-10 | 8000 |
Exchange rate data: | rate_date | usd_jpy_rate | |------------|--------------| | 2023-01-01 | 130.5 | | 2023-02-01 | 132.0 | | 2023-03-01 | 133.5 | | 2023-04-01 | 135.2 |
full_join(exchange_rate_data, by = join_by(closest(order_date >= rate_date)))
Joins all rows from both dataframes. For each order date, finds the closest preceding exchange rate. For rate_date=2023-04-01 which has no corresponding order, order_id and amount columns will contain NA values.
order_id | order_date | amount | rate_date | usd_jpy_rate |
---|---|---|---|---|
1001 | 2023-01-15 | 3000 | 2023-01-01 | 130.5 |
1002 | 2023-02-20 | 5000 | 2023-02-01 | 132.0 |
1003 | 2023-03-10 | 8000 | 2023-03-01 | 133.5 |
NA | NA | NA | 2023-04-01 | 135.2 |
Source data (meeting schedule): | meeting_id | start_time | end_time | room | |------------|----------------------|----------------------|----------| | M001 | 2023-06-01 10:00 | 2023-06-01 11:30 | Room A | | M002 | 2023-06-01 13:00 | 2023-06-01 14:00 | Room B | | M003 | 2023-06-02 09:30 | 2023-06-02 12:00 | Room A |
Reservation data: | reservation_id | start_time | end_time | reserved_by | |----------------|----------------------|----------------------|-------------| | R001 | 2023-06-01 09:00 | 2023-06-01 10:30 | Smith | | R002 | 2023-06-01 11:00 | 2023-06-01 12:00 | Johnson | | R003 | 2023-06-01 13:30 | 2023-06-01 15:00 | Brown | | R004 | 2023-06-02 10:00 | 2023-06-02 11:00 | Wilson | | R005 | 2023-06-03 10:00 | 2023-06-03 12:00 | Taylor |
full_join(reservation_data, by = join_by(overlaps(start_time, end_time, start_time, end_time)))
Joins all rows from both dataframes. For combinations where meeting times overlap with reservation times, all data is included. For reservation_id=R005 which has no overlapping meeting, meeting_id and room columns will contain NA values.
meeting_id | start_time | end_time | room | reservation_id | reserved_by |
---|---|---|---|---|---|
M001 | 2023-06-01 10:00 | 2023-06-01 11:30 | Room A | R001 | Smith |
M001 | 2023-06-01 10:00 | 2023-06-01 11:30 | Room A | R002 | Johnson |
M002 | 2023-06-01 13:00 | 2023-06-01 14:00 | Room B | R003 | Brown |
M003 | 2023-06-02 09:30 | 2023-06-02 12:00 | Room A | R004 | Wilson |
NA | NA | NA | NA | R005 | Taylor |
Source data (store hours): | store_id | opening_time | closing_time | |----------|--------------|--------------| | S001 | 09:00 | 21:00 | | S002 | 10:00 | 22:00 | | S003 | 08:30 | 20:00 |
Event data: | event_id | start_time | end_time | description | |----------|------------|----------|---------------| | E001 | 08:00 | 10:00 | Morning Mtg | | E002 | 11:00 | 13:00 | Seminar | | E003 | 15:00 | 19:00 | Exhibition | | E004 | 19:30 | 22:30 | Party |
full_join(event_data, by = join_by(within(opening_time, closing_time, start_time, end_time)))
Joins all rows from both dataframes. For combinations where events completely fit within store opening hours, all data is included. For event_id=E001 and event_id=E004 which don't fit within any store's hours, store_id, opening_time, and closing_time columns will contain NA values.
store_id | opening_time | closing_time | event_id | start_time | end_time | description |
---|---|---|---|---|---|---|
S001 | 09:00 | 21:00 | E002 | 11:00 | 13:00 | Seminar |
S002 | 10:00 | 22:00 | E002 | 11:00 | 13:00 | Seminar |
S003 | 08:30 | 20:00 | E002 | 11:00 | 13:00 | Seminar |
S001 | 09:00 | 21:00 | E003 | 15:00 | 19:00 | Exhibition |
S002 | 10:00 | 22:00 | E003 | 15:00 | 19:00 | Exhibition |
S003 | 08:30 | 20:00 | E003 | 15:00 | 19:00 | Exhibition |
NA | NA | NA | E001 | 08:00 | 10:00 | Morning Mtg |
NA | NA | NA | E004 | 19:30 | 22:30 | Party |
Summary
funs provides a flexible way to generate a named list of functions for input to other functions like summarise_each.
Syntax
funs(<function_name(s)>
)
Arguments
A list of functions specified by: * Their name, "mean" * The function itself, mean * A call to the function with . as a dummy parameter, mean(., na.rm = TRUE)
Example
funs(mean, "mean", mean(., na.rm = TRUE))
# Override default names
funs(m1 = mean, m2 = "mean", m3 = mean(., na.rm = TRUE))
Summary
Converts the data frame into a grouped data frame where the following operations will be performed based on the groups. Grouping should be done before you want to aggregate values.
Syntax
group_by(<column(s)>
)
Example
group_by(COUNTRY)
Group the data by COUNTRY.
group_by(REGION, COUNTRY)
Group the data by REGION and COUNTRY.
group_by(FIRSTNAME + LASTNAME)
Group by a dynamically generated value of 'FIRSTNAME + LASTNAME'
Overview
Performs an inner join. Returns rows that match in both the current dataframe (left side) and the target dataframe (right side), and all columns from both dataframes. Rows without matches are excluded from the result. If there are multiple matches, all combinations will be returned.
Syntax
inner_join(<target_dataframe>
, by = join_by(<join_condition>
), suffix = c(<source_dataframe_suffix>
, <target_dataframe_suffix>
), na_matches = <NA_handling_method>
, ignore_case = <TRUE|FALSE>
)
Arguments
Complex join conditions with join_by
Examples
Source data: | customer_id | customer_name | order_amount | |-------------|---------------|--------------| | 101 | Smith | 5000 | | 102 | Johnson | 3000 | | 103 | Brown | 7000 |
Customer attributes data: | customer_id | region | age | |-------------|-------------|-----| | 101 | New York | 35 | | 102 | California | 42 | | 104 | Texas | 28 |
inner_join(customer_attributes, by = join_by(customer_id == customer_id))
Returns only rows where "customer_id" matches in both dataframes. Customer_id=103 and customer_id=104 are excluded from the result because they exist in only one of the dataframes.
customer_id | customer_name | order_amount | region | age |
---|---|---|---|---|
101 | Smith | 5000 | New York | 35 |
102 | Johnson | 3000 | California | 42 |
Source data (employees): | employee_id | job_code | name | |-------------|----------|----------| | 1 | A01 | Thompson | | 2 | A02 | Wilson | | 3 | NA | Davis |
Job data: | job_code | job_title | |----------|--------------| | A01 | Sales | | A02 | Development | | A03 | Finance | | NA | Unassigned |
inner_join(job_data, by = join_by(job_code == job_code), na_matches = "na")
Returns only rows where "job_code" matches in both dataframes. With na_matches="na", NA values are considered matching. Job_code=A03 is excluded from the result because it exists only in the job data.
employee_id | job_code | name | job_title |
---|---|---|---|
1 | A01 | Thompson | Sales |
2 | A02 | Wilson | Development |
3 | NA | Davis | Unassigned |
Source data (user data): | user_id | name | |---------|-------| | 1 | john | | 2 | MARY | | 3 | David |
Profile data: | user_id | name | age | |---------|-------|-----| | 1 | John | 25 | | 2 | mary | 30 | | 4 | Steve | 35 |
inner_join(profile_data, by = join_by(name == name), ignore_case = TRUE)
Returns only rows where "name" matches in both dataframes, ignoring case differences. Name="john" matches name="John", and name="MARY" matches name="mary". Name="David" and name="Steve" are excluded because they exist in only one of the dataframes.
user_id.x | name | user_id.y | age |
---|---|---|---|
1 | john | 1 | 25 |
2 | MARY | 2 | 30 |
Source data (product data): | product_id | product_name | price | |------------|--------------|-------| | P001 | Laptop | 80000 | | P002 | Mouse | 3000 | | P003 | Keyboard | 5000 |
Sale period data: | product_id | sale_start_date | sale_end_date | discount_rate | |------------|-----------------|---------------|---------------| | P001 | 2023-06-01 | 2023-06-15 | 10 | | P001 | 2023-07-01 | 2023-07-15 | 15 | | P002 | 2023-06-10 | 2023-06-20 | 20 | | P004 | 2023-06-01 | 2023-06-30 | 25 |
inner_join(sale_period_data, by = join_by(product_id == product_id, between(as.Date("2023-06-05"), sale_start_date, sale_end_date)))
Returns only rows where "product_id" matches and "2023-06-05" falls within the sale period. The second period for product_id=P001 (July) and the period for product_id=P002 (starting from 2023-06-10) are excluded because 2023-06-05 is not within those periods. Product_id=P003 has no sale period, and product_id=P004 doesn't exist in the product data, so they're also excluded.
product_id | product_name | price | sale_start_date | sale_end_date | discount_rate |
---|---|---|---|---|---|
P001 | Laptop | 80000 | 2023-06-01 | 2023-06-15 | 10 |
Source data (order data): | order_id | order_date | amount | |----------|------------|--------| | 1001 | 2023-01-15 | 3000 | | 1002 | 2023-02-20 | 5000 | | 1003 | 2023-03-10 | 8000 |
Exchange rate data: | rate_date | usd_jpy_rate | |------------|--------------| | 2023-01-01 | 130.5 | | 2023-02-01 | 132.0 | | 2023-03-01 | 133.5 | | 2023-04-01 | 135.2 |
inner_join(exchange_rate_data, by = join_by(closest(order_date >= rate_date)))
Returns only rows where there's a matching condition between dataframes. For each order date, finds the closest preceding exchange rate. Rate_date=2023-04-01 is excluded because there's no corresponding order.
order_id | order_date | amount | rate_date | usd_jpy_rate |
---|---|---|---|---|
1001 | 2023-01-15 | 3000 | 2023-01-01 | 130.5 |
1002 | 2023-02-20 | 5000 | 2023-02-01 | 132.0 |
1003 | 2023-03-10 | 8000 | 2023-03-01 | 133.5 |
Source data (meeting schedule): | meeting_id | start_time | end_time | room | |------------|----------------------|----------------------|----------| | M001 | 2023-06-01 10:00 | 2023-06-01 11:30 | Room A | | M002 | 2023-06-01 13:00 | 2023-06-01 14:00 | Room B | | M003 | 2023-06-02 09:30 | 2023-06-02 12:00 | Room A |
Reservation data: | reservation_id | start_time | end_time | reserved_by | |----------------|----------------------|----------------------|-------------| | R001 | 2023-06-01 09:00 | 2023-06-01 10:30 | Smith | | R002 | 2023-06-01 11:00 | 2023-06-01 12:00 | Johnson | | R003 | 2023-06-01 13:30 | 2023-06-01 15:00 | Brown | | R004 | 2023-06-02 10:00 | 2023-06-02 11:00 | Wilson | | R005 | 2023-06-03 10:00 | 2023-06-03 12:00 | Taylor |
inner_join(reservation_data, by = join_by(overlaps(start_time, end_time, start_time, end_time)))
Returns only rows where meeting times overlap with reservation times. Reservation_id=R005 is excluded because it doesn't overlap with any meeting.
meeting_id | start_time | end_time | room | reservation_id | reserved_by |
---|---|---|---|---|---|
M001 | 2023-06-01 10:00 | 2023-06-01 11:30 | Room A | R001 | Smith |
M001 | 2023-06-01 10:00 | 2023-06-01 11:30 | Room A | R002 | Johnson |
M002 | 2023-06-01 13:00 | 2023-06-01 14:00 | Room B | R003 | Brown |
M003 | 2023-06-02 09:30 | 2023-06-02 12:00 | Room A | R004 | Wilson |
Source data (store hours): | store_id | opening_time | closing_time | |----------|--------------|--------------| | S001 | 09:00 | 21:00 | | S002 | 10:00 | 22:00 | | S003 | 08:30 | 20:00 |
Event data: | event_id | start_time | end_time | description | |----------|------------|----------|---------------| | E001 | 08:00 | 10:00 | Morning Mtg | | E002 | 11:00 | 13:00 | Seminar | | E003 | 15:00 | 19:00 | Exhibition | | E004 | 19:30 | 22:30 | Party |
inner_join(event_data, by = join_by(within(opening_time, closing_time, start_time, end_time)))
Returns only rows where events completely fit within store opening hours. Event_id=E001 and event_id=E004 are excluded because they don't satisfy the condition with any store.
store_id | opening_time | closing_time | event_id | start_time | end_time | description |
---|---|---|---|---|---|---|
S001 | 09:00 | 21:00 | E002 | 11:00 | 13:00 | Seminar |
S002 | 10:00 | 22:00 | E002 | 11:00 | 13:00 | Seminar |
S003 | 08:30 | 20:00 | E002 | 11:00 | 13:00 | Seminar |
S001 | 09:00 | 21:00 | E003 | 15:00 | 19:00 | Exhibition |
S002 | 10:00 | 22:00 | E003 | 15:00 | 19:00 | Exhibition |
S003 | 08:30 | 20:00 | E003 | 15:00 | 19:00 | Exhibition |
Summary
Provides access to a row at a given physical offset prior to that position.
Syntax
lag(<column>
, n = <number>
, default = <text>
, order_by = <column(s)>
)
Arguments
Example
// x <- c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10)
lag(x)
Returns NA 1 2 3 4 5 6 7 8 9
lag(x, 2)
Returns NA NA 1 2 3 4 5 6 7 8
Summary
returns the last value of all the values.
Syntax
last(<column>
, order_by = <column(s)>
, default = <text>
)
Arguments
Example
summarize(total = last(name))
Create a new column for returning the last value for each group.
Summary
Provides access to a row at a given physical offset that follows the current row.
Syntax
lead(<column>
, n = <number>
, default = <text>
, order_by = <column(s)>
)
Arguments
Example
// x <- c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10)
lead(x)
Returns 2 3 4 5 6 7 8 9 10 NA
lead(x, 2)
Returns 3 4 5 6 7 8 9 10 NA NA
Overview
Performs a left outer join. Returns all rows from the current dataframe (left side) and all columns from both the current dataframe and the target dataframe (right side). When a row in the left dataframe has no matching row in the right dataframe, the columns added from the right side will contain NA values. If there are multiple matches, all combinations will be returned.
Syntax
left_join(<target_dataframe>
, by = join_by(<join_condition>
), suffix = c(<source_dataframe_suffix>
, <target_dataframe_suffix>
), na_matches = <NA_handling_method>
, ignore_case = <TRUE|FALSE>
)
Arguments
Complex join conditions with join_by
Examples
Source data: | customer_id | customer_name | order_amount | |-------------|---------------|--------------| | 101 | Smith | 5000 | | 102 | Johnson | 3000 | | 103 | Brown | 7000 |
Customer attributes data: | customer_id | region | age | |-------------|-------------|-----| | 101 | New York | 35 | | 102 | California | 42 | | 104 | Texas | 28 |
left_join(customer_attributes, by = join_by(customer_id == customer_id))
Keeps all rows from the source dataframe and joins with the "customer_attributes" dataframe using the "customer_id" column as the key. For customer_id=103 which has no match in the target dataframe, region and age columns will contain NA values.
customer_id | customer_name | order_amount | region | age |
---|---|---|---|---|
101 | Smith | 5000 | New York | 35 |
102 | Johnson | 3000 | California | 42 |
103 | Brown | 7000 | NA | NA |
Source data (employees): | employee_id | job_code | name | |-------------|----------|----------| | 1 | A01 | Thompson | | 2 | A02 | Wilson | | 3 | NA | Davis |
Job data: | job_code | job_title | |----------|--------------| | A01 | Sales | | A02 | Development | | NA | Unassigned |
left_join(job_data, by = join_by(job_code == job_code), na_matches = "na")
Keeps all rows from the source dataframe and joins with the "job_data" dataframe using the "job_code" column as the key. NA values are also used as join keys since na_matches="na".
employee_id | job_code | name | job_title |
---|---|---|---|
1 | A01 | Thompson | Sales |
2 | A02 | Wilson | Development |
3 | NA | Davis | Unassigned |
Source data (user data): | user_id | name | |---------|-------| | 1 | john | | 2 | MARY | | 3 | David |
Profile data: | user_id | name | age | |---------|-------|-----| | 1 | John | 25 | | 2 | mary | 30 | | 4 | Steve | 35 |
left_join(profile_data, by = join_by(name == name), ignore_case = TRUE)
Keeps all rows from the source dataframe and joins with the "profile_data" dataframe using the "name" column as the key, ignoring case differences.
user_id.x | name | user_id.y | age |
---|---|---|---|
1 | john | 1 | 25 |
2 | MARY | 2 | 30 |
3 | David | NA | NA |
Source data (product data): | product_id | product_name | price | |------------|--------------|-------| | P001 | Laptop | 80000 | | P002 | Mouse | 3000 | | P003 | Keyboard | 5000 |
Sale period data: | product_id | sale_start_date | sale_end_date | discount_rate | |------------|-----------------|---------------|---------------| | P001 | 2023-06-01 | 2023-06-15 | 10 | | P001 | 2023-07-01 | 2023-07-15 | 15 | | P002 | 2023-06-10 | 2023-06-20 | 20 |
left_join(sale_period_data, by = join_by(product_id == product_id, between(as.Date("2023-06-05"), sale_start_date, sale_end_date)))
Keeps all rows from the source dataframe and joins with matching products where "2023-06-05" falls within the sale period.
product_id | product_name | price | sale_start_date | sale_end_date | discount_rate |
---|---|---|---|---|---|
P001 | Laptop | 80000 | 2023-06-01 | 2023-06-15 | 10 |
P002 | Mouse | 3000 | NA | NA | NA |
P003 | Keyboard | 5000 | NA | NA | NA |
Source data (order data): | order_id | order_date | amount | |----------|------------|--------| | 1001 | 2023-01-15 | 3000 | | 1002 | 2023-02-20 | 5000 | | 1003 | 2023-03-10 | 8000 |
Exchange rate data: | rate_date | usd_jpy_rate | |------------|--------------| | 2023-01-01 | 130.5 | | 2023-02-01 | 132.0 | | 2023-03-01 | 133.5 | | 2023-04-01 | 135.2 |
left_join(exchange_rate_data, by = join_by(closest(order_date >= rate_date)))
Keeps all rows from the source dataframe and joins with the closest exchange rate that is on or before each order date.
order_id | order_date | amount | rate_date | usd_jpy_rate |
---|---|---|---|---|
1001 | 2023-01-15 | 3000 | 2023-01-01 | 130.5 |
1002 | 2023-02-20 | 5000 | 2023-02-01 | 132.0 |
1003 | 2023-03-10 | 8000 | 2023-03-01 | 133.5 |
Source data (meeting schedule): | meeting_id | start_time | end_time | room | |------------|----------------------|----------------------|----------| | M001 | 2023-06-01 10:00 | 2023-06-01 11:30 | Room A | | M002 | 2023-06-01 13:00 | 2023-06-01 14:00 | Room B | | M003 | 2023-06-02 09:30 | 2023-06-02 12:00 | Room A |
Reservation data: | reservation_id | start_time | end_time | reserved_by | |----------------|----------------------|----------------------|-------------| | R001 | 2023-06-01 09:00 | 2023-06-01 10:30 | Smith | | R002 | 2023-06-01 11:00 | 2023-06-01 12:00 | Johnson | | R003 | 2023-06-01 13:30 | 2023-06-01 15:00 | Brown | | R004 | 2023-06-02 10:00 | 2023-06-02 11:00 | Wilson |
left_join(reservation_data, by = join_by(overlaps(start_time, end_time, start_time, end_time)))
Keeps all rows from the source dataframe and joins with reservations that overlap with meeting times.
meeting_id | start_time | end_time | room | reservation_id | reserved_by |
---|---|---|---|---|---|
M001 | 2023-06-01 10:00 | 2023-06-01 11:30 | Room A | R001 | Smith |
M001 | 2023-06-01 10:00 | 2023-06-01 11:30 | Room A | R002 | Johnson |
M002 | 2023-06-01 13:00 | 2023-06-01 14:00 | Room B | R003 | Brown |
M003 | 2023-06-02 09:30 | 2023-06-02 12:00 | Room A | R004 | Wilson |
Source data (store hours): | store_id | opening_time | closing_time | |----------|--------------|--------------| | S001 | 09:00 | 21:00 | | S002 | 10:00 | 22:00 | | S003 | 08:30 | 20:00 |
Event data: | event_id | start_time | end_time | description | |----------|------------|----------|---------------| | E001 | 08:00 | 10:00 | Morning Mtg | | E002 | 11:00 | 13:00 | Seminar | | E003 | 15:00 | 19:00 | Exhibition | | E004 | 19:30 | 22:30 | Party |
left_join(event_data, by = join_by(within(opening_time, closing_time, start_time, end_time)))
Keeps all rows from the source dataframe and joins with events that completely fit within store hours.
store_id | opening_time | closing_time | event_id | start_time | end_time | description |
---|---|---|---|---|---|---|
S001 | 09:00 | 21:00 | E002 | 11:00 | 13:00 | Seminar |
S001 | 09:00 | 21:00 | E003 | 15:00 | 19:00 | Exhibition |
S002 | 10:00 | 22:00 | E002 | 11:00 | 13:00 | Seminar |
S002 | 10:00 | 22:00 | E003 | 15:00 | 19:00 | Exhibition |
S003 | 08:30 | 20:00 | E002 | 11:00 | 13:00 | Seminar |
S003 | 08:30 | 20:00 | E003 | 15:00 | 19:00 | Exhibition |
Summary
Ranks. Ties get min rank. Missing values are left as is.
Syntax
min_rank(<column_num>
)
Return Value
Numeric
Example
mutate(population_rank = min_rank(population))
Create a new column to have the rank number based on 'population' values. The smaller the numbers are the higher the ranks are.
mutate(population_rank = min_rank(desc(population)))
Create a new column to have the rank number based on 'population' values. The bigger the numbers are the higher the ranks are.
filter(min_rank(population) < 10)
Select rows whose rank numbers are less 10.
Summary
Mutate creates new column(s) with given expressions and preserves existing
Syntax
mutate(<new_column>
= <expression>
, ...)
Example
mutate(profit = revenue - expense)
Create 'profit' column based on the calculation result of 'revenue - expense'.
Summary
Returns the count of the rows for each group.
Syntax
n()
Arguments
Return Value
Numeric
Example
summarize(total = n())
Create a new column for calculating the number of entries for each group.
Summary
Returns the count of unique values.
Syntax
n_distinct(<column(s)>
, na.rm = <logical>
)
Arguments
Return Value
Numeric
Example
summarize(total = n_distinct(TAIL_NUM))
Create a new column for calculating the count of unique values of TAIL_NUM for each group.
summarize(total = n_distinct(STATE, CITY))
Create a new column for calculating the count of unique combinations of STATE and CITY for each group.
Summary
returns the nth value of all the values.
Syntax
nth(<column>
, n = <number>
, order_by = <column(s)>
, default = <text>
)
Arguments
Example
summarize(total = nth(name, 5))
Returns the fifth value for each group.
summarize(total = nth(name, -5))
Returns the fifth value from the end for each group.
Summary
Divides data into a specified number (n) of "equal frequency" groups (with approximately the same number of rows in each group) and returns an integer value (from 1 to n) indicating which group each row belongs to.
Note that this "equal frequency division" distributes rows evenly across groups, which differs from equal width (same value range) division.
Syntax
ntile(<column>
, <number>
)
Arguments
Return Value
Numeric
Example
ntile(sales_amount, 4)
Divides "sales_amount" into 4 equal frequency groups. This splits the data so each group contains the same number of rows. For example, if there are 8 rows total, each group will contain 2 rows.
| ID | sales_amount | quartile_group |
|----|-------------|---------------|
| 1 | 10000 | 1 |
| 2 | 15000 | 1 |
| 3 | 20000 | 2 |
| 4 | 25000 | 2 |
| 5 | 30000 | 3 |
| 6 | 35000 | 3 |
| 7 | 40000 | 4 |
| 8 | 45000 | 4 |
Summary
Ranks rescaled to [0, 1]. Missing values are left as is.
Syntax
percent_rank(<column_num>
)
Return Value
Numeric
Example
mutate(population_rank = percent_rank(population))
Create a new column to have the percent rank number based on 'population' values.
filter(percent_rank(population) < .25)
Select rows whose percent rank numbers are less .25.
Summary
Rename existing column names.
Syntax
rename(<new_column_name>
= <column>
, ...)
Example
rename(state = X1)
Rename X1 with 'state'
rename(state = X1, country = X2)
Rename X1 with 'state' and X2 with 'country'
Overview
Performs a right outer join. Returns all rows from the target dataframe (right side) and all columns from both the current dataframe (left side) and the target dataframe. When a row in the right dataframe has no matching row in the left dataframe, the columns added from the left side will contain NA values. If there are multiple matches, all combinations will be returned.
Syntax
right_join(<target_dataframe>
, by = join_by(<join_condition>
), suffix = c(<source_dataframe_suffix>
, <target_dataframe_suffix>
), na_matches = <NA_handling_method>
, ignore_case = <TRUE|FALSE>
)
Arguments
Complex join conditions with join_by
Examples
Source data: | customer_id | customer_name | order_amount | |-------------|---------------|--------------| | 101 | Smith | 5000 | | 102 | Johnson | 3000 | | 103 | Brown | 7000 |
Customer attributes data: | customer_id | region | age | |-------------|-------------|-----| | 101 | New York | 35 | | 102 | California | 42 | | 104 | Texas | 28 |
right_join(customer_attributes, by = join_by(customer_id == customer_id))
Keeps all rows from the target dataframe (customer_attributes) and joins with the source dataframe using the "customer_id" column as the key. For customer_id=104 which has no match in the source dataframe, customer_name and order_amount columns will contain NA values.
customer_id | customer_name | order_amount | region | age |
---|---|---|---|---|
101 | Smith | 5000 | New York | 35 |
102 | Johnson | 3000 | California | 42 |
104 | NA | NA | Texas | 28 |
Source data (employees): | employee_id | job_code | name | |-------------|----------|----------| | 1 | A01 | Thompson | | 2 | A02 | Wilson | | 3 | NA | Davis |
Job data: | job_code | job_title | |----------|--------------| | A01 | Sales | | A02 | Development | | A03 | Finance | | NA | Unassigned |
right_join(job_data, by = join_by(job_code == job_code), na_matches = "na")
Keeps all rows from the target dataframe (job_data) and joins with the source dataframe using the "job_code" column as the key. NA values are also used as join keys since na_matches="na". For job_code=A03 which has no match in the source dataframe, employee_id and name columns will contain NA values.
employee_id | job_code | name | job_title |
---|---|---|---|
1 | A01 | Thompson | Sales |
2 | A02 | Wilson | Development |
NA | A03 | NA | Finance |
3 | NA | Davis | Unassigned |
Source data (user data): | user_id | name | |---------|-------| | 1 | john | | 2 | MARY | | 3 | David |
Profile data: | user_id | name | age | |---------|-------|-----| | 1 | John | 25 | | 2 | mary | 30 | | 4 | Steve | 35 |
right_join(profile_data, by = join_by(name == name), ignore_case = TRUE)
Keeps all rows from the target dataframe (profile_data) and joins with the source dataframe using the "name" column as the key, ignoring case differences. For name=Steve (user_id=4) which has no match in the source dataframe, the source user_id column will contain NA value.
user_id.x | name | user_id.y | age |
---|---|---|---|
1 | john | 1 | 25 |
2 | MARY | 2 | 30 |
NA | Steve | 4 | 35 |
Source data (product data): | product_id | product_name | price | |------------|--------------|-------| | P001 | Laptop | 80000 | | P002 | Mouse | 3000 | | P003 | Keyboard | 5000 |
Sale period data: | product_id | sale_start_date | sale_end_date | discount_rate | |------------|-----------------|---------------|---------------| | P001 | 2023-06-01 | 2023-06-15 | 10 | | P001 | 2023-07-01 | 2023-07-15 | 15 | | P002 | 2023-06-10 | 2023-06-20 | 20 | | P004 | 2023-06-01 | 2023-06-30 | 25 |
right_join(sale_period_data, by = join_by(product_id == product_id, between(as.Date("2023-06-05"), sale_start_date, sale_end_date)))
Keeps all rows from the target dataframe (sale_period_data) and joins with matching products where "2023-06-05" falls within the sale period. For rows that don't satisfy the condition or have no matching product_id, product_name and price columns will contain NA values.
product_id | product_name | price | sale_start_date | sale_end_date | discount_rate |
---|---|---|---|---|---|
P001 | Laptop | 80000 | 2023-06-01 | 2023-06-15 | 10 |
P001 | NA | NA | 2023-07-01 | 2023-07-15 | 15 |
P002 | NA | NA | 2023-06-10 | 2023-06-20 | 20 |
P004 | NA | NA | 2023-06-01 | 2023-06-30 | 25 |
Source data (order data): | order_id | order_date | amount | |----------|------------|--------| | 1001 | 2023-01-15 | 3000 | | 1002 | 2023-02-20 | 5000 | | 1003 | 2023-03-10 | 8000 |
Exchange rate data: | rate_date | usd_jpy_rate | |------------|--------------| | 2023-01-01 | 130.5 | | 2023-02-01 | 132.0 | | 2023-03-01 | 133.5 | | 2023-04-01 | 135.2 |
right_join(exchange_rate_data, by = join_by(closest(order_date >= rate_date)))
Keeps all rows from the target dataframe (exchange_rate_data) and joins with the closest order that is on or after each rate date. For rate_date=2023-04-01 which has no matching order date, order_id and amount columns will contain NA values.
order_id | order_date | amount | rate_date | usd_jpy_rate |
---|---|---|---|---|
1001 | 2023-01-15 | 3000 | 2023-01-01 | 130.5 |
1002 | 2023-02-20 | 5000 | 2023-02-01 | 132.0 |
1003 | 2023-03-10 | 8000 | 2023-03-01 | 133.5 |
NA | NA | NA | 2023-04-01 | 135.2 |
Source data (meeting schedule): | meeting_id | start_time | end_time | room | |------------|----------------------|----------------------|----------| | M001 | 2023-06-01 10:00 | 2023-06-01 11:30 | Room A | | M002 | 2023-06-01 13:00 | 2023-06-01 14:00 | Room B | | M003 | 2023-06-02 09:30 | 2023-06-02 12:00 | Room A |
Reservation data: | reservation_id | start_time | end_time | reserved_by | |----------------|----------------------|----------------------|-------------| | R001 | 2023-06-01 09:00 | 2023-06-01 10:30 | Smith | | R002 | 2023-06-01 11:00 | 2023-06-01 12:00 | Johnson | | R003 | 2023-06-01 13:30 | 2023-06-01 15:00 | Brown | | R004 | 2023-06-02 10:00 | 2023-06-02 11:00 | Wilson | | R005 | 2023-06-03 10:00 | 2023-06-03 12:00 | Taylor |
right_join(reservation_data, by = join_by(overlaps(start_time, end_time, start_time, end_time)))
Keeps all rows from the target dataframe (reservation_data) and joins with meetings that overlap with reservation times. For reservation_id=R005 which has no overlapping meeting, meeting_id and room columns will contain NA values.
meeting_id | start_time | end_time | room | reservation_id | reserved_by |
---|---|---|---|---|---|
M001 | 2023-06-01 10:00 | 2023-06-01 11:30 | Room A | R001 | Smith |
M001 | 2023-06-01 10:00 | 2023-06-01 11:30 | Room A | R002 | Johnson |
M002 | 2023-06-01 13:00 | 2023-06-01 14:00 | Room B | R003 | Brown |
M003 | 2023-06-02 09:30 | 2023-06-02 12:00 | Room A | R004 | Wilson |
NA | NA | NA | NA | R005 | Taylor |
Source data (store hours): | store_id | opening_time | closing_time | |----------|--------------|--------------| | S001 | 09:00 | 21:00 | | S002 | 10:00 | 22:00 | | S003 | 08:30 | 20:00 |
Event data: | event_id | start_time | end_time | description | |----------|------------|----------|---------------| | E001 | 08:00 | 10:00 | Morning Mtg | | E002 | 11:00 | 13:00 | Seminar | | E003 | 15:00 | 19:00 | Exhibition | | E004 | 19:30 | 22:30 | Party |
right_join(event_data, by = join_by(within(opening_time, closing_time, start_time, end_time)))
Keeps all rows from the target dataframe (event_data) and joins with stores where events completely fit within store hours. For events that don't fit within any store's hours (e.g., E001 and E004), store_id, opening_time, and closing_time columns will contain NA values.
store_id | opening_time | closing_time | event_id | start_time | end_time | description |
---|---|---|---|---|---|---|
NA | NA | NA | E001 | 08:00 | 10:00 | Morning Mtg |
S001 | 09:00 | 21:00 | E002 | 11:00 | 13:00 | Seminar |
S002 | 10:00 | 22:00 | E002 | 11:00 | 13:00 | Seminar |
S003 | 08:30 | 20:00 | E002 | 11:00 | 13:00 | Seminar |
S001 | 09:00 | 21:00 | E003 | 15:00 | 19:00 | Exhibition |
S002 | 10:00 | 22:00 | E003 | 15:00 | 19:00 | Exhibition |
S003 | 08:30 | 20:00 | E003 | 15:00 | 19:00 | Exhibition |
NA | NA | NA | E004 | 19:30 | 22:30 | Party |
Summary
Returns row numbers. Equivalent to Rank.
Syntax
row_number()
Arguments
Return Value
Numeric
Example
mutate(rowid = row_number())
Returns an unique row number for each row.
Summary
Sample n fraction of rows from the data frame.
Syntax
sample_frac(<number>
, weight=<column_num>
, replace=<logical>
)
Arguments
Example
sample_frac(.2)
Select randomly selected 20% of of data.
sample_frac(.2, weight = ARR_DELAY)
Select randomly selected 20% of of data with a weight on ARR_DELAY column.
Summary
Sample n rows from the data frame.
Syntax
sample_n(<number>
, weight=<column_num>
, replace=<logical>
)
Arguments
Example
sample_n(100)
Select 100 randomly selected rows.
sample_n(100, weight = ARR_DELAY)
Select 100 randomly selected rows from the data with a weight on ARR_DELAY column.
Summary
Select column(s) by column names.
Syntax
select(<column(s)>
,...)
select(-<column(s)>
,...)
select(starts_with(<text>
, ignore.case = <logical>
))
select(ends_with(<text>
, ignore.case = <logical>
))
select(contains(<text>
, ignore.case = <logical>
))
select(matches(<text>
, ignore.case = <logical>
))
select(num_range(<text>
, <start_num>:<end_num>
))
select(any_of(<text1>
, <text2>
, ...))
select(<column>
, everything())
Arguments
Used Together
starts_with, ends_with, contains, matches, num_range, any_of, everything
Example
select(X, Y)
Select column X and Y.
select(-X,- Y)
Remove column X and Y.
select(starts_with("X", ignore.case = TRUE))
Select only columns whose name starts with "X"
select(ends_with("ABR", ignore.case = TRUE))
Select only columns whose name ends with "ABR"
select(contains("ABR", ignore.case = TRUE))
Select only columns whose name contains with "ABR"
select(matches("[:digit:]+", ignore.case = TRUE))
Select only columns whose names contains digits (regular expression)
select(num_range("X", 1:6))
Selects all columns (numerically) from X1 to X5. 
select(any_of("x", "y", "z”))
Selects columns provided inside the brackets.
select(X, everything())
Reorder columns: keep the column “X" in the front
select_if(is.numeric)
Select only numeric columns.
Overview
Returns only those rows from the current dataframe that have matching values in the target dataframe. Only columns from the current dataframe are returned; columns from the target dataframe are not included. Unlike inner_join, duplicates do not occur even if there are multiple matching rows.
Syntax
semi_join(<target_dataframe>
, by = join_by(<join_condition>
))
Arguments
Complex join conditions with join_by
Examples
Source data: | customer_id | customer_name | order_amount | |-------------|---------------|--------------| | 101 | Smith | 5000 | | 102 | Johnson | 3000 | | 103 | Brown | 7000 |
Customer attributes data: | customer_id | region | age | |-------------|-------------|-----| | 101 | New York | 35 | | 102 | California | 42 | | 104 | Texas | 28 |
semi_join(customer_attributes, by = join_by(customer_id == customer_id))
From the source dataframe, returns only rows where "customer_id" has a match in the target dataframe. Columns from the target dataframe are not included.
customer_id | customer_name | order_amount |
---|---|---|
101 | Smith | 5000 |
102 | Johnson | 3000 |
Source data (employees): | employee_id | job_code | name | |-------------|----------|----------| | 1 | A01 | Thompson | | 2 | A02 | Wilson | | 3 | B01 | Davis | | 4 | C01 | Brown |
Job data: | job_code | job_title | |----------|--------------| | A01 | Sales | | A02 | Development | | B01 | Finance |
semi_join(job_data, by = join_by(job_code == job_code))
From the source dataframe, returns only rows where "job_code" has a match in the target dataframe. Job code "C01" doesn't exist in the job data, so it's excluded from the results.
employee_id | job_code | name |
---|---|---|
1 | A01 | Thompson |
2 | A02 | Wilson |
3 | B01 | Davis |
Source data (product data): | product_id | product_name | price | |------------|--------------|-------| | P001 | Laptop | 80000 | | P002 | Mouse | 3000 | | P003 | Keyboard | 5000 |
Sale period data: | product_id | sale_start_date | sale_end_date | discount_rate | |------------|-----------------|---------------|---------------| | P001 | 2023-06-01 | 2023-06-15 | 10 | | P001 | 2023-07-01 | 2023-07-15 | 15 | | P002 | 2023-06-10 | 2023-06-20 | 20 | | P004 | 2023-06-01 | 2023-06-30 | 25 |
semi_join(sale_period_data, by = join_by(product_id == product_id, between(as.Date("2023-06-05"), sale_start_date, sale_end_date)))
From the source dataframe, returns only rows where "product_id" matches in the target dataframe and "2023-06-05" falls within the sale period. The period 2023-06-01 to 2023-06-15 for product_id=P001 satisfies this condition.
product_id | product_name | price |
---|---|---|
P001 | Laptop | 80000 |
Source data (order data): | order_id | order_date | amount | |----------|------------|--------| | 1001 | 2023-01-15 | 3000 | | 1002 | 2023-02-20 | 5000 | | 1003 | 2023-03-10 | 8000 | | 1004 | 2023-04-05 | 4000 |
Exchange rate data: | rate_date | usd_jpy_rate | |------------|--------------| | 2023-01-01 | 130.5 | | 2023-02-01 | 132.0 | | 2023-03-01 | 133.5 |
semi_join(exchange_rate_data, by = join_by(closest(order_date >= rate_date)))
From the source dataframe, returns only rows that meet the condition with rows in the target dataframe. Extracts orders that have a closest rate date that is on or before the order date. Order with id=1004 is excluded because there's no corresponding rate date.
order_id | order_date | amount |
---|---|---|
1001 | 2023-01-15 | 3000 |
1002 | 2023-02-20 | 5000 |
1003 | 2023-03-10 | 8000 |
Source data (meeting schedule): | meeting_id | start_time | end_time | room | |------------|----------------------|----------------------|----------| | M001 | 2023-06-01 10:00 | 2023-06-01 11:30 | Room A | | M002 | 2023-06-01 13:00 | 2023-06-01 14:00 | Room B | | M003 | 2023-06-02 09:30 | 2023-06-02 12:00 | Room A | | M004 | 2023-06-03 10:00 | 2023-06-03 11:00 | Room C |
Reservation data: | reservation_id | start_time | end_time | reserved_by | |----------------|----------------------|----------------------|-------------| | R001 | 2023-06-01 09:00 | 2023-06-01 10:30 | Smith | | R002 | 2023-06-01 11:00 | 2023-06-01 12:00 | Johnson | | R003 | 2023-06-01 13:30 | 2023-06-01 15:00 | Brown | | R004 | 2023-06-02 10:00 | 2023-06-02 11:00 | Wilson |
semi_join(reservation_data, by = join_by(overlaps(start_time, end_time, start_time, end_time)))
From the source dataframe, returns only rows that meet the condition with rows in the target dataframe. Extracts meetings that have overlapping times with reservations. Meeting with id=M004 is excluded because it doesn't overlap with any reservation.
meeting_id | start_time | end_time | room |
---|---|---|---|
M001 | 2023-06-01 10:00 | 2023-06-01 11:30 | Room A |
M002 | 2023-06-01 13:00 | 2023-06-01 14:00 | Room B |
M003 | 2023-06-02 09:30 | 2023-06-02 12:00 | Room A |
Source data (store hours): | store_id | opening_time | closing_time | |----------|--------------|--------------| | S001 | 09:00 | 21:00 | | S002 | 10:00 | 22:00 | | S003 | 08:30 | 20:00 | | S004 | 07:00 | 15:00 |
Event data: | event_id | start_time | end_time | description | |----------|------------|----------|---------------| | E001 | 08:00 | 10:00 | Morning Mtg | | E002 | 11:00 | 13:00 | Seminar | | E003 | 15:00 | 19:00 | Exhibition |
semi_join(event_data, by = join_by(within(opening_time, closing_time, start_time, end_time)))
From the source dataframe, returns only rows that meet the condition with rows in the target dataframe. Extracts stores where at least one event completely fits within the store's opening hours. All stores except S004 (which doesn't fit the Morning Mtg event) are included in the result.
store_id | opening_time | closing_time |
---|---|---|
S001 | 09:00 | 21:00 |
S002 | 10:00 | 22:00 |
S003 | 08:30 | 20:00 |
Summary
Select rows by positions. This is not supported for relational databases, in which case you should use filter() with row_number() function.
Syntax
slice(<numeric>
)
slice(<numeric>:<numeric>
)
Arguments
Example
slice(1) : equivalent to -> filter(row_number() == 1)
Select only the first row.
slice(5:n()) : equivalent to -> filter(between(row_number(), 5, n()))
Select only rows starting from the 5th to the last.
slice(n()) : equivalent to -> filter(row_number() == n())
Select only the last row
slice(n()-10:n())
Select only the last 10 rows
Summary
Summarize a column values to a single value by using aggregate functions. Make sure you have grouped the data frame already using group_by() already.
Syntax
summarize(<new_column>
= <aggregate_expression>
, ...)
Example
summarise(avg = mean(revenue))
Create a new column 'avg' to store average values of 'revenue' for each group.
summarize(avg = mean(revenue), total = sum(revenue))
Create new columns 'avg' for 'average of revenue' and 'total' for 'sum of revenue' for each group.
Summary
Summarize the data by either calling n() or sum() on top of the grouped data. group_by() operation should have been done before.
Syntax
tally(wt = <column>
, sort = <logical>
)
Arguments
Example
tally()
Returns the number of rows for each group.
tally(wt = ARR_DELAY)
Returns the sum total of 'ARR_DELAY' for each group.
tally(wt = ARR_DELAY, sort = TRUE)
Returns the sum total of 'ARR_DELAY' and sorted by the total number.
Summary
Select the top or bottom N data points within each group based on a specified column.
This function allows you to filter the dataset to keep only the top or bottom N data points for each group according to a given measure.
Syntax
top_n(<number>
, <column_num>
)
top_n(<number>
, desc(<column_num>
))
Arguments
Used Together
desc
Example
Original Data
Country | Population (millions) | GDP (trillion yen) |
---|---|---|
Japan | 126 | 500 |
USA | 331 | 2100 |
China | 1400 | 1500 |
India | 1380 | 300 |
Germany | 83 | 400 |
top_n(3, Population (millions))
Select the top 3 countries with the highest population.
Country | Population (millions) | GDP (trillion yen) |
---|---|---|
China | 1400 | 1500 |
India | 1380 | 300 |
USA | 331 | 2100 |
top_n(-2, Population (millions))
Select the bottom 2 countries with the lowest population. This is equivalent to top_n(2, desc(Population (millions)))
.
Country | Population (millions) | GDP (trillion yen) |
---|---|---|
Germany | 83 | 400 |
Japan | 126 | 500 |
group_by(Region) %>% top_n(1, Sales (10,000 yen))
Select the store with the highest sales in each region.
Region | Store | Sales (10,000 yen) |
---|---|---|
Tokyo | Store C | 1500 |
Osaka | Store D | 1100 |
Nagoya | Store G | 900 |
Summary
Transmute adds new columns and drops existing columns.
Syntax
transmute(<new_column>
= <expression>
, ...)
Example
transmute(profit = revenue - expense)
Create 'profit' column based on the calculation result of 'revenue - expense', and drop 'revenue' and 'expense' columns.
Summary
Ungroup existing grouping
Syntax
ungroup()
Example
ungroup()
Ungroup the existing groups.
Summary
Rounds date/time up to the nearest integer value of the specified time unit.
Syntax
ceiling_date(<column_date>
, unit = <unit>
)
Arguments
Return Value
POSIXct
Example
ceiling_date(A, "second")
returns "2015-10-01 06:10:15 UTC"
ceiling_date(A, "minute")
returns "2015-10-01 06:11:00 UTC"
ceiling_date(A, "hour")
returns "2015-10-01 07:00:00 UTC"
ceiling_date(A, "day")
returns "2015-10-02 UTC"
ceiling_date(A, "week")
returns "2015-10-04 UTC"
ceiling_date(A, "month")
returns "2015-11-01 UTC"
ceiling_date(A, "quarter")
returns "2016-01-01 UTC"
ceiling_date(A, "year")
returns "2016-01-01 UTC"
Summary
Converts numeric data calculated as a fraction of the year to date data type.
Syntax
date_decimal(<column_num>
, tz = <timezone>
)
Arguments
Return Value
POSIXct
Example
date_decimal(2015.953)
Returns "2015-12-14 20:16:47 UTC"
Summary
Extract Day
Syntax
day(<column_date>
)
Return Value
Numeric
Example
day("2015-10-01 06:15:30")
returns 1
Summary
Convert a number to day periods so you can use it for further calculations.
Syntax
days(<column_num>
)
Return Value
Period
Example
days(1)
returns "1d 0H 0M 0S"
ymd_hms("2015-10-01 06:15:30") + days(1)
returns "2015-10-02 06:15:30 UTC"
Summary
Converts date data to numeric data type by calculating the date to a fraction of the year.
Syntax
decimal_date(<column_date>
)
Arguments
Return Value
Numeric
Example
decimal_date(ymd("2015-12-15"))
Returns 2015.953
Summary
Convert Character or Number to Date / Period when data contains only Date, but not Time.
Syntax
dmy(<column>
, tz = <timezone>
, locale = <locale>
, truncated = <number>
)
Arguments
Return Value
Date
Example
dmy("10-01-2015")
returns "2015-10-01"
dmy("10/01/2015")
returns "2015-10-01"
dmy("Created on 10 1 2015")
returns "2015-10-01"
dmy("1 Oct, 2015", locale = "English")
returns "2015-10-01"
Summary
Convert Character or Number to Date / Period when data contains Date and Time.
Syntax
dmy_h(<column>
, tz = <timezone>
, locale = <locale>
, truncated = <number>
)
Arguments
Return Value
POSIXct
Example
dmy_h("01-10-2015 06:10")
returns "2015-10-01 06:10:00 UTC"
dmy_h("Created on 1 10 2015, at 6:10AM")
returns "2015-10-01 06:10:00 UTC"
dmy_h("10-01-2015 06", truncated = 1)
returns "2015-10-01 06:00:00 UTC"
dmy_h("Oct 01 2015 06", locale = "English")
returns "2015-10-01 06:00:00 UTC"
Summary
Convert Character or Number to Date / Period when data contains Date and Time.
Syntax
dmy_hm(<column>
, tz = <timezone>
, locale = <locale>
, truncated = <number>
)
Arguments
Return Value
POSIXct
Example
dmy_hm("01-10-2015 06:10")
returns "2015-10-01 06:10:00 UTC"
dmy_hm("Created on 1 10 2015, at 6:10AM")
returns "2015-10-01 06:10:00 UTC"
dmy_hm("01-10-2015 06", truncated = 1)
returns "2015-10-01 06:00:00 UTC"
dmy_hm("01 Oct 2015 06:00",locale = "English)
returns "2015-10-01 06:00:00 UTC"
Summary
Convert Character or Number to Date / Period when data contains Date and Time.
Syntax
dmy_hms(<column>
, tz = <timezone>
, locale = <locale>
, truncated = <number>
)
Arguments
Return Value
POSIXct
Example
dmy_hms("01-10-2015 06:10:15")
returns "2015-10-01 06:10:15 UTC"
dmy_hms("01/10/2015 06:10:15")
returns "2015-10-01 06:10:15 UTC"
dmy_hms("Created on 1 10 2015, at 06:10:15 AM")
returns "2015-10-01 06:10:15 UTC"
dmy_hms("01-10-2015 06")
returns NA, because there is not enough data for Date and Time parsing.
dmy_hms("01-10-2015 06", truncated = 2)
returns "2015-10-01 06:00:00 UTC" as expected
dmy_hms("01 Oct 2015 06:00:00", locale = "English")
returns "2015-10-01 06:00:00 UTC"
Summary
Create a duration
Syntax
duration(<column_num>
, units = <units>
)
Arguments
Return Value
Duration
Example
duration(1.5, "minutes")
returns 90s
duration(mins = 1.5)
returns 90s
Summary
Convert Character or Number to Date / Period when data contains only Date, but not Time.
Syntax
dym(<column>
, tz = <timezone>
, locale = <locale>
, truncated = <number>
)
Arguments
Return Value
Date
Example
dym("01-2015-10")
returns "2015-10-01"
dym("01/2015/10")
returns "2015-10-01"
dym("Created on 1 2015 10")
returns "2015-10-01"
dym("1 2015, Oct", locale = "English")
returns "2015-10-01"
Summary
Rounds date/time down to the nearest integer value of the specified time unit.
Syntax
floor_date(<column_date>
, unit = <unit>
)
Arguments
Return Value
POSIXct
Example
floor_date(A, "second")
returns "2015-10-01 06:10:15 UTC"
floor_date(A, "minute")
returns "2015-10-01 06:10:00 UTC"
floor_date(A, "hour")
returns "2015-10-01 06:00:00 UTC"
floor_date(A, "day")
returns "2015-10-01 UTC"
floor_date(A, "week")
returns "2015-09-27 UTC"
floor_date(A, "month")
returns "2015-10-01 UTC"
floor_date(A, "quarter")
returns "2015-10-01 UTC"
floor_date(A, "year")
returns "2015-01-01 UTC"
Summary
Re-registers the data with a given timezone.
Syntax
force_tz(<column_date>
, tzone = <timezone>
)
Arguments
Return Value
POSIXct
Example
as.POSIXct("2014-01-01 00:15:00", tz = "UTC")
Returns "2014-01-01 00:15:00 UTC".
force_tz(as.POSIXct("2014-01-01 00:15:00", tz = "America/Los_Angeles"))
Returns "2014-01-01 00:15:00 PST". Notice that the timezone is now "PST" instead of the original "UTC"
Summary
Get current time in your local timezone
Syntax
here()
Return Value
POSIXct
Example
here()
returns "2015-12-31 00:04:59 PST"
Summary
Convert text to Period data type if a given column holds values that look like Hours and Minutes
Syntax
hm(<column>
, tz = <timezone>
)
Arguments
Return Value
Period
Example
hm("06:10")
returns "6H 10M 0S"
hm("06:10", "UTC")
returns "6H 10M 0S"
Summary
Create hms data. It's difftime dealt as seconds.
Syntax
hms(seconds = <column_num>
, minutes = <column_num>
, hours = <column_num>
, days = <column_num>
)
Arguments
Return Value
hms
Example
hms(seconds = 20, minutes = 1, hours = 5)
returns 20 + 1 * 60 + 5 * 60 * 60 = 18080 seconds difftime
Summary
Extract hour
Syntax
hour(<column_date>
)
Return Value
Integer
Example
hour("2015-10-01 06:15:30")
returns 6
Summary
Convert a number to hour periods so you can use it for further calculations.
Syntax
hours(<column_num>
)
Return Value
Period
Example
hours(1)
returns "1H 0M 0S"
ymd_hms("2015-10-01 06:15:30") + hours(1)
returns "2015-10-01 07:15:30 UTC"
Summary
Return intervals between two dates. You can use this result to get the duration by either dividing by a given units such as days, weeks, etc, OR converting it to numeric data type with "as.numeric()".
Syntax
interval(<column_date>
, <column_date>
)
Return Value
Interval
Example
interval("2015-10-01", "2015-10-18")
returns 2015-10-01 UTC--2015-10-18 UTC
interval("2015-10-01", "2015-10-18") %/% days(1)
returns 17
as.numeric(interval("2015-10-01", "2015-10-18"), units = "days")
returns 17
interval("2015-10-01", "2015-10-18") %/% weeks(1)
returns 2
Summary
Returns true if a given object is Date type.
Syntax
is.Date
Arguments
Return Value
Logical
Example
mutate_if(is.Date, wday)
Extract day of the week information from Date columns.
mutate_if(is.Date, funs(week_day = wday))
Extract day of the week information from Date columns and creating new columns.
mutate_if(is.Date, wday, label = TRUE)
Extract day of the week information using wday function with 'label = TRUE' argument from Date columns.
Summary
Returns true if a given object is difftime type.
Syntax
is.difftime
Arguments
Return Value
Logical
Example
mutate_if(is.difftime, as.numeric)
Convert difftime type columns to numeric type.
Summary
Returns true if a given object is duration type.
Syntax
is.duration
Arguments
Return Value
Logical
Example
mutate_if(is.duration, as.numeric)
Convert duration type columns to numeric type.
Summary
Returns true if a given object is POSIXct type.
Syntax
is.POSIXct
Arguments
Return Value
Logical
Example
mutate_if(is.POSIXct, wday)
Extract day of the week information from Date columns.
mutate_if(is.POSIXct, funs(week_day = wday))
Extract day of the week information from Date columns and creating new columns.
mutate_if(is.POSIXct, wday, label = TRUE)
Extract day of the week information using wday function with 'label = TRUE' argument from Date columns.
Summary
Extract week numbers of the year. Weeks start from Monday.
Syntax
isoweek(<column_date>
)
Return Value
Numeric
Example
isoweek(ymd(c( \
"2020-01-01",\
"2020-01-02",\
"2020-01-03",\
"2020-01-04",\
"2020-01-05",\
"2020-01-06",\
"2020-01-07",\
"2020-01-08")))\
returns 1,1,1,1,1,2,2,2 (2020-01-06 is Monday)
Summary
Extract Day of Month
Syntax
mday(<column_date>
)
Return Value
Numeric
Example
mday("2015-10-01 06:15:30")
returns 1
Summary
Convert Character or Number to Date / Period when data contains only Date, but not Time.
Syntax
mdy(<column>
, tz = <timezone>
, locale = <locale>
, truncated = <number>
)
Arguments
Return Value
Date
Example
mdy("01-10-2015")
returns "2015-10-01"
mdy("01/10/2015")
returns "2015-10-01"
mdy("Created on 1 10 2015")
returns "2015-10-01"
mdy("Oct, 1 2015", locale = "English")
returns "2015-10-01"
Summary
Convert Character or Number to Date / Period when data contains Date and Time.
Syntax
mdy_h(<column>
, tz = <timezone>
, locale = <locale>
, truncated = <number>
)
Arguments
Return Value
POSIXct
Example
mdy_h("10-01-2015 06")
returns "2015-10-01 06:00:00 UTC"
mdy_h("Created on 10 1 2015 6 AM")
returns "2015-10-01 06:00:00 UTC"
mdy_h("Oct 1, 2015 06 AM", locale = "English")
returns "2015-10-01 06:00:00 UTC"
Summary
Convert Character or Number to Date / Period when data contains Date and Time.
Syntax
mdy_hm(<column>
, tz = <timezone>
, locale = <locale>
, truncated = <number>
)
Arguments
Return Value
POSIXct
Example
mdy_hm("10-01-2015T06:10")
returns "2015-10-01 06:10:00 UTC"
mdy_hm("Created on 10 1 2015, at 6:10 AM")
returns "2015-10-01 06:10:00 UTC"
mdy_hm("Oct 1, 2015 06:10 AM", locale = "English")
returns "2015-10-01 06:10:00 UTC"
Summary
Convert Character or Number to Date / Period when data contains Date and Time.
Syntax
mdy_hms(<column>
, tz = <timezone>
, locale = <locale>
, truncated = <number>
)
Arguments
Return Value
POSIXct
Example
mdy_hms("10-01-2015T06:10:15")
returns "2015-10-01 06:10:15 UTC"
mdy_hms("10/01/2015T06:10:15")
returns "2015-10-01 06:10:15 UTC"
mdy_hms("Created on 10 1 2015")
returns "2015-10-01 06:10:15 UTC"
mdy_hms("10-01-2015 06")
returns NA, because there is not enough data for Date and Time parsing.
mdy_hms("10-01-2015 06", truncated = 2)
returns "2015-10-01 06:00:00 UTC" as expected
mdy_hms("Oct 1, 2015 06:10:15 AM", locale = "English")
returns "2015-10-01 06:10:15 UTC"
Summary
Convert a number to milliseconds periods so you can use it for further calculations.
Syntax
milliseconds(<column_num>
)
Return Value
Period
Example
milliseconds(1)
returns "1S"
ymd_hms("2015-10-01 06:15:30") + milliseconds(1)
returns "2015-10-01 06:15:31 UTC"
Summary
Extract minute
Syntax
minute(<column_date>
)
Return Value
Integer
Example
minute("2015-10-01 06:15:30")
returns 15
Summary
Convert a number to minute periods so you can use it for further calculations.
Syntax
minutes(<column_num>
)
Return Value
Period
Example
minutes(1)_
returns "1M 0S"
ymd_hms("2015-10-01 06:15:30") + minutes(1)
returns "2015-10-01 06:16:30 UTC"
Summary
Extract Month
Syntax
month(<column_date>
, label = <logical>
, abbr = <logical>
, locale = <locale>
)
Arguments
Return Value
Numeric or Text
Example
month("2015-10-01 06:15:30")
returns 10
month("2015-10-01 06:15:30", label = TRUE)
returns October
month("2015-10-01 06:15:30", label = TRUE, abbr = TRUE)
returns Oct
Summary
Convert text to Period data type if a given column holds values that look like Minutes and Seconds
Syntax
ms(<column>
, tz = <timezone>
)
Arguments
Return Value
Period
Example
ms("10:15")
returns "10M 15S"
ms("10:15", "UTC")
returns "10M 15S"
Summary
Convert Character or Number to Date / Period when data contains only Date, but not Time.
Syntax
myd(<column>
, tz = <timezone>
, locale = <locale>
, truncated = <number>
)
Arguments
Return Value
Date
Example
myd("01-2015-10")
returns "2015-01-10"
myd("01/2015/10")
returns "2015-01-10"
myd("Created on 1 2015 10")
returns "2015-01-10"
myd("Oct, 2015 1 ", locale = "English")
returns "2015-10-01"
Summary
Returns current date and time. A similar function 'today()' returns only date.
Syntax
now(tzone = <timezone>
)
Arguments
Return Value
POSIXct
Example
now()
returns "2015-12-31 08:04:59 PST"
now("GMT")
returns "2015-12-31 08:04:59 PST"
now("America/Los_Angeles")
returns "2015-12-31 00:04:59 PST"
Summary
Convert data to Date/Time data type.
Syntax
parse_date_time(<column>
, orders = <date_time_order>
, locale = <locale>
, tz = <timezone>
, truncated = <integer>
, quiet = <logical>
, exact = <logical>
, lt = <logical>
)
Arguments
orders - Specify the order of parameters in the original data. The supported parameters are
locale - The default is the session default locale. Set the date/time related configuration like time zone, encoding, decimal mark, and day/month names.
tz (Optional) - Specify which time zone to parse the date with. You can find a list of time zones.
trundated (Optional) - The default is 0. Number of parameters that can be missed.
quiet (Opional) - The default is FALSE. If TRUE, it suppresses the error when the format is wrong.
exact (Optional) - The default is FALSE. If TRUE, the orders parameters must explicit format (ex. "%m/%d/%y %I:%M").
lt (Optional) - The default is FALSE. If FALSE, it returns POSIXct. If TRUE, it return POSIXlt.
Return Value
POSIXct
Example
parse_date_time("03/15/05 1:30 PM", "mydHMp")
Returns "2015-03-05 13:30:00 PDT"
lubridate::parse_date_time("1:30 PM", "HMp")
Returns "0000-01-01 13:30:00 PDT. The default to each parameters are 0 and it always returns POSIXct"
Summary
Extract Day of Quarter
Syntax
qday(<column_date>
)
Return Value
Numeric
Example
qday("2015-10-01 06:15:30")
returns 7
Summary
Extract Quarter
Syntax
quarter(<column_date>
)
Return Value
Numeric
Example
quarter("2015-08-01 06:15:30")
returns 3
quarter("2015-06-16 07:42:29")
returns 2
Summary
Calculate the last day of the previous month or to the first day of the month based on a given date.
Syntax
rollback(<column_date>
, roll_to_first = <logical>
, preserve_hms = <logical>
)
Arguments
Return Value
POSIXct
Example
rollback("2015-03-03 06:15:30")
returns "2010-02-28 06:15:30 UTC"
rollback("2015-03-03 06:15:30", roll_to_first = TRUE)
returns "2010-03-01 06:15:30 UTC"
rollback("2015-03-03 06:15:30", preserve_hms = FALSE)
returns "2010-02-28 UTC"
Summary
Rounds date/time to the nearest integer value of the specified time unit.
Syntax
round_date(<column_date>
, unit = <unit>
)
Arguments
Return Value
POSIXct
Example
round_date("2015-10-01 06:10:15 UTC", "second")
returns "2015-10-01 06:10:15 UTC"
round_date("2015-10-01 06:10:15 UTC", "minute")
returns "2015-10-01 06:10:00 UTC"
round_date("2015-10-01 06:10:15 UTC", "hour")
returns "2015-10-01 06:00:00 UTC"
round_date("2015-10-01 06:10:15 UTC", "day")
returns "2015-10-01 UTC"
round_date("2015-10-01 06:10:15 UTC", "week")
returns "2015-10-04 UTC"
round_date("2015-10-01 06:10:15 UTC", "month")
returns "2015-10-01 UTC"
round_date("2015-10-01 06:10:15 UTC", "quarter")
returns "2015-10-01 UTC"
round_date("2015-10-01 06:10:15 UTC", "year")
returns "2015-01-01 UTC"
Summary
Extract second
Syntax
second(<column_date>
)
Return Value
Integer
Example
second("2015-10-01 06:15:30")
returns 30
Summary
Convert a number to second periods so you can use it for further calculations.
Syntax
seconds(<column_num>
)
Return Value
Period
Example
seconds(1)
returns "1S"
ymd_hms("2015-10-01 06:15:30") + seconds(1)
returns "2015-10-01 06:15:31 UTC"
Summary
Calculate the exact time length between two dates.
Syntax
time_length(<interval_or_duration>
, unit = <unit>
)
Arguments
Return Value
Numeric
Example
time_length(interval("2015-10-01", "2015-10-18"), "day")
returns 17
time_length(interval("2015-10-01", "2015-10-18"), "month")
returns 0.5483871
trunc(time_length(interval("2015-10-01", "2015-10-18"), "month"))
returns 0
round(time_length(interval("2015-10-01", "2015-10-18"), "month"))
returns 1
Summary
Returns current date without time. A similar function 'now()' returns date including time.
Syntax
today(tzone = <timezone>
)
Arguments
Return Value
Date
Example
today("GMT")
returns "2015-12-31"
today("America/Los_Angeles")
returns "2015-12-31"
Summary
Extract day of the week as a decimal number (01-07, Sunday is 1) or as Text (e.g. Sunday). The Text can be a full text or an abbreviated text. Sunday vs. Sun
Syntax
wday(<column_date>
, label = <logical>
, abbr = <logical>
, week_start = <number>
, locale = <locale>
)
Arguments
Return Value
Numeric or Text
Example
wday("2015-10-01 06:15:30")
returns 5
wday(A, label = TRUE)
returns Thurs
wday(A, label = TRUE, abbr=FALSE)
returns Thursday
Summary
Extract week numbers of the year. The 1st day of the 1st week always starts from January 1st regardless of the day of the week.
Syntax
week(<column_date>
)
Return Value
Numeric
Example
week(ymd(c( \
"2020-01-01",\
"2020-01-02",\
"2020-01-03",\
"2020-01-04",\
"2020-01-05",\
"2020-01-06",\
"2020-01-07",\
"2020-01-08")))\
returns 1,1,1,1,1,1,1,2
Summary
Convert a number to week periods so you can use it for further calculations.
Syntax
weeks(<column_num>
)
Return Value
Period
Example
weeks(1)
returns "7d 0H 0M 0S"
ymd_hms("2015-10-01 06:15:30") + weeks(1)
returns "2015-10-08 06:15:30 UTC"
Summary
Returns Date / Time in a given time zone
Syntax
with_tz(<column_date>
, tzone = <timezone>
)
Arguments
Return Value
POSIXct
Example
with_tz(as.POSIXct("2014-01-01 00:15:00", tz = "UTC"), tzone = "America/Los_Angeles")
Returns "2013-12-31 16:15:00 PST". The original data is registered in UTC, and with_tz() function is returning it in Pacific Timezone.
with_tz(as.POSIXct("2014-01-01 00:15:00", tz = "UTC"))
Returns "2013-12-31 16:15:00 PST". with_tz() function returns the value in the default timezone (e.g. "America/Los_Angeles").
Summary
Extract Day of Year
Syntax
yday(<column_date>
)
Return Value
Numeric
Example
yday("2015-10-01 06:15:30")
returns 274
Summary
Convert Character or Number to Date / Period when data contains only Date, but not Time.
Syntax
ydm(<column>
, tz = <timezone>
, locale = <locale>
, truncated = <number>
)
Arguments
Return Value
Date
Example
ydm("2015-10-01")
returns "2015-01-10"
ydm("2015/10/01")
returns "2015-01-10"
ydm("Created on 2015 10 1")
returns "2015-01-10"
ydm("2015, 1 Oct", locale = "English")
returns "2015-10-01"
Summary
Convert Character or Number to Date / Period when data contains Date and Time.
Syntax
ydm_h(<column>
, tz = <timezone>
, locale = <locale>
, truncated = <number>
)
Arguments
Return Value
POSIXct
Example
ydm_h("2015-01-10T06")
returns "2015-10-01 06:00:00 UTC"
ydm_h("2015/01/10 06")
returns "2015-10-01 06:00:00 UTC"
ydm_h("Created on 15-01-10 at 6 AM")
returns "2015-10-01 06:00:00 UTC"
ydm_h("2015 01 Oct, at 6 AM", locale = "English")
returns "2015-10-01 06:00:00 UTC"
Summary
Convert Character or Number to Date / Period when data contains Date and Time.
Syntax
ydm_hm(<column>
, tz = <timezone>
, locale = <locale>
, truncated = <number>
)
Arguments
Return Value
POSIXct
Example
ydm_hm("2015-01-10 06:10")
returns "2015-10-01 06:10:00 UTC"
ydm_hm("2015/01/10 06:00 UTC")
returns "2015-10-01 06:10:15 UTC"
ydm_hm("Created on 15-01-10 at 06:10 AM")
returns "2015-10-01 06:10:00 UTC"
ydm_hm("2015 01 Oct, at 6:10 AM", locale = "English")
returns "2015-10-01 06:10:00 UTC"
Summary
Convert Character or Number to Date / Period when data contains Date and Time.
Syntax
ydm_hms(<column>
, tz = <timezone>
, locale = <locale>
, truncated = <number>
)
Arguments
Return Value
POSIXct
Example
ydm_hms("2015-01-10T06:10:15")
returns "2015-10-01 06:10:15 UTC"
ydm_hms("2015/01/10 06:10:15 UTC")
returns "2015-10-01 06:10:15 UTC"
ydm_hms("Created on 15-01-10 at 06:10:15 AM")
returns "2015-10-01 06:10:15 UTC"
ydm_hms("2015-01-10 06")
returns NA, because there is not enough data for Date and Time parsing.
ydm_hms("2015-01-10 06", truncated = 2)
returns "2015-10-01 06:00:00 UTC" as expected
ydm_hms("2015 01 Oct, at 6:10:15 AM", locale = "English")
returns "2015-10-01 06:10:15 UTC"
Summary
Extract Year
Syntax
year(<column_date>
)
Return Value
Numeric
Example
year("2015-10-01 06:15:30")
returns 2015
Summary
Convert a number to year periods so you can use it for further calculations.
Syntax
years(<column_num>
)
Return Value
Period
Example
ymd_hms("2015-10-01 06:15:30") + years(1)
returns "2016-10-01 06:15:30 UTC"
Summary
Convert Character or Number to Date / Period when data contains only Date, but not Time.
Syntax
ymd(<column>
, tz = <timezone>
, locale = <locale>
, truncated = <number>
)
Arguments
Return Value
Date
Example
ymd("2015-10-01")
returns "2015-10-01"
ymd("2015/10/01")
returns "2015-10-01"
ymd("Created on 2015 10 1")
returns "2015-10-01"
ymd("2015, Oct 1", locale = "English")
returns "2015-10-01"
ß
Summary
Convert Character or Number to Date / Period when data contains Date and Time.
Syntax
ymd_h(<column>
, tz = <timezone>
, locale = <locale>
, truncated = <number>
)
Arguments
Return Value
POSIXct
Example
ymd_h("2015-10-01T06")
returns "2015-10-01 06:00:00 UTC"
ymd_h("2015/10/01 06")
returns "2015-10-01 06:00:00 UTC"
ymd_h("Created on 15-10-01 at 6 AM")
returns "2015-10-01 06:00:00 UTC"
ymd_h("2015, Oct 1 6AM", locale = "English")
returns "2015-10-01 06:00:00 UTC"
Summary
Convert Character or Number to Date / Period when data contains Date and Time.
Syntax
ymd_hm(<column>
, tz = <timezone>
, locale = <locale>
, truncated = <number>
)
Arguments
Return Value
POSIXct
Example
ymd_hm("2015-10-01 06:10")
returns "2015-10-01 06:10:00 UTC"
ymd_hm("2015/10/01 06:00 UTC")
returns "2015-10-01 06:10:15 UTC"
ymd_hm("Created on 15-10-01 at 06:10 AM")
returns "2015-10-01 06:10:00 UTC"
ymd_hm("2015, Oct 1 06:10 AM", locale = "English")
returns "2015-10-01 06:10:00 UTC"
Summary
Convert Character or Number to Date / Period when data contains Date and Time.
Syntax
ymd_hms(<column>
, tz = <timezone>
, locale = <locale>
, truncated = <number>
)
Arguments
Return Value
POSIXct
Example
ymd_hms("2015-10-01T06:10:15")
returns "2015-10-01 06:10:15 UTC"
ymd_hms("2015/10/01 06:10:15 UTC")
returns "2015-10-01 06:10:15 UTC"
ymd_hms("Created on 15-10-01 at 06:10:15 AM")
returns "2015-10-01 06:10:15 UTC"
ymd_hms("2015-10-01 06")
returns NA, because there is not enough data for Date and Time parsing.
ymd_hms("2015-10-01 06", truncated = 2)
returns "2015-10-01 06:00:00 UTC" as expected
ymd_hms("2015, Oct 1 06:10:15 AM", locale = "English")
returns "2015-10-01 06:10:15 UTC"
Summary
Computes the correlation of two column values. The covariance divided by the product of the standard deviations of the two column values.
Syntax
cor(<column_num>
, <column_num>
, use = <cor_na_operation>
, method = <cor_method>
)
Arguments
Example
summarize(correlation = cor(ARR_DELAY, DEP_DELAY))
Return the correlation value for ARR_DELAY and DEP_DELAY columns.
Summary
Computes the covariance of two column values.
Syntax
cov(<column_num>
, <column_num>
, use = <cor_na_operation>
, method = <cor_method>
)
Arguments
Example
summarize(covariance = cov(ARR_DELAY, DEP_DELAY))
Return the covariance value for ARR_DELAY and DEP_DELAY columns.
Summary
computes interquartile range of the x values
Syntax
IQR(<column_num_logic_date>
, na.rm = <logical>
, type = 1|2|3|4|5|6|7|8|9
)
Arguments
Return Value
Numeric
Example
// x <- c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10)
IQR(x)
Returns 4.5
Summary
returns the median absolute deviation of the values.
Syntax
mad(<column_num_logic_date>
, na.rm = <logical>
)
Arguments
Return Value
Numeric
Example
summarize(mad = mad(revenue))
Create a new column for calculating the median absolute deviation of revenue for each group.
Summary
Returns the numerical median value.
Syntax
median(<column_num_logic_date>
, na.rm = <logical>
)
Arguments
Return Value
Numeric
Example
summarize(revenue_median = median(revenue))
Create a new column for calculating the median value of revenue for each group.
Summary
Returns data augmented with fitted values and residuals.
Syntax
predict(<column_model>
, data = <column_data_set>
, newdata = <data_set>
, type.predict = "link"|"response"
)
Arguments
Example
predict(model) - assume that 'model' column contains a fitted model of 'lm', 'glm' or 'kmeans'. If it's lm, the output is like below.
.rownames | ARR_DELAY | DEP_DELAY | .fitted | .se.fit | .resid | .hat | .sigma | .cooksd | .std.resid |
---|---|---|---|---|---|---|---|---|---|
1 | -15 | -9 | -13.934 | 0.0247 | -1.066 | 0.0000035503 | 13.0886 | 7.8503e-9 | -0.0814 |
2 | -28 | -19 | -22.4116 | 0.035 | -5.5884 | 0.0000071333 | 13.0886 | 4.3348e-7 | -0.427 |
Summary
sample quantiles corresponding to the given probabilities. The smallest observation corresponds to a probability of 0 and the largest to a probability of 1.
Syntax
quantile(<column_num_logic_date>
, probs = <percent_number>
, na.rm = <logical>
, [type = 1|2|3|4|5|6|7|8|9
])
Arguments
Return Value
Numeric
Example
//x <- c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10)
quantile(x, 0.25)
Returns 3.25
quantile(x, .5)
Returns 5.5
Summary
returns the standard deviation of the values.
Syntax
sd(<column_num_logic_date>
, na.rm = <TRUE|FALSE>
)
Arguments
Return Value
Numeric
Example
summarize(sd = sd(revenue))
Create a new column for calculating the standard deviation of revenue for each group.
Summary
returns the variance of the values.
Syntax
var(<column_num_logic_date>
, na.rm = <logical>
)
Arguments
Return Value
Numeric
Example
summarize(total = sum(revenue))
Create a new column for calculating the variance of revenue for each group.
Summary
Compares two strings lexicographically and determines their order. The result is returned as a numeric value: -1 (comes before), 0 (equal), or 1 (comes after).
Syntax
stri_compare(<text_column1>
, <text_column2>
, ..., opts_collator = NULL)
Arguments
Example
stri_compare(X1, Y1)
Compares strings in columns X1 and Y1 lexicographically, returning -1 if X1 comes before Y1, 0 if they're equal, and 1 if X1 comes after Y1.
X1 | Y1 | NEW |
---|---|---|
apple | banana | -1 |
banana | banana | 0 |
cat | banana | 1 |
stri_compare(X1, Y1, opts_collator = stri_opts_collator(strength = 2))
Compares strings without distinguishing between uppercase and lowercase.
X1 | Y1 | NEW |
---|---|---|
APPLE | apple | 0 |
banana | BANANA | 0 |
Cat | cat | 0 |
Summary
Converts a date-time column to a string based on the specified format.
Syntax
stri_datetime_format(<date_time_column>
, format = <format>
, tz = <timezone>
, locale = <locale>
)
Arguments
Example
stri_datetime_format(X1, format="MM/dd/uuuu")
Converts values in date-time column X1 to strings in month/day/year format.
X1 | NEW |
---|---|
2024-01-01 09:00:00 | 01/01/2024 |
2024-01-02 14:30:00 | 01/02/2024 |
2024-01-03 18:15:00 | 01/03/2024 |
Summary
Converts non-ASCII characters (such as Japanese multibyte characters or special symbols) in a string to Unicode escape sequences (\uXXXX format). ASCII characters (alphanumeric characters and basic symbols) are preserved as is.
Syntax
stri_escape_unicode(<string_column>
)
Arguments
Return Value
Character
Example
stri_escape_unicode(X1)
Converts strings in column X1 to ASCII escape sequences.
X1 | NEW |
---|---|
Hello | Hello |
日本語 | \u65E5\u672C\u8A9E |
♥ | \u2665 |
Summary
Returns the number of bytes required to store text data. In UTF-8 encoding, one character can be represented by 1-4 bytes, so the number of characters and the number of bytes may differ.
Syntax
stri_numbytes(<string>
)
Arguments
Example
stri_numbytes(X1)
Returns the number of bytes needed to store the strings in column X1.
X1 | NEW |
---|---|
hello | 5 |
こんにちは | 15 |
123 | 3 |
Summary
Generates random strings of specified length. Various character types can be used for generation, including ASCII characters and Unicode characters.
Syntax
stri_rand_strings(n = <number_of_strings>
, length = <length_of_each_string>
, pattern = <character_pattern>
)
Arguments
Example
stri_rand_strings(X1, length=5, pattern="[A-Z]")
Generates random strings of uppercase letters with a length of 5 characters based on the values in column X1.
X1 | NEW |
---|---|
1 | KXMPQ |
2 | RNWVY |
3 | LBFTH |
stri_rand_strings(X1, length=8, pattern="[0-9]")
Generates random strings of digits with a length of 8 characters based on the values in column X1.
X1 | NEW |
---|---|
1 | 47295836 |
2 | 91634507 |
3 | 25803914 |
stri_rand_strings(X1, length=6, pattern="[A-Za-z0-9]")
Generates random alphanumeric strings with a length of 6 characters based on the values in column X1.
X1 | NEW |
---|---|
1 | kR9nP2 |
2 | mX4vL8 |
3 | tB7hJ5 |
Summary
Modifier functions control the meaning of the pattern argument to stringr functions
Syntax
regex(pattern, ignore_case = <logical>
, multiline = <logical>
, comments = <logical>
, dotall = <logical>
, locale = <character>
)
Arguments
Return Value
A stringr modifier object, i.e. a character vector with parent S3 class stringr_pattern.
Example
regex("Capital", ignore_case = TRUE)
Summary
Concatenates multiple text from multiple columns.
Syntax
str_c(<column(s)>
, sep = <text>
, collapse = <text>
)
Arguments
Return Value
Character
Example
str_c(CARRIER, TAIL_NUM)
Concatenate CARRIER and TAIL_NUM values.
str_c(CARRIER, TAIL_NUM, sep = "-")
Use dash as a separator to concatenate CARRIER and TAIL_NUM values.
summarize(combined_text = str_c(CARRIER, collapse = ",")
Use comma ',' as a separator to combine all the CARRIER names into one text for each group.
Summary
Convert text to a specified encoding. e.g. UTF-8
Syntax
str_conv(<column_text>
, <encoding>
)
Arguments
Return Value
Character
Example
str_conv("Exploratory", "UTF-8")
Summary
Return number of letters or words in text data. You can also set a pattern to count only the matched pattern or text.
Syntax
str_count(<column_text>
, pattern = [<regular_expression>]
)
str_count(<column_text>
, pattern = [boundary("word")]
)
Arguments
Return Value
Numeric
Example
str_count("I'm traveling to New York on 1/1.")
returns 33 including punctuations and space
str_count("I'm traveling to New York on 1/1.", "New")
returns 1 by matching text 'New'
str_count("I'm traveling to New York on 1/1.", "[:alpha:]")
returns 22 counting only alphabet letters
str_count("I'm traveling to New York on 1/1.", "[:digit:]")
returns 2 counting only digit letters
str_count("I'm traveling to New York.", boundary("word"))
returns 5 counting as words
str_count("I'm traveling to New York on 1/1.", ".")
returns 33 because dot matches anything
str_count("I'm traveling to New York on 1/1.", fixed("."))
returns 1 because dot is searching dot as letter with 'fixed()'
Summary
Return TRUE or FALSE based on whether Text data contains a given text or not
Syntax
str_detect(<column_text>
, <pattern>
)
Arguments
Return Value
Logical
Example
str_detect("New York", "New")
returns TRUE because there is 'New' in the data.
str_detect("New York on 1/1.", "[:alpha:]")
returns TRUE because there are alphabet letters in the data.
str_detect("New York on 1/1.", "[:digit:]")
returns TRUE because there are numeric (digit) letters in the data.
str_detect("New York on 1/1.", "^New")
returns TRUE because there is "New" text at the beginning of the data.
str_detect("New York on 1/1.", "New$")
returns TRUE because there is no "New" text at the end of the data.
str_detect("New York on 1/1.", "N.w")
returns TRUE because 'New' starts with 'N' and ends with 'w'
str_detect("New York on 1/1.", fixed("N.w"))
returns FALSE because 'N.w' exact matching letters don't exist
str_detect("New York on 1/1.", c("New", "Old"))
returns TRUE because a given text matches with one of matching texts
Summary
Repeat text values.
Syntax
str_dup(<column_text>
, times = <numeber>
)
Arguments
Return Value
Character
Example
str_dup("wow", 2)
returns "wowwow".
Summary
Extract only letters that match with a given letters or patterns
Syntax
str_extract(<column_text>
, <pattern>
)
Arguments
Return Value
Character
Example
str_extract(X1, "[A-Za-z]+")
Extracts the first sequence of alphabetic characters from column X1.
X1 | NEW |
---|---|
ABC123 | ABC |
123DEF | DEF |
A1B2C3 | A |
str_extract(X1, "[:alpha:]+")
Extracts the first sequence of alphabetic characters from column X1.
X1 | NEW |
---|---|
ABC123 | ABC |
123DEF | DEF |
A1B2C3 | A |
str_extract(X1, "[0-9]+")
Extracts the first sequence of digits from column X1.
| X1 | NEW |
|---|---|
| ABC123 | 123 |
| 456DEF | 456 |
| A1B2C3 | 1 |
str_extract(X1, "[:digit:]+")
Extracts the first sequence of digits from column X1.
| X1 | NEW |
|---|---|
| ABC123 | 123 |
| 456DEF | 456 |
| A1B2C3 | 1 |
Summary
Extract all characters that match with a given letters or patterns. Can be combined with list_to_text function to convert extracted strings into a concatenated text.
Syntax
str_extract_all(<column_text>
, <pattern>
)
Arguments
Return Value
List
Example
list_to_text(str_extract_all(X1, "[:alpha:]+"))
Extracts all letters from column X1.
X1 | NEW |
---|---|
Hello world | Hello, world |
ABC123DEF | ABC, DEF |
R2D2C3PO | R, D, C, PO |
list_to_text(str_extract_all(X1, "[a-zA-Z]+"))
Extracts all alphabets from column X1.
X1 | NEW |
---|---|
Hello123World | Hello, World |
R2D2C3PO | R, D, C, PO |
Data_Science_2023 | Data, Science |
list_to_text(str_extract_all(Col_Name, "[:digit:]+"))
Extracts all number from column X1.
X1 | NEW |
---|---|
ABC123DEF456 | 123, 456 |
Phone: 555-1234 | 555, 1234 |
Year2023Month01 | 2023, 01 |
list_to_text(str_extract_all(X1, "[a-zA-Z0-9]+"))
Extracts alphabetic characters and numbers from column X1.
X1 | NEW |
---|---|
Hello_123_World! | Hello,123,World |
R2D2_C3PO | R2D2,C3PO |
Data-Science-2023 | Data,Science,2023 |
list_to_text(str_extract_all(X1, "[:alnum:]+"))
Extracts all characters and numbers from column X1.
X1 | NEW |
---|---|
Hello_123_World! | Hello, 123, World |
TYO-Tokyo-2023 | TYO, Tokyo, 2023 |
Data@Science#2023 | Data, Science, 2023 |
Summary
Return number of letters in text data.
Syntax
str_length(<column_text>
)
Return Value
Numeric
Example
str_length("I'm traveling to New York.")
Returns 26 including punctuations and space
Summary
Add white spaces to text values to make the total length to be exact a specified number.
Syntax
str_pad(<column_text>
, width = <number>
, side = "left"|"right"|"both"
, pad = <text>
)
Arguments
Return Value
Character
Example
str_pad("Exploratory", 3)
returns "Exploratory" because 3 is less than number of the original given text.
str_pad("Exploratory", 30, "left")
returns " Exploratory" to make the length of the text to be 30 by adding white spaces.
Summary
Replace letters that matches with a given letters or expressions
Syntax
str_replace(<column_text>
, <pattern>
, <replacement>
)
Arguments
Return Value
Character
Example
str_replace("I am traveling to New York!!", "York", "Jersey")
returns "I am traveling to New Jersey"
str_replace("I am traveling to New York!!", "[:punct:]", "")
returns "I am traveling to New York!" replacing only the first matching punctuation. You want to use str_replace_all() to replace them all.
str_replace("New York New", "^New", "Old")
returns "Old York New" by finding "New" at the beginning and replace it with "Old"
str_replace("New York New.", "New$", "Old")
returns "New York Old" by finding "New" at the end and replace it with "Old"
str_replace("New York", str_sub("New York", 4, 8), "")
returns "New".
Summary
Replace letters that matches with a given letters or expressions
Syntax
str_replace_all(<column_text>
, <pattern>
, <replacement>
)
Arguments
Return Value
Character
Example
str_replace_all("I am traveling to New York!!", "[:punct:]", "")
returns "I am traveling to New York"
Summary
Replace NA values to a specific Text
Syntax
str_replace_na(<column_text>
, <replacement>
)
Arguments
Return Value
Character
Example
str_replace_na(ARR_DELAY, "Not Available")
Returns "Not Available" for NA in ARR_DELAY column.
Summary
Split a given text into multiple text by a given separator. It will return a list so you want to unnest the newly created column with unnest() function.
Syntax
str_split(<column_text>
, pattern = <pattern>
, n = <numeric>
)
Arguments
Return Value
List
Example
str_split("a, b, c", ",")
returns "a" " b" " c" in a list format.
str_split("a, b, c", ",", 2)
returns "a" and " b, c" in a list format.
mutate(y = str_split(text_a, ",")) %>% unnest()
First split a given text column, this will create a list column. Then, unnest() function will break the list by given each element of the list its own row.
unnest(y = str_split(text_a, ","))
This is equivalent of the above command.
Summary
Extract letters from Text values based on the position
Syntax
str_sub(<column_text>
, <start>
, <end>
)
Arguments
Return Value
Character
Example
str_sub("Kevin Spacey", start = 7, end = 12)
returns "Spacey". Starts from 7 to 12.
str_sub("Kevin Spacey", start = 7)
returns "Spacey". Starts from 7 to the end.
str_sub(Text, start = -6)
returns "Spacey". Last 6 letters.
Summary
Convert text to lowercase
Syntax
str_to_lower(<column_text>
, locale = <locale>
)
Arguments
Return Value
Character
Example
str_to_lower("CALIFORNIA")
Returns "california"
Summary
Convert text to Title-case
Syntax
str_to_title(<column_text>
, locale = <locale>
)
Arguments
Return Value
Character
Example
str_to_title("CALIFORNIA")
Returns "California"
Summary
Convert text to UPPERCASE
Syntax
str_to_upper(<column_text>
, locale = <locale>
)
Arguments
Return Value
Character
Example
str_to_upper("california")
Returns "CALIFORNIA"
str_to_upper("i", "en")
Setting English locale would return 'i'
str_to_upper("i", "tr")
Setting Turkish locale would return dotted 'I' in capital
Summary
Trim (or remove) white spaces in Text values
Syntax
str_trim(<column_text>
, side = "both"|"left"|"right"
)
Arguments
Return Value
Character
Example
str_trim(" String with trailing and leading white space\t")
returns "String with trailing and leading white space".
str_trim("\n\nString with trailing and leading white space\n\n")
returns "String with trailing and leading white space".
Summary
Extract First / Last Nth Word in Text data
Syntax
word(<column_text>
, start = <number>
, end = <number>
, sep = <text>
)
Arguments
Return Value
Character
Example
word(text_column, 1, sep = " ")
Extracts the first word from the text column using space as separator.
text_column | new_column |
---|---|
I am traveling | I |
Hello World | Hello |
Data Analysis | Data |
word(text_column, 1, 3, sep = " ")
Extracts the 1st to 3rd words from the text column using space as separator.
text_column | new_column |
---|---|
I am traveling to | I am traveling |
This is good | This is good |
A B C D | A B C |
word(text_column, -1, sep = " ")
Extracts the last word from the text column using space as separator.
text_column | new_column |
---|---|
I am traveling | traveling |
Hello World | World |
Data Analysis | Analysis |
word(text_column, -2, -1, sep = " ")
Extracts the second-to-last and last words from the text column using space as separator.
text_column | new_column |
---|---|
I am traveling | am traveling |
Hello World | Hello World |
Big Data Analysis | Data Analysis |
word(text_column, 1, sep = "-")
Extracts the first word from the text column using hyphen as separator.
text_column | new_column |
---|---|
web-development | web |
user-interface-design | user |
back-end-server-configuration | back |
Summary
Complete a data frame by adding missing combinations of data.
Syntax
complete(<column(s)>
, fill = <list>
)
Arguments
Used Together
full_seq, nesting
Example
Sample data:
year | product | type | sales |
---|---|---|---|
2013 | iPhone | 5 | 100 |
2013 | iPad | mini | 50 |
2015 | iPhone | 6 | 200 |
complete(year, product)
Returns all the combinations of year and product values.
year | product | type | sales |
---|---|---|---|
2013 | iPad | mini | 50 |
2013 | iPhone | 5 | 100 |
2015 | iPad | NA | NA |
2015 | iPhone | 6 | 200 |
complete(year, product, fill = list(sales = 0))
Returns all the combinations of year and product values and fill with 0 for NA for sales column.
year | product | type | sales |
---|---|---|---|
2013 | iPad | mini | 50 |
2013 | iPhone | 5 | 100 |
2015 | iPad | NA | 0 |
2015 | iPhone | 6 | 200 |
complete(year = full_seq(year, period = 1), product)
Returns all the combinations of year and product values plus fill missing years (2014).
year | product | type | sales |
---|---|---|---|
2013 | iPad | mini | 50 |
2013 | iPhone | 5 | 100 |
2014 | iPad | NA | NA |
2014 | iPhone | NA | NA |
2015 | iPad | NA | NA |
2015 | iPhone | 6 | 200 |
complete(year, product, type)
Returns all the combinations of year, product, and type values.
year | product | type | sales |
---|---|---|---|
2013 | iPad | 5 | NA |
2013 | iPad | 6 | NA |
2013 | iPad | mini | 50 |
2013 | iPhone | 5 | 100 |
2013 | iPhone | 6 | NA |
2013 | iPhone | mini | NA |
2015 | iPad | 5 | NA |
2015 | iPad | 6 | NA |
2015 | iPad | mini | NA |
2015 | iPhone | 5 | NA |
2015 | iPhone | 6 | 200 |
2015 | iPhone | mini | NA |
complete(year, nesting(product, type))
By using nesting() function you can set the combination of values to be the ones that really exist in data for some columns.
year | product | type | sales |
---|---|---|---|
2013 | iPad | mini | 50 |
2013 | iPhone | 5 | 100 |
2013 | iPhone | 6 | NA |
2015 | iPad | mini | NA |
2015 | iPhone | 5 | NA |
2015 | iPhone | 6 | 200 |
Summary
Expand a data frame by adding all the combination of given columns.
Syntax
expand(<column(s)>
)
Arguments
Used Together
full_seq, nesting
Example
Sample data:
year | product | type | sales |
---|---|---|---|
2013 | iPhone | 5 | 100 |
2013 | iPad | mini | 50 |
2015 | iPhone | 6 | 200 |
expand(year, product)
Returns all the combinations of year and product values.
year | product |
---|---|
2013 | iPad |
2013 | iPhone |
2015 | iPad |
2015 | iPhone |
expand(year = full_seq(year, period = 1), product)
Returns all the combinations of year and product values plus fill missing years (2014).
year | product |
---|---|
2013 | iPad |
2013 | iPhone |
2014 | iPad |
2014 | iPhone |
2015 | iPad |
2015 | iPhone |
expand(year, product, type)
Returns all the combinations of year, product, and type values.
year | product | type |
---|---|---|
2013 | iPad | 5 |
2013 | iPad | 6 |
2013 | iPad | mini |
2013 | iPhone | 5 |
2013 | iPhone | 6 |
2013 | iPhone | mini |
2015 | iPad | 5 |
2015 | iPad | 6 |
2015 | iPad | mini |
2015 | iPhone | 5 |
2015 | iPhone | 6 |
2015 | iPhone | mini |
expand(year, nesting(product, type))
By using nesting() function you can set a certain combination of column values to be fixed and have them nested inside other columns.
year | product | type |
---|---|---|
2013 | iPad | mini |
2013 | iPhone | 5 |
2013 | iPhone | 6 |
2015 | iPad | mini |
2015 | iPhone | 5 |
2015 | iPhone | 6 |
Summary
Extract only the number from a given text. This is useful for strings that are supposed to be numbers with extra formatting (e.g. $1,200.34, -12%, X1, etc.).
Syntax
extract_numeric(<column>
)
Return Value
Numeric
Example
extract_numeric("$56,500.00")
Returns 56500
extract_numeric("-45%")
Returns -45
mutate(
Crate a new column to have only the extracted numbers from the original column.
Summary
Fills missing values in using the previous entry. This is convenient especially when values are not presented in some cells to avoid duplicated entries.
Syntax
fill(<column(s)>
, .direction = <fill_direction>
)
Arguments
Example
fill(POPULATION)
Fills using the previous appearing value for POPULATION column.
fill(POPULATION, .direction = "up")
Fills using the next appearing value for POPULATION column.
fill(POPULATION, PRODUCTION)
Fills using the previous appearing value for POPULATION and PRODUCTION columns.
fill(-UNEMPLOYMENT)
Fills using the previous appearing value for all the columns other than UNEMPLOYMENT column.
Summary
Generate a sequence of numbers based on the values in a given column.
Syntax
full_seq(<column_num>
, period = <number>
)
Arguments
Used Together
complete, expand,
Example
Sample data:
year | product | type | sales |
---|---|---|---|
2013 | iPhone | 5 | 100 |
2013 | iPad | mini | 50 |
2015 | iPhone | 6 | 200 |
expand(year = full_seq(year, period = 1), product)
Returns all the combinations of year and product values plus fill missing years (2014).
year | product |
---|---|
2013 | iPad |
2013 | iPhone |
2014 | iPad |
2014 | iPhone |
2015 | iPad |
2015 | iPhone |
Without full_seq() function.
expand(year, product)
Returns all the combinations of year and product values.
year | product |
---|---|
2013 | iPad |
2013 | iPhone |
2015 | iPad |
2015 | iPhone |
Summary
Takes multiple columns and collapses into key-value pairs, duplicating all other columns as needed.
Syntax
gather(<new_key_column>
, <new_value_column>
, <column(s)>
, na.rm = <logical>
, convert = <logical>
)
Arguments
Example
gather(COUNTRY, POPULATION, 2:50, na.rm = FALSE)
Takes the 2nd column to the 50th column into two new columns, COUNTRY and POPULATION.
gather(COUNTRY, POPULATION, -YEAR)
Takes all the columns other than YEAR column into two new columns, COUNTRY and POPULATION.
gather(COUNTRY, POPULATION, starts_with("country"))
Takes the columns whose names start with "country" into two new columns, COUNTRY and POPULATION.
Summary
Nest a set of columns together as a list column.
Syntax
nest(<column(s)>
, .key = <text>
)
Arguments
Example
Sample Data
year | product | type | sales |
---|---|---|---|
2013 | iPhone | 5 | 100 |
2013 | iPad | mini | 50 |
2015 | iPhone | 6 | 200 |
nest(product, type, sales, .key = a)
This will create a new column that is a list of data frames and each row of the data frame holds 'product', 'type', 'sales' values.
year | a |
---|---|
2013 | list() |
2015 | list() |
Summary
Fix multiple columns so that only the combination of actually existing values in those columns will be used for complete() or expand() operation.
Syntax
nesting(<column(s)>
)
Arguments
Used Together
complete, expand,
Example
Sample data:
year | product | type | sales |
---|---|---|---|
2013 | iPhone | 5 | 100 |
2013 | iPad | mini | 50 |
2015 | iPhone | 6 | 200 |
expand(year, nesting(product, type))
By using nesting() function you can set a certain combination of column values to be fixed and have them nested inside other columns.
year | product | type |
---|---|---|
2013 | iPad | mini |
2013 | iPhone | 5 |
2013 | iPhone | 6 |
2015 | iPad | mini |
2015 | iPhone | 5 |
2015 | iPhone | 6 |
Without nesting():
expand(year, product, type)
Returns all the combinations of year, product, and type values.
year | product | type |
---|---|---|
2013 | iPad | 5 |
2013 | iPad | 6 |
2013 | iPad | mini |
2013 | iPhone | 5 |
2013 | iPhone | 6 |
2013 | iPhone | mini |
2015 | iPad | 5 |
2015 | iPad | 6 |
2015 | iPad | mini |
2015 | iPhone | 5 |
2015 | iPhone | 6 |
2015 | iPhone | mini |
Summary
Separates a column with delimited values into multiple columns.
Syntax
separate(<column_text_date>
, into = c("<new_column_name(s)>
"), sep = <text>
, remove=<logical>
, convert=<logical>
, extra = "warn"|"drop"|"merge"
, fill = "warn"|"right"|"left"
)
Arguments
Example
separate(NAME, into = c("FIRSTNAME", "LASTNAME"), extra="drop")
Separates NAME column into FIRSTNAME and LASTNAME columns by non-alphanumeric values. It will drop the rest if any.
separate(NAME, into = c("FIRSTNAME", "LASTNAME"), remove = FALSE)
Separates NAME column into FIRSTNAME and LASTNAME columns by non-alphanumeric values. It will keep NAME column.
separate(NAME, into = c("FIRSTNAME", "LASTNAME"), extra="merge")
Separates NAME column into FIRSTNAME and LASTNAME columns by non-alphanumeric values. It will merge the rest into the last column.
separate(NAME, into = c("FIRSTNAME", "LASTNAME"), sep = "-")
Separates NAME column into FIRSTNAME and LASTNAME columns by dash "-".
separate(NAME, into = c("FIRSTNAME", "LASTNAME"), sep = 2)
Separates NAME column into FIRSTNAME and LASTNAME by a position of 2.
separate(DATE, into = c("Y", "M", "D"), convert = TRUE)
Separate DATE into Year, Month, and Day columns and set the data type for each column based on the heuristics.
Summary
Spread a key-value pair across multiple columns. Missing values will be replaced with NA.
Syntax
spread(<key_column>
, <value_column>
, fill = <value>
, convert = <logical>
, drop = <logical>
, sep=<text>
)
Arguments
Example
spread(COUNTRY, POPULATION)
Spread a COUNTRY-POPULATION pair across multiple columns of COUNTRY.
spread(COUNTRY, POPULATION, fill = “NA")
Spread a COUNTRY-POPULATION pair across multiple columns of COUNTRY, and fill with "NA" text for missing values.
Summary
Unite multiple columns together into one column with given uniting characters. It will concatenate values by “_" and remove the original column by default. Multiple columns specified in the list can be combined together at once.
Syntax
unite(<new_column>
, <column(s)>
, sep = <text>
, remove = <logical>
)
Arguments
Example
unite(NAME, FIRSTNAME, LASTNAME)
Unite FIRSTNAME and LASTNAME into a new column called NAME using _ as a uniting character.
unite(NAME, FIRSTNAME, LASTNAME, sep = "-")
Unite FIRSTNAME and LASTNAME into a new column called NAME using _ as a uniting character.
Summary
Unnest a list column or a list of data frames by making each element of the list to be presented in its own row.
Syntax
unnest(<column_list>
, .drop = <logical>
, .id = <text>
, .sep = <text>
)
Arguments
Example
Sample Data with a list column
issue_id | assignee |
---|---|
720 | list(Kan, Kei) |
721 | list(Kan, Marie, Simon) |
unnest() or unnest(assignee)
Unnest 'assignee' column which is a list. This will give each value of the list its own row.
issue_id | assignee |
---|---|
720 | Kan |
720 | Kei |
721 | Kan |
721 | Marie |
721 | Simon |
Sample Data with a list of Data Frames:
issue_id | assignee |
---|---|
720 | list(name = Kan, id = 105, country = US ) |
721 | list(name = Kei, id = 106, country = Japan ) |
unnest()
Unnest 'assignee' column which is a list of data frames. This will create columns for each column of the nested data frame.
issue_id | name | id | country |
---|---|---|---|
720 | Kan | 105 | US |
721 | Kei | 106 | Japan |
unnest(.sep="-")
Unnest With .sep argument, the result columns are with original list column name connected by .sep argument.
issue_id | assignee-name | assignee-id | assignee-country |
---|---|---|---|
720 | Kan | 105 | US |
721 | Kei | 106 | Japan |
Sample Data with two list columns
issue_id | assignee | category |
---|---|---|
720 | list(Kan, Kei) | list(Critical, Security) |
721 | list(Kan, Marie, Simon) | list(Critical, Minor Issue) |
unnest(assignee, .drop = FALSE)
Unnest 'assignee' column which is a list. This will give each value of the list its own row.
issue_id | assignee | category |
---|---|---|
720 | Kan | list(Critical, Security) |
720 | Kei | list(Critical, Security) |
721 | Kan | list(Critical, Security) |
721 | Marie | list(Critical, Minor Issue) |
721 | Simon | list(Critical, Minor Issue) |
Summary
Removes parameters from a given URL.
Syntax
param_remove(<column_text>
, keys = <text>
)
Arguments
Return Value
Character
Example
param_remove(urls = "https://github.com/hideaki/tam/issues?q=is%3Aopen&ticket=open", keys = c("q","ticket"))
Returns "https://github.com/hideaki/tam/issues"
Summary
Decodes a given encoded URL. It assume character encoding is UTF-8.
Syntax
url_decode(<column_text>
)
Arguments * column - column that you want to decode.
Return Value
Character
Example
url_decode("http://exploratory.io%3fname%3dflight%20data%20analysis%20%28V1%29")
Returns "http://exploratory.io?name=flight data analysis (V1)"
Summary
Encodes a given URL. It assume character encoding is UTF-8.
Syntax
url_encode(<column_text>
)
Arguments * column - column that you want to encode.
Return Value
Character
Example
url_encode("http://exploratory.io?name=flight data analysis (V1)")
Returns "http://exploratory.io%3fname%3dflight%20data%20analysis%20%28V1%29"
Summary
Returns decoded query parameters from url.
Syntax
url_parameters(<column_text>
)
Arguments
Return Value
Character
Example
url_parameters("https://exploratory.io/?debug=%3Ctrue%3E#test")
returns "debug=<true>".
Summary
Returns the rolling (or moving) max value.
Syntax
roll_max(<column_num>
, n = <numeric>
, weights = NULL, by = <numeric>
, fill = <numeric>
, align = <alignment>
, normalize = <logical>
, na.rm = <logical>
)
Arguments
Return Value
Numeric
Example
rolling_max = roll_max(revenue, n = 3, align="center", fill = 0)
Returns a max value among three values starting from one previous and one next value, setting NA to 0.
year | revenue | rolling_max |
---|---|---|
2012 | 100 | 0 |
2013 | 200 | 300 |
2014 | 300 | 400 |
2015 | 400 | 500 |
2016 | 500 | 0 |
rolling_max = roll_max(revenue, n = 3, align="left", fill = 0)
Returns a max value among three values starting from the current value and the next two values, setting NA to 0.
year | revenue | rolling_max |
---|---|---|
2012 | 100 | 300 |
2013 | 200 | 400 |
2014 | 300 | 500 |
2015 | 400 | 0 |
2016 | 500 | 0 |
Summary
Calculate a moving average for a column. By specifying a window size, it is possible to return the moving average over the specified interval.
Syntax
roll_mean(<column_num>
, n = <numeric>
, weights = NULL, by = <numeric>
, fill = <numeric>
, align = <alignment>
, normalize = <logical>
, na.rm = <logical>
)
Arguments
Return Value
Numeric
Example
rolling_mean = roll_mean(revenue, n = 3, align="center", fill = 0)
Returns a mean value among three values starting from one previous and one next value, setting NA to 0.
year | revenue | rolling_mean |
---|---|---|
2012 | 100 | 0 |
2013 | 200 | 200 |
2014 | 300 | 300 |
2015 | 400 | 400 |
2016 | 500 | 0 |
rolling_mean = roll_mean(revenue, n = 3, align="left", fill = 0)
Returns a mean value among three values starting from the current value and the next two values, setting NA to 0.
year | revenue | rolling_mean |
---|---|---|
2012 | 100 | 200 |
2013 | 200 | 300 |
2014 | 300 | 400 |
2015 | 400 | 0 |
2016 | 500 | 0 |
Summary
Returns the rolling (or moving) median value.
Syntax
roll_median(<column_num>
, n = <numeric>
, weights = NULL, by = <numeric>
, fill = <numeric>
, align = <alignment>
, normalize = <logical>
, na.rm = <logical>
)
Arguments
Return Value
Numeric
Example
rolling_median = roll_median(revenue, n = 3, align="center", fill = 0)
Returns a median value among three values starting from one previous and one next value, setting NA to 0.
year | revenue | rolling_median |
---|---|---|
2012 | 100 | 0 |
2013 | 200 | 200 |
2014 | 300 | 300 |
2015 | 400 | 400 |
2016 | 500 | 0 |
rolling_median = roll_median(revenue, n = 3, align="left", fill = 0)
Returns a median value among three values starting from the current value and the next two values, setting NA to 0.
year | revenue | rolling_median |
---|---|---|
2012 | 100 | 200 |
2013 | 200 | 300 |
2014 | 300 | 400 |
2015 | 400 | 0 |
2016 | 500 | 0 |
Summary
Returns the rolling (or moving) min value.
Syntax
roll_min(<column_num>
, n = <numeric>
, weights = NULL, by = <numeric>
, fill = <numeric>
, align = <alignment>
, normalize = <logical>
, na.rm = <logical>
)
Arguments
Return Value
Numeric
Example
rolling_max = roll_min(revenue, n = 3, align="center", fill = 0)
Returns a min value among three values starting from one previous and one next value, setting NA to 0.
year | revenue | rolling_min |
---|---|---|
2012 | 100 | 0 |
2013 | 200 | 100 |
2014 | 300 | 200 |
2015 | 400 | 300 |
2016 | 500 | 0 |
rolling_max = roll_min(revenue, n = 3, align="left", fill = 0)
Returns a min value among three values starting from the current value and the next two values, setting NA to 0.
year | revenue | rolling_min |
---|---|---|
2012 | 100 | 100 |
2013 | 200 | 200 |
2014 | 300 | 300 |
2015 | 400 | 0 |
2016 | 500 | 0 |
Summary
Returns the rolling (or moving) prod value.
Syntax
roll_prod(<column_num>
, n = <numeric>
, weights = NULL, by = <numeric>
, fill = <numeric>
, align = <alignment>
, normalize = <logical>
, na.rm = <logical>
)
Arguments
Return Value
Numeric
Example
rolling_prod = roll_prod(revenue, n = 3, align="center", fill = 0)
Returns a prod value among three values starting from one previous and one next value, setting NA to 0.
year | revenue | rolling_prod |
---|---|---|
2012 | 1 | 0 |
2013 | 2 | 6 |
2014 | 3 | 24 |
2015 | 4 | 60 |
2016 | 5 | 0 |
rolling_prod = roll_prod(revenue, n = 3, align="left", fill = 0)
Returns a prod value among three values starting from the current value and the next two values, setting NA to 0.
year | revenue | rolling_prod |
---|---|---|
2012 | 1 | 6 |
2013 | 2 | 24 |
2014 | 3 | 60 |
2015 | 4 | 0 |
2016 | 5 | 0 |
Summary
Returns the rolling (or moving) standard deviation value.
Syntax
roll_sd(<column_num>
, n = <numeric>
, weights = NULL, by = <numeric>
, fill = <numeric>
, align = <alignment>
, normalize = <logical>
, na.rm = <logical>
)
Arguments
Return Value
Numeric
Example
rolling_sd = roll_sd(revenue, n = 3, align="center", fill = 0)
Returns a standard deviation value among three values starting from one previous and one next value, setting NA to 0.
year | revenue | rolling_sd |
---|---|---|
2012 | 100 | 0 |
2013 | 200 | 100 |
2014 | 300 | 100 |
2015 | 400 | 100 |
2016 | 500 | 0 |
rolling_sd = roll_sd(revenue, n = 3, align="left", fill = 0)
Returns a min value among three values starting from the current value and the next two values, setting NA to 0.
year | revenue | rolling_sd |
---|---|---|
2012 | 100 | 100 |
2013 | 200 | 100 |
2014 | 300 | 100 |
2015 | 400 | 0 |
2016 | 500 | 0 |
Summary
Returns the rolling (or moving) sum value.
Syntax
roll_sum(<column_num>
, n = <numeric>
, weights = NULL, by = <numeric>
, fill = <numeric>
, align = <alignment>
, normalize = <logical>
, na.rm = <logical>
)
Arguments
Return Value
Numeric
Example
rolling_sum = roll_sum(revenue, n = 3, align="center", fill = 0)
Returns a sum value among three values starting from one previous and one next value, setting NA to 0.
year | revenue | rolling_sum |
---|---|---|
2012 | 100 | 0 |
2013 | 200 | 600 |
2014 | 300 | 900 |
2015 | 400 | 1200 |
2016 | 500 | 0 |
rolling_sum = roll_sum(revenue, n = 3, align="left", fill = 0)
Returns a sum value among three values starting from the current value and the next two values, setting NA to 0.
year | revenue | rolling_sum |
---|---|---|
2012 | 100 | 600 |
2013 | 200 | 900 |
2014 | 300 | 1200 |
2015 | 400 | 0 |
2016 | 500 | 0 |
Summary
Returns the rolling (or moving) variance value.
Syntax
roll_var(<column_num>
, n = <numeric>
, weights = NULL, by = <numeric>
, fill = <numeric>
, align = <alignment>
, normalize = <logical>
, na.rm = <logical>
)
Arguments
Return Value
Numeric
Example
rolling_var = roll_var(revenue, n = 3, align="center", fill = 0)
Returns a variance value among three values starting from one previous and one next value, setting NA to 0.
year | revenue | rolling_var |
---|---|---|
2012 | 100 | 0 |
2013 | 200 | 10000 |
2014 | 300 | 10000 |
2015 | 400 | 10000 |
2016 | 500 | 0 |
rolling_var = roll_var(revenue, n = 3, align="left", fill = 0)
Returns a variance value among three values starting from the current value and the next two values, setting NA to 0.
year | revenue | rolling_var |
---|---|---|
2012 | 100 | 10000 |
2013 | 200 | 10000 |
2014 | 300 | 10000 |
2015 | 400 | 0 |
2016 | 500 | 0 |
Summary
Removes column(s). It can be used along with any column(s) selection operation like select(), gather(), etc.
Syntax
-
Return Value
Example
select(-Country)
Remove Country column.
select(-starts_with("user"))
Remove all the columns whose names start with "user".
gather(Year, Population, -Country)
Gather all the columns except Country column into Year and Population columns.
Summary
Select column(s) with a predicate (conditional) function.
Syntax
select_if(<predicate_condition>
,...)
Arguments
Example
select_if(is.numeric)
Select only numeric data type columns.
select_if(is.character)
Select only character data type columns.
select_if(function(x) is.numeric(x) && mean(x) > 5)
Select only the columns that are numeric data type and their mean values are greater than 5.
Summary
Returns the column names that starts with a given text.
Syntax
starts_with(<text>
, ignore.case = <logical>
)
Arguments
Example
select(starts_with("X", ignore.case = TRUE))
Select only columns whose name starts with "X"
Summary
Returns the column names that ends with a given text.
Syntax
ends_with(<text>
, ignore.case = <logical>
)
Arguments
Example
select(ends_with("ABR", ignore.case = TRUE))
Select only columns whose name ends with "ABR"
Summary
Returns the column names that contain a given text.
Syntax
contains(<text>
, ignore.case = <logical>
)
Arguments
Example
select(contains("ABR", ignore.case = TRUE))
Select only columns whose name contains with "ABR"
Summary
Returns the column names that matches with a given text.
Syntax
matches(<text>
, ignore.case = <logical>
)
Arguments
Example
select(matches("[:digit:]+", ignore.case = TRUE))
Select only columns whose names contains digits (regular expression)
Summary
Returns the column names that starts with a given text with numbers that are within a given range. It's useful especially when you have column names like X1, X2, X3, X4, etc.
Syntax
num_range(<text>
, <start_num>
:<end_num>
)
Arguments
Example
select(num_range("X", 1:5))
Selects all columns (numerically) from X1 to X5.
Summary
Returns the column names that are any of the given names.
Syntax
any_of(<text1>
, <text2>
, ...)
Arguments
Example
select(any_of("x", "y", "z”))
Selects columns whose names are "x", "y", and "z".
Summary
Returns all the column names. It's useful when you want to have particular column(s) first before everything else.
Syntax
everything()
Arguments
Example
select(X, everything())
Reorder columns: keep the column “X" before all the other columns.
Summary
Returns the columns for which the specified column selection function (e.g. is.numeric) returns TRUE.
Syntax
where(<columns_select_function>
)
Arguments
Example
select(where(is.numeric))
Select all numeric columns.
Summary
This function is used for selecting columns for other functions such as summarize_row.
Syntax
<column(s)>
))<column(s)>
))<columns_select_function>
))<text>
, ignore.case = <logical>
))<text>
, ignore.case = <logical>
))<text>
, ignore.case = <logical>
))<text>
, ignore.case = <logical>
))<text>
, <start_num>:<end_num>
))Arguments
Example
mutate(Total = summarize_row(across(c(Sales_Consumer, Sales_Business, Sales_Government)), sum, na.rm = TRUE)
Sums up values from the specified columns for each row, and stores the sums in the Total column.
mutate(Total = summarize_row(across(where(is.numeric)), sum, na.rm = TRUE)
Sums up values from all numeric columns for each row, and stores the sums in the Total column.
Summary
Apply functions to all the columns.
Syntax
mutate_all(funs(<function(s)>
), ...)
Arguments
Example
mutate_all(funs(. * 0.5))
Multiply 0.5 to all the column values and override the existing columns.
mutate_all(funs(half = . * 0.5))
Multiply 0.5 to all the column values and create new columns whose names will end with "_half".
mutate_all(funs(min_rank, dense_rank))
Apply min_rank and dense_rank functions to all the column values and create new columns whose names will end with the function name (min_rank, dense_rank).
Summary
Apply functions to specified columns.
Syntax
mutate_at(vars(<column(s)>
), funs(<function(s)>
), ...)
Arguments
Example
mutate_at(vars(ends_with("DELAY")), min_rank)
Apply 'min_rank' function to the columns whose name ending with "DELAY".
mutate_at(vars(ends_with("DELAY")), funs(. * .5))
Multiply 0.5 to the columns whose name ending with "DELAY".
mutate_at(c("ARR_DELAY", "DEP_DELAY"), min_rank)
Apply 'min_rank' function to "ARR_DELAY" and "DEP_DELAY" columns.
mutate_at(c(1,3), min_rank)
Apply 'min_rank' function to 1st and 3rd columns.
Summary
Apply functions to only the columns that match with a given condition.
Syntax
mutate_if(<predicate_condition>
, funs(<function(s)>
), ...)
Arguments
Example
mutate_if(is_numeric, mean)
Apply 'mean' function to all the columns that are numeric data type.
mutate_if(is.numeric, funs(. * 0.2))
Multiply all the numeric columns by 0.2 overriding existing columns.
mutate_if(is.numeric, funs(calc = . * 0.2))
Multiply all the numeric columns by 0.2 as new columns.
Summary
Apply functions to all the columns.
Syntax
summarize_all(funs(<function(s)>
), ...)
Arguments
Example
summarize_all(mean)
Apply mean function to all the columns values per group and override the existing columns.
summarize_all(mean, na.rm = TRUE)
Apply mean function with argument "na.rm = TRUE" to all the columns values per group.
summarize_all(funs(min, max))
Apply min and max functions to all the columns values per group and create new columns.
summarize_all(funs(average = mean), na.rm = TRUE)
Apply mean function to all the columns values per group and create new columns whose names ending with 'average'.
Summary
Apply functions to specified columns.
Syntax
summarize_at(vars(<column(s)>
), funs(<function(s)>
), ...)
Arguments
Example
summarize_at(vars(ends_with("DELAY")), mean)
Apply 'mean' function to the columns whose name ending with "DELAY".
summarize_at(vars(ends_with("DELAY")), mean, na.rm = TRUE)
Apply 'mean' function with "na.rm = TRUE" argument to the columns whose name ending with "DELAY".
summarize_at(c("ARR_DELAY", "DEP_DELAY"), mean)
Apply 'mean' function to "ARR_DELAY" and "DEP_DELAY" columns.
summarize_at(c(1,3), mean)
Apply 'mean' function to 1st and 3rd columns.
Summary
Apply functions to only the columns that match with a given condition.
Syntax
summarize_if(<predicate_condition>
, )
Arguments
Example
summarize_if(is_numeric, mean)
Apply 'mean' function to all the columns that are numeric data type for each group.
Summary
Returns a list of user selected columns.
Syntax
vars(<column(s)>
)
Arguments
Example
vars(starts_with("ARR"))
Select columns whose name starting with "ARR".
vars(ARR_DELAY, ARR_TIME)
Select "ARR_DELAY" and "ARR_TIME" columns.
Summary Returns values based on whether a given condition is satisfied or not.
Syntax
if_else(<condition>
, <return_value_when_TRUE>
, <return_value_when_FALSE>
, missing = NULL)
Arguments
Return Value
Vector
Example
mutate(category = ifelse(price >= 500, "Expensive", "Cheap"))
price | category |
---|---|
1000 | Expensive |
500 | Expensive |
100 | Cheap |
50 | Cheap |
Summary Returns values based on multiple conditions. This is similar to CASE WHEN conditions in SQL. You can use two sided formula to define the condition and the mapping values, and use TRUE as ELSE condition. See examples below.
Syntax
case_when(...)
Arguments
Return Value
Vector
Example
mutate(category = case_when(price > 900 ~ "Super Expensive", price >= 500 ~ "Expensive", price >= 100 ~ "Mild")))
price | category |
---|---|
1000 | Super Expensive |
500 | Expensive |
100 | Mild |
50 | NA |
mutate(category = case_when(price > 900 ~ "Super Expensive", price >= 500 ~ "Expensive", price >= 100 ~ "Mild", TRUE ~ "Cheap")))
price | category |
---|---|
1000 | Super Expensive |
500 | Expensive |
100 | Mild |
50 | Cheap |
Summary
Replace the missing values with a given value or the values that are at the same position in other columns.
Syntax
coalesce(<column>
, <column(s)>
)
Arguments
Return Value
Vector
Example
Original data:
price | price_old |
---|---|
1500 | 1000 |
NA | 500 |
NA | 100 |
50 | 10 |
mutate(price = coalesce(price, 0))
price | price_old |
---|---|
1500 | 1000 |
0 | 500 |
0 | 100 |
50 | 10 |
mutate(price = coalesce(price, price_old))
price | price_old |
---|---|
1500 | 1000 |
500 | 500 |
100 | 100 |
50 | 10 |
Summary
Replaces the matching values to NA.
Syntax
na_if(<column>
, <column_value>
)
Arguments
Return Value
Vector
Example
mutate(after = na_if(before, 999))
before | after |
---|---|
A | A |
G | G |
999 | NA |
Summary
Compares two numeric vectors (columns). This is a safe way of comparing if two vectors of floating point numbers are (pairwise) equal. This is safer than using ==, because it has a built in tolerance.
Syntax
near(<column_num>
, <column_num>
, tol = <numeric>
)
Arguments
Return Value
Logical
Example
sqrt(2) ^ 2 == 2 Returns FALSE.
near(sqrt(2) ^ 2, 2) Returns TRUE.
Summary
Recodes (or replaces) the original values with given values. You can replace numeric values either by the name with backticks or based on their position, and character values by their name.
Syntax
recode(<column>
, ..., .default = <text>
, .missing = <text>
)
Arguments
Return Value
Vector
Example
Original data is numeric:
ID |
---|
4716 |
4755 |
2432 |
mutate(name = recode(ID, `4716` = "Oracle", `4755` = "Rakuten"))
ID | name |
---|---|
4716 | Oracle |
4755 | Rakuten |
2432 | NA |
Since the original value 2432 doesn't have a replacement value yet its data type (numeric) is different from the other replacement values (character) its replacement value becomes NA. If the data type is same then it will use the original value.
mutate(name = recode(ID, `4716` = "Oracle", `4755` = "Rakuten", .default = "Not Known"))
You can use '.default' argument to set the default value.
ID | name |
---|---|
4716 | Oracle |
4755 | Rakuten |
2432 | Not known |
Original data is character:
ID |
---|
A |
G |
M |
mutate(name = recode(ID, "A" = "Apple", "G" = "Google"))
ID | name |
---|---|
A | Apple |
G | |
M | M |
Since the original value 'M' doesn't have a replacement value yet its data type (character) is same as the other replacement values (character) it will use the original value as 'M'.
mutate(name = recode(ID, A = "Apple", G = "Google", .default = "Not known"))
You can use '.default' argument to set the default value.
ID | name |
---|---|
A | Apple |
G | |
M | Not Known |
Summary
Recodes (or replaces) the original values with given values and create a factor column. You can replace numeric values either by the name with backticks or based on their position, and character values by their name.
Syntax
recode_factor(<column>
, ..., .default = <text>
, .missing = <text>
, .ordered = <logical>
)
Arguments
Return Value
Factor
Example
Original Data:
ID |
---|
M |
T |
S |
mutate(name = recode_factor(ID, S = "Sunday", M = "Mondy", T = "Tuesday", .ordered = TRUE))
ID | name |
---|---|
M | Monday |
T | Tuesday |
S | Sunday |
Summary
Batch converts multiple column names by applying a function.
Syntax
rename_with(.data = <dataframe_name>
, .fn = <function_to_transform_column_names>
, .cols = <columns_to_change>
)
Arguments * .data - A data frame, or a data frame extension like tibble. * .fn - The function to apply. Specifies a function to transform the selected column names. * .cols - Specifies which columns to change. Default is everything(), which selects all columns.
Example
rename_with(iris, ~ toupper(.x))
Converts all column names of the iris dataframe to uppercase.
SEPAL.LENGTH | SEPAL.WIDTH | PETAL.LENGTH | PETAL.WIDTH | SPECIES |
---|---|---|---|---|
5.1 | 3.5 | 1.4 | 0.2 | setosa |
4.9 | 3.0 | 1.4 | 0.2 | setosa |
Summary
Separates a column with delimited values into multiple rows.
Syntax
separate_rows(<column_text_date>
, sep = <text>
, convert=<logical>
)
Arguments
Example
Original data:
company | investors |
---|---|
Uber | Lowercase Capital, Benchmark Capital, Google Ventures |
Xiaomi | Digital Sky Technologies, QiMing Venture Partners, Qualcomm Ventures |
seperate_rows(investors, sep=", ")
company | investors |
---|---|
Uber | Lowercase Capital |
Uber | Benchmark Capital |
Uber | Google Ventures |
Xiaomi | Digital Sky Technologies |
Xiaomi | QiMing Venture Partners |
Xiaomi | Qualcomm Ventures |
Summary
Unnest a list column by turning each element of a list-column into a column.
Syntax
unnest_wider(col = <column_list>
, names_sep = <text>
, names_repair = <text>
, simplify = <logical>
, ptype = <list>
, transform = <list>
)
Arguments
names_repair (Optional) - Used to check that output data frame has valid names. Must be one of the following options
simplify (Optional) - If TRUE, will attempt to simplify lists of length-1 vectors to an atomic vector.
ptype (Optional) - A named list of prototypes declaring the desired output type of each component. Use this argument if you want to check each element has the types you expect when simplifying.
transform (Optional) - A named list of transformation functions applied to each component. Use this function if you want transform or parse individual elements as they are hoisted.
Summary
Unnest a list column by turning each element of a list-column into a row.
Syntax
unnest_longer(col = <column_list>
, values_to = <text>
, indices_to = <text>
, indices_include = <logical>
, names_repair = <text>
, simplify = <logical>
, ptype = <list>
, transform = <list>
)
Arguments
names_repair (Optional) - Used to check that output data frame has valid names. Must be one of the following options
simplify (Optional) - If TRUE, will attempt to simplify lists of length-1 vectors to an atomic vector.
ptype (Optional) - A named list of prototypes declaring the desired output type of each component. Use this argument if you want to check each element has the types you expect when simplifying.
transform (Optional) - A named list of transformation functions applied to each component. Use this function if you want transform or parse individual elements.
Hoist allows you to selectively pull components of a list-column out in to their own top-level columns.
Syntax
hoist(.col = <column_list>
, .remove = <logical>
, .simplify = <logical>
, .ptype = <list>
, .transform=<list>
)
Arguments
Summary
Drop rows that have NA value.
Syntax
drop_na(<column(s)>
,...)
drop_na(-<column(s)>
,...)
drop_na(starts_with(<text>
, ignore.case = <logical>
))
drop_na(ends_with(<text>
, ignore.case = <logical>
))
drop_na(contains(<text>
, ignore.case = <logical>
))
drop_na(matches(<text>
, ignore.case = <logical>
))
drop_na(num_range(<text>
, <start_num>:<end_num>
))
drop_na(one_of(<text1>
, <text2>
, ...))
drop_na(<column>
, everything())
Arguments
Example
Sample data:
Original data:
location | year | plant_date | harvest_date |
---|---|---|---|
Als | 1900 | NA | 35.02899 |
Als | 1901 | 10.00000 | 24.02899 |
Als | 1902 | 12.58863 | 45.02899 |
Bea | 1900 | 5.00000 | 18.00000 |
Bea | 1901 | 3.31119 | 10.00000 |
Bea | 1902 | 29.12891 | NA |
Bor | 1900 | 12.52136 | 18.14896 |
Bor | 1901 | NA | 10.64896 |
Bor | 1902 | 10.23056 | 20.64896 |
drop_na(ends_with("date"))
Returns a data frame without NA values in plant_date and harvest_date columns.
location | year | plant_date | harvest_date |
---|---|---|---|
Als | 1901 | 10.00000 | 24.02899 |
Als | 1902 | 12.58863 | 45.02899 |
Bea | 1900 | 5.00000 | 18.00000 |
Bea | 1901 | 3.31119 | 10.00000 |
Bor | 1900 | 12.52136 | 18.14896 |
Bor | 1902 | 10.23056 | 20.64896 |
Summary
Add rows with given column values.
Syntax
add_row()
Arguments
Return Value
Data Frame
Example
Original Data:
ID | name |
---|---|
4716 | Oracle |
4755 | Rakuten |
2432 | DeNA |
add_row(ID = 4689)
ID | name |
---|---|
4716 | Oracle |
4755 | Rakuten |
2432 | DeNA |
4689 | NA |
add_row(ID = 4689, name = "Yahoo")
ID | name |
---|---|
4716 | Oracle |
4755 | Rakuten |
2432 | DeNA |
4689 | Yahoo |
Summary
Parse characters, extract numeric values, and convert to number data type including
Syntax
parse_number(<column>
, na = c("", "NA"), locale = <locale>
)
Arguments
Return Value
Numeric
Example
parse_number("12345")
Returns 12345
parse_number("12345.10")
Returns 12345.1
parse_number("$12,345.10")
Returns 12345.1
parse_number("$12.20M")
Returns 12.2
parse_number(as.Date("2015-01-30") - as.Date("2015-01-15"))
Returns 15
Summary
Convert data to double precision number data type.
Syntax
parse_double(<column>
, na = c("", "NA"), locale = <locale>
)
Arguments
Return Value
Double
Example
parse_double("12345")
Returns 12345
parse_double("12345.10")
Returns 12345.1
Summary
Convert data to double precision number data type.
Syntax
parse_euro_double(<column>
, na = c("", "NA"))
Arguments
Return Value
Double
Example
parse_euro_double("12345")
Returns 12345
parse_euro_double("12345.10")
Returns 12345.1
Summary
Convert data to integer number data type.
Syntax
parse_integer(<column>
, na = c("", "NA"), locale = <locale>
)
Arguments
Return Value
Integer
Example
parse_integer("12345")
Returns 12345
parse_integer("12345.10")
Returns 12345
Summary
Converts character strings containing time values into time objects (hms format). When combined with as.character(), it can effectively transform time data from various formats into standardized hms objects.
Syntax
parse_time(<column>
, format = <time_format>
, locale = <locale>
)
Arguments
Return Value
time (hms type time object)
Example
R_Start parse_time(as.character(time_column)) R_End
Converts standard time format strings to hms type time.
time_column | converted_time |
---|---|
13:10:05 | 13:10:05 |
09:45:30 | 09:45:30 |
23:59:59 | 23:59:59 |
R_Start parse_time(as.character(time_column), format = "%I:%M:%S %p") R_End
Converts 12-hour format (with AM/PM) time strings to 24-hour format hms type.
time_column | converted_time |
---|---|
10:20:15 PM | 22:20:15 |
08:45:00 AM | 08:45:00 |
12:00:00 PM | 12:00:00 |
R_Start parse_time(as.character(time_column), locale = locale(tz = "America/Los_Angeles")) R_End
Converts time strings with a specific time zone specified.
time_column | converted_time |
---|---|
13:10:05 | 13:10:05 |
09:45:30 | 09:45:30 |
R_Start parse_time(as.character(time_column), locale = locale("ja")) R_End
Converts time strings in Japanese locale (including AM/PM notation).
time_column | converted_time |
---|---|
10:20:15 午後 | 22:20:15 |
08:45:00 午前 | 08:45:00 |
R_Start parse_time(as.character(time_column), format = "%H.%M") R_End
Converts period-separated time format (hour.minute) to standard time type.
time_column | converted_time |
---|---|
13.45 | 13:45:00 |
09.30 | 09:30:00 |
22.15 | 22:15:00 |
R_Start parse_time(as.character(time_column), format = "%H時%M分%S秒") R_End
Converts Japanese format time notation (hour-minute-second) to standard time type.
time_column | converted_time |
---|---|
13時45分30秒 | 13:45:30 |
09時15分00秒 | 09:15:00 |
23時59分59秒 | 23:59:59 |
Summary
Convert a given data to character data type.
Syntax
parse_character(<column>
, na = c("", "NA"), locale = <locale>
)
Arguments
Return Value
Character
Example
parse_character(123.11)
Returns "123.11" as character.
Summary
Parse data and convert to factor data type.
Syntax
parse_factor(x, levels, ordered = <logical>
, na = c("", "NA"), locale = <locale>
)
Arguments
Return Value
Factor
Example
columnA - "iMac", "iPod", "iPhone", "iPod", "iPhone"
parse_factor(columnA)
Returns iMac, iPod, iPhone, iPod, iPhone (It has Levels information of : iMac iPhone iPod)
Summary
Convert data to Logical (Boolean) data type.
Syntax
parse_logical(<column>
, na = c("", "NA"), locale = <locale>
)
Arguments
Return Value
Logical
Example
parse_logical("TRUE")
Returns TRUE
parse_logical("T")
Returns TRUE
Summary
Heuristically guess the data type for each column by reading the first 1000 rows, parse the data, and set appropriate data types for all the columns of the data frame.
Syntax
type_convert()
Arguments
Return Value
Data Frame
Example
type_convert()
Summary
Convert Character or Number to Date / Period when data contains only Date, but not Time.
Syntax
ym(<column>
, tz = <timezone>
, locale = <locale>
)
Arguments
Return Value
Date
Example
ym("2015-10")
returns "2015-10-01"
ym("2015/10")
returns "2015-10-01"
ym("Created on 2015 10")
returns "2015-10-01"
ym("2015, Oct", locale = "English")
returns "2015-10-01"
Summary
Convert Character or Number to Date / Period when data contains only Date, but not Time.
Syntax
my(<column>
, tz = <timezone>
, locale = <locale>
)
Arguments
Return Value
Date
Example
returns "2015-01-01"
my("01/2015")
returns "2015-01-01"
my("Created on 1 2015")
returns "2015-01-01"
my("Oct, 2015", locale = "English")
returns "2015-10-01"
Summary
Convert text to Date data type if a given column holds values that look like Year and Quoter.
Syntax
yq(<column>
, tz = <timezone>
, locale = <locale>
)
Arguments
Return Value
Date
Example
yq("2016.2")
returns "2016-04-01"
yq("2016-03")
returns "2016-07-01"
Summary
Extract week numbers of the year. Weeks start from Sunday.
Syntax
epiweek(<column_date>
)
Return Value
Numeric
Example
epiweek(ymd(c( \
"2020-01-01",\
"2020-01-02",\
"2020-01-03",\
"2020-01-04",\
"2020-01-05",\
"2020-01-06",\
"2020-01-07",\
"2020-01-08")))\
returns 1,1,1,1,2,2,2,2 (2020-01-05 is Sunday)
Summary
Convert a given POSIXct to Date.
Syntax
as_date(<column_date>
)
Arguments
Return Value
Date
Example
as_date(as_datetime("2015-10-10 13:10:05"))
Returns "2015-10-10"
Summary
Convert a given data to date.
Syntax
as_datetime(<column_date>
, format = <date_time_format>
, tz = <timezone>
, origin = <text>
)
Arguments
Return Value
POSIXct
Example
as_datetime("2015-10-10 13:10:05")
Returns "2015-10-10 13:10:05 PDT"
as_datetime("2015-10-10 13:10:05", tz = "America/Los_Angeles")
Returns "2015-10-10 13:10:05 PDT"
as.as_datetime("2015-10-10 13:10:05", tz = "Asia/Tokyo")
Returns "2015-10-10 13:10:05 JST"
as_datetime("05-10-15T13:10:05", format = "%d-%m-%yT%H:%M")
Returns "2015-10-05 13:10:00 PDT"
Summary
Remove leading and trailing white spaces and convert multiple consecutive white spaces into a single space.
Syntax
str_squish(<column_text>
)
Arguments
Return Value
Character
Example
str_squish(X1)
Remove leading and trailing white spaces and convert multiple consecutive white spaces into a single space.
X1 | NEW |
---|---|
String with spaces | String with spaces |
Hello World | Hello World |
Multiple white spaces | Multiple white spaces |
Summary
Truncate text to make the maximum length to be the specified number.
Syntax
str_trunc(<column_text>
, width = <number>
, side = "right"|"left"|"center"
, ellipsis = <text>
)
Arguments
Return Value
Character
Example
str_trunc("Exploratory", 6)
returns "Exp..."
str_trunc("Exploratory", 6, side = "left")
returns "...ory"
Summary
Remove letters that match with given letters or expressions.
Syntax
str_remove(<column_text>
, <pattern>
)
Arguments
Return Value
Character
Example
str_remove("I am traveling to New York!!", "New")
returns "I am traveling to York!!"
Summary
Remove letters that match with given letters or expressions
Syntax
str_remove_all(<column_text>
, <pattern>
)
Arguments
Return Value
Character
Example
str_remove_all("I am traveling to New York and New Jersey!!", "new")
returns "I am traveling to York and Jersey!!"
Summary
Remove a word from sentence.
Syntax
str_remove_word(<column_text>
, start = <number>
, end = <number>
, sep = <text>
)
Arguments
Return Value
Character
Example
str_remove_word(text_column, start = 1, sep = " ")
Removes the first word from the text column.
text_column | new_column |
---|---|
Hello World | World |
R Programming | Programming |
Data Analysis | Analysis |
str_remove_word(text_column, start = -1, sep = " ")
Removes the last word from the text column.
text_column | new_column |
---|---|
Hello World | Hello |
R Programming | R |
Data Analysis | Data |
str_remove_word(text_column, end = -1, sep = "-")
Removes the last hyphen-separated word from the text column.
text_column | new_column |
---|---|
user-name-id | user-name |
first-last-middle | first-last |
product-color-size | product-color |
Summary
Replace word from a sentence with given text.
Syntax
str_replace_word(<column_text>
, start = <number>
, end = <number>
, sep = <text>
, rep = <text>
)
Arguments
NOTE: For now only the first and the last work.
Return Value
Character
Example
str_replace_word(text_column, start = -1, sep = " ", rep = "England")
Replaces the last word in the text column with "England".
コピー| text_column | new_column | |-------------|------------| | Hello World | Data World | | R Programming | Data Programming | | Text Analysis | Data Analysis |
str_replace_word(text_column, start = 1, sep = " ", rep = "Data")
Replaces the first word in the text column with "Data".
text_column | new_column |
---|---|
Hello World | Data World |
R Programming | Data Programming |
Text Analysis | Data Analysis |
str_replace_word(text_column, start = 1, sep = "-", rep = "PREFIX")
Replaces the first hyphen-separated word in the text column with "PREFIX".
text_column | new_column |
---|---|
user-name-id | PREFIX-name-id |
first-last-middle | PREFIX-last-middle |
product-color-size | PREFIX-color-size |
Summary
Replace a text inside of characters.
Syntax
str_remove_inside(<column_text>
, begin = <text>
, end = <text>
, all = <logical>
, include_special_chars = <logical>
)
Arguments
Return Value
Character
Example
exploratory::str_remove_inside("I am traveling to New York (2020)", begin = "(", end = ")")
returns "I am traveling to New York "
Summary
Replace a text inside of characters.
Syntax
str_replace_inside(<column_text>
, begin = <text>
, end = <text>
, rep = <text>
, all = <logical>
, include_special_chars = <logical>
)
Arguments
Return Value
Character
Example
exploratory::str_replace_inside("I am traveling to New (York)", begin = "(", end = ")", rep = "England")
returns "I am traveling to New England"
Summary
Remove an URL inside of characters.
Syntax
str_remove_url(<column_text>
, position = <text>
)
Arguments
Return Value
Character
Example
exploratory::str_remove_url("I am traveling to New York http://example.com", position = "any")
returns "I am traveling to New York "
Summary
Replace an URL inside of characters with provided text.
Syntax
str_replace_url(<column_text>
, rep = <text>
)
Arguments
Return Value
Character
Example
exploratory::str_replace_url("I am traveling to New http://example.com/ny", rep = "York")
returns "I am traveling to New York"
Summary
Replace an URL inside of characters with provided text.
Syntax
str_extract_url(<column_text>
)
Arguments
Return Value
Character
Example
exploratory::str_extract_url("I am traveling to New http://example.com/ny")
returns "http://example.com/ny"
Summary
Replace an URL inside of characters with provided text.
Syntax
str_remove_emoji(<column_text>
)
Arguments
Return Value
Character
Example
exploratory::str_remove_emoji("I am traveling to New York 🤩")
returns "I am traveling to New York "
Summary
Returns a summary information of a given model in a tidy (normalized data frame) format.
Syntax
model_info(<column_list>
, output = <model_info_output_type>
, ...)
Arguments
* output (Optional) - The default is "summary". This changes the type of output. Can be one of the following.
Returned Values
Example
Following examples assumes that 'model' column contains a fitted model of 'lm' by do() command.
model_info(model, output = "summary")
r.squared | adj.r.squared | sigma | statistic | p.value | df | logLik | AIC | BIC | deviance | df.residual |
---|---|---|---|---|---|---|---|---|---|---|
0.8879 | 0.8879 | 13.0886 | 1716441.58 | 0 | 3 | -1729152.62 | 3458313.24 | 3458357.15 | 74228528.48 | 433295 |
model_info(model, output = "variables")
term | estimate | std.error | statistic | p.value |
---|---|---|---|---|
(Intercept) | -3.0966 | 0.0341 | -90.749 | 0 |
2 DEP_DELAY | 1.003 | 0.0005 | 1851.9463 | 0 |
3 DISTANCE | -0.0037 | 0 | -114.3364 | 0 |
model_info(model, output = "data")
.rownames | ARR_DELAY | DEP_DELAY | .fitted | .se.fit | .resid | .hat | .sigma | .cooksd | .std.resid |
---|---|---|---|---|---|---|---|---|---|
1 | -15 | -9 | -13.934 | 0.0247 | -1.066 | 0.0000035503 | 13.0886 | 7.8503e-9 | -0.0814 |
2 | -28 | -19 | -22.4116 | 0.035 | -5.5884 | 0.0000071333 | 13.0886 | 4.3348e-7 | -0.427 |
Summary
Converts numeric values representing seconds into hms (hours:minutes:seconds) time format.
Syntax
as.hms(<column_num>
)
Arguments
Return Value
hms
Example
as.hms(X1)
Converts the numerical values in column X1 to hms type, handling them in seconds.
X1 | NEW |
---|---|
20 | 00:00:20 |
3600 | 01:00:00 |
7320 | 02:02:00 |
Summary
Returns true if a given object is hms type.
Syntax
is.hms(<column>
)
Arguments
Return Value
Logical
Example
mutate_if(is.hms, as.character)
Convert hms columns into character columns. An example of the value is "05:01:20".
Summary
Removes empty rows or columns. Useful for removing rows or columns that are empty (NA, NULL, blank) from a data frame.
Syntax
remove_empty(<data_frame>
, which = <"rows"|"cols">
, quiet = <TRUE|FALSE>
)
Arguments * data_frame - The data frame from which to remove empty rows or columns * which - Specifies what to remove: * "rows" - Removes empty rows * "cols" - Removes empty columns * c("rows", "cols") - Removes both empty rows and columns * quiet (Optional) - Specifies whether to display information messages about removed rows or columns. Default is FALSE (displays messages).
Return Value
data.frame
Example
Original data:
name | age | city |
---|---|---|
Smith | 25 | London |
NA | NA | NA |
Johnson | 30 | Paris |
remove_empty(data_frame, which = "rows")
Removes rows where all columns are NA, NULL, or blank values.
name | age | city |
---|---|---|
Smith | 25 | London |
Johnson | 30 | Paris |
Original data:
name | age | city | notes |
---|---|---|---|
Smith | 25 | London | NA |
Johnson | 30 | Paris | NA |
Williams | 28 | Berlin | NA |
remove_empty(data_frame, which = "cols")
Removes columns where all rows are NA, NULL, or blank values.
name | age | city |
---|---|---|
Smith | 25 | London |
Johnson | 30 | Paris |
Williams | 28 | Berlin |
Original data:
name | age | city | notes |
---|---|---|---|
Smith | 25 | London | NA |
NA | NA | NA | NA |
Johnson | 30 | Paris | NA |
remove_empty(data_frame, which = c("rows", "cols"))
Removes both empty rows and columns.
name | age | city |
---|---|---|
Smith | 25 | London |
Johnson | 30 | Paris |
Summary
Make column names clean by using underscores (_), lowercase letters, and numbers for easier data handling and standardization.
Syntax
clean_names(<data_frame>
, case = <"Format">
)
Arguments
Return Value
data.frame
Example
clean_names(data_frame)
Original data:
Item Name | .transaction |
---|---|
banana | 1 |
apple | 1 |
carrot | 2 |
apple | 2 |
orange | 3 |
NA | 3 |
Returns data with clean column names using default settings (snake_case):
item_name | transaction |
---|---|
banana | 1 |
apple | 1 |
carrot | 2 |
apple | 2 |
orange | 3 |
NA | 3 |
clean_names(data_frame, case = "upper_camel")
Using case="upper_camel" (UpperCamelCase):
ItemName | Transaction |
---|---|
banana | 1 |
apple | 1 |
carrot | 2 |
apple | 2 |
orange | 3 |
NA | 3 |
Summary
Extracts rows that are duplicated in the specified column(s). If no columns are specified, it returns rows that are duplicated across all columns.
Syntax
get_dupes(<column(s)>
)
Arguments
Return Value
data.frame (containing only the duplicated rows from the original data frame, with an added "dupe_count" column showing the number of duplicates)
Example
get_dupes(data_frame, name)
Original data:
name | n |
---|---|
banana | 1 |
apple | 1 |
apple | 1 |
apple | 2 |
Returns rows that are duplicated in the name column (with a dupe_count column added):
name | n | dupe_count |
---|---|---|
apple | 1 | 2 |
apple | 1 | 2 |
apple | 2 | 1 |
get_dupes(data_frame, name, n)
Returns rows that are duplicated across both name and n columns:
name | n | dupe_count |
---|---|---|
apple | 1 | 2 |
apple | 1 | 2 |
Summary
Converts Excel numeric date values (e.g., 42370) to date format (e.g., 2016-01-01)
Syntax
excel_numeric_to_date(<column_num>
, date_system = "modern"|"mac pre-2011"
)
Arguments
Example
Original data:
user | last_login |
---|---|
Lisa | 42370 |
Emily | 42318 |
John | 42489 |
excel_numeric_to_date(last_login)
Converts Excel numeric date values to date format as shown below.
user | last_login |
---|---|
Lisa | 2016-01-01 |
Emily | 2015-11-10 |
John | 2016-04-29 |
Return Value
Date
Summary
Create frequency table for a column in a data frame. Useful for quickly understanding data distribution with percentage calculations and NA handling.
Syntax
tabyl(<column_name>
, sort = <TRUE|FALSE>
, show_na = <TRUE|FALSE>
, show_missing_levels = <TRUE|FALSE>
)
Arguments
Return Value
data.frame (frequency table)
Example
tabyl(data_frame, item_name)
Original data:
item_name | transaction_id |
---|---|
banana | 1 |
apple | 1 |
NA | NA |
carrot | 2 |
apple | 2 |
NA | NA |
orange | 3 |
NA | 3 |
Shows frequency table for item_name column (including NA values):
item_name | n | percent | valid_percent |
---|---|---|---|
apple | 2 | 0.250 | 0.40 |
banana | 1 | 0.125 | 0.20 |
carrot | 1 | 0.125 | 0.20 |
orange | 1 | 0.125 | 0.20 |
NA | 3 | 0.375 | NA |
tabyl(data_frame, item_name, sort = TRUE)
Sorted by frequency (high to low):
item_name | n | percent | valid_percent |
---|---|---|---|
NA | 3 | 0.375 | NA |
apple | 2 | 0.250 | 0.40 |
banana | 1 | 0.125 | 0.20 |
carrot | 1 | 0.125 | 0.20 |
orange | 1 | 0.125 | 0.20 |
tabyl(data_frame, item_name, show_na = FALSE)
Excluding NA values:
item_name | n | percent |
---|---|---|
apple | 2 | 0.40 |
banana | 1 | 0.20 |
carrot | 1 | 0.20 |
orange | 1 | 0.20 |
Summary
Convert a given column to a factor. Compared to base R's as.factor, this function creates levels in the order in which they appear, which will be the same on every platform (base R sorts in the current locale which can vary from place to place).
Syntax
as_factor(<column_char_num_date>
)
Arguments
Example
x <- c("FR", "US", "JP")
as_factor(x)
Returns a factor with FR, US, JP with levels of FR, US, JP.
Summary
Anonymize factor levels. Replaces factor levels with arbitrary numeric identifiers. Neither the values nor the order of the levels are preserved.
Syntax
fct_anon(<column_char_num_factor>
, prefix = <text>
)
Arguments
Example
x <- factor(c("FR", "US", "JP"))
fct_anon(x)
Returns a factor with 1, 2, 3.
Summary
Add additional levels to a factor
Syntax
fct_expand(<column_char_num_factor>
, <text>
)
Arguments
Example
x <- factor(c("FR", "US"))
fct_expand(x, "JP", "CA")
Returns a factor with levels, FR US JP CA.
Summary
Drop unused levels.
Syntax
fct_drop(<column_char_num_factor>
, only = <text>
)
Arguments
Example
x <- factor(c("FR", "US"), levels = c("FR", "US", "JP", "CA"))
fct_drop(x)
Returns a factor with levels of FR, US.
fct_drop(x, only = "CA")
Returns a factor with levels of FR, US, JP.
Summary
This gives missing value an explicit factor level, ensuring that they appear in the charts.
Syntax
fct_explicit_na(<column_char_num_factor>
, na_level = <text>
)
Arguments
Example
x <- factor(c("FR", NA, "US", NA, "JP"))
fct_explicit_na(x, "No value")
Returns a factor with FR, No value, US, No value, JP and with levels of FR JP US No value
Summary
Moves least/most common factor levels into "Other" category.
Syntax
fct_lump(<column_char_num_factor>
, n = <number>
, prop = <number>
, other_level = <text>
, ties.method = <ties_method_type>
, w = <column_num>
)
Arguments
If both n and prop are missing, fct_lump lumps together the least frequent levels into "other", while ensuring that "other" is still the smallest level.
Example
mutate(STATE = fct_lump(STATE))
This will make STATE column to keep most of the values in order to make the 'Other' level will have the smallest counts compared to all the other levels.
mutate(ORIGIN_STATE_ABR = fct_lump(ORIGIN_STATE_ABR, n = 10))
This will make STATE column to keep only 10 States from the most frequent ones and add 'Other' level for putting all the others together.
mutate(ORIGIN_STATE_ABR = fct_lump(ORIGIN_STATE_ABR, prop = .1))
This will make ORIGIN_STATE_ABR column to keep only States that have more than 10% proportion of all the entries and add 'Other' level for putting all the others together.
mutate(ORIGIN_STATE_ABR = fct_lump(ORIGIN_STATE_ABR, n = 5, w = DISTANCE))
This will make ORIGIN_STATE_ABR column to keep only 5 States that have the most weight value, which is calculated as sum of DISTANCE for each State, and add 'Other' level for putting all the others together.
Summary
Creates "Other" category by moving specified values to "Other", or by keeping only specified values from "Other".
Syntax
fct_other(<column_char_num_factor>
, keep = c("<column_value(s)>
"), drop = c("<column_value(s)>
"), other_level = <text>
)
Arguments
Example
mutate(STATE = fct_other(STATE, keep = c("CA", "AK", "AZ")))
This will make STATE column to keep only the levels for specified values and the 'Other' level for all other values.
mutate(STATE = fct_other(STATE, drop = c("CA", "AK", "AZ")))
This will make STATE column to keep the levels that are not specified here, and the 'Other' level for the specified values.
Summary
Reorder factors levels by first appearance.
Syntax
fct_inorder(<column_char_num_factor>
, ordered = <logical>
)
Arguments
Example
x <- factor(c("FR", "US", "US", "JP", "US", "JP"))
fct_inorder(x)
Returns a factor with levels in the following order, FR US JP.
x <- c("FR", "US", "US", "JP", "US", "JP") -- it can be used for Character data type column.
fct_inorder(x)
Returns a factor with levels in the following order, FR US JP.
Summary
Reorder factors levels by frequency (the most frequently appears to the least.)
Syntax
fct_infreq(<column_char_num_factor>
, ordered = <logical>
)
Arguments
Example
x <- factor(c("FR", "US", "US", "JP", "US", "JP"))
fct_infreq(x)
Returns a factor with levels in the following order, US JP FR.
x <- c("FR", "US", "US", "JP", "US", "JP") -- it can be used for Character data type column.
fct_infreq(x)
Returns a factor with levels in the following order, US JP FR.
Summary
Sets the level of a given factor column by moving any number of levels to any location.
Syntax
fct_relevel(<column_char_num_factor>
, ...)
Arguments
Example
x <- c("France", "Japan", "Australia", "United States") -- creating a character column.
fct_relevel(x, "Japan")
Return as Japan, Australia, France, United States. The character column is converted to factor with the default level setting, which is an alphabetical order.
x <- factor(level = c("France", "Japan", "Australia", "United States")) -- creating a factor column with levels.
fct_relevel(x, "Japan")
Return as Japan, France, Australia, United States
fct_relevel(x, "Japan", "United States")
Return as Japan, United States, France, Australia
Summary
Sets the order of values based on sorted values from another column and converts the data type to factor.
Syntax
fct_reorder(<column_char_num_factor>
, <column_num>
, .fun = <aggregate_function>
, ..., .desc = <logical>
)
Arguments
Example
fct_reorder(Country, Sales)
Sets the Country column levels by the median values of the Sales column in ascending order. Since the median Sales values are France(500) < Japan(1000) < USA(2000), the levels are set in this order.
Country | Sales | NEW |
---|---|---|
Japan | 1000 | Japan |
Japan | 1200 | Japan |
Japan | 800 | Japan |
USA | 2000 | USA |
USA | 2200 | USA |
USA | 1800 | USA |
France | 500 | France |
France | 600 | France |
France | 400 | France |
fct_reorder(Country, Sales, .desc = TRUE)
Sets the Country column levels by the median values of the Sales column in descending order. Since the median Sales values are USA(2000) > Japan(1000) > France(500), the levels are set in this order.
Country | Sales | NEW |
---|---|---|
Japan | 1000 | Japan |
Japan | 1200 | Japan |
Japan | 800 | Japan |
USA | 2000 | USA |
USA | 2200 | USA |
USA | 1800 | USA |
France | 500 | France |
France | 600 | France |
France | 400 | France |
fct_reorder(Country, Sales, .fun = sum, .desc = TRUE, na.rm = TRUE)
Sets the Country column levels by the sum of the Sales column in descending order. Since the sum of Sales values are USA(6000) > Japan(3000) > France(900), the levels are set in this order.
Country | Sales | NEW |
---|---|---|
Japan | 1000 | Japan |
Japan | 1200 | Japan |
Japan | 800 | Japan |
USA | 2000 | USA |
USA | 2200 | USA |
USA | 1800 | USA |
France | 500 | France |
France | NA | France |
France | 400 | France |
Summary
Reverse order of factor levels.
Syntax
fct_rev(<column_factor>
)
Arguments
Example
x <- factor(c("FR", "US", "JP"))
This creates a factor column with levels in an order of, FR, JP, US. (alphabetical order as default.)
fct_rev(x)
Returns a factor with levels in an order of US, JP, FR.
Summary
Create extreme gradient boosting model for regression.
Syntax
xgboost_reg(watchlist_rate = <numeric>
, output_type = <xgb_reg_output_type>
, nrounds = <integer>
, booster = <xgboost_booster_type>
, eval_metric = <xgb_reg_evaluation_type>
, weight = <column_num>
, early_stopping_rounds = <integer>
)
Arguments
Return Value
Model
Example
Original data:
CARRIER | ARR_DELAY | DEP_DELAY | DISTANCE |
---|---|---|---|
UA | 10 | 7 | 300 |
UA | 20 | 30 | 1000 |
UA | -5 | 0 | 200 |
AA | 20 | 20 | 500 |
AA | -5 | 3 | 700 |
AA | 10 | 8 | 1000 |
%>% group_by(CARRIER)
%>% build_model(model_func = xgboost_reg, formula = DEP_DELAY ~ ARR_DELAY + DISTANCE, test_rate = 0.1)
Returns a data frame that stores a xgb.Booster model based on a formula defined as 'Predict DEP_DELAY based on ARR_DELAY and DISTANCE.' It also returns a column of original data.
CARRIER | source_data | model | .test_index |
---|---|---|---|
UA | source dataframe | xgb.Booster model | c(1) |
AA | source dataframe | xgb.Booster model | c(2) |
You can use model_coef or model_stats function to get the summary information about the models. Also, you can use prediction function to predict with the data that is stored in the same data frame.
Summary
Create extreme gradient boosting model for binary classification.
Syntax
xgboost_binary(watchlist_rate = <numeric>
, output_type = <xgb_binary_output_type>
, nrounds = <integer>
, booster = <xgboost_booster_type>
, eval_metric = <xgb_binary_evaluation_type>
, weight = <column_num>
, early_stopping_rounds = <integer>
)
Arguments
Return Value
Model
Summary
Create extreme gradient boosting model for binary classification.
Syntax
xgboost_multi(watchlist_rate = <numeric>
, output_type = <xgb_multi_output_type>
, nrounds = <integer>
, booster = <xgboost_booster_type>
, eval_metric = <xgb_multi_evaluation_type>
, weight = <column_num>
, early_stopping_rounds = <integer>
)
Arguments
Return Value
Model
Example
Summary
Create random forest model for regression.
Syntax
randomForestReg(subset = <column_logical>
, na.action = <'na_action_type'>
, ntree = <integer>
, mtry = <numeric>
, replace = <logical>
, strata = <column>
, sampsize = <integer>
, nodesize = <integer>
, maxnodes = <integer>
, importance = <logical>
, localImp = <logical>
, nPerm = <integer>
, proximity = <logical>
, oob.prox = <logical>
, keep.forest = <logical>
, corr.bias = <logical>
, keep.inbag = <logical>
)
Arguments
Return Value
Model
Example
Summary
Create random forest model for binary classification.
Syntax
randomForestReg(subset = <column_logical>
, na.action = <'na_action_type'>
, ntree = <integer>
, mtry = <numeric>
, replace = <logical>
, classwt = <numeric>
, cutoff = <numeric>
, strata = <column>
, sampsize = <integer>
, nodesize = <integer>
, maxnodes = <integer>
, importance = <logical>
, localImp = <logical>
, nPerm = <integer>
, proximity = <logical>
, oob.prox = <logical>
, norm.votes = <logical>
, keep.forest = <logical>
, corr.bias = <logical>
, keep.inbag = <logical>
)
Arguments
Return Value
Model
Example
Summary
Create random forest model for multi class classification.
Syntax
randomForestReg(subset = <column_logical>
, na.action = <'na_action_type'>
, ntree = <integer>
, mtry = <numeric>
, replace = <logical>
, classwt = <numeric>
, cutoff = <numeric>
, strata = <column>
, sampsize = <integer>
, nodesize = <integer>
, maxnodes = <integer>
, importance = <logical>
, localImp = <logical>
, nPerm = <integer>
, proximity = <logical>
, oob.prox = <logical>
, norm.votes = <logical>
, keep.forest = <logical>
, corr.bias = <logical>
, keep.inbag = <logical>
)
Arguments
Return Value
Model
Example
Summary
Extract a value of a list data type column based on a position or name when the data inside the list is data frame.
Syntax
list_extract(<column_list>
, position = <numeric_text>
, rownum = <numeric>
)
Arguments
Return Value
Character
Example
1) Column is List of Character
mutate(text = list_extract(categories, 1))
Returns the following.
category | text |
---|---|
c("Doctors", "Health & Medical") | Doctors |
c("Bars", "American (New)", "Nightlife") | Bars |
mutate(text = list_extract(categories, -1))
Returns the value of the 1st element counted from the last.
category | text |
---|---|
c("Doctors", "Health & Medical") | Health & Medical |
c("Bars", "American (New)", "Nightlife") | Nightlife |
2) Column is List of Data Frame
mutate(text = list_extract(labels, 1))
Returns the value of the 1st column and 1st row. The default is 1st row when not specified.
labels | text |
---|---|
list(name = c("Regression", "UI"), color = c("5319e7", "207de5")) | Regression |
mutate(text = list_extract(labels, -1))
Returns the value of the last column and 1st row..
labels | text |
---|---|
list(name = c("Regression", "UI"), color = c("5319e7", "207de5")) | 5319e7 |
mutate(text = list_extract(labels, 1, 2))
Returns the value of the 1st column and 2nd row.
labels | text |
---|---|
list(name = c("Regression", "UI"), color = c("5319e7", "207de5")) | UI |
mutate(text = list_extract(labels, "name"))
Returns the value of a column, 'name', and the 1st row. The default is 1st row when not specified.
labels | text |
---|---|
list(name = c("Regression", "UI"), color = c("5319e7", "207de5")) | Regression |
Summary
Concatenates texts from all the elements of a list data type column.
Syntax
list_to_text(<column_list>
, sep = <text>
)
Arguments
Return Value
character
Example
mutate(category_text = list_to_text(category))
Returns the following.
category | category_text |
---|---|
c("Doctors", "Health & Medical") | Doctors, Health & Medical |
Nightlife | Nightlife |
c("Active Life", "Mini Golf", "Golf") | Active Life, Mini Golf, Golf |
c("Bars", "American (New)", "Nightlife") | Bars, American (New), Nightlife |
c("Bars", "American (Traditional)", "Nightlife") | Bars, American (Traditional), Nightlife |
Summary
Concatenates values from multiple columns into a list.
Syntax
list_concat(<column(s)>
, collapse = <logical>
)
Arguments
Return Value
list
Example
mutate(items = list_concat(item1, item2))
Returns the following.
Before:
name | item1 | item2 |
---|---|---|
Cathy | c("Banana", "Apple") | "Apple" |
Mary | "Lemon" | c("Lemon", "Banana") |
Lisa | c("Lemon", "Banana") | c("Banana", "Apple") |
After:
name | item1 | item2 | items |
---|---|---|---|
Cathy | c("Banana", "Apple") | "Apple" | c("Banana", "Apple", "Apple") |
Mary | "Lemon" | c("Lemon", "Banana") | c("Lemon", "Lemon", "Banana") |
Lisa | c("Lemon", "Banana") | c("Banana", "Apple") | c("Lemon", "Banana", "Banana", "Apple") |
summarize(categories = list_concat(category, collapse = TRUE))
Returns the following.
Before:
grouped by "region"
region | category |
---|---|
East | c("Doctors", "Health & Medical") |
East | Nightlife |
East | c("Active Life", "Mini Golf", "Golf") |
West | c("Bars", "American (New)", "Nightlife") |
West | c("Bars", "American (Traditional)", "Nightlife") |
After:
region | categories |
---|---|
East | c("Doctors", "Health & Medical", "Nightlife", "Active Life", "Mini Golf", "Golf") |
West | c("Bars", "American (New)", "Nightlife", "Bars", "American (Traditional)", "Nightlife") |
Summary
Returns number of elements inside a list data type column for each row.
Syntax
list_n(<column_list>
)
Arguments
Return Value
Numeric
Example
mutate(count = list_n(categories)) Returns the following.
category | count |
---|---|
c("Doctors", "Health & Medical") | 2 |
Nightlife | 1 |
c("Active Life", "Mini Golf", "Golf") | 3 |
c("Bars", "American (New)", "Nightlife", "Lounges", "Restaurants") | 5 |
c("Bars", "American (Traditional)", "Nightlife", "Restaurants") | 4 |
Summary
Centers and/or scales the numeric values of a column.
Syntax
normalize(<column_num>
, center = <logical>
, scale = <logical>
)
Arguments
(x - mean(x)) / sd(x)
. When it is TRUE and center is FALSE, scaling is done by dividing by the root-mean-square, which is same as calculating x / sqrt(sum(x^2)/(length(x)-1))
. When it is a numeric value, input values are divided by this numeric value. When it is FALSE, no scaling is done.Return Value
Numeric
Example
// x <- c(1, 2, 3, 4, 5, 100)
normalize(x, center = 0)
Returns -0.4584610 -0.4332246 -0.4079882 -0.3827518 -0.3575154 2.0399410. These are same as the below.
(x - mean(x)) / sd(x)
normalize(x, center = FALSE)
Returns 0.02229944 0.04459888 0.06689832 0.08919776 0.11149720 2.22994404. This is same as the below.
x / sqrt(sum(x^2)/(length(x)-1))
normalize(x, center = 0, scale = max(x)))
Returns 0.01 0.02 0.03 0.04 0.05 1.00
Summary
Anonymize values by hashing algorithms.
Syntax
anonymize(<column>
, algo = <hash_algorithms>
, seed = <integer>
, chars = <character>
, n_chars = <integer>
)
Arguments
Example
Original data:
user | login |
---|---|
Lisa | 1482148710 |
Lisa | 1484148710 |
Emily | 1479913692 |
John | 1481128318 |
mutate(user = anonymize(user, algo = "murmur32") Returns anonymized values .
user | login |
---|---|
6fe21ea2 | 1482148710 |
6fe21ea2 | 1484148710 |
2f968d4a | 1479913692 |
9b5f08f7 | 1481128318 |
Return Value
Character
Summary
Convert number of date/time from excel to POSIXct.
Syntax
excel_numeric_to_datetime(<column_num>
, tz = <timezone>
)
Arguments
Example
mutate(Last_Login_Time_Converted = excel_numeric_to_datetime(Last_Login_Time_converted))
Converts Last_Login_Time to POSIXct.
Customer | Last_Login_Time | Last_Login_Time_Converted |
---|---|---|
Lisa | 42370.5 | 2016-01-01 12:00:00 |
Emily | 42318.25 | 2015-11-10 06:00:00 |
John | 42489.75 | 2016-04-29 18:00:00 |
Return Value
POSIXct
Summary
Convert unix time numeric values to POSIXct.
Syntax
unixtime_to_datetime(<column_num>
)
Example
Original data:
user | last_login |
---|---|
Lisa | 1482148710 |
Emily | 1479913692 |
John | 1481128318 |
mutate(last_login = unixtime_to_datetime(last_login)) Returns data frame with last_login column as Date.
user | last_login |
---|---|
Lisa | 2016-12-19 11:58:30 GMT |
Emily | 2016-11-23 15:08:12 GMT |
John | 2016-12-07 16:31:58 GMT |
Return Value
Date, POSIXct
Summary
Calculates coordinations by reducing dimensionality using SVD (Singular Value Decomposition).
Syntax
do_svd.kv(<subject_column>
, <key_column>
, <value_column>
, type = <svd_type>
, fill = <numeric>
, fun.aggregate = <aggregate_function>
, n_component = <integer>
, centering = <logical>
, output = <svd_output>
)
Arguments
Return Value
Data frame
Example
Original data:
author_name | keywords | tfidf_value |
---|---|---|
Cathy | party | 3.2 |
Cathy | exciting | 1.3 |
Cathy | friends | 0.8 |
Mary | study | 2.4 |
Mary | exam | 2.1 |
Mary | hard | 1.5 |
Lisa | sports | 1.9 |
Lisa | exciting | 1.8 |
Lisa | hard | 1.6 |
do_svd.kv(document_name, keywords, tfidf_value, n_component=2)
Reducing the number of the dimensions (keywords) to 2 and returning the result in a 'long' data form.
author_name | new.dimension | value |
---|---|---|
Cathy | 1 | -0.7131255 |
Cathy | 2 | 0.3976414 |
Lisa | 1 | 0.0121952 |
Lisa | 2 | -0.8164055 |
Mary | 1 | 0.7009303 |
Mary | 2 | 0.4187641 |
do_svd.kv(name, key, value, n_component=2, output="wide")
Reducing the number of the dimensions (keywords) to 2 and returning the result in a 'wide' data form.
author_name | axis1 | axis2 |
---|---|---|
Cathy | -0.7131255 | 0.3976414 |
Lisa | 0.0121952 | -0.8164055 |
Mary | 0.7009303 | 0.4187641 |
do_svd.kv(name, key, value, n_component=2, type="dimension")
Reducing the number of the dimensions (keywords) to 2 and returning the result by 'dimensions' instead of by 'group'.
top_key_words | new.dimension | value |
---|---|---|
exam | 1 | 0.4163953 |
exam | 2 | 0.3259924 |
exciting | 1 | -0.2560436 |
exciting | 2 | -0.3531243 |
friends | 1 | -0.1613866 |
friends | 2 | 0.1179235 |
hard | 1 | 0.3029449 |
hard | 2 | -0.2513704 |
party | 1 | -0.6455466 |
party | 2 | 0.4716940 |
sports | 1 | 0.0065547 |
sports | 2 | -0.5750138 |
study | 1 | 0.4758803 |
study | 2 | 0.3725628 |
do_svd.kv(name, key, value, n_component=2, type="variance")
Reducing the number of the dimensions (keywords) to 2 and returning the 'variance' values for each dimension
new.dimension | value |
---|---|
1 | 3.534991 |
2 | 2.697623 |
Summary
Calculates coordinations by reducing dimensionality using SVD (Singular Value Decomposition).
Syntax
do_svd(<column(s)>
, skv = <character_vector>
, type = <svd_type>
, fill = <numeric>
, fun.aggregate = <aggregate_function>
, n_component = <integer>
, centering = <logical>
, output = <svd_output>
Arguments
Return Value
Data frame
Summary
Builds a linear regression model (lm) and store it in a data frame.
Syntax
build_lm(<formula>
, test_rate = <numeric>
, subset = <numeric_vector>
, weights = <numeric_vector>
, na.action = <na_action_type>
, method = <lm_method_type>
, model = <logical>
, x = <logical>
, y = <logical>
, qr = <logical>
, singular.ok = <logical>
, contrasts = <list>
, offset = <numeric_vector>
)
Arguments
Return Value
Data frame
Example
Original data:
CARRIER | ARR_DELAY | DEP_DELAY | DISTANCE |
---|---|---|---|
UA | 10 | 7 | 300 |
UA | 20 | 30 | 1000 |
UA | -5 | 0 | 200 |
AA | 20 | 20 | 500 |
AA | -5 | 3 | 700 |
AA | 10 | 8 | 1000 |
%>% group_by(CARRIER)
%>% build_lm(ARR_DELAY ~ DEP_DELAY + DISTANCE, test_rate = 0.1)
Returns a data frame that stores a lm (Linear Regression) model based on a formula defined as 'Predict ARR_DELAY based on DEP_DELAY and DISTANCE.' It also returns a column of original data.
CARRIER | source_data | model | .test_index |
---|---|---|---|
UA | source dataframe | lm model | c(1) |
AA | source dataframe | lm model | c(2) |
You can use model_coef or model_stats function to get the summary information about the models. Also, you can use prediction function to predict with the data that is stored in the same data frame.
Summary
Calculates Survival Curve from survival time and survival status.
Syntax
do_survfit(<column_num>
, <column_num_logic>
, weights = <numeric_vector>
, subset = <numeric_vector>
, na.action = <na_action_type>
, id = <id_column>
, istate = <column>
, type = <survival_curve_type>
, error = <survfit_error_type>
, conf.type = <survfit_conf_int_type>
, conf.lower = <survfit_lower_conf_int_type>
, conf.int = <numeric>
, se.fit = <logical>
, influence = <logical>
)
Arguments
Return Value
Data frame
Example
Original data:_
status | time |
---|---|
1 | 40 |
1 | 30 |
0 | 35 |
0 | 8 |
do_survfit(time, status) Returns a data frame as follows.
time | n_risk | n_event | n_censor | estimate | std_error | conf_high | conf_low |
---|---|---|---|---|---|---|---|
8 | 4 | 0 | 1 | 1.0000000 | 0.0000000 | 1 | 1.0000000 |
30 | 3 | 1 | 0 | 0.6666667 | 0.4082483 | 1 | 0.2995071 |
35 | 2 | 0 | 1 | 0.6666667 | 0.4082483 | 1 | 0.2995071 |
40 | 1 | 1 | 0 | 0.0000000 | Inf | NA | NA |
Summary
Simulates Survival Curve for specified cohort based on a survival model.
Syntax
prediction_survfit(newdata = <cohort_data_set>
, na.action = <na_action_type>
, type = <coxph_survival_curve_type>
, conf.type = <survfit_conf_int_type>
, conf.int = <numeric>
, se.fit = <logical>
, censor = <logical>
, start.time = <numeric>
)
Arguments
Return Value
Data frame
Example
Original data: model data frame.
prediction_survfit(newdata = data.frame(age = c(50, 60), sex = (0, 1))) Returns a data frame as follows.
time | n_risk | n_event | n_censor | age_sex | estimate | std_error | conf_high | conf_low |
---|---|---|---|---|---|---|---|---|
8 | 4 | 0 | 1 | 50_0: | 1.0000000 | 0.0000000 | 1 | 1.0000000 |
30 | 3 | 1 | 0 | 50_0: | 0.6666667 | 0.4082483 | 1 | 0.2995071 |
35 | 2 | 0 | 1 | 50_0: | 0.6666667 | 0.4082483 | 1 | 0.2995071 |
40 | 1 | 1 | 0 | 50_0: | 0.0000000 | Inf | NA | NA |
Summary
Builds logistic regression model and store it in a data frame.
Syntax
build_lr(<formula>
, test_rate = <numeric>
, weights = <numeric_vector>
, subset = <numeric_vector>
, na.action = <na_action_type>
, start = <numeric_vector>
, etastart = <numeric_vector>
, mustart = <numeric_vector>
, offset = <numeric_vector>
, epsilon = <numeric>
, maxit = <integer>
, trace = <logical>
, model = <logical>
, method = <glm_method_type>
, contrasts = <list>
, x = <logical>
, y = <logical>
)
Arguments
Example
Original data:
CARRIER | IS_DELAYED | DEP_DELAY | DISTANCE |
---|---|---|---|
UA | TRUE | 7 | 300 |
UA | TRUE | 30 | 1000 |
UA | FALSE | 0 | 200 |
AA | TRUE | 20 | 500 |
AA | FALSE | 3 | 700 |
AA | TRUE | 8 | 1000 |
%>% group_by(CARRIER)
%>% build_lr(IS_DELAYED ~ DEP_DELAY + DISTANCE, test_rate = 0.1)
Returns a data frame that stores a GLM model based on a formula defined as 'Predict ARR_DELAY based on DEP_DELAY and DISTANCE.'
CARRIER | source_data | model | .test_index |
---|---|---|---|
UA | source dataframe | glm model | c(1) |
AA | source dataframe | glm model | c(2) |
You can use model_coef or model_stats function to get the summary information about the models. Also, you can use prediction function to predict with the data that is stored in the same data frame.
Summary
Builds generalized linear models (glm) and store it in a data frame.
Syntax
build_glm(<formula>
, test_rate = <numeric>
, family = <glm_family_type>
, weights = <numeric_vector>
, subset = <numeric_vector>
, na.action = <na_action_type>
, start = <numeric_vector>
, etastart = <numeric_vector>
, mustart = <numeric_vector>
, offset = <numeric_vector>
, epsilon = <numeric>
, maxit = <integer>
, trace = <logical>
, model = <logical>
, method = <glm_method_type>
, contrasts = <list>
, x = <logical>
, y = <logical>
)
Arguments
Example
Original data:
CARRIER | ARR_DELAY | DEP_DELAY | DISTANCE |
---|---|---|---|
UA | 10 | 7 | 300 |
UA | 20 | 30 | 1000 |
UA | -5 | 0 | 200 |
AA | 20 | 20 | 500 |
AA | -5 | 3 | 700 |
AA | 10 | 8 | 1000 |
%>% group_by(CARRIER)
%>% build_glm(ARR_DELAY ~ DEP_DELAY + DISTANCE, test_rate = 0.1)
Returns a data frame that stores a GLM model based on a formula defined as 'Predict ARR_DELAY based on DEP_DELAY and DISTANCE.'
CARRIER | source_data | model | .test_index |
---|---|---|---|
UA | source dataframe | lm model | c(1) |
AA | source dataframe | lm model | c(2) |
You can use model_coef or model_stats function to get the summary information about the models. Also, you can use prediction function to predict with the data that is stored in the same data frame.
Summary
Builds multinomial logistic regression model and store it in a data frame.
Syntax
build_multinom(<formula>
, test_rate = <numeric>
, subset = <numeric_vector>
, weights = <numeric_vector>
, na.action = <na_action_type>
, Hess = <logical>
, summ = <multinom_summ_type>
, censored = <logical>
, model = <logical>
, contrasts = <list>
)
Arguments
Example
Original data:
CARRIER | ARR_DELAY | DEP_DELAY | DISTANCE |
---|---|---|---|
UA | 10 | 7 | 300 |
UA | 20 | 30 | 1000 |
DL | -5 | 0 | 200 |
AA | 20 | 20 | 500 |
AA | -5 | 3 | 700 |
AA | 10 | 8 | 1000 |
%>% build_multinom(CARRIER ~ ARR_DELAY + DEP_DELAY, test_rate = 0.1)
Returns a data frame that stores a GLM model based on a formula defined as 'Predict ARR_DELAY based on DEP_DELAY and DISTANCE.'
source_data | model | .test_index |
---|---|---|
source dataframe | multinom model | c(1) |
You can use model_coef or model_stats function to get the summary information about the models. Also, you can use prediction function to predict with the data that is stored in the same data frame.
Summary
Builds Cox Proportional Hazard Model for survival analysis and store it in a data frame.
Syntax
build_coxph(<formula>
, test_rate = <numeric>
, subset = <numeric_vector>
, weights = <numeric_vector>
, na.action = <na_action_type>
, init = <numeric>
, ties = <coxph_tie_type>
, singular.ok = <logical>
, model = <logical>
, x = <logical>
, y = <logical>
, tt = <function>
)
Arguments
Example
Original data:
time | status | age | sex |
---|---|---|---|
40 | 2 | 55 | female |
30 | 2 | 68 | male |
35 | 1 | 59 | male |
8 | 1 | 80 | female |
build_coxph(survival::Surv(time, status) ~ age + sex) Builds a Cox Proportional Hazard Model that estimates survival of the subjects based on age and sex. The resulting model is stored in returning data frame. You can use model_coef or model_stats function to get the summary information about the models. Also, you can use prediction_coxph function to predict with the data that is stored in the same data frame.
Summary
Returns a data frame with regression information about a model.
Syntax
prediction(data = "training"|"test"|"newdata"
, data_frame = <data_set>
)
Arguments * data (Optional) - The default is "training". Type of data to use in the prediction. * data_frame (Optional) - This works only when data = "newdata". A name of another data frame to apply the model.
Return Value
Data Frame
Example
Original data: model data frame.
prediction(data = "test")
Returns a data frame with predicted values.
Summary
Returns a data frame with binary classification information about a model.
Syntax
prediction_binary(<predicted_probability_column>
, <actual_value_column>
, data = "training"|"test"|"newdata"
, threshold = <binary_metrics>
, data_frame = <data_set>
)
Arguments
Return Value
Data Frame
Example
Original data: binary classification model data frame.
prediction_binary(data = "test", threshold = 0.2)
Returns a data frame with predicted values of binary classification.
Summary
Returns a data frame with predicted values of Cox Proportional Hazard Model.
Syntax
prediction_coxph(data = "training"|"test"
, type.predict = <coxph_predict_type>
, type.residuals = <coxph_residuals_type>
)
Arguments * data (Optional) - The default is "training". Type of data to use in the prediction. * type.predict (Optional) - The default is "lp". The type of prediction. This can be * "lp" * "risk" * "expected" * type.residuals (Optional) - The default is "martingale". The type of residuals. This can be * "martingale" * "deviance" * "score" * "schoenfeld" * "scaledsch" * "dfbeta" * "dfbetas"
Return Value
Data Frame
Example
Original data: coxph model data frame.
prediction_coxph(data = "test")
Returns a data frame with predicted values.
Summary
Returns a data frame with 'Parameter Estimates (Coefficients)' information about a model including the below.
Syntax
model_coef(conf_int = <conf_int_type>
, conf.level = <numeric>
)
Arguments
Return Value
Data Frame
Example
Original data: lm model or glm model
model_coef(conf_int = "default", conf.level = 0.90)
Returns a data frame with 'Parameter Estimates' information.
Summary
Returns a data frame with 'Summary of Fit' information about a model including the below.
Syntax
model_stats()
Example
Original data: lm model or glm model
model_stats()
Returns a data frame with 'Summary of Fit' information.
Return Value
Data Frame
Summary
Returns a data frame with anova test information about a model including the below.
Syntax
model_anova()
Return Value
Data Frame
Example
Original data: lm model or glm model
model_anova()
Returns a data frame with anova test information.
Summary
Returns a data frame with evaluation score of regression including the below.
Syntax
evaluate_regression(<predicted_value_column>
, <actual_value_column>
)
Arguments
Return Value
Data Frame
Example
Original data:
ARR_DELAY | predicted_value |
---|---|
10 | 7 |
20 | 30 |
-5 | 0 |
20 | 20 |
-5 | 3 |
10 | 8 |
evaluate_regression(predicted_value, ARR_DELAY)
Returns a data frame with evaluation scores of regression.
Summary
Returns a data frame with evaluation score of binary classification including the below.
Syntax
evaluate_binary(<predicted_probability_column>
, <actual_value_column>
, threshold = <binary_metrics>
)
Arguments
Return Value
Data Frame
Example
Original data:
IS_DELAYED | predicted_probability |
---|---|
TRUE | 0.82 |
TRUE | 0.67 |
FALSE | 0.34 |
TRUE | 0.98 |
FALSE | 0.19 |
TRUE | 0.26 |
evaluate_binary(predicted_value, ARR_DELAY, threshold = "accuracy")
Returns a data frame with evaluation scores of binary classification.
Summary
Returns a data frame with evaluation score of multi classification including the below.
Syntax
evaluate_multi(<predicted_label_column>
, <actual_value_column>
)
Arguments
Return Value
Data Frame
Example
Original data:
CARRIER | predicted_label |
---|---|
AA | AA |
9E | 9E |
AS | AA |
9E | 9E |
AA | AA |
AS | FL |
evaluate_binary(predicted_value, ARR_DELAY, threshold = "accuracy")
Returns a data frame with evaluation scores of multi classification.
micro_f_score | macro_f_score | accuracy | misclassification_rate |
---|---|---|---|
0.6666667 | 0.45 | 0.6666667 | 0.3333333 |
Summary
Returns coordinates of roc curve.
Syntax
do_roc(<predicted_probability_column>
, <actual_value_column>
)
Arguments
Return Value
Data Frame
Example
Original data:
IS_DELAYED | predicted_probability |
---|---|
TRUE | 0.82 |
TRUE | 0.67 |
FALSE | 0.34 |
TRUE | 0.98 |
FALSE | 0.19 |
TRUE | 0.26 |
do_roc(predicted_value, ARR_DELAY)
Returns a data frame with true_positive_rate and false_positive_rate.
true_positive_rate | false_positive_rate |
---|---|
0.00 | 0.0 |
0.25 | 0.0 |
0.50 | 0.0 |
0.75 | 0.0 |
0.75 | 0.5 |
1.00 | 0.5 |
1.00 | 1.0 |
With this data, you can assign "false_positive_rate" to X axis and "true_positive_rate" to Y axis to draw ROC curve.
Summary
Builds a clustering model (k-means) from variable columns and returns the summary of the model or the augmented data depending on the parameter value.
Syntax
build_kmeans.cols(<column(s)>
, centers = <integer>
, fill = <numeric>
, fun.aggregate = <aggregate_function>
, seed = <integer>
, iter.max = <integer>
, nstart = <integer>
, algorithm = <kmeans_algorithms>
, trace = <logical>
, normalize_data = <logical>
, augment = <logical>
, keep.source = <logical>
)
Arguments
augment_kmeans(model, source.data)
Return Value
Data frame
Example
Original data:
location | 1900 | 1901 | 1902 |
---|---|---|---|
Als | 35.02899 | 24.02899 | 45.02899 |
Bea | 18.00000 | 10.00000 | 27.00000 |
Bor | 18.14896 | 10.64896 | 20.64896 |
build_kmeans.cols(-location, centers=2, augment=TRUE)
Builds a K-means clustering model selecting all the columns other than 'location' and setting 2 as the number of the clusters, and scoring the original data right away. If you want to evaluate the model set 'augment' to FALSE.
location | X1900 | X1901 | X1902 | .cluster |
---|---|---|---|---|
Als | 35.02899 | 24.02899 | 45.02899 | 2 |
Bea | 18.00000 | 10.00000 | 27.00000 | 1 |
Bor | 18.14896 | 10.64896 | 20.64896 | 1 |
Summary
Builds a clustering model (k-means) from key-value columns and store the model into a generated data frame or augment the original data with the clustered ID.
Syntax
build_kmeans.kv(<subject_column>
, <key_column>
, <value_column>
, centers = <integer>
, seed = <integer>
, iter.max = <integer>
, nstart = <integer>
, algorithm = <kmeans_algorithms>
, trace = <logical>
, normalize_data = <logical>
, augment = <logical>
, keep.source = <logical>
)
Arguments
augment_kmeans(model, source.data)
Return Value
Data frame
Example
Original data:
location | year | harvest_date |
---|---|---|
Als | 1900 | 35.02899 |
Als | 1901 | 24.02899 |
Als | 1902 | 45.02899 |
Bea | 1900 | 18.00000 |
Bea | 1901 | 10.00000 |
Bea | 1902 | 27.00000 |
Bor | 1900 | 18.14896 |
Bor | 1901 | 10.64896 |
Bor | 1902 | 20.64896 |
build_kmeans.kv(location, year, harvest_date, centers=2, augment=TRUE)
Builds a K-means clustering model selecting 'location' as the subject to cluster, 'year' as the dimension, and 'harvest_date' as the value, and setting 2 as the number of the clusters, and scoring the original data right away. If you want to evaluate the model set 'augment' to FALSE.
location | year | harvest_date | .cluster |
---|---|---|---|
Als | 1900 | 35.02899 | 2 |
Als | 1901 | 24.02899 | 2 |
Als | 1902 | 45.02899 | 2 |
Bea | 1900 | 18.00000 | 1 |
Bea | 1901 | 10.00000 | 1 |
Bea | 1902 | 27.00000 | 1 |
Bor | 1900 | 18.14896 | 1 |
Bor | 1901 | 10.64896 | 1 |
Bor | 1902 | 20.64896 | 1 |
Summary
Builds a clustering model (k-means). Stores the model into a generated data frame or the augmented data depending on the parameter value.
Syntax
build_kmeans(<column(s)>
, skv = <character_vector>
, fill = <numeric>
, fun.aggregate = <aggregate_function>
, centers = <integer>
, iter.max = <integer>
, nstart = <integer>
, algorithm = <kmeans_algorithms>
, trace = <logical>
, normalize_data = <logical>
, seed = <integer>
, augment = <logical>
, keep.source = <logical>
)
Arguments
augment_kmeans(model, source.data)
Return Value
Data frame
Example
Original data:
location | 1900 | 1901 | 1902 |
---|---|---|---|
Als | 35.02899 | 24.02899 | 45.02899 |
Bea | 18.00000 | 10.00000 | 27.00000 |
Bor | 18.14896 | 10.64896 | 20.64896 |
build_kmeans(-location, centers=2, augment=TRUE)
Builds a K-means clustering model selecting all the columns other than 'location' and setting 2 as the number of the clusters, and scoring the original data right away. If you want to evaluate the model set 'augment' to FALSE.
location | X1900 | X1901 | X1902 | .cluster |
---|---|---|---|---|
Als | 35.02899 | 24.02899 | 45.02899 | 2 |
Bea | 18.00000 | 10.00000 | 27.00000 | 1 |
Bor | 18.14896 | 10.64896 | 20.64896 | 1 |
Original data:
location | year | harvest_date |
---|---|---|
Als | 1900 | 35.02899 |
Als | 1901 | 24.02899 |
Als | 1902 | 45.02899 |
Bea | 1900 | 18.00000 |
Bea | 1901 | 10.00000 |
Bea | 1902 | 27.00000 |
Bor | 1900 | 18.14896 |
Bor | 1901 | 10.64896 |
Bor | 1902 | 20.64896 |
build_kmeans(skv = c("location", "year", "harvest_date"), centers = 2, augment = TRUE)
Builds a K-means clustering model selecting 'location' as the subject to cluster, 'year' as the dimension, and 'harvest_date' as the value, and setting 2 as the number of the clusters, and scoring the original data right away. If you want to evaluate the model set 'augment' to FALSE.
location | year | harvest_date | .cluster |
---|---|---|---|
Als | 1900 | 35.02899 | 2 |
Als | 1901 | 24.02899 | 2 |
Als | 1902 | 45.02899 | 2 |
Bea | 1900 | 18.00000 | 1 |
Bea | 1901 | 10.00000 | 1 |
Bea | 1902 | 27.00000 | 1 |
Bor | 1900 | 18.14896 | 1 |
Bor | 1901 | 10.64896 | 1 |
Bor | 1902 | 20.64896 | 1 |
Summary
Execute t-test, which checks differences of means of variables.
Syntax
do_t.test(<value_column>
, <group_column>
, alternative = <alternative_type>
, mu = <integer>
, paired = <logical>
, var.equal = <logical>
, conf.level = <numeric>
, subset = <column_logical>
, na.action = <na_action_type>
)
Arguments
Return Value
Data frame
Example
Original data:
location | year | harvest_date |
---|---|---|
Als | 1900 | 35.02899 |
Als | 1901 | 24.02899 |
Als | 1902 | 45.02899 |
Bor | 1900 | 18.14896 |
Bor | 1901 | 10.64896 |
Bor | 1902 | 20.64896 |
do_t.test(harvest_date, location)
Returns a data frame with one row that has t-test result.
Summary
Execute F-test, which checks the differences of variances between groups.
Syntax
do_var.test(<value_column>
, <group_column>
, ratio = <numeric>
, alternative = <alternative_type>
, conf.level = <numeric>
, na.action = <na_action_type>
)
Arguments
Return Value
Data frame
Example
Original data:
location | year | harvest_date |
---|---|---|
Als | 1900 | 35.02899 |
Als | 1901 | 24.02899 |
Als | 1902 | 45.02899 |
Bor | 1900 | 18.14896 |
Bor | 1901 | 10.64896 |
Bor | 1902 | 20.64896 |
do_var.test(harvest_date, location)
Returns a data frame with one row that has t-test result.
Summary
Execute chi-squared contingency table tests and goodness-of-fit tests.
Syntax
do_chisq.test(<column(s)>
, correct = <logical>
, p = <column_num>
, rescale.p = <logical>
, simulate.p.value = <logical>
, B = <numeric>
)
Arguments
Return Value
Data frame
Example
Original data:
clarity | GIA | HRD | IGI |
---|---|---|---|
IF | 6 | 4 | 34 |
VS1 | 61 | 13 | 7 |
VS2 | 36 | 15 | 2 |
VVS1 | 15 | 23 | 14 |
VVS2 | 33 | 24 | 21 |
do_chisq.test(-clarity)
Returns a data frame with one row that has chi-squared test result.
statistic | p.value | parameter | method |
---|---|---|---|
112.7472 | 0 | 8 | Pearson's Chi-squared test |
Summary
Find rules of what tend to occur at the same time from transaction data.
Syntax
do_apriori(<subject_column>
, <key_column>
, minlen = <integer>
, maxlen = <integer>
, min_support = <numeric>
, max_support = <numeric>
, min_confidence = <numeric>
, lhs = <character_vector>
, rhs = <character_vector>
, max_basket_items = <numeric>
)
Arguments
Return Value
Data frame
Example
Original data:
item_name | transaction_id |
---|---|
banana | 1 |
apple | 1 |
carrot | 2 |
apple | 2 |
orange | 3 |
carrot | 3 |
apple | 4 |
carrot | 4 |
do_apriori(item_name, transaction_id)
Returns association rules about item names.
lhs | rhs | support | confidence | lift |
---|---|---|---|---|
apple | 0.75 | 0.7500000 | 1.0000000 | |
carrot | 0.75 | 0.7500000 | 1.0000000 | |
banana | apple | 0.25 | 1.0000000 | 1.3333333 |
orange | carrot | 0.25 | 1.0000000 | 1.3333333 |
apple | carrot | 0.50 | 0.6666667 | 0.8888889 |
carrot | apple | 0.50 | 0.6666667 | 0.8888889 |
do_apriori(item_name, transaction_id, minlen=2)
Returns association rules where the total number of item names in lhs and rhs are more than 2.
lhs | rhs | support | confidence | lift |
---|---|---|---|---|
banana | apple | 0.25 | 1.0000000 | 1.3333333 |
orange | carrot | 0.25 | 1.0000000 | 1.3333333 |
apple | carrot | 0.50 | 0.6666667 | 0.8888889 |
carrot | apple | 0.50 | 0.6666667 | 0.8888889 |
do_apriori(item_name, transaction_id, lhs=c( "apple", "orange"))
Returns association rules where apple or orange come to lhs.
lhs | rhs | support | confidence | lift |
---|---|---|---|---|
orange | carrot | 0.25 | 1.0000000 | 1.3333333 |
apple | carrot | 0.50 | 0.6666667 | 0.8888889 |
Summary
Detect anomaly in time series data frame.
Syntax
do_anomaly_detection(<column_date>
, <column_num>
, direction = <anomaly_direction_type>
, e_value = <logical>
, max_anoms = <numeric>
, alpha = <numeric>
, only_last = <anomaly_last_type>
, threshold = <anomaly_threshold_type>
, longterm = <logical>
, piecewise_median_period_weeks = <integer>
)
Arguments
Return Value
Data frame
Example
Original data:
dateHour | newUsers |
---|---|
... | ... |
2017-01-20 09:00:00 | 4 |
2017-01-20 10:00:00 | 2 |
2017-01-20 11:00:00 | 5 |
2017-01-20 12:00:00 | 31 |
2017-01-20 13:00:00 | 3 |
2017-01-20 14:00:00 | 3 |
... | ... |
do_anomaly_detection(dateHour, newUsers, direction = "both", e_value = TRUE)
Returns data frame with columns about anomaly data.
dateHour | newUsers | pos_anomaly | pos_value | neg_anomaly | neg_value | expected_value |
---|---|---|---|---|---|---|
... | ... | ... | ... | ... | ... | ... |
2017-01-20 09:00:00 | 4 | FALSE | NA | FALSE | NA | 4 |
2017-01-20 10:00:00 | 2 | FALSE | NA | FALSE | NA | 2 |
2017-01-20 11:00:00 | 5 | FALSE | NA | FALSE | NA | 5 |
2017-01-20 12:00:00 | 31 | TRUE | 31 | FALSE | NA | 5 |
2017-01-20 13:00:00 | 3 | FALSE | NA | FALSE | NA | 3 |
2017-01-20 14:00:00 | 3 | FALSE | NA | FALSE | NA | 3 |
... | ... | ... | ... | ... | ... | ... |
Summary
Add forecast data to time series data frame.
Syntax
do_prophet(<column_date>
, <column_num>
, <numeric>
, time_unit = <prophet_time_unit>
, include_history = <logical>
, fun.aggregate = <aggregate_function>
, growth = "linear"|"logistics"
, seasonality.prior.scale = <numeric>
, yearly.seasonality = <logical>
, weekly.seasonality = <logical>
, n.changepoints = <numeric>
, changepoint.prior.scale = <numeric>
, changepoints = <list>
, holidays.prior.scale = <numeric>
, holidays = <data_set>
, mcmc.samples =
Arguments
Return Value
Data frame
Example
Original data:
date | pageviews |
---|---|
... | ... |
2017-03-19 | 2383 |
2017-03-20 | 4441 |
2017-03-21 | 3147 |
2017-03-22 | 456 |
... | ... |
do_prophet(date, pageviews, 2, time_unit = "day")
Returns data frame with forecasted rows and columns about trend and seasonality.
date | pageviews | forecasted_value | forcasted_value_high | forcasted_value_low | trend | trend_high | trend_low | seasonal | seasonal_low | seasonal_high | yearly | yearly_low | yearly_high | weekly | weekly_low | weekly_high |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
2017-03-19 | 2383 | 3200.6539 | 4150.7993 | 2199.57 | 1131.882 | 1131.882 | 1131.882 | 2068.7719 | 2068.7719 | 2068.7719 | 1014.3165 | 1014.3165 | 1014.3165 | -177.1076 | -177.1076 | -177.1076 |
2017-03-20 | 4441 | 3401.953 | 4330.5388 | 2401.9095 | 1089.3266 | 1089.3266 | 1089.3266 | 2312.6264 | 2312.6264 | 2312.6264 | 1258.171 | 1258.171 | 1258.171 | -174.3324 | -174.3324 | -174.3324 |
2017-03-21 | 3147 | 2723.3344 | 3655.037 | 1841.2349 | 1046.7712 | 1046.7712 | 1046.7712 | 1676.5631 | 1676.5631 | 1676.5631 | 622.1077 | 622.1077 | 622.1077 | 15.2078 | 15.2078 | 15.2078 |
2017-03-22 | 456 | 1209.1918 | 2152.4326 | 298.3122 | 1004.2159 | 1004.2159 | 1004.2159 | 204.976 | 204.976 | 204.976 | -849.4794 | -849.4794 | -849.4794 | 143.4983 | 143.4983 | 143.4983 |
2017-03-23 | NA | 654.9152 | 1581.3289 | -276.3799 | 876.5498 | 876.5498 | 876.5497 | -221.6346 | -221.6346 | -221.6346 | -1276.09 | -1276.09 | -1276.09 | 107.7315 | 107.7315 | 107.7315 |
2017-03-24 | NA | 2090.3595 | 3049.9388 | 1133.799 | 833.9944 | 833.9944 | 833.9944 | 1256.3651 | 1256.3651 | 1256.3651 | 201.9097 | 201.9097 | 201.9097 | 77.2003 | 77.2003 | 77.2003 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
Summary
Estimate impact of an event (advertisement, etc.) on a market, by using other markets to form a synthetic control.
Syntax
do_market_impact(<column_date>
, <column_num>
, <column_text>
, target_market = <text>
, time_unit = <market_impact_time_unit>
, fun.aggregate = <aggregate_function>
, event_time = <text>
, output_type = "series"|"model_stats"|"model_coef"|"predictor_market_candidates"
, na_fill_type = "spline"|"interpolate"|"previous"|"value"
, na_fill_value = <numeric>
, distance_weight = <numeric>
, alpha = <numeric>
, niter = <numeric>
, standardize.data = <logical>
, prior.level.sd = <numeric>
, nseasons = <numeric>
, season.duration = <numeric>
, dynamic.regression = <logical>
)
Arguments
Return Value
Data frame
Example
Original data:
date | pageviews | country |
---|---|---|
... | ... | ... |
2017-03-19 | 2383 | Japan |
2017-03-19 | 3625 | UK |
2017-03-19 | 3825 | US |
2017-03-20 | 4441 | Japan |
2017-03-20 | 4721 | UK |
2017-03-20 | 8367 | US |
2017-03-21 | 383 | Japan |
2017-03-21 | 2693 | UK |
2017-03-21 | 3147 | US |
... | ... | ... |
do_market_impact(date, pageviews, country, target_market = "Japan", event_time = "2017-03-20")
Returns data frame with pageviews of target market, synthetic control calculated from pageviews of other markets, and estimated impact of the event which happened only to the target market.
Output data frame columns:
Summary
Use a row as column names.
Syntax
row_as_header(row_index = <integer>
, prefix = <character>
, clean_names = <logical>
)
Arguments
Return Value
Data frame
Example
Original data:
V1 | V2 |
---|---|
NA | NA |
item.name | transaction.id |
banana | 1 |
apple | 1 |
carrot | 2 |
apple | 2 |
orange | 3 |
carrot | 3 |
apple | 4 |
carrot | 4 |
row_as_header(row_index = 2, prefix = "c_", clean_names = TRUE)
Returns data frame using the 2nd row as column names with prefix and dot is replaced by underscore.
c_item_name | c_transaction_id |
---|---|
NA | NA |
banana | 1 |
apple | 1 |
carrot | 2 |
apple | 2 |
orange | 3 |
carrot | 3 |
apple | 4 |
carrot | 4 |
Summary
Pivot columns into rows and columns. Values are count of pairs of rows and columns or aggregation of another column.
Syntax
pivot(<formula>
, value = <column>
, fill = <numeric>
, fun.aggregate = <character>
, na.rm = <logical>
)
Arguments
Return Value
Data frame
Example
Original data:
clarity | certification | colour | price |
---|---|---|---|
VVS2 | HRD | F | 10796 |
VS1 | GIA | F | 1551 |
VS1 | GIA | F | 4291 |
VVS2 | GIA | F | 5881 |
VS2 | GIA | H | 4585 |
VVS2 | GIA | H | 5193 |
pivot(clarity+colour~certification, value = price, fill=0, fun.aggregate = sum)
Returns data frame with rows of combinations of clarity and colour and with columns of certification. Values are sum of price and missing values are filled by 0.
clarity_colour | GIA | HRD |
---|---|---|
VS1_F | 5842 | 0 |
VS2_H | 4585 | 0 |
VVS2_F | 5881 | 10796 |
VVS2_H | 5193 | 0 |
Original data:
FL_DATE | ARR_DELAY | CARRIER |
---|---|---|
2018-10-01 | 10 | AA |
2018-10-02 | 20 | UA |
2018-10-03 | 30 | AA |
2018-10-04 | 40 | UA |
2018-11-01 | 20 | UA |
2018-11-02 | 10 | AA |
2018-11-03 | 20 | UA |
2018-11-04 | 30 | AA |
pivot(floor_date(FL_DATE, unit="month") ~ CARRIER, value = ARR_DELAY, fun.aggregate = mean)
Returns data frame with rows of FL_DATE (floored to month) and ARR_DELAY with columns of each CARRIER. Values are mean of ARR_DELAY.
FL_DATE | AA | UA |
---|---|---|
2018-10-01 | 20 | 30 |
2018-11-01 | 20 | 20 |
Summary
Calculates correlations for all the pairs of the variables (columns).
Syntax
do_cor.cols(<column(s)>
, use = <cor_na_operation>
, method = <cor_method>
, distinct = <logical>
, diag = <logical>
)
Arguments
Return Value
Data frame
Example
Original data:
year | Als | Bea | Bor |
---|---|---|---|
1900 | 35.02899 | 18 | 18.14896 |
1901 | 24.02899 | 10 | 10.64896 |
1902 | 45.02899 | 27 | 20.64896 |
do_cor.cols(-year)
pair.name.1 | pair.name.2 | value |
---|---|---|
Als | Bea | 0.9981135 |
Als | Bor | 0.9680283 |
Bea | Als | 0.9981135 |
Bea | Bor | 0.9508014 |
Bor | Als | 0.9680283 |
Bor | Bea | 0.9508014 |
Summary
Calculates correlations for all the pairs of subject columns.
Syntax
do_cor.kv(<subject_column>
, <key_column>
, <value_column>
, use = <cor_na_operation>
, method = <cor_method>
, distinct = <logical>
, diag = <logical>
, fill = <numeric>
, fun.aggregate = <aggregate_function>
)
Arguments
Return Value
Data frame
Example
Original data:
location | year | harvest_date |
---|---|---|
Als | 1900 | 35.02899 |
Als | 1901 | 24.02899 |
Als | 1902 | 45.02899 |
Bea | 1900 | 18.00000 |
Bea | 1901 | 10.00000 |
Bea | 1902 | 27.00000 |
Bor | 1900 | 18.14896 |
Bor | 1901 | 10.64896 |
Bor | 1902 | 20.64896 |
do_cor.kv(location, year, harvest_date)
location.x | location.y | value |
---|---|---|
Als | Bea | 0.9981135 |
Als | Bor | 0.9680283 |
Bea | Als | 0.9981135 |
Bea | Bor | 0.9508014 |
Bor | Als | 0.9680283 |
Bor | Bea | 0.9508014 |
Summary
Calculates correlations for all the pairs of the variables or subjects.
Syntax
do_cor(<column(s)>
, skv = <character_vector>
, use = <cor_na_operation>
, method = <cor_method>
, distinct = <logical>
, diag = <logical>
, fill = <numeric>
, fun.aggregate = <aggregate_function>
)
Arguments
Return Value
Data frame
Example
Original data:
year | Als | Bea | Bor |
---|---|---|---|
1900 | 35.02899 | 18 | 18.14896 |
1901 | 24.02899 | 10 | 10.64896 |
1902 | 45.02899 | 27 | 20.64896 |
do_cor(-year)
pair.name.1 | pair.name.2 | value |
---|---|---|
Als | Bea | 0.9981135 |
Als | Bor | 0.9680283 |
Bea | Als | 0.9981135 |
Bea | Bor | 0.9508014 |
Bor | Als | 0.9680283 |
Bor | Bea | 0.9508014 |
Original data:
location | year | harvest_date |
---|---|---|
Als | 1900 | 35.02899 |
Als | 1901 | 24.02899 |
Als | 1902 | 45.02899 |
Bea | 1900 | 18.00000 |
Bea | 1901 | 10.00000 |
Bea | 1902 | 27.00000 |
Bor | 1900 | 18.14896 |
Bor | 1901 | 10.64896 |
Bor | 1902 | 20.64896 |
do_cor( skv = c("location", "year", "harvest_date"))
location.x | location.y | value |
---|---|---|
Als | Bea | 0.9981135 |
Als | Bor | 0.9680283 |
Bea | Als | 0.9981135 |
Bea | Bor | 0.9508014 |
Bor | Als | 0.9680283 |
Bor | Bea | 0.9508014 |
Summary
Calculate the distances between each of the pairs.
Syntax
do_dist.kv(<subject_column>
, <key_column>
, <value_column>
, distinct = <logical>
, diag = <logical>
, method = <dist_method>
, fill = <numeric>
, fun.aggregate = <aggregate_function>
, p = <numeric>
)
Arguments
Return Value
Data frame
Example
Original data:
date | name | product | num |
---|---|---|---|
2015-06-12 | Lisa | orange | 5 |
2015-06-12 | Emily | apple | 3 |
2015-06-12 | John | carrot | 3 |
2015-06-13 | Emily | apple | 4 |
2015-06-13 | John | apple | 6 |
2015-06-16 | Lisa | orange | 2 |
2015-06-16 | John | carrot | 4 |
2015-06-16 | Emily | orange | 7 |
do_dist.kv(name, product, num)
This considers matrix like this. Duplicated entries is aggregated to mean as default and you can change it by fun.aggregate argument.
Emily | John | Lisa | |
---|---|---|---|
apple | 3.5 | 6.0 | 0.0 |
carrot | 0.0 | 3.5 | 0.0 |
orange | 7.0 | 0.0 | 3.5 |
Then calculate euclidean distances between pairs of each column.
name.x | name.y | value |
---|---|---|
Emily | John | 8.215838 |
Emily | Lisa | 4.949747 |
John | Emily | 8.215838 |
John | Lisa | 7.778175 |
Lisa | Emily | 4.949747 |
Lisa | John | 7.778175 |
do_dist.kv(name, product, num, fun.aggregate=sum, method="manhattan")
This considers matrix like this. Duplicated entries is aggregated to sum.
Emily | John | Lisa | |
---|---|---|---|
apple | 7 | 6 | 0 |
carrot | 0 | 7 | 0 |
orange | 7 | 0 | 7 |
Then calculate manhattan distances between pairs of each column.
name.x | name.y | value |
---|---|---|
Emily | John | 15 |
Emily | Lisa | 7 |
John | Emily | 15 |
John | Lisa | 20 |
Lisa | Emily | 7 |
Lisa | John | 20 |
Summary
Calculate distances of each of the pairs.
Syntax
do_dist.cols(<column(s)>
, distinct = <logical>
, diag = <logical>
, method = <dist_method>
, p = <numeric>
)
Arguments
Return Value
Data frame
Example
Original data:
year | Als | Bea | Bor |
---|---|---|---|
1900 | 35.02899 | 18 | 18.14896 |
1901 | 24.02899 | 10 | 10.64896 |
1902 | 45.02899 | 27 | 20.64896 |
do_dist.cols(-year)
pair.name.1 | pair.name.2 | value |
---|---|---|
Als | Bea | 28.492868 |
Als | Bor | 32.532238 |
Bea | Als | 28.492868 |
Bea | Bor | 6.385847 |
Bor | Als | 32.532238 |
Bor | Bea | 6.385847 |
Summary
Calculate distances of each of the pairs of the variables or subjects.
Syntax
do_dist(<column(s)>
, skv = <character_vector>
, distinct = <logical>
, diag = <logical>
, method = <dist_method>
, p = <numeric>
, fill = <numeric>
, fun.aggregate = <aggregate_function>
)
Arguments
Return Value
Data frame
Example
Original data:
year | Als | Bea | Bor |
---|---|---|---|
1900 | 35.02899 | 18 | 18.14896 |
1901 | 24.02899 | 10 | 10.64896 |
1902 | 45.02899 | 27 | 20.64896 |
do_dist(-year)
pair.name.1 | pair.name.2 | value |
---|---|---|
Als | Bea | 28.492868 |
Als | Bor | 32.532238 |
Bea | Als | 28.492868 |
Bea | Bor | 6.385847 |
Bor | Als | 32.532238 |
Bor | Bea | 6.385847 |
Original data:
date | name | product | num |
---|---|---|---|
2015-06-12 | Lisa | orange | 5 |
2015-06-12 | Emily | apple | 3 |
2015-06-12 | John | carrot | 3 |
2015-06-13 | Emily | apple | 4 |
2015-06-13 | John | apple | 6 |
2015-06-16 | Lisa | orange | 2 |
2015-06-16 | John | carrot | 4 |
2015-06-16 | Emily | orange | 7 |
do_dist(skv = c("name", "product", "num"))
This considers matrix like this. Duplicated entries is aggregated to mean as default and you can change it by fun.aggregate argument.
Emily | John | Lisa | |
---|---|---|---|
apple | 3.5 | 6.0 | 0.0 |
carrot | 0.0 | 3.5 | 0.0 |
orange | 7.0 | 0.0 | 3.5 |
Then calculate euclidean distances between pairs of each column.
name.x | name.y | value |
---|---|---|
Emily | John | 8.215838 |
Emily | Lisa | 4.949747 |
John | Emily | 8.215838 |
John | Lisa | 7.778175 |
Lisa | Emily | 4.949747 |
Lisa | John | 7.778175 |
do_dist(skv = c("location", "year", "harvest_date"), fun.aggregate=sum, method="manhattan")
This considers matrix like this. Duplicated entries is aggregated to sum.
Emily | John | Lisa | |
---|---|---|---|
apple | 7 | 6 | 0 |
carrot | 0 | 7 | 0 |
orange | 7 | 0 | 7 |
Then calculate manhattan distances between pairs of each column.
name.x | name.y | value |
---|---|---|
Emily | John | 15 |
Emily | Lisa | 7 |
John | Emily | 15 |
John | Lisa | 20 |
Lisa | Emily | 7 |
Lisa | John | 20 |
Summary
Cleans up text by removing escape characters (e.g. \n, \t), extra white spaces, extra period, and leading and trailing spaces.
Syntax
str_clean(<column_text>
)
Arguments
Return Value
Character
Example
str_clean(" Exploratory ..io ")
returns "Exploratory.io".
Summary
Count patterns from texts
Syntax
str_count_all(<column_text>
, patterns = <character_vector>
, remove.zero = <logical>
)
Arguments
Return Value
List of data frame
Example
str_count_all("I ate banana, apple, and peach yesterday, and banana, peach today.", patterns=c("apple", "banana"), remove.zero=TRUE)
Returns a list column of data frames with 'apple' and 'banana' columns.
Summary
Normalize text by replacing full-width alphabets, numbers, special characters with regular alphabets, numbers, special characters, replacing half-width Kana characters with full-width Kana characters, and applying other normalization rules. It follows the rule of Unicode Normalization Forms. This is a wrapper function around stringi::stri_trans_nfkc function.
Syntax
str_normalize(<column_text>
)
Arguments
Return Value
Character
Example
str_normalize("ABC123+ー=") returns "ABC123+-=".
Summary
Extract text inside symbols with given begin and end symbol. If multiple strings are extracted, they will be returned as a list type, so please wrap them with the list_to_text function."
Syntax
str_extract_inside(<column_text>
, begin = <text>
, end = <text>
, include_special_chars = <TRUE|FALSE>
, all = <TRUE|FALSE>
)
Arguments
Return Value
Character
Example
list_to_text(str_extract_inside(company, begin = "(", end = ")", include_special_chars = FALSE, all = TRUE))
list_to_text(str_extract_inside(company, begin = "{", end = "}", include_special_chars = FALSE, all = TRUE))
list_to_text(str_extract_inside(company, begin = "[", end = "]", include_special_chars = FALSE, all = TRUE))
- Original Data: "Apple [Cupertino]"
- Result: Cupertino
list_to_text(str_extract_inside(company, begin = "'", end = "'", include_special_chars = FALSE, all = TRUE))
list_to_text(str_extract_inside(company, begin = '"', end = '"', include_special_chars = FALSE, all = TRUE))
Summary
Converts the data type of a column to logical type, transforming values into TRUE or FALSE.
If the "true_value" argument is not specified, it converts "true", "Yes", "yes", "1" to TRUE, and "false", "No", "no", "0" to FALSE.
If you want to make a specific value TRUE and all other values FALSE, specify the original value in true_value.
Even if the target column for logical conversion contains a mix of uppercase and lowercase letters, the conversion ignores case and focuses only on the string content.
Syntax
str_logical(<column_text>
, true_value = <text>
)
Arguments
Return Value
Logical
Example
Original data:
'Status' column is Character data type.
Status |
---|
yes |
no |
No |
NO |
YES |
Yes |
mutate(New_Status = str_logical(Status))
Result:
'New_status' column is Logical data type.
Status | New_Status |
---|---|
yes | TRUE |
no | FALSE |
No | FALSE |
NO | FALSE |
YES | TRUE |
Yes | TRUE |
Original data:
'Status' column is Character data type.
Status |
---|
TRUE |
FalSE |
True |
False |
NA |
tRUE |
FALSE |
mutate(New_Status = str_logical(Status))
Result:
'New_status' column is Logical data type.
Status | New_Status |
---|---|
TRUE | TRUE |
FalSE | FALSE |
True | TRUE |
False | FALSE |
NA | NA |
tRUE | TRUE |
FALSE | FALSE |
Original data:
'Status' column is Numeric data type.
Status |
---|
1 |
0 |
1 |
0 |
NA |
1 |
0 |
mutate(New_Status = str_logical(Status))
Result:
'New_status' column is Logical data type.
Status | New_Status |
---|---|
1 | TRUE |
0 | FALSE |
1 | TRUE |
0 | FALSE |
NA | NA |
1 | TRUE |
0 | FALSE |
Original data:
'Status' column is Character data type.
Status |
---|
Sign Up |
Not Yet |
Sign Up |
Not Yet |
NA |
sign Up |
Not yet |
mutate(New_Status = str_logical(Status, "Sign Up")
Result:
'New_status' column is Logical data type.
Status | New_Status |
---|---|
Sign Up | TRUE |
Not Yet | FALSE |
Sign Up | TRUE |
Not Yet | FALSE |
NA | NA |
sign Up | TRUE |
Not yet | FALSE |
Summary
Returns stop words like "a", "the", "and", etc.
Syntax
get_stopwords(lang = <language>
, include = <character_vector>
, exclude = <character_vector>
)
Arguments
Return Value
Character
Example
get_stopwords()
Return a list of the stop words like "a", "the", "and".
Summary
Returns sentiment types of positive or negative based on word(s).
Syntax
word_to_sentiment(<column_text>
, lexicon = <sentiment_lexicon>
)
Arguments
Return Value
If lexicon argument is "bing" (default), Character.
If lexicon argument is "AFINN", Numeric.
If lexicon argument is "nrc", List.
Example
word_to_sentiment("good")
Return "positive".
word_to_sentiment("bad", lexicon="AFINN")
Return -3.
word_to_sentiment("cry", lexicon="nrc")
Return c("negative" "sadness").
Summary
Returns the sentiment score for a text sentence. The positive score indicates the positive sentence and the negative score indicates the opposite. 0 means 'neutral.'
Syntax
get_sentiment(<column_text>
)
Arguments
Return Value
Numeric
Example
get_sentiment("I'm happy")
Return 0.7071068.
get_sentiment("I'm sick.")
Return -0.7071068.
get_sentiment("I'm not so good.")
Return -0.5000000.
Summary
Stem word so that almost the same words which have a little different spells can be recognized as the same words.
Syntax
stem_word(<column_text>
, language = <stem_language>
)
Arguments
Return Value
Character
Example
stem_word(c("stand","stands", "cheerful", "cheering"))
Return c("stand","stand","cheer","cheer").
Summary
Returns TRUE if a word is included in a list of the stop words defined by one of the dictionary.
Syntax
is_stopword(<column_text>
, lang = <language>
, include = <character_vector>
, exclude = <character_vector>
)
Arguments
Return Value
Logical
Example
is_stopword(c("a", "and", "stopword", "the"))
Return c(TRUE, TRUE, FALSE, TRUE).
Summary
Returns TRUE if a text is empty string or NA.
Syntax
is_empty(<column_text>
)
Arguments
Return Value
Logical
Example
is_stopword(X1)
Determines whether each word in column X1 is a stopword and returns a logical value (TRUE or FALSE).
X1 | NEW |
---|---|
a | TRUE |
and | TRUE |
stopword | FALSE |
filter(!is_stopword(X1))
Filters the data to keep only rows where X1 is not a stopword, retaining meaningful content words.
X1 |
---|
stopword |
analysis |
computer |
programming |
Summary
Returns TRUE if a text contains only alphabets.
Syntax
is_alphabet(<column_text>
)
Arguments
Return Value
Data frame
Example
is_alphabet(c("1", "132", "32MB", "hello", "Hello World"))
Return c(FALSE, FALSE, FALSE, TRUE, FALSE).
This is expected to be used to tokens and if this is used to sentences, returns FALSE.
Summary
Returns country names from IP addresses.
Syntax
ip_to_country(<column_text>
)
Arguments
Return Value
Character
Example
ip_to_country("133.43.96.45")
returns "Japan".
Summary
Returns domain (ex. "exploratory.io") from url.
Syntax
url_domain(<column_text>
)
Arguments
Return Value
Character
Example
url_domain("https://exploratory.io/login/")
returns "exploratory.io".
Summary
Returns fragment from url.
Syntax
url_fragment(<column_text>
)
Arguments
Return Value
Character
Example
url_fragment("https://exploratory.io/?debug=true#test")
returns "test".
Summary
Returns path from url.
Syntax
url_path(<column_text>
)
Arguments
Return Value
Character
Example
url_path("https://exploratory.io/reference/")
returns "reference/".
Summary
Returns port from url.
Syntax
url_port(<column_text>
)
Arguments
Return Value
Character
Example
url_port("https://exploratory.io:443")
returns "443".
Summary
Returns scheme (ex. "http", "https") from url.
Syntax
url_scheme(<column_text>
)
Arguments
Return Value
Character
Example
url_scheme("https://exploratory.io")
returns "https".
Summary
Returns suffix (ex. "com", "org") from url.
Syntax
url_suffix(<column_text>
)
Arguments
Return Value
Character
Example
url_suffix("https://exploratory.io")
returns "io".
url_suffix("http://sample.edu.co")
returns "edu.co".
Summary
Returns subdomain (ex. "www", "blog") from url.
Syntax
url_subdomain(<column_text>
)
Arguments
Return Value
Character
Example
url_subdomain("https://blog.exploratory.io")
returns "blog".
Summary
Returns top-level domain (ex. "com", "co") from url.
Syntax
url_tld(<column_text>
)
Arguments
Return Value
Character
Example
url_tld("http://sample.edu.co")
returns "co".
Summary
Returns decoded query parameter from url.
Syntax
url_param(<column_text>
, parameter_name = <character>
)
Arguments
* parameter_name - Name of query. If there's "?user=ferret" in the end of the url, the name is "user".
Return Value
Character
Example
url_param("http://example.com/?user=ferret", "user")
returns "ferret".
Summary
Converts country names or country codes to other formats of country codes or names (country names, continent names, region names, etc.). This is useful for standardizing country codes from different data sources when conducting international data analysis or creating reports.
Syntax
countrycode(<column of country names or country codes>
, origin = <format of country codes or country names before conversion>
, destination = <format of country codes or country/region names after conversion>
)
Arguments
Possible values for origin and destination arguments:
Return Value
Country names, country codes, continent names, or region names after conversion (depending on the specified destination format)
Example
countrycode(country_name, origin = "country.name", destination = "iso2c")
Converts country names in the country_name column to ISO 3166-1 alpha-2 (2-letter country codes). Abbreviations and common names are automatically recognized.
country_name | ISO2_code |
---|---|
U.S. | US |
United States | US |
Japan | JP |
China | CN |
South Korea | KR |
countrycode(ISO2_code, origin = "iso2c", destination = "country.name")
Converts 2-letter country codes in the ISO2_code column to official English country names.
ISO2_code | country_name |
---|---|
US | United States |
JP | Japan |
GB | United Kingdom |
DE | Germany |
FR | France |
countrycode(COW_code, origin = "cowc", destination = "country.name")
Converts Correlates of War string format country codes in the COW_code column to official English country names.
COW_code | country_name |
---|---|
USA | United States |
JPN | Japan |
GBR | United Kingdom |
RUS | Russia |
CHN | China |
countrycode(COW_numeric_code, origin = "cown", destination = "country.name")
Converts Correlates of War numeric codes in the COW_numeric_code column to official English country names.
COW_numeric_code | country_name |
---|---|
2 | United States |
4 | United Kingdom |
81 | Japan |
365 | Russia |
710 | China |
countrycode(ISO3_code, origin = "iso3c", destination = "iso2c")
Converts 3-letter country codes in the ISO3_code column to 2-letter country codes.
ISO3_code | ISO2_code |
---|---|
USA | US |
JPN | JP |
GBR | GB |
DEU | DE |
FRA | FR |
countrycode(country_name, origin = "country.name", destination = "continent")
Converts country names in the country_name column to the continent names to which they belong.
country_name | continent_name |
---|---|
United States | Americas |
Japan | Asia |
Germany | Europe |
South Africa | Africa |
Australia | Oceania |
countrycode(country_name, origin = "country.name", destination = "region")
Converts country names in the country_name column to region names classified into 7 regions.
country_name | region_name |
---|---|
United States | Northern America |
Japan | Eastern Asia |
Germany | Western Europe |
Brazil | South America |
Australia | Australia and New Zealand |
Summary
This function takes a column that has US State information and returns either US State names, abbreviations, numeric codes, division, or region, based on 'output_type' argument setting. The US State information can be either US State names, US State abbreviations (FIPS / ANSI codes), or US State numeric code (FIPS) in any combination.
Syntax
statecode(<state_column>
, output_type = <statecode_type>
)
Arguments
Output Types:
Return Value
State name, State code, etc.
Example
statecode("CA", "name")
Return California.
statecode("CA", "num_code")
Return 06.
statecode("California", "num_code")
Return 06.
statecode("CA", "region")
Return West.
Summary
Generate US county code (FIPS - Federal Information Processing Standard) based on US State and County names.
Syntax
countycode(state = <state_column>
, county = <county_column>
)
Arguments
Return Value
County code
Example
countycode("California", "San Mateo")
Return 06081.
countycode("CA", "San Mateo")
Return 06081.
countycode("CA", "San Mateo County")
Return 06081.
Summary
Returns one token (e.g. word) per row after tokenizing a text.
Syntax
do_tokenize(<column_text>
, token = <token_type>
, keep_cols = <logical>
, to_lower = <logical>
, drop = <logical>
, output = <new_column_name>
, with_id = <logical>
, pattern = <pattern>
)
Arguments
Return Value
Data frame
Example
Original data:
index | text |
---|---|
First | It was a great party. |
Second | She has just left. She will be off tomorrow. |
do_tokenize(text)
index | document_id | sentence_id | token |
---|---|---|---|
First | 1 | 1 | it |
First | 1 | 1 | was |
First | 1 | 1 | a |
First | 1 | 1 | great |
First | 1 | 1 | party |
Second | 2 | 1 | she |
Second | 2 | 1 | has |
Second | 2 | 1 | just |
Second | 2 | 1 | left |
Second | 2 | 2 | she |
Second | 2 | 2 | will |
Second | 2 | 2 | be |
Second | 2 | 2 | off |
Second | 2 | 2 | tomorrow |
do_tokenize(text, token="sentences")
index | token |
---|---|
First | it was a great party. |
Second | she has just left. |
Second | she will be off tomorrow. |
Summary
Count pairs of words (tokens) that cooccur within a group
Syntax
pair_count(group = <column>
, value = <column>
, distinct = <logical>
, diag = <logical>
, sort = <logical>
)
Arguments
Return Value
Data frame
Example
pair_count(group=title, value=word, distinct=FALSE, diag = FALSE, sort=TRUE)
Return a data frame with word.x and word.y which is from "word" column and value that has count.
Summary
Calculates TF-IDF for each term against a group. TF-IDF is a weighting mechanism that calculates the importance of each word to each document by increasing the importance based on the term frequency while decreasing the importance based on the document frequency.
Syntax
do_tfidf(<document_column>
, <token_column>
, tf_weight = <tf_weight_type>
, idf_log_scale = <function>
, norm = "l2"|"l1"|FALSE
)
Arguments
1+log(count of a term in a document)
.log_scale_function((the total number of documents)/(the number of documents which have the token))
. It's how rare the token is in the set of documents. It might be worth trying log2 or log10. log2 increases the value more easily and log10 increases it more slowly.Return Value
Data frame
Example
Original data:
document_id | token |
---|---|
1 | this |
1 | is |
1 | what |
1 | it |
1 | is |
2 | which |
2 | is |
2 | better |
do_tfidf(document_id, token)
document_id | token | count_per_doc | count_of_docs | tfidf |
---|---|---|---|---|
1 | is | 2 | 2 | 0.0000000 |
1 | it | 1 | 1 | 0.5773503 |
1 | this | 1 | 1 | 0.5773503 |
1 | what | 1 | 1 | 0.5773503 |
2 | better | 1 | 1 | 0.7071068 |
2 | is | 1 | 2 | 0.0000000 |
2 | which | 1 | 1 | 0.7071068 |
Summary
Create columns of n-grams connected in sentences.
Syntax
do_ngram(<token_column>
, <document_column>
, <sentence_column>
, maxn = <numeric>
, sep = <text>
)
Arguments
Return Value
Data frame
Example
Original data:
document_id | sentence_id | token |
---|---|---|
1 | 1 | it |
1 | 1 | is |
1 | 1 | good |
2 | 1 | she |
2 | 1 | left |
2 | 2 | she |
2 | 2 | will |
2 | 2 | come |
2 | 2 | tomorrow |
do_ngram(token, document_id, sentence_id, maxn=3)
document_id | sentence_id | gram | token |
---|---|---|---|
1 | 1 | 1 | it |
1 | 1 | 1 | is |
1 | 1 | 1 | good |
1 | 1 | 2 | it_is |
1 | 1 | 2 | is_good |
1 | 1 | 3 | it_is_good |
2 | 1 | 1 | she |
2 | 1 | 1 | left |
2 | 1 | 2 | she_left |
2 | 2 | 1 | she |
2 | 2 | 1 | will |
2 | 2 | 1 | come |
2 | 2 | 1 | tomorrow |
2 | 2 | 2 | she_will |
2 | 2 | 2 | will_come |
2 | 2 | 2 | come_tomorrow |
2 | 2 | 3 | she_will_come |
2 | 2 | 3 | will_come_tomorrow |
Summary
Calculates the similarity between each pair of the documents using the cosine similarity algorithm. Cosine similarity measures the cosine of the angle between two vectors in the multi-dimensional space.
Syntax
do_cosine_sim.kv(<subject_column>
, <key_column>
, <value_column>
, distinct = <logical>
, diag = <logical>
, fun.aggregate = <aggregate_function>
)
Arguments
Return Value
Data frame
Example
Original data:
document_id | token | value |
---|---|---|
Lisa | it | 0.8966972 |
Lisa | was | 0.2655087 |
Lisa | good | 0.3721239 |
Emily | she | 0.5728534 |
Emily | is | 0.9082078 |
Emily | nice | 0.2016819 |
John | she | 0.8983897 |
John | is | 0.9446753 |
John | good | 0.6607978 |
do_cosine_sim.kv(document_id, token, value)
document_id.x | document_id.y | value |
---|---|---|
Lisa | Emily | 0.0000000 |
Lisa | John | 0.1671574 |
Emily | Lisa | 0.0000000 |
Emily | John | 0.8595770 |
John | Lisa | 0.1671574 |
John | Emily | 0.8595770 |
Summary
Execute multidimensional scaling (MDS). Calculate approximated coordinations from distances of entity pairs.
Syntax
do_cmdscale(<name1_column>
, <name2_column>
, <value_column>
, k = <integer>
)
Arguments
Return Value
Data frame
Example
Original data:
name.x | name.y | value |
---|---|---|
cow | horse | 36 |
cow | turtle | 82 |
horse | turtle | 48 |
do_cmdscale(pair.name.1, pair.name.2, value)
name | V1 | V2 |
---|---|---|
cow | -38.915551 | -3.435417 |
horse | -4.163866 | 5.962475 |
turtle | 43.079417 | -2.527058 |
Summary
Impute NA by mean / median / mode values, manually entered values, or predicted values. It will build a linear regression model for the prediction.
Syntax
impute_na(<target_column>
, type = <impute_na_type>
, val = <column>
, <column(s)>
)
Arguments
<column(s)>
.Example
mutate(New = impute_na(DEP_DELAY, type = "mean"))
Imputes missing values in the DEP_DELAY column with the mean value of that column.
CARRIER | DEP_DELAY | NEW |
---|---|---|
UA | 7 | 7 |
UA | NA | 8.75 |
UA | 0 | 0 |
AA | 20 | 20 |
AA | NA | 8.75 |
AA | 8 | 8 |
mutate(New = impute_na(DEP_DELAY, type = "median"))
Imputes missing values in the DEP_DELAY column with the median value of that column.
CARRIER | DEP_DELAY | NEW |
---|---|---|
UA | 7 | 7 |
UA | NA | 7.5 |
UA | 0 | 0 |
AA | 20 | 20 |
AA | NA | 7.5 |
AA | 8 | 8 |
mutate(New = impute_na(product_category, type = "mode"))
Imputes missing values in the product_category column with the most frequently occurring value in that column.
product_ID | product_category | NEW |
---|---|---|
001 | Food | Food |
002 | NA | Food |
003 | Clothing | Clothing |
004 | Food | Food |
005 | NA | Food |
006 | Daily Items | Daily Items |
mutate(New = impute_na(DEP_DELAY, type = "value", val = 10))
Imputes missing values in the DEP_DELAY column with the value 10.
CARRIER | ARR_DELAY | DEP_DELAY | DISTANCE | NEW |
---|---|---|---|---|
UA | 10 | 7 | 300 | 7 |
UA | 20 | NA | 1000 | 10 |
UA | -5 | 0 | 200 | 0 |
AA | 20 | 20 | 500 | 20 |
AA | -5 | NA | 700 | 10 |
AA | 10 | 8 | 100 | 8 |
mutate(New = impute_na(DEP_DELAY, type = "predict", CARRIER, ARR_DELAY, DISTANCE))
Imputes missing values in the DEP_DELAY column with predicted values using linear regression based on the CARRIER, ARR_DELAY, and DISTANCE columns.
CARRIER | ARR_DELAY | DEP_DELAY | DISTANCE | NEW |
---|---|---|---|---|
UA | 10 | 7 | 300 | 7 |
UA | 20 | NA | 1000 | 3.47 |
UA | -5 | 0 | 200 | 0 |
AA | 20 | 20 | 500 | 20 |
AA | -5 | NA | 700 | 3.59 |
AA | 10 | 8 | 100 | 8 |
Summary
Detects outliers in a numeric column and returns labels indicating if values are outliers ('Upper' or 'Lower') or not ('Normal'). This function is useful for labeling outliers or for easily removing outliers when used in combination with a filter.
Syntax
detect_outlier(<column_num>
, type = <detect_outlier_type>
, threshold = <numeric>
)
Arguments
Example
mutate(New = detect_outlier(DISTANCE, type = "iqr"))
Detects outliers in the data using IQR and labels them as "Upper", "Normal", or "Lower".
CARRIER | DISTANCE | NEW |
---|---|---|
UA | 800 | Normal |
UA | 5000 | Upper |
UA | 600 | Normal |
AA | 500 | Normal |
AA | 700 | Normal |
AA | 10 | Lower |
filter(detect_outlier(population, "iqr") == "Normal")
Removes outliers detected using IQR on both sides
state | population |
---|---|
California | 3979576 |
Texas | 2693976 |
mutate(outlier_label = detect_outlier(population, type = "percentile", threshold = 0.95))
Labels outliers using the 95th percentile method
state | population | outlier_label |
---|---|---|
New York | 8336817 | Upper |
California | 3979576 | Normal |
Texas | 2693976 | Normal |
filter(detect_outlier(population, type = "percentile", threshold = 0.99) == "Normal")
Removes outliers detected using the 99th percentile
state | population |
---|---|
New York | 8336817 |
California | 3979576 |
Texas | 2693976 |
mutate(outlier_label = detect_outlier(population, type = "standard_deviation", threshold = 2))
Labels values exceeding 2 standard deviations as outliers
state | population | outlier_label |
---|---|---|
New York | 8336817 | Upper |
California | 3979576 | Normal |
Texas | 2693976 | Normal |
Vermont | 193976 | Lower |
filter(detect_outlier(population, type = "standard_deviation", threshold = 3) == "Normal")
Removes outliers detected using 3 standard deviations
state | population |
---|---|
California | 3979576 |
Texas | 2693976 |
Summary
Build K-means clustering model and label each row with cluster id.
Syntax
cluster(<column(s)>
, n_cluster = <integer>
)
Arguments
Example
Original data:
CARRIER | ARR_DELAY | DEP_DELAY | DISTANCE |
---|---|---|---|
UA | 10 | 7 | 800 |
UA | 20 | NA | 5000 |
UA | -5 | 0 | 600 |
AA | 20 | 20 | 500 |
AA | -5 | NA | 700 |
AA | 10 | 8 | 10 |
mutate(cluster = cluster(DEP_DELAY, DISTANCE))
Label each row with the cluster id (factor) based on DEP_DELAY and DISTANCE columns values.
CARRIER | ARR_DELAY | DEP_DELAY | DISTANCE | cluster |
---|---|---|---|---|
UA | 10 | 7 | 800 | 2 |
UA | 20 | NA | 5000 | NA |
UA | -5 | 0 | 600 | 1 |
AA | 20 | 20 | 500 | 1 |
AA | -5 | NA | 700 | NA |
AA | 10 | 8 | 10 | 3 |
Summary
Create data frame with models from input data frame by model function and arguments.
Syntax
build_model(model_func = <model_function>
, seed = <integer>
, test_rate = <numeric>
, <model_parameters>
)
Arguments
Example
Original data:
CARRIER | ARR_DELAY | DEP_DELAY | DISTANCE | WEIGHTS |
---|---|---|---|---|
UA | 10 | 7 | 300 | 0.8 |
UA | 20 | 30 | 1000 | 0.5 |
UA | -5 | 0 | 200 | 1.2 |
AA | 20 | 20 | 500 | 0.8 |
AA | -5 | 3 | 700 | 0.9 |
AA | 10 | 8 | 1000 | 0.5 |
%>% build_model(model_func = lme4::lmer, formula = ARR_DELAY ~ DEP_DELAY + (DISTANCE|CARRIER), test_rate = 0.1, weights = WEIGHTS)
Returns a data frame that stores a linear mixed-effects model. It also returns a column of original data.
source_data | model | .test_index |
---|---|---|
source dataframe | lme4 model | c(1) |
You can use model_coef or model_stats function to get the summary information about the models. Also, you can use prediction function to predict with the data that is stored in the same data frame.
Summary
One-hot encodes a categorical column, producing separate columns for each categorical values, each of which has values of 1 or 0 that tells whether a row has the value the column represents.
Syntax
one_hot(<column>
)
Example
Original data:
ARR_DELAY | CARRIER |
---|---|
10 | UA |
20 | UA |
-5 | UA |
20 | AA |
-5 | AA |
10 | AA |
one_hot(CARRIER)
ARR_DELAY | CARRIER_UA | CARRIER_AA |
---|---|---|
10 | 1 | 0 |
20 | 1 | 0 |
-5 | 1 | 0 |
20 | 0 | 1 |
-5 | 0 | 1 |
10 | 0 | 1 |
Summary
Returns the most frequent value (mode) in a column.
Syntax
get_mode(<column>
, na.rm = <logical>
)
Arguments
* column - Column to get mode of
* na.rm (Optional) - The default is FALSE. If NA should be ignored even if it is the most frequent value.
Example
Original data:
ARR_DELAY | CARRIER |
---|---|
10 | UA |
20 | UA |
-5 | UA |
20 | AA |
-5 | AA |
summarize(ARR_DELAY_mode = get_mode(ARR_DELAY), CARRIER_mode = get_mode(CARRIER))
ARR_DELAY_mode | CARRIER_mode |
---|---|
20 | UA |
Summary
Summarize (or aggregate) column values with specified aggregate functions. If grouping columns are provided then summarize the values for each group.
Syntax
summarize_group(group_cols=<column_list>
, group_funs = <function(s)>
, ...)
Arguments
* group_cols - Columns to group by
* grp_aggregatons - Aggregate Functions to apply to group by Columns.
* ... - A pair of a newly created column name and an aggregation function with a column name. You can give multiple pairs to aggregate on multiple columns.
Example
Original data:
ARR_DELAY | CARRIER |
---|---|
10 | UA |
20 | UA |
30 | UA |
-20 | UA |
20 | AA |
-5 | AA |
-9 | AA |
summarize_group(group_cols=c("CARRIER"), group_funs=("none"), ARR_DELAY_mean = mean(ARR_DELAY))
CARRIER | ARR_DELAY_mean |
---|---|
UA | 10 |
AA | 2 |
Original data:
ARR_DELAY | DEP_DELAY | CARRIER |
---|---|---|
10 | 20 | UA |
20 | 15 | UA |
30 | -5 | UA |
-20 | 30 | UA |
20 | 10 | AA |
-5 | 20 | AA |
-9 | 30 | AA |
summarize_group(group_cols=c("CARRIER"), group_funs=("none"), ARR_DELAY_mean = mean(ARR_DELAY), DEP_DEALY_sum = sum(DEP_DELAY))
CARRIER | ARR_DELAY_mean | DEP_DELAY_sum |
---|---|---|
UA | 10 | 60 |
AA | 2 | 60 |
Original data:
FL_DATE | ARR_DELAY | DEP_DELAY |
---|---|---|
06/01/2018 | 10 | 20 |
06/12/2018 | 20 | 15 |
06/18/2018 | 30 | -5 |
06/21/2018 | -20 | 30 |
07/14/2018 | 20 | 10 |
07/16/2018 | -5 | 20 |
07/25/2018 | -9 | 30 |
summarize_group(group_cols=c("FL_DATE"), group_funs=("month"), ARR_DELAY_mean = mean(ARR_DELAY), DEP_DEALY_sum = sum(DEP_DELAY))
FL_DATE_month | ARR_DELAY_mean | DEP_DELAY_sum |
---|---|---|
6 | 10 | 60 |
7 | 2 | 60 |
Summary
Summarize (or aggregate) column values across columns for each row with the specified aggregate function.
Syntax
<column(s)>
)), <aggregate_function>
, ...)<column(s)>
)), <aggregate_function>
, ...)<columns_select_function>
)), <aggregate_function>
, ...)<text>
, ignore.case = <logical>
)), <aggregate_function>
, ...)<text>
, ignore.case = <logical>
)), <aggregate_function>
, ...)<text>
, ignore.case = <logical>
)), <aggregate_function>
, ...)<text>
, ignore.case = <logical>
)), <aggregate_function>
, ...)<text>
, <start_num>:<end_num>
)), <aggregate_function>
, ...)Arguments
Example
mutate(Total = summarize_row(across(c(Sales_Consumer, Sales_Business, Sales_Government)), sum)
Sums up values from the specified columns for each row, and stores the sums in the Total column. If any of the column values are NA, the result will be NA too.
mutate(Total = summarize_row(across(c(Sales_Consumer, Sales_Business, Sales_Government)), sum, na.rm = TRUE)
Same as the above example, except that even if some of the column values are NA, the result will be calculated from the rest of the values.
mutate(Total = summarize_row(across(where(is.numeric)), sum, na.rm = TRUE)
Sums up values from all numeric columns for each row, and stores the sums in the Total column.
Summary
Create new columns or update existing columns with specified functions such as Window Calculations. If grouping columns are provided then it peforms calculation for each group.
Syntax
mutate_group(group_cols=<column_list>
, group_funs = <function(s)>
, sort_cols=<column_list>
, sort_funs = <function(s)>
, ...)
Arguments
Example
Original data:
ARR_DELAY | CARRIER |
---|---|
10 | UA |
20 | UA |
30 | UA |
20 | UA |
20 | AA |
5 | AA |
9 | AA |
mutate_group(group_cols=c("CARRIER"), group_funs=("none"), ARR_DELAY_cumsum = cumsum(ARR_DELAY))
CARRIER | ARR_DELAY_cumsum |
---|---|
UA | 10 |
UA | 30 |
UA | 60 |
UA | 80 |
AA | 20 |
AA | 25 |
AA | 34 |
Original data:
ARR_DELAY | DEP_DELAY | CARRIER |
---|---|---|
10 | 20 | UA |
20 | 15 | UA |
30 | -5 | UA |
-20 | 30 | UA |
20 | 10 | AA |
-5 | 20 | AA |
-9 | 30 | AA |
mutate_group(group_cols=c("CARRIER"), group_funs=("none"), ARR_DELAY_dfprev = ARR_DELAY - lag(ARR_DELAY), DEP_DEALY_dfprev = DEP_DELAY - lag(DEP_DELAY))
CARRIER | ARR_DELAY_dfprev | DEP_DELAY_dfprev |
---|---|---|
UA | NA | NA |
UA | 10 | -5 |
UA | 10 | -20 |
UA | -50 | 35 |
AA | NA | NA |
AA | -25 | 10 |
AA | -4 | 10 |
Summary
Returns the sum of all the values that satisfy given conditions.
Syntax
sum_if(<column_num_logic>
, <condition>
,... , na.rm = <logical>
)
Arguments
Return Value
Numeric
Example
summarize(total = sum_if(revenue, product == 'PC' & country %in% c("Japan", "United States", "United Kingdom"))) Create a new column for calculating the sum of revenue for product 'PC' and country is either "Japan" or "United States" or "United Kingdom" for each group.
summarize(total = sum_if(quantity, priority == 'P1' | country == "United States")) Create a new column for calculating the sum of quantity for piroity 'P1' or country is "United States".
Summary
Returns the ratio of the sum of all the values that satisfy given conditions to the sum of all the values.
Syntax
sum_if_ratio(<column_num_logic>
, <condition>
,... , na.rm = <logical>
)
Arguments
Return Value
Numeric
Example
summarize(total = sum_if_ratio(revenue, product == 'PC' & country %in% c("Japan", "United States", "United Kingdom"))) Create a new column for calculating the ratio of the sum of revenue for product 'PC' and country is either "Japan" or "United States" or "United Kingdom" to the sum of total revenue for each group.
summarize(total = sum_if_ratio(quantity, priority == 'P1' | country == "United States")) Create a new column for calculating the ratio of sum of quantity for piroity 'P1' or country is "United States" to the sum of total quantity.
Summary
Summarize the data by counting number of rows that satisfy given conditions.
Syntax
count_if(<column>
, <condition>
, ... , na.rm = <logical>
)
Arguments
Example
summarize(country_count = count_if(country, product == 'PC' | quantity > 10)) returns the number of rows for product 'PC' or quantity is greater than 10 for each group.
Summary
Return the ratio of the sum of the data by counting the number of rows that satisfy given conditions to the sum of the total count.
Syntax
count_if_ratio(<column>
, <condition>
, ... , na.rm = <logical>
)
Arguments
Example
summarize(country_count_ratio = count_if_ratio(country, product == 'PC' | quantity > 10)) Returns the ratio of the number of rows for product 'PC' or quantity is greater than 10 to the number of rows for each group.
Summary
Counts the number of the unique values of a column that come from rows that satisfy given conditions.
Syntax
count_unique_if(<column>
, <condition>
, ... , na.rm = <logical>
)
Arguments
Example
summarize(country_count = count_unique_if(country, product == 'PC' & quantity > 10)) returns the number of unique countries for rows where product is 'PC' and quantity is greater than 10 for each group.
Summary
Return the ratio of the number of the unique values of a column that come from rows that satisfy given conditions to number of the unique values.
Syntax
count_unique_if_ratio(<column>
, <condition>
, ... , na.rm = <logical>
)
Arguments
Example
summarize(country_count = count_unique_if_ratio(country, product == 'PC' & quantity > 10)) Returns the ratio of the number of unique countries for rows where a product is 'PC' and quantity is greater than 10 to the number of unique countries for each group.
Summary
Returns the numerical average (mean) value that satisfy given conditions.
Syntax
mean_if(<column_num_logic_date>
, <condition>
, ... , na.rm = <logical>
)
Arguments
Return Value
Numeric
Example
summarize(revenue_mean = mean_if(revenue), product == 'PC' | quantity > 10))
Create a new column for calculating the average value of revenue for product 'PC' or quantity is greater than 10 for each group.
Summary
Returns the numerical average (mean) value that satisfy given conditions. This is an alias of mean_if.
Syntax
average_if(<column_num_logic_date>
, <condition>
, ... , na.rm = <logical>
)
Arguments
Return Value
Numeric
Example
summarize(revenue_mean = average_if(revenue), product == 'PC' | quantity > 10))
Create a new column for calculating the average value of revenue for product 'PC' or quantity is greater than 10 for each group.
Summary
Returns the numerical median value that satisfy given conditions.
Syntax
median_if(<column_num_logic_date>
, <condition>
, ... , na.rm = <logical>
)
Arguments
Return Value
Numeric
Example
summarize(revenue_median = median_if(revenue), product == 'PC' | quantity > 10))
Create a new column for calculating the median value of revenue for product 'PC' or quantity is greater than 10 for each group.
Summary
Returns the maximum value from a numeric or date column that meets the specified condition.
Syntax
max_if(<column>
, <condition>
, ... , na.rm = <logical>
)
Arguments
Return Value
Numeric, Date, or DateTime (same type as the input column)
Example
max_if(purchase_amount, customer_category == "premium")
Returns the maximum "purchase_amount" for customers whose "customer_category" is "premium".
order_id | customer_category | purchase_amount | result |
---|---|---|---|
O001 | regular | 15000 | 58000 |
O002 | premium | 58000 | 58000 |
O003 | premium | 42000 | 58000 |
O004 | regular | 25000 | 58000 |
max_if(purchase_date, product_name == "Smartphone X")
Returns the latest date (most recent date) when "Smartphone X" was purchased.
order_id | product_name | purchase_date | result |
---|---|---|---|
O001 | Tablet Y | 2023-06-15 | 2023-05-22 |
O002 | Smartphone X | 2023-02-10 | 2023-05-22 |
O003 | Laptop PC Z | 2023-04-05 | 2023-05-22 |
O004 | Smartphone X | 2023-05-22 | 2023-05-22 |
Summary
Returns the minimum value from a numeric or date column that meets the specified condition.
Syntax
min_if(<column>
, <condition>
, ..., na.rm = <logical>
)
Arguments
Return Value
Numeric, Date, or DateTime (same type as the input column)
Example
min_if(price, product_category == "electronics")
Returns the minimum "price" for products whose "product_category" is "electronics".
product_id | product_category | price | result |
---|---|---|---|
E001 | electronics | 25000 | 12000 |
F001 | food | 5000 | 12000 |
E002 | electronics | 12000 | 12000 |
C001 | clothing | 8000 | 12000 |
min_if(purchase_date, product_name == "Laptop A")
Returns the earliest date when "Laptop A" was purchased.
order_id | product_name | purchase_date | result |
---|---|---|---|
O001 | Tablet B | 2023-03-15 | 2023-01-20 |
O002 | Laptop A | 2023-01-20 | 2023-01-20 |
O003 | Smartphone C | 2023-02-10 | 2023-01-20 |
O004 | Laptop A | 2023-04-05 | 2023-01-20 |
Summary
Returns the count of the rows for each group.
Syntax
count_rows()
Arguments
Return Value
Numeric
Example
summarize(total = count_rows())
Creates a new column for calculating the number of entries for each group.
Summary
Returns the count of unique values.
Syntax
count_unique(<column(s)>
, na.rm = <logical>
)
Arguments
Return Value
Numeric
Example
summarize(total = count_unique(TAIL_NUM))
Create a new column for calculating the count of unique values of TAIL_NUM for each group.
summarize(total = count_unique(STATE, CITY))
Create a new column for calculating the count of unique combinations of STATE and CITY for each group.
Summary
Returns either Weekday or Weekend based on the provided date column value.
Syntax
weekend(<column_date>
)
Arguments * column - Date Column to get Weekday or Weekend
Return Value
Factor (either Weekday or Weekend)
Example
weekend("2015-10-01")
returns Weekday
weekend("2015-10-02")
returns Weekend
Summary Alias to is_jholiday. Returns TRUE if the provided date is a Japanese Holiday.
Syntax
is_japanese_holiday(<column_date>
)
Arguments * column - Date Column that you want to detect if it's a Japanese Holiday
Return Value
Factor (either TRUE or FALSE)
Example
is_japanese_holiday("2020-01-01")
returns TRUE
is_japanese_holiday("2020-01-05")
returns FALSE
Summary
Extract Week of Month. For example, if the given date is in the 1st week of the month, it returns 1
.
Syntax
get_week_of_month(<column_date>
)
Return Value
Numeric
Example
get_week_of_month(as.Date("2019-06-01")) returns 1 get_week_of_month(as.Date("2019-06-25")) returns 5
Summary
Sample n rows from the data frame. This is same as sample_n except for it handles the case where the number of rows in the data is fewer than the specified n without throwing error.
Syntax
sample_rows(<number>
, weight=<column_num>
, replace=<logical>
, seed=<number>
)
Arguments
Example
sample_rows(100)
Select 100 randomly selected rows.
sample_rows(100, weight = ARR_DELAY)
Select 100 randomly selected rows from the data with a weight on ARR_DELAY column.
Summary
Returns a confidence interval range (half-width of confidence interval) of mean from given numeric data.
Syntax
confint_mean(<column_num>
, level=<numeric>
)
Arguments
Return Value
Numeric
Example
Original data:
DEPARTMENT | MONTHLY_INCOME |
---|---|
Sales | 2200 |
Sales | 3000 |
Sales | 5000 |
Sales | : |
R&D | 4000 |
R&D | 3200 |
R&D | 2200 |
R&D | : |
confint = confint_mean(MONTHLY_INCOME)
DEPARTMENT | confint |
---|---|
Sales | 25.32468 |
Sales | 17.9226 |
Summary
Returns a confidence interval range (half-width of confidence interval) of TRUE ratio from given logical data.
Syntax
confint_ratio(<column_logical>
, level=<numeric>
)
Arguments
Return Value
Numeric
Example
Original data:
DEPARTMENT | ATTRITION |
---|---|
Sales | TRUE |
Sales | FALSE |
Sales | FALSE |
Sales | : |
R&D | FALSE |
R&D | TRUE |
R&D | FALSE |
R&D | : |
confint = confint_ratio(ATTRITION)
DEPARTMENT | confint |
---|---|
Sales | 0.02065466 |
Sales | 0.02683791 |
Summary
Returns a confidence interval range (half-width of confidence interval) of mean from a size and a standard deviation of sample data.
Syntax
calc_confint_mean(<column_num>
, <column_num>
. level=<numeric>
)
Arguments
Return Value
Numeric
Example
Original data:
DEPARTMENT | SD | NUMBER_OF_PEOPLE |
---|---|---|
Sales | 577.4946 | 2000 |
R&D | 288.8194 | 1000 |
confint = calc_confint_mean(SD, NUMBER_OF_PEOPLE)
DEPARTMENT | SD | NUMBER_OF_PEOPLE | confint |
---|---|---|---|
Sales | 577.4946 | 2000 | 25.32468 |
R&D | 288.8194 | 1000 | 17.9226 |
Summary
Returns a confidence interval range (half-width of confidence interval) of ratio from a size and a TRUE ratio of sample data.
Syntax
calc_confint_ratio(<column_num>
, <column_num>
. level=<numeric>
)
Arguments
Return Value
Numeric
Example
Original data:
DEPARTMENT | ATTRITION_RATIO | NUMBER_OF_PEOPLE |
---|---|---|
Sales | 0.33 | 2000 |
R&D | 0.25 | 1000 |
confint = calc_confint_ratio(ATTRITION_RATIO, NUMBER_OF_PEOPLE)
DEPARTMENT | ATTRITION_RATIO | NUMBER_OF_PEOPLE | confint |
---|---|---|---|
Sales | 0.33 | 2000 | 0.02065466 |
R&D | 0.25 | 1000 | 0.02683791 |
Summary
Caluculates cumulative sum of decaying effects. It is same as cumsum when r (the second argument) is 1.
Syntax
cumsum_decayed(<column_num>
, <numeric>
)
Arguments
Return Value Numeric
Example
Original data:
Ad_Unit |
---|
100 |
200 |
100 |
100 |
mutate(Effective_Ad_Unit = cumsum_decayed(Add_Unit, 0.1))
Ad_Unit | Effective_Ad_Unit |
---|---|
100 | 100 |
200 | 210 |
100 | 121 |
100 | 112.1 |
Summary
Calculate period between two dates in years.
Syntax
years_between(<column_date>
, <column_date>
)
Arguments
Return Value
numeric
Example
years_between(ymd("2020-10-01), ymd("2021-10-01"))
returns 1
Summary
Calculate period between two dates in months.
Syntax
months_between(<column_date>
, <column_date>
)
Arguments
Return Value
numeric
Example
months_between(ymd("2020-10-01), ymd("2021-10-01"))
returns 12
Summary
Calculate period between two dates in weeks.
Syntax
weeks_between(<column_date>
, <column_date>
)
Arguments
Return Value
numeric
Example
weeks_between(ymd("2020-10-01), ymd("2021-10-01"))
returns 52.14286
Summary
Calculate period between two dates in days.
Syntax
days_between(<column_date>
, <column_date>
)
Arguments
Return Value
numeric
Example
days_between(ymd("2020-10-01), ymd("2021-10-01"))
returns 365
Summary
Calculate period between two dates in hours.
Syntax
hours_between(<column_date>
, <column_date>
)
Arguments
Return Value
numeric
Example
hours_between(ymd_hms("2020-10-01 05:00:00"), ymd_hms("2020-10-01 15:00:00"))
returns 10
Summary
Calculate period between two dates in minutes.
Syntax
minutes_between(<column_date>
, <column_date>
)
Arguments
Return Value
numeric
Example
minutes_between(ymd_hms("2020-10-01 05:00:00"), ymd_hms("2020-10-01 15:00:00"))
returns 600
Summary
Calculate period between two dates in seconds.
Syntax
seconds_between(<column_date>
, <column_date>
)
Arguments
Return Value
numeric
Example
seconds_between(ymd_hms("2020-10-01 05:00:00"), ymd_hms("2020-10-01 15:00:00"))
returns 36000
Summary Returns the last date of the specified period (e.g., month) that the original date belongs to.
Syntax
last_date(<column_date>
, unit = "week"|"month"|"quarter"|"year"
, previous = <logical>
, week_start = <number>
)
Arguments
Return Value Date
Example last_date(ymd("2022-10-01"), unit = "month") returns "2022-10-31"
Summary
Returns the value of the previous period.
Syntax
ts_lag(<column_date>
, <column_num>
, unit = "day"|"week"|"month"|"quarter"|"year"
, n = <integer>
, na_fill_type = "previous"|"next"|"none"
)
Arguments
Return Value numeric - The values with the time lag.
Example ts_lag(ymd(c("2020-01-01","2021-01-01","2022-01-01")), c(1,2,3), unit = "year", n = 1) returns c(NA, 1, 2)
Summary Returns differences in values in a time series data since the specified lag time ago.
Syntax
ts_diff(<column_date>
, <column_num>
, unit = "day"|"week"|"month"|"quarter"|"year"
, n = <integer>
, na_fill_type = "previous"|"next"|"none"
)
Arguments
Return Value numeric
Example ts_diff(ymd(c("2020-01-01","2021-01-01","2022-01-01")), c(1,2,3), unit = "year", n = 1) returns c(NA, 1, 1)
Summary Returns the differences of the values in ratio compared to the base values from the lag time ago in a time series.
Syntax
ts_diff_ratio(<column_date>
, <column_num>
, unit = "day"|"week"|"month"|"quarter"|"year"
, n = <integer>
, na_fill_type = "previous"|"next"|"none"
)
Arguments
Return Value numeric
Example ts_diff_ratio(ymd(c("2020-01-01","2021-01-01","2022-01-01")), c(1,2,3), unit = "year", n = 1) returns c(NA, 1, 0.5)
Summary
Returns Likert's sigma values given a raw data column of survey answers with selected levels such as 1="Strongly disagree", 2="Disagree", 3="Neutral", 4="Agree", 5="Strongly agree".
Syntax
likert_sigma(<column_num>
)
Arguments
Return Value
Numeric
Example
Original data:
NAME | LIKES_ICE_CREAM |
---|---|
Joe | 5 |
Sally | 4 |
Bob | 2 |
John | 5 |
sigma = likert_sigma(LIKES_ICE_CREAM)
NAME | LIKES_ICE_CREAM | sigma |
---|---|---|
Joe | 5 | 0.7978846 |
Sally | 4 | -0.3246628 |
Bob | 2 | -1.2711063 |
John | 5 | 0.7978846 |
Summary
Removes the text before a specified delimiter or regex match.
Syntax
str_remove_before(<text_column>
, sep = <delimiter>
, include_sep = <TRUE|FALSE>
)
Arguments
Return Value
Character
Example
str_remove_before(X1, sep = ",")
Removes text before the comma in column X1, including the comma.
X1 | NEW |
---|---|
New York, USA | USA |
Tokyo, Japan | Japan |
Paris, France | France |
str_remove_before(X1, sep = ",", include_sep = FALSE)
Removes text before the comma in column X1, but keeps the comma.
X1 | NEW |
---|---|
New York, USA | , USA |
Tokyo, Japan | , Japan |
Paris, France | , France |
Summary
Removes the text after a specified delimiter or regex match.
Syntax
str_remove_after(<text_column>
, sep = <delimiter>
, include_sep = <TRUE|FALSE>
)
Arguments
Return Value
Character
Example
str_remove_after(X1, sep = ",")
Removes text after the comma in column X1, including the comma.
X1 | NEW |
---|---|
New York, USA | New York |
Tokyo, Japan | Tokyo |
Paris, France | Paris |
str_remove_after(X1, sep = ",", include_sep = FALSE)
Removes text after the comma in column X1, but keeps the comma.
X1 | NEW |
---|---|
New York, USA | New York, |
Tokyo, Japan | Tokyo, |
Paris, France | Paris, |
Summary
Removes characters from a specified range in text data.
Syntax
str_remove_range(<text_column>
, start = <start_position>
, end = <end_position>
)
Arguments
Return Value
Character
Example
str_remove_range(X1, start = 1, end = 3)
Removes characters from the 1st to the 3rd position in column X1.
X1 | NEW |
---|---|
Hello | lo |
World | ld |
Analysis | lysis |
str_remove_range(X1, start = -3)
Removes the last 3 characters from column X1.
X1 | NEW |
---|---|
Hello | He |
World | Wo |
Analysis | Analy |
Summary
Replaces the text before a specified delimiter or regex match with a specified string.
Syntax
str_replace_before(<text_column>
, sep = <delimiter>
, rep = <replacement_string>
, include_sep = <TRUE|FALSE>
)
Arguments
Return Value
Character
Example
str_replace_before(X1, sep = ",", rep = "City")
Replaces text before the comma in column X1 with "City".
X1 | NEW |
---|---|
New York, USA | City USA |
Tokyo, Japan | City Japan |
Paris, France | City France |
str_replace_before(X1, sep = ",", rep = "City", include_sep = FALSE)
Replaces text before the comma in column X1 with "City", but keeps the comma.
X1 | NEW |
---|---|
New York, USA | City, USA |
Tokyo, Japan | City, Japan |
Paris, France | City, France |
Summary
Replaces the text after a specified delimiter or regex match with a specified string.
Syntax
str_replace_after(<text_column>
, sep = <delimiter>
, rep = <replacement_string>
, include_sep = <TRUE|FALSE>
)
Arguments
Return Value
Character
Example
str_replace_after(X1, sep = ",", rep = "Country")
Replaces text after the comma in column X1 with "Country".
X1 | NEW |
---|---|
New York, USA | New York Country |
Tokyo, Japan | Tokyo Country |
Paris, France | Paris Country |
str_replace_after(X1, sep = ",", rep = "Country", include_sep = FALSE)
Replaces text after the comma in column X1 with "Country", but keeps the comma.
X1 | NEW |
---|---|
New York, USA | New York, Country |
Tokyo, Japan | Tokyo, Country |
Paris, France | Paris, Country |
Summary
Replaces characters in a specified range of text data with another string.
Syntax
str_replace_range(<text_column>
, start = <start_position>
, end = <end_position>
, replacement = <replacement_string>
)
Arguments
Return Value
Character
Example
str_replace_range(X1, start = 1, end = 3, replacement = "NEW")
Replaces characters from the 1st to the 3rd position in column X1 with "NEW".
X1 | NEW |
---|---|
Hello | NEWlo |
World | NEWld |
Analysis | NEWlysis |
str_replace_range(X1, start = -3, replacement = "END")
Replaces the last 3 characters in column X1 with "END".
X1 | NEW |
---|---|
Hello | HeEND |
World | WoEND |
Analysis | AnalyEND |
Summary
Extracts the text before a specified delimiter or regex match.
Syntax
str_extract_before(<text_column>
, sep = <delimiter>
, include_sep = <TRUE|FALSE>
)
Arguments
Return Value
Character
Example
str_extract_before(X1, sep = ",")
Extracts text before the comma in column X1, excluding the comma.
X1 | NEW |
---|---|
New York, USA | New York |
Tokyo, Japan | Tokyo |
Paris, France | Paris |
str_extract_before(X1, sep = ",", include_sep = TRUE)
Extracts text before the comma in column X1, including the comma.
X1 | NEW |
---|---|
New York, USA | New York, |
Tokyo, Japan | Tokyo, |
Paris, France | Paris, |
Summary
Extracts the text after a specified delimiter or regex match.
Syntax
str_extract_after(<text_column>
, sep = <delimiter>
, include_sep = <TRUE|FALSE>
)
Arguments
Return Value
Character
Example
str_extract_after(X1, sep = ",")
Extracts text after the comma in column X1, excluding the comma.
X1 | NEW |
---|---|
New York, USA | USA |
Tokyo, Japan | Japan |
Paris, France | France |
str_extract_after(X1, sep = ",", include_sep = TRUE)
Extracts text after the comma in column X1, including the comma.
X1 | NEW |
---|---|
New York, USA | , USA |
Tokyo, Japan | , Japan |
Paris, France | , France |
Summary
Balances imbalanced data using SMOTE (Synthetic Minority Oversampling Technique). Creates synthetic data for the minority class and appropriately adjusts the majority class data to help build unbiased machine learning models.
The target column must be limited to 0/1 numeric values, binary categorical data, or logical type columns.
Syntax
exp_balance(<imbalanced column>
, target_minority_perc = <target percentage of minority data>
, target_size = <target number of rows>
, max_synth_perc = <maximum percentage increase for minority data>
, k = <number of neighbors>
, seed = <seed value>
)
Arguments
Return Value
Data Frame
Example
Original data class distribution (flight delay data):
Flight_ID | Delayed |
---|---|
1001 | TRUE |
1002 | FALSE |
1003 | FALSE |
1004 | FALSE |
1005 | FALSE |
1006 | FALSE |
1007 | FALSE |
1008 | TRUE |
Since the proportion of Delayed = TRUE is small, creating a machine learning model with this data might result in predictions biased toward FALSE.
exp_balance(Delayed, target_minority_perc = 50, target_size = 10, max_synth_perc = 200, k = 5, seed = 0)
Adjusts the proportion of minority data (TRUE) to 50% with a target dataset size of 10 rows.
Flight_ID | Delayed | Synthesized |
---|---|---|
1001 | TRUE | FALSE |
1002 | FALSE | FALSE |
1003 | FALSE | FALSE |
1004 | FALSE | FALSE |
1005 | FALSE | FALSE |
1006 | FALSE | FALSE |
2001 | TRUE | TRUE |
2002 | TRUE | TRUE |
2003 | TRUE | TRUE |
2004 | TRUE | TRUE |
Synthetic data (Synthesized = TRUE) is added, increasing the amount of Delayed = TRUE data, thus balancing the dataset.
Summary
Divides numeric data into equal intervals (e.g., in steps of 500) and returns the corresponding group name. Can be combined with format_cut_output function to customize the display format of the intervals.
Syntax
exp_cut_by_step(<numeric_column>
, step = <interval_width>
, lower.range = <minimum_value>
, upper.range = <maximum_value>
, right = <TRUE|FALSE>
, include.outside.range = <TRUE|FALSE>
, dig.lab = <display_digits>
, as.factor = <TRUE|FALSE>
)
Arguments * numeric_column - Numeric data to be divided into intervals * step - Specifies the width of the interval (e.g., 1000, 5000, etc.) * lower.range (optional) - Specifies the lower limit of the range. If NA, the minimum value of the data is used * upper.range (optional) - Specifies the upper limit of the range. If NA, the maximum value of the data is used * right (optional) - Specifies how to handle the endpoints of the range: * TRUE: Includes the right endpoint of the range * FALSE: Includes the left endpoint of the range. Default is FALSE * include.outside.range (optional) - Specifies how to handle values outside the specified range: * TRUE: Includes values outside the range (creates intervals for values below the lower limit and above the upper limit) * FALSE: Treats values outside the range as NA. Default is TRUE * dig.lab (optional) - Specifies the number of digits to display in the interval labels. Default is 3 * as.factor (optional) - Specifies the data type of the return value: * TRUE: Returns as a factor type (preserves order) * FALSE: Returns as a string. Default is TRUE
Return Value
Factor
Example
format_cut_output(exp_cut_by_step(sales_amount, step = 10000), right = FALSE, decimal.digits = 0)
Divides the sales amount into equal intervals in steps of 10000. Classifies by including the lower limit of each group and returns the corresponding group name in integer notation.
| sales_amount | interval |
|------|--------|
| 5000 | < 5,000 |
| 15000 | 5,000 - 15,000 |
| 25000 | 15,000 - 25,000 |
format_cut_output(exp_cut_by_step(sales_amount, step = 5000, lower.range = 0, upper.range = 20000, right = TRUE), right = TRUE, decimal.digits = 0)
Restricts the sales amount to a range from 0 to 20000 and divides it into equal intervals in steps of 5000. Classifies by including the upper limit of each group and returns the corresponding group name in integer notation.
sales_amount | interval |
---|---|
5000 | 0 - 5,000 |
15000 | 10,000 - 15,000 |
25000 | 20,000 < |
Summary
Divides numeric data into equal width groups and returns the corresponding group name. Can be combined with format_cut_output function to customize the display format of the intervals.
Syntax
exp_cut(<numeric_column>
, breaks = <number_of_intervals>
, lower.range = <minimum_value>
, upper.range = <maximum_value>
, right = <TRUE|FALSE>
, include.outside.range = <TRUE|FALSE>
, dig.lab = <display_digits>
, as.factor = <TRUE|FALSE>
)
Arguments * numeric_column - Numeric data to be divided into intervals * breaks - Specifies the number of intervals to create * lower.range (optional) - Specifies the lower limit of the range. If NA, the minimum value of the data is used * upper.range (optional) - Specifies the upper limit of the range. If NA, the maximum value of the data is used * right (optional) - Specifies how to handle the endpoints of the range: * TRUE: Includes the right endpoint of the range * FALSE: Includes the left endpoint of the range. Default is FALSE * include.outside.range (optional) - Specifies how to handle values outside the specified range: * TRUE: Includes values outside the range (creates intervals for values below the lower limit and above the upper limit) * FALSE: Treats values outside the range as NA. Default is TRUE * dig.lab (optional) - Specifies the number of digits to display in the interval labels. Default is 3 * as.factor (optional) - Specifies the data type of the return value: * TRUE: Returns as a factor type (preserves order) * FALSE: Returns as a string. Default is TRUE
Return Value
Factor
Example
format_cut_output(exp_cut(sales_amount, breaks = 5), right = TRUE, decimal.digits = 2)
Divides the sales amount values into 5 equal-width groups based on the minimum and maximum values, and returns which group each value belongs to. Group names display information up to 2 decimal places.
sales_amount | group |
---|---|
1 | 1 - 5,000 |
4800 | 1 - 5,000 |
13000 | 10,000 - 15,000 |
25000 | 20,000 - 25,000 |
format_cut_output(exp_cut(sales_amount, breaks = 4, lower.range = 0, upper.range = 20000, right = TRUE), right = TRUE, decimal.digits = 2)
Divides the sales amount into 4 equal-width groups within the range of 0 to 20000, and returns which group each value belongs to. Classified to include the upper limit of each group, with group names displaying information up to 2 decimal places.
sales_amount | group |
---|---|
1 | 0.00 - 5,000.00 |
4800 | 0.00 - 5,000.00 |
13000 | 10,000 - 15,000 |
25000 | 20,000.00 < |
Summary
Logistic function.
Syntax
logistic(<column_num>
)
Return Value
Numeric
Example
mutate(Y = logistic(X))
Create a new column Y with the result of logistic function applied to column X.
Summary Returns TRUE if the provided date is a Japanese Holiday.
Syntax
is_jholiday(<column_date>
)
Arguments
Return Value
Logical (either TRUE or FALSE)
Example
is_jholiday("2020-01-01")
returns TRUE
is_jholiday("2020-01-05")
returns FALSE
Summary
Converts between full-width (zenkaku) and half-width (hankaku) characters in a string.
Syntax
str_conv_zenhan(<text_column>
, to = "zenkaku"|"hankaku"
)
Arguments
Return Value
Character
Example
str_conv_zenhan(X1, "zenkaku")
Converts half-width characters to full-width characters.
X1 | NEW |
---|---|
ガッ | ガッ |
ハロー | ハロー |
ABC | ABC |
str_conv_zenhan(X1, to = "hankaku")
Converts full-width characters to half-width characters.
X1 | NEW |
---|---|
ガッ | ガッ |
ハロー | ハロー |
ABC | ABC |
Summary
Converts between hiragana and katakana characters in a string.
Syntax
str_conv_hirakana(<text_column>
, to = "hiragana"|"katakana"
)
Arguments
Return Value
Character
Example
str_conv_hirakana(X1, to = "hiragana")
Converts katakana characters to hiragana.
X1 | NEW |
---|---|
アイウエオ | あいうえお |
カタカナ | かたかな |
ヤマダ | やまだ |
str_conv_hirakana(X1, to = "katakana")
Converts hiragana characters to katakana.
X1 | NEW |
---|---|
あいうえお | アイウエオ |
かたかな | カタカナ |
やまだ | ヤマダ |
Summary Splits a Japanese address string into prefecture, city, and street components.
Syntax
separate_japanese_address(<address_column>
, prefecture_col = <prefecture_column_name>
, city_col = <city_column_name>
, street_col = <street_column_name>
)
Arguments
Return Value Character
Example
separate_japanese_address(address)
Splits the address string into prefecture, city, and street components.
address | prefecture | city | street |
---|---|---|---|
東京都千代田区丸の内1-1-1 | 東京都 | 千代田区 | 丸の内1-1-1 |
大阪府大阪市中央区北浜1-1-1 | 大阪府 | 大阪市中央区 | 北浜1-1-1 |
北海道札幌市中央区北1条西2丁目 | 北海道 | 札幌市中央区 | 北1条西2丁目 |
separate_japanese_address(address, prefecture_col = "prefecture", city_col = "city", street_col = "street_address")
Splits the address string with custom column names for prefecture, city, and street components.
address | prefecture | city | street_address |
---|---|---|---|
東京都千代田区丸の内1-1-1 | 東京都 | 千代田区 | 丸の内1-1-1 |
大阪府大阪市中央区北浜1-1-1 | 大阪府 | 大阪市中央区 | 北浜1-1-1 |
北海道札幌市中央区北1条西2丁目 | 北海道 | 札幌市中央区 | 北1条西2丁目 |
Summary
Converts Japanese era years (Reiwa, Heisei, Showa, etc.) to Western calendar years. Supports both kanji notation (令和三年) and alphabetic notation (R3).
Syntax
convert_jyear(<japanese_era_year>
, legacy = <TRUE|FALSE>
)
Arguments
Return Value
Numeric
Example
convert_jyear(japanese_era_year)
Converts Japanese era years to Western calendar years.
japanese_era_year | western_year |
---|---|
令和3年 | 2021 |
R3 | 2021 |
平成元年 | 1989 |
昭和64年 | 1989 |
Summary
Converts prefecture names between kanji format (e.g., 東京都) and romanized format (e.g., Tokyo-to). Useful for normalizing and standardizing address data.
Syntax
convert_prefecture(<prefecture_name_column>
, to = <"kanji"|"roman">
)
Arguments
Return Value
Character
Example
convert_prefecture(prefecture, to = "kanji")
Converts romanized prefecture names to kanji format.
prefecture | converted_prefecture |
---|---|
tokyo-to | 東京都 |
osaka-fu | 大阪府 |
hokkaido | 北海道 |
Summary
Converts prefecture names written in hiragana or katakana to kanji notation. Useful for normalizing address data and name matching processes.
Syntax
convert_prefecture_from_kana(<kana_prefecture_name>
)
Arguments
Return Value
Character
Example
convert_prefecture_from_kana(prefecture_hiragana)
Converts prefecture names in hiragana to kanji notation.
prefecture_hiragana | prefecture_kanji |
---|---|
とうきょうと | 東京都 |
おおさかふ | 大阪府 |
ホッカイドウ | 北海道 |
Summary
Retrieves the date for a specific weekday in a given year and month. For example, it can find "the 3rd Tuesday of May 2023." Useful for holiday settings and recurring schedule management.
Syntax
find_date_by_wday(<year>
, <month>
, <weekday>
, <occurrence>
)
Arguments
Return Value
Date
Example
find_date_by_wday(2023, 5, 2, 3)
Retrieves the date of the 3rd Tuesday in May 2023.
Input | Output |
---|---|
(2023, 5, 2, 3) | 2023-05-16 |
(2023, 1, 1, 1) | 2023-01-01 |
(2023, 12, 7, 4) | 2023-12-24 |
Summary
Standardizes prefecture names to a unified format. Absorbs differences such as the presence/absence of "都", "府", "県" suffixes, as well as abbreviated and formal name variations, to provide consistent notation.
Syntax
harmonize_prefecture_name(<prefecture_name>
, to = "long"|"short"
)
Arguments
Return Value
Character
Example
harmonize_prefecture_name(prefecture_abbreviated, to = "long")
Standardizes abbreviated prefecture names (short) to formal name format (long).
prefecture_abbreviated | prefecture_formal |
---|---|
東京 | 東京都 |
大阪府 | 大阪府 |
京都 | 京都府 |
Summary
Determines whether a string is a valid prefecture name. Returns FALSE if the string contains additional text such as city names or if it's an invalid prefecture name.
Syntax
is_prefecture(<string>
)
Arguments
Return Value
Logical
Example
is_prefecture(prefecture_name)
Evaluates whether the string is a valid prefecture name.
prefecture_name | is_valid_prefecture |
---|---|
東京都 | TRUE |
大阪 | TRUE |
東京都新宿区 | FALSE |
とうきょう | FALSE |
Summary
Determines whether a string is in the correct format for a Japanese postal code. Supports both 7-digit numeric format and hyphenated format.
Syntax
is_zipcode(<string_or_number>
)
Arguments
Return Value
Logical
Example
is_zipcode(zipcode)
Evaluates whether the string is in a valid postal code format.
zipcode | is_valid_zipcode |
---|---|
100-0001 | TRUE |
1000001 | TRUE |
100-001 | FALSE |
abc-defg | FALSE |
Summary
Converts between different Japanese character types (hiragana, katakana, full-width, half-width, etc.). By specifying the conversion function and target format, various character type conversions are supported.
Syntax
str_jconv(<string>
, fun = <conversion_function>
, to = <target_format>
)
Arguments
Return Value
Character
Example
str_jconv("カタカナの列", str_conv_hirakana, to = "hiragana")
Converts katakana to hiragana.
katakana_column | hiragana_column |
---|---|
カタカナ | かたかな |
アイウエオ | あいうえお |
Summary
Normalizes strings to a standardized format. Converts full-width alphanumeric characters to half-width, unifies space characters, removes duplicate spaces, and more.
Syntax
str_jnormalize(<string>
)
Arguments
Return Value
Character
Example
str_jnormalize(string)
Normalizes strings to a standardized format.
string | normalized_string |
---|---|
ABC 123 | ABC 123 |
デ ー タ | データ |
R&D部門 | R&D部門 |
Summary
Formats postal code strings. Can convert between numeric-only format and hyphenated format. Useful for data standardization and display format unification.
Syntax
zipcode_spacer(<zipcode>
, remove = <TRUE|FALSE>
)
Arguments
Return Value
Character
Example
zipcode_spacer(zipcode, remove = FALSE)
Formats postal codes by adding or removing hyphens.
zipcode | formatted_zipcode |
---|---|
1000001 | 100-0001 |
100-0001 | 100-0001 |
Summary
Converts Japanese era date strings (e.g., 令和3年1月1日, R3/1/1) to Western calendar dates (e.g., 2021-01-01).For Japanese dates in the Gregorian calendar, such as "yyyy年mm 月" or "yyyy年mm月dd日", use the lubridate functions (ym or ymd) according to the date order (year, month, day) instead of using the convert_jdate function. The lubridate package functions can also properly process Japanese dates such as "year", "month", and "day".
Syntax
convert_jdate(<japanese_date_string>
, legacy = <TRUE|FALSE>
)
Arguments
Return Value
Date
Example
Converts Japanese era dates to Western calendar dates.
convert_jdate(japanese_date_string)
japanese_date_string | western_date |
---|---|
令和3年1月1日 | 2021-01-01 |
R3/1/1 | 2021-01-01 |
平成31年4月30日 | 2019-04-30 |
Gregorian Year-Month Pattern (e.g., 2024年1月, 2021年09月)
mutate(year_month_column = ym(year_month_column))
Gregorian Year-Month-Day Pattern (e.g., 2024年1月1日, 2021年09月15日)
mutate(date_column = ymd(date_column))