Function Reference

!

Summary
Reverses the meaning of the following condition. It is a logical negation and has a similar meaning as 'Not'.

Syntax
!

Return Value

Example
filter(!is.na(Population))
Keep the row if Population value is NOT NA
filter(!str_detect(City, "Francisco"))
Keep the row if City value does NOT include text 'Francisco'.
filter(!Sales > 500)
Keep the row if Sales value is NOT greater than 500.

abbreviate

Summary
Abbreviate strings to at least minlength characters, such that they remain unique (if they were). First all spaces at the beginning of the string are stripped. Then (if necessary) any other spaces are stripped. Next, lower case vowels are removed (starting at the right) followed by lower case consonants. Finally if the abbreviation is still longer than minlength upper case letters are stripped.

Syntax
abbreviate(<column>, minlength = <number>, dot = <logical>)

Arguments

minlength (Optional) - the minimum length of the abbreviations.
dot (Optional) - logical: should a dot (".") be appended?

Return Value
Character

Example
abbreviate("Exploratory")
Returns "Expl"
abbreviate("Exploratory", 3)
Returns "Exp"

abs

Summary
Returns the absolute value.

Syntax
abs(<column_num>)

Return Value
Numeric

Example
mutate(abs = abs(ARR_DELAY))

acos

Summary
Returns the inverse cosine of a value, in radians

Syntax
acos(<column_num>)

Return Value
Numeric

Example
mutate(a = acos(X1))
Create a new column for calculating the acos.

acosh

Summary
Returns the inverse hyperbolic cosine of a number.

Syntax
acosh(<column_num>)

Return Value
Numeric

Example
mutate(a = acosh(X1))
Create a new column for calculating the acosh.

all

Summary
Return TRUE if the condition is satisfied for all the rows or a group of rows.

Syntax
all(<condition>, na.rm = <logical>)

Arguments

na.rm - TRUE or FALSE. The default is FALSE. Decides whether missing values (including NaN) to be removed or not.

Return Value
Logical

Example
summarize(is_matched_all = all(userid == username))
Return TRUE if all the userid values and username values are same, otherwise FALSE. summarize(is_matched_all = all(revenue > 500))
Return TRUE if all the revenue values are greater than 500, otherwise FALSE.

any

Summary
Return TRUE if the condition is satisfied for any of the rows or a group of rows.

Syntax
any(<condition>, na.rm = <logical>)

Arguments

na.rm - TRUE or FALSE. The default is FALSE. Decides whether missing values (including NaN) to be removed or not.

Return Value
Logical

Example
summarize(is_matched_all = any(userid == username))
Return TRUE if any of the userid values and username values are same, otherwise FALSE. summarize(is_matched_all = all(revenue > 500))
Return TRUE if any of the revenue values are greater than 500, otherwise FALSE.

as.character

Summary
Convert a given data to character data type.

Syntax
as.character(<column>)

Return Value
Character

Example
as.character(123.11)
Returns "123.11" as character.

as.Date

Summary
Convert a given data to Date data type. Date data type doesn't include Time. The default formats follow the rules of the ISO 8601 international standard which expresses a day as "2001-02-03". If the given data is number as 'internal system number (# of days since some origin)' then you want to provide which date to be used as 'origin'.

Syntax
as.Date(<column>, format = <date_format>, tz = <timezone>, origin = <text>)

Arguments

format (Optional) - specify a date format to read the given data. If not specified, it will try "%Y-%m-%d" then "%Y/%m/%d"
- %d - Day of the month (decimal number)
- %m - Month (decimal number)
- %b - Month (abbreviated)
- %B - Month (full name)
- %y - Year (2 digit)
- %Y - Year (4 digit)
tz (Optional) - specify which timezone to use to read the given data.
origin (Optional) - specify the origin date to calculate a given internal number of days from.

Return Value
Date

Example
as.Date("2015-10-10")
Returns "2015-10-10"
as.Date("2015-10-10", tz = "America/Los_Angeles")
Returns "2015-10-10"
as.Date("10-10-10", format = "%d-%m-%y")
Returns "2010-10-10"
as.Date(35981, origin = "1899-12-30") # This is recommended for Windows Excel data.
Returns "1998-07-05"
as.Date(34519, origin = "1904-01-01") # This is recommended for Mac Excel data.
Returns "1998-07-05"

as.double

Summary
Convert data to double precision number data type. It is identical to numeric.

Syntax
as.double(<column>)

Return Value
Double

Example
as.double("12345")
Returns 12345
as.double("12345.10")
Returns 12345.1

as.factor

Summary
Encode a column as a factor.

Syntax
as.factor(<column>)

Arguments

Return Value
Factor

Example
columnA - "iMac", "iPod", "iPhone", "iPod", "iPhone"
as.factor(columnA)
Returns iMac, iPod, iPhone, iPod, iPhone (It has Levels information of : iMac iPhone iPod)

as.integer

Summary
Convert data to integer number data type.

Syntax
as.integer(<column>)

Return Value
Integer

Example
as.integer("12345")
Returns 12345
as.integer("12345.10")
Returns 12345

as.logical

Summary
Converts the target column to logical type (TRUE/FALSE). For numeric values, 0 is converted to FALSE, and all non-zero values are converted to TRUE. For character strings, "TRUE", "True", "true", "T", "t" are converted to TRUE, and "FALSE", "False", "false", "F", "f" are converted to FALSE. Any other strings result in NA. The difference with the str_logical function is that str_logical can also convert Yes/No values to TRUE/FALSE, and allows you to directly specify which values should be converted to TRUE.

Syntax
as.logical(<column_to_convert>)

Arguments * column_to_convert - Specifies the column you want to convert to logical type.

Return Value
Logical

Example

as.logical(numeric_column)

Converts a numeric column to logical type. 0 is converted to FALSE, and all non-zero values are converted to TRUE.

numeric_column	converted
0	FALSE
1	TRUE
1	TRUE
0	FALSE

as.logical(text_column)

Converts a text column to logical type. Only "TRUE"/"FALSE" and their variations are converted, all others become NA.

text_column	converted
TRUE	TRUE
False	FALSE
t	TRUE
Yes	NA
No	NA

as.numeric

Summary
Convert data to

Syntax
as.numeric(<column>, units = <difftime_units>)

Arguments

units (Optional) - this can be used only for difftime data type columns or values.

Return Value
Numeric

Example
as.numeric("12345")
Returns 12345
as.numeric("12345.10")
Returns 12345.1
as.numeric(as.Date("2015-01-30") - as.Date("2015-01-15"), units = "days")
Returns 15
as.numeric(as.Date("2015-01-30") - as.Date("2015-01-15"), units = "weeks")
Returns 2.142857

as.POSIXct

Summary
Convert data to Date/Time data type.

Syntax
as.POSIXct(<column>, format = <date_time_format>, tz = <timezone>, origin = <text>)

Arguments

format (Optional) - specify a date format to read the given data. If not specified, it will try "%Y-%m-%d" then "%Y/%m/%d"
- %Y - 4-digit year
- %y - 2-digit year
- %m - Decimal month
- %b - Abbreviated month
- %B - Full month
- %d - Decimal date
- %H - Decimal hours (24 hour)
- %I - Decimal hours (12 hour)
- %M - Decimal minute
- %z - Offset from GMT
- %Z - Time zone (character)
- %a - Abbreviated weekday
- %A - Full weekday
- %j - Decimal day of the year
- %p - Locale-specific AM/PM
- %S - Decimal second
- %U - Decimal week of the year (starting on Sunday)
- %w - Decimal Weekday (0=Sunday)
- %W - Decimal week of the year (starting on Monday)
- %x - Locale-specific Date
- %X - Locale-specific Time
- %c - Locale-specific date and time
tz (Optional) - specify which timezone to use to read the given data.
origin (Optional) - specify the origin date to calculate a given internal number of days from.

Return Value
POSIXct

Example
as.POSIXct("2015-10-10 13:10:05") Returns "2015-10-10 13:10:05 PDT"
as.POSIXct("2015-10-10 13:10:05", tz = "America/Los_Angeles") Returns "2015-10-10 13:10:05 PDT"
as.POSIXct("2015-10-10 13:10:05", tz = "Asia/Tokyo") Returns "2015-10-10 13:10:05 JST"
as.POSIXct("05-10-15T13:10:05", format = "%d-%m-%yT%H:%M") Returns "2015-10-05 13:10:00 PDT"

asin

Summary
Returns the inverse sine of a value, in radians.

Syntax
asin(<column_num>)

Return Value
Numeric

Example
mutate(a = asin(X1))
Create a new column for calculating the asin.

asinh

Summary
Returns the inverse hyperbolic sine of a number.

Syntax
asinh(<column_num>)

Return Value
Numeric

Example
mutate(a = asinh(X1))
Create a new column for calculating the asinh.

atan

Summary
Returns the inverse tangent of a value, in radians.

Syntax
atan(<column_num>)

Return Value
Numeric

Example
mutate(a = atan(X1))
Create a new column for calculating the atan.

atan2

Summary
Returns the angle between the x-axis and a line segment from the origin (0,0) to specified coordinate pair (x,y), in radians.

Syntax
atan2(<column_num>)

Return Value
Numeric

Example
mutate(a = atan2(X1, Y1))
Create a new column for calculating the atan2.

atanh

Summary
Returns the inverse hyperbolic tangent of a number.

Syntax
atanh(<column_num>)

Return Value
Numeric

Example
mutate(a = atanh(X1))
Create a new column for calculating the atanh.

c

Summary
This is a generic function which combines its arguments.

Syntax
c(<value1>, <value2>, ... , recursive = <logical>)

Arguments

recursive (Optional) - The default is FALSE. If recursive = TRUE, the function recursively descends through lists (and pairlists) combining all their elements into a vector.

Example
c(1:10)
Returns 1 2 3 4 5 6 7 8 9 10
c(1:10, 1:3)
Returns 1 2 3 4 5 6 7 8 9 10 1 2 3
c("a", "b", "c")
Returns "a" "b" "c"

ceiling

Summary
Rounds numeric values up to the specified decimal place.

Syntax
ceiling(<column_num>, digits = <number>)

Arguments

digits (Optional) - number of digits

Return Value
Numeric

Example
ceiling(X1, digits=2) Rounds X1 up to the second decimal place.

X1	NEW
-0.2344	-0.23
0.91098	0.92
0.311785	0.32

chartr

Summary
Translates each character in x that is specified in old to the corresponding character specified in new.

Syntax
chartr(<old_text>, <new_text>, <column_text>)

Arguments

old_text - the characters to be translated.
new - the characters to replace with.

Return Value
Character

Example
x = "abcdef 123456" chartr("abc", "xyz", x)
Returns "xyzdef 123456"
chartr("a-f", "u-z", x)
Returns "uvwxyz 123456"
chartr("a-f", "xxxxxx", x)
Returns "xxxxxx 123456"
chartr("123", "000", x)
Returns "abcdef 000456"

cos

Summary
Returns the cosine of an angle provided in radians.

Syntax
cos(<column_num>)

Return Value
Numeric

Example
mutate(a = cos(X1))
Create a new column for calculating the cos.

cosh

Summary
Returns the hyperbolic cosine of any real number.

Syntax
cosh(<column_num>)

Return Value
Numeric

Example
mutate(a = tanh(X1))
Create a new column for calculating the cosh.

cospi

Summary
Returns the cos(pi * ).

Syntax
cospi(<column_num>)

Return Value
Numeric

Example
mutate(a = cospi(X1))
Create a new column for calculating the cospi.

cummax

Summary
Returns the cumulative maxima.

Syntax
cummax(<column_num>)

Return Value
Numeric

Example
// X = c(1, 2, 3, 2, 1) cummax(X)
Returns 1, 2, 3, 3, 3

cummin

Summary
Returns the cumulative minima.

Syntax
cummin(<column_num>)

Return Value
Numeric

Example
// X = c(1, 2, 3, 2, 1) cummin(X)
Returns 1, 1, 1, 1, 1

cumprod

Summary
Returns the cumulative products.

Syntax
cumprod(<column_num>)

Return Value
Numeric

Example
// X = c(1, 2, 3, 2, 1) cumprod(X)
Returns 1, 2, 6, 12, 12

cumsum

Summary
Returns the cumulative sums.

Syntax
cumsum(<column_num>)

Return Value
Numeric

Example
// X = c(1, 2, 3, 2, 1) cumsum(X)
Returns 1, 3, 6, 8, 9

cut

Summary
Divides the range of values into intervals and return the values in x according to which interval they fall.

Syntax
cut(<column_num>, breaks = <number>, labels = <text>, include.lowest = <logical>, right = <logical>, dig.lab = <number>), ordered_result = <logical>)

Arguments

breaks - either a numeric vector of two or more unique cut points or a single number (greater than or equal to 2) giving the number of intervals into which x is to be cut.
labels (Optional) - labels for the levels of the resulting category. By default, labels are constructed using "(a,b]" interval notation. If labels = FALSE, simple integer codes are returned instead of a factor.
include.lowest (Optional) - logical, indicating if an ‘x[i]’ equal to the lowest (or highest, for right = FALSE) ‘breaks’ value should be included.
right (Optional) - The default is TRUE. Indicating if the intervals should be closed on the right (and open on the left) or vice versa.
dig.lab (Optional) - integer which is used when labels are not given. It determines the number of digits used in formatting the break numbers. The default is 3.
ordered_result (Optional) - The default is FALSE. Decide whether the result is an ordered factor?

Example
//x = c(1,2,3,4,5,6,7,8,9,10)
cut(x, breaks = 5)
Returns (0.991,2.8] (0.991,2.8] (2.8,4.6] (2.8,4.6] (4.6,6.4] (4.6,6.4] (6.4,8.2] (6.4,8.2] (8.2,10] (8.2,10]
cut(x, breaks = 2)
Returns (0.991,5.5] (0.991,5.5] (0.991,5.5] (0.991,5.5] (0.991,5.5] (5.5,10] (5.5,10] (5.5,10] (5.5,10] (5.5,10]
cut(x, breaks = 2, labels = c(1,2))
Returns 1 1 1 1 1 2 2 2 2 2
cut(x, breaks = 2, labels = c("A", "B"))
Returns A A A A A B B B B B
mutate(a = cut(FL_DATE, breaks = 5, start.on.monday = FALSE))
Returns one of the five buckets based on the date value

difftime

Summary
Calculates the difference between two given dates/times.

Syntax
difftime(<column_date>, <column_date>, tz = <timezone>, units = <difftime_units>)

Arguments

tz (Optional) - specify which timezone to use to read the given data.
units (Optional) - specify an unit to return the result with. The default is "auto', which would use the most appropriate unit based on the given data. Possible values: years, months, weeks, days, hours, mins, seconds.

Return Value
Difftime

Example
difftime("2015-12-10", "2015-12-30"))
Returns Time difference of -20 days.
difftime("2015-12-10", "2015-12-30", unit = "hours")
Returns Time difference of -480 hours.
as.numeric(difftime("2015-12-10", "2015-12-30"))
Returns -20.
as.numeric(difftime("2015-12-10", "2015-12-30"), units = "hours")
Returns -480.

duplicated

Summary
Returns a logical value (TRUE or FALSE) indicating if it is a duplicated value or rows.

Syntax
duplicated(<column(s)>, incomparables = <logical>)

Arguments

incomparables (Optional) - a vector of values that cannot be compared. FALSE is a special value, meaning that all values can be compared, and may be the only value accepted for methods other than the default.

Return Value
Logical

Example
// x <- c("a", "b", "c", "a", "b")
duplicated(x)
Returns FALSE FALSE FALSE TRUE TRUE duplicated(x, incomparables = "a")
Returns FALSE FALSE FALSE FALSE TRUE
mutate(a = duplicated(ARR_TIME)) Returns TRUE or FALSE for each row
filter(duplicated(ARR_TIME, ARR_DELAY))
Keep only the duplicated rows
summarise(a = sum(!duplicated(ARR_TIME)))
Returns # of NOT duplicated values for each group. This is same as the below.
summarise(a = sum(length(unique(ARR_TIME))))

exp

Summary
Returns Euler's number, e (~2.718) raised to a power.

Syntax
exp(<column_num>)

Return Value
Numeric

Example
mutate(revenue_exp = exp(revenue))
Returns the exponential value of revenue

factor

Summary
Encode a column as a factor.

Syntax
factor(<column>, levels = <text>, labels = <text>, exclude = <text>, ordered = <logical>, nmax = <number>)

Arguments

levels (Optional) - Set a set of levels (categories) explicitly. If not specified, the unique values of the given column will be used.
labels (Optional) - If not specified, level values will be used.
exclude (Optional) - The default is NA> Values to be excluded from the levels (categories). For a numeric column, set exclude = NULL to make NA an extra level (prints as ); by default, this is the last level.
ordered (Optional) - The default is FALSE. Decide whether the level values are ordered or not.
nmax (Optional) - The default is NA. an upper bound on the number of levels.

Return Value
Factor

Example

mutate(color = factor(color, levels = c("Gold","Silver","Bronze"), ordered=TRUE))

Before:

color	counts
Bronze	8
Silver	12
Gold	15

After:

color	counts
Gold	15
Silver	12
Bronze	8

floor

Summary
Rounds a number down to the nearest integer. floor(3.475) is 3.

Syntax
floor(<column_num>, digits = <number>)

Arguments

digits (Optional) - number of digits

Return Value
Numeric

Example
floor(3.475, digits=2) Returns 3.47

iconv

Summary
Convert a given text from one encoding to another.

Syntax
iconv(<column>, from = <encoding>, to = <encoding>, sub = <text>, mark = <logical>, toRaw = <logical>)

Arguments

from - A character string describing the current encoding.
to - A character string describing the target encoding.
sub (Optional) - character string. If not NA it is used to replace any non-convertible bytes in the input. (This would normally be a single character, but can be more.) If "byte", the indication is "" with the hex code of the byte.
mark (Optional) - The default is FALSE. Should encodings be marked?
toRaw (Optional) - The default is FALSE. Should a list of raw vectors be returned rather than a character vector?

Return Value
Character

Example
iconv(x, "ISO_8859-1", "UTF-8")
Converts a given text from "ISO_8859-1" to "UTF-8".

ifelse

Summary
Returns different values based on the conditional result.

Syntax
ifelse(<condition>, <return_value_when_TRUE>, <return_value_when_FALSE>)

Arguments

condition - a condition to test.
return_value_when_yes - return value when the condition test returns TRUE.
return_value_when_no - return value when the condition test returns FALSE.

Return Value

Example
//x <- 1:5
ifelse(x > 3, "Bigger", "Smaller")
Returns "Smaller" "Smaller" "Smaller" "Bigger" "Bigger"

intersect

Overview
Extracts only the rows that exist in both dataframes. Equivalent to the "intersection" operation in set theory.

Syntax
intersect(<dataframe>, force_data_type = <TRUE|FALSE>)

Arguments * dataframe - Specifies the dataframes to find common rows between. * force_data_type (optional) - When set to TRUE, it first converts all column data types to character type, and after processing, converts them back to inferred values.

Examples
Source Data1: | ID | product_name | price | |----|--------------|-------| | 1 | apple | 100 | | 2 | orange | 80 | | 3 | banana | 120 |

Source Data2: | ID | product_name | price | |----|--------------|-------| | 2 | orange | 80 | | 3 | banana | 120 | | 4 | kiwi | 150 |

intersect(Source Data1, Source Data2)

Extracts only the rows that exist in both Data1 and Data2. | ID | product_name | price | |----|--------------|-------| | 2 | orange | 80 | | 3 | banana | 120 |

is.character

Summary
Returns true if a given object is character type.

Syntax
is.character

Return Value
Logical

Example

mutate_if(is.character, str_to_lower)
Convert all the letters to lowercase for all the character type columns.

is.double

Summary
Returns true if a given object is double type.

Syntax
is.double

Return Value
Logical

Example

mutate_if(is.double, round)
Round all the double type columns.
mutate_if(is.double, funs(. * 0.2))
Multiply all the double columns by 0.2 overriding existing columns.
mutate_if(is.double, funs(calc = . * 0.2))
Multiply all the double columns by 0.2 as new columns.
summarize_if(is.double, mean) Calculate the mean (average) for all the double columns.

is.factor

Summary
Returns true if a given object is factor type.

Syntax
is.factor

Return Value
Logical

Example

mutate_if(is.factor, as.character)
Convert all the factor type columns to character type.

is.list

Summary
Returns true if a given object is list type.

Syntax
is.list

Return Value
Logical

Example

select_if(is.list)
Select only list type columns.

is.logical

Summary
Returns true if a given object is logical type.

Syntax
is.logical

Return Value
Logical

Example

mutate_if(is.logical, as.numeric)
Convert the logical values (TRUE, FALSE) to numeric values (1,0).
summarize_if(is.logical, sum) Count the number of TRUE values for all the logical type columns.

is.matrix

Summary
Returns true if a given object is matrix type.

Syntax
is.matrix

Return Value
Logical

Example

mutate_if(is.matrix, scale)
Scale and center all the columns of matrices in all the matrix type columns.

is.na

Summary
Returns TRUE when it is NA.

Syntax
is.na(<column>)

Return Value
Logical

Example
filter(is.na(ARR_DELAY))
Keep only rows where ARR_DELAY is NA
filter(!is.na(ARR_DELAY))
Keep only rows where ARR_DELAY is NOT NA

is.null

Summary
Returns TRUE when it is null.

Syntax
is.null(<column>)

Return Value
Logical

Example
filter(is.null(ARR_DELAY))
Keep only rows where ARR_DELAY is null
filter(!is.null(ARR_DELAY))
Keep only rows where ARR_DELAY is NOT null

is.numeric

Summary
Returns true if a given object is numeric type.

Syntax
is.numeric

Return Value
Logical

Example

mutate_if(is.numeric, funs(. * 0.2))
Multiply all the numeric columns by 0.2 overriding existing columns.
mutate_if(is.numeric, funs(calc = . * 0.2))
Multiply all the numeric columns by 0.2 as new columns.
summarize_if(is.numeric, mean) Calculate the mean (average) for all the numeric columns.

isTRUE

Summary
Returns true if and only if a given value is a length-one logical vector whose only element is TRUE.

Syntax
isTRUE()

Return Value
Logical

Example
isTRUE(TRUE)
Returns TRUE
isTRUE(FALSE)
Returns FALSE
summarise(a = isTRUE(mean(ARR_DELAY) > 120))
Returns TRUE or FALSE based on the average of ARR_DELAY for each group.

length

Summary
Returns the length of the values.

Syntax
length(<column>)

Return Value
Numeric

Example
summarize(text_length = length(TAIL_NUM))
Returns # of values for each group
summarize(text_length = length(unique(TAIL_NUM)))
Returns # of unique values of 'TAIL_NUM' for each group

levels

Summary
Returns the levels of a given factor.

Syntax
levels(<column>)

Example
// x <- as.factor(c("a", "b", "c", "d", "e", "a", "b", "a"))
levels(x)
Returns "a" "b" "c" "d" "e"
summarise(n = length(levels(as.factor(flight$CARRIER))))
Returns 16

list

Summary
Construct a list

Syntax
list(<column>)

Return Value
List

Example
list(sales = 0, profit = 10)
Create a named list of [sales = 0, profit = 10]
complete(year, product, fill = list(sales = 0))
Set 0 for 'sales' column for missing values.

log

Summary
Returns logarithms, by default natural logarithms.

Syntax
log(<column_num>, base = <number>)

Arguments

base (Optional) - a positive or complex number: the base with respect to which logarithms are computed. Defaults to e=exp(1).

Return Value
Numeric

Example
mutate(Sales_log = log(Sales))

log10

Summary
Computes common (i.e., base 10) logarithms

Syntax
log10(<column_num>, base = <number>)

Arguments

base (Optional) - a positive or complex number: the base with respect to which logarithms are computed. Defaults to e=exp(1).

Return Value
Numeric

Example
mutate(Sales_log = log10(Sales))

log1p

Summary
Computes log(1+x) accurately also for |x| << 1.

Syntax
log1p(<column_num>, base = <number>)

Arguments

base (Optional) - a positive or complex number: the base with respect to which logarithms are computed. Defaults to e=exp(1).

Return Value
Numeric

Example
mutate(Sales_log = log1p(Sales))

log2

Summary
computes binary (i.e., base 2) logarithms

Syntax
log2(<column_num>, base = <number>)

Arguments

base (Optional) - a positive or complex number: the base with respect to which logarithms are computed. Defaults to e=exp(1).

Return Value
Numeric

Example
mutate(Sales_log = log2(Sales))

max

Summary
Returns the maximum value in a numeric column.

Syntax
max(<column>, na.rm = <logical>)

Arguments

na.rm - TRUE or FALSE. The default is FALSE. Decides whether missing values (including NaN) to be removed or not.

Example
summarize(revenue_max = max(revenue))
Create a new column for calculating the max value of of revenue for each group.

mean

Summary
Returns the numerical average (mean) value.

Syntax
mean(<column_num_logic_date>, na.rm = <logical>, trim = )

Arguments

na.rm (Optional) - The default is FALSE. Decides whether missing values (including NaN) to be removed or not.
trim (Optional) - the fraction (0 to 0.5) of observations to be trimmed from each end of the values of a given column before the mean operation. Values of trim outside that range are taken as the nearest endpoint.

Return Value
Numeric

Example
summarize(revenue_mean = mean(revenue))
Create a new column for calculating the average value of revenue for each group.

min

Summary
Returns the minimum value in a numeric column.

Syntax
min(<column>, na.rm = <logical>)

Arguments

na.rm - TRUE or FALSE. The default is FALSE. Decides whether missing values (including NaN) to be removed or not.

Example
summarize(revenue_min = min(revenue))
Create a new column for calculating the minimum of revenue for each group.

months

Summary
Convert a number to month periods so you can use it for further calculations.

Syntax
months(<column_num>)

Return Value
Period

Example
months(1) returns "1m 0d 0H 0M 0S"
ymd_hms("2015-10-01 06:15:30") + months(1)
returns "2015-11-01 06:15:30 UTC"

nchar

Summary
Returns # of letters or size of a given value.

Syntax
nchar(<column>, type = "chars"|"bytes"|"width", allowNA = <logical>)

Arguments

type (Optional) - The default is "chars". Set a type to count a given characters among "chars", "bytes", or "width".
allowNA (Optional) - The default is FALSE. Decide should NA be returned for invalid multibyte strings or "bytes"-encoded strings (rather than throwing an error)?

Return Value
Numeric

Example
nchar("Exploratory")
Returns 11
nchar("西田")
Returns 2
nchar("西田", type = "bytes")
Returns 6

pmax

Summary
Returns the parallel maxima of the input values.

Syntax
pmax(<column_num_logic_date>, na.rm = <logical>)

Arguments

na.rm - TRUE or FALSE. The default is FALSE. Decides whether missing values (including NaN) to be removed or not.

Return Value
Numeric

Example
x <- c(1, 2, 3, 4, 5) y <- c(3, 3, 3, 3, 3) pmax(x, y)
Returns 3 3 3 4 5
mutate(a = pmax(ARR_DELAY, DEP_DELAY, na.rm = TRUE))
Returns a bigger value between ARR_DELAY and DEP_DELAY values.

pmin

Summary
Returns the parallel minima of the input values.

Syntax
pmin(<column_num_logic_date>, na.rm = <logical>)

Arguments

na.rm - logical. Should missing values (including NaN) be removed?

Return Value
Numeric

Example
x <- c(1, 2, 3, 4, 5)
y <- c(3, 3, 3, 3, 3)
pmin(x, y)
Returns 1 2 3 3 3
mutate(a = pmin(ARR_DELAY, DEP_DELAY, na.rm = TRUE))
Returns a smaller value between ARR_DELAY and DEP_DELAY values.

prod

Summary
Returns the product value by multiplying a series of numbers in a given data together.

Syntax
prod(<<column_num_logic>, na.rm = <logical>)

Arguments

na.rm - logical. Should missing values (including NaN) be removed?

Return Value
Numeric

Example
// x <- 1:5
prod(x)
Returns 120

range

Summary
Returns a min and max values of a given data.

Syntax
range(<column>, na.rm = <logical>)

Arguments

na.rm - logical. Should missing values (including NaN) be removed?

Return Value

Example
// x <- c(1, 2, 3, 4, 5)
Returns 1 5
filter(ARR_DELAY %in% range(DEP_DELAY, na.rm = TRUE))
Keep only rows with ARR_DELAY is in a range of DEP_DELAY. This is same as the below. filter(between(ARR_DELAY, range(DEP_DELAY, na.rm = TRUE)[1], range(DEP_DELAY, na.rm = TRUE)[2]))

rep

Summary
Repeats given numbers.

Syntax
rep(<column>, times = <number>, length_out = <number>, each = <number>)

Arguments

times (Optional) - number of times to repeat.
length.out (Optional) - the desired length of the output.
each (Optional) - each element of a given value is repeated each times.

Return Value
Numeric

Example
// x <- c(1, 2, 3, 4, 5)
rep(x, 2)
Returns 1 2 3 4 5 1 2 3 4 5
rep(x, times = 2)
Returns 1 2 3 4 5 1 2 3 4 5
rep(x, length.out = 8)
Returns 1 2 3 4 5 1 2 3
rep(x, times = 2, each = 2)
Returns 1 1 2 2 3 3 4 4 5 5 1 1 2 2 3 3 4 4 5 5
rep(x, length.out = 8, each = 2)
Returns 1 1 2 2 3 3 4 4

rev

Summary
Reverses the entries in a given data.

Syntax
rev(<column>)

Return Value

Example
// x <- c(1, 2, 3, 4, 5)
rev(x)
Returns 5 4 3 2 1

round

Summary
Rounds a number to a certain number of decimal places.

Syntax
round(<column_num>, digits = <number>)

Arguments

digits (Optional) - number of digits

Return Value
Numeric

Example
round(3.475, digits=2)
Returns 3.48

seq

Summary
Returns a sequence of numbers.

Syntax
seq(from = <number>, to = <number>, by = <number>, length.out = <number>, along.with = <column>)

Arguments

from - the starting number.
to - the ending number.
by (Optional) - the number to use to increment the value.
length.out (Optional) - desired length of the sequence. A non-negative number, which for seq and seq.int will be rounded up if fractional.
along.with (Optional) - take the length from the length of this argument.

Return Value
Numeric

Example
seq(1, 5)
Returns 1 2 3 4 5 seq(1, 5, by = 2)
Returns 1 3 5
seq(1:5)
Returns 1 2 3 4 5
seq(0, 1, length.out = 11)
Returns 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
seq(ymd("2017-04-01"), by = "month", length.out = 6) Returns "2017-04-01" "2017-05-01" "2017-06-01" "2017-07-01" "2017-08-01" "2017-09-01" - It will generate date values for the next 6 months.

setdiff

Overview
Extracts rows that don't exist in common between two dataframes. Equivalent to the "set difference" operation in set theory.

Syntax
setdiff(<dataframe>, force_data_type = <TRUE|FALSE>)

Arguments * dataframe - Specifies the dataframes to find non-common rows between. * force_data_type (optional) - When set to TRUE, it first converts all column data types to character type, and after processing, converts them back to inferred values.

Examples
Data1: | ID | product_name | price | |----|--------------|-------| | 1 | apple | 100 | | 2 | orange | 80 | | 3 | banana | 120 |

Data2: | ID | product_name | price | |----|--------------|-------| | 2 | orange | 80 | | 3 | banana | 120 | | 4 | kiwi | 150 |

setdiff(Data1, Data2)

Extracts only the rows that exist in Data1 but not in Data2. | ID | product_name | price | |----|--------------|-------| | 1 | apple | 100 |

sign

Summary
Returns either 1, 0, or -1 depending on whether the input value is positive, zero, or negative, respectively.

Syntax
sign(<column_num>)

Return Value
Numeric

Example
// x <- c(1, 2, 10, 0, -1, -2, -10)
sign(x)
Returns 1 1 1 0 -1 -1 -1

signif

Summary
Rounds the values in its first argument to the specified number of significant digits.

Syntax
signif(<column_num>, digits = <number>)

Arguments

digits (Optional) - number of digits

Return Value
Numeric

Example
signif(3.475, digits=2)
Returns 3.5

sin

Summary
Returns the sine of an angle provided in radians.

Syntax
sin(<column_num>)

Return Value
Numeric

Example
mutate(a = sinh(X1))
Create a new column for calculating the sinh.

sinh

Summary
Returns the hyperbolic sine of any real number.

Syntax
sinh(<column_num>)

Return Value
Numeric

Example
mutate(a = sinh(X1))
Create a new column for calculating the sinh.

sinpi

Summary
Returns the sin(pi * ).

Syntax
sinpi(<column_num>)

Return Value
Numeric

Example
mutate(a = sinpi(X1))
Create a new column for calculating the sinpi.

sqrt

Summary
Returns the square root of the values.

Syntax
sqrt(<column_num>)

Return Value
Numeric

Example
mutate(revenue_sq = sqrt(revenue))
Returns the square root of revenue values

sum

Summary
Returns the sum of all the values.

Syntax
sum(<column_num_logic>, na.rm = <logical>)

Arguments

na.rm - TRUE or FALSE. The default is FALSE. Decides whether missing values (including NaN) to be removed or not.

Return Value
Numeric

Example
summarize(total = sum(revenue))
Create a new column for calculating the sum of revenue for each group.

tan

Summary
Returns the tangent of an angle provided in radians.

Syntax
tan(<column_num>)

Return Value
Numeric

Example
mutate(a = sinh(X1))
Create a new column for calculating the sinh.

tanh

Summary
Returns the hyperbolic tangent of any real number.

Syntax
tanh(<column_num>)

Return Value
Numeric

Example
mutate(a = tanh(X1))
Create a new column for calculating the tanh.

tanpi

Summary
Returns the tan(pi * ).

Syntax
tanpi(<column_num>)

Return Value
Numeric

Example
mutate(a = tanpi(X1))
Create a new column for calculating the tanpi.

trunc

Summary
Truncates the numbers by omitting digits.

Syntax
trunc(<column_num>)

Return Value
Numeric

Example
trunc(3.475)
Returns 3

union

Overview
Combines multiple dataframes, keeping rows that exist in either or both dataframes. Equivalent to the "union" operation in set theory.

Syntax
union(<dataframe>, force_data_type = <TRUE|FALSE>)

Arguments * dataframe - Specifies the dataframes to combine by rows. Multiple dataframes can be specified. You can also specify multiple dataframes in list format. * force_data_type (optional) - When set to TRUE, it first converts all column data types to character type, and after combining, converts them back to inferred values.

Examples
Data1: | ID | product_name | price | |----|--------------|-------| | 1 | apple | 100 | | 2 | orange | 80 | | 3 | banana | 120 |

Data2: | ID | product_name | price | |----|--------------|-------| | 2 | orange | 80 | | 3 | banana | 120 | | 4 | kiwi | 150 |

union(Data1, Data2)

Combines rows that exist in either Data1 or Data2, or both. | ID | product_name | price | |----|--------------|-------| | 1 | apple | 100 | | 2 | orange | 80 | | 3 | banana | 120 | | 4 | kiwi | 150 |

unique

Summary
Returns number of unique values.

Syntax
unique(<column>, incomparables = <logical>, fromLast = <logical>, nmax = <number>)

Arguments

incomparables (Optional) - The default is FALSE. A vector of values that cannot be compared. FALSE is a special value, meaning that all values can be compared, and may be the only value accepted for methods other than the default. It will be coerced internally to the same type as x.
fromLast - The default is FALSE. Indicating if duplication should be considered from the last, i.e., the last (or rightmost) of identical elements will be kept. This only matters for names or dimnames.
nmax - the maximum number of unique items expected.

Return Value
Numeric

Example
summarize(text_length = length(unique(TAIL_NUM)))
Returns # of unique values of 'TAIL_NUM' for each group

which

Summary
Returns the TRUE indices of a logical object or based on the condition.

Syntax
which(<condition>, arr.ind = <logical>, useNames = <logical>)

Arguments

arr.ind (Optional) - TRUE or FALSE, whether it returns array indices when the input is an array?
useNames (Optional) - TRUE or FALSE, indicating if the value of arrayInd() should have (non-null) dimnames at all.

Return Value
Numeric

Example
x <- c("a", "b", "c", "d", "e")
which(x == "c")
Returns 3
select(which(sapply(., is.numeric)))
Returns only numeric columns.

xor

Summary
Performs an exclusive-or check across two values and return TRUE when only one of them is TRUE.

Syntax
xor(<value1>, <value2>)

Return Value
Logical

Example
xor(TRUE, TRUE)
Returns FALSE
xor(FALSE, TRUE)
Returns TRUE
xor(FALSE, FALSE)
Returns FALSE

anti_join

Overview
Returns only those rows from the current dataframe that do not have matching values in the target dataframe. Only columns from the current dataframe are retained.

Syntax
anti_join(<target_dataframe>, by = join_by(<join_condition>))

Arguments

target_dataframe - Specifies the dataframe to join with.
by - Specifies the join condition:
- join_by(source_column == target_column) - Equality join
- join_by(source_column1 == target_column1, source_column2 == target_column2) - Multiple equality joins
- join_by(complex_condition) - Complex join conditions (see below)

Complex join conditions with join_by

join_by(source_column == target_column) - Equality join (works with all data types)
join_by(source_column < target_column) - "Less than" condition (dates or numeric values only)
join_by(source_column <= target_column) - "Less than or equal" condition (dates or numeric values only)
join_by(source_column > target_column) - "Greater than" condition (dates or numeric values only)
join_by(source_column >= target_column) - "Greater than or equal" condition (dates or numeric values only)
join_by(between(source_key_column, target_key_column1, target_key_column2)) - Range join. Matches when source value is between two target values (dates or numeric values only)
join_by(within(source_start_column, source_end_column, target_start_column, target_end_column)) - Interval-within join. Matches when source interval is completely contained within target interval (dates or numeric values only)
join_by(overlaps(source_start_column, source_end_column, target_start_column, target_end_column)) - Interval-overlap join. Matches when source interval overlaps with target interval (dates or numeric values only)
join_by(closest(source_column < target_column)) - Nearest-neighbor join. Matches with the closest value that satisfies the condition (dates or numeric values only, requires an operator other than equality)
Multiple conditions can be combined: join_by(ID_column == ID_column, between(date_column, start_date_column, end_date_column))

Examples

Source data: | customer_id | customer_name | order_amount | |-------------|---------------|--------------| | 101 | Smith | 5000 | | 102 | Johnson | 3000 | | 103 | Brown | 7000 |

Customer attributes data: | customer_id | region | age | |-------------|-------------|-----| | 101 | New York | 35 | | 102 | California | 42 | | 104 | Texas | 28 |

anti_join(customer_attributes, by = join_by(customer_id == customer_id))

From the source dataframe, returns only rows where "customer_id" does not have a match in the target dataframe. Customer_id=103 is included in the result because it doesn't exist in the customer attributes data.

customer_id	customer_name	order_amount
103	Brown	7000

Source data (employees): | employee_id | job_code | name | |-------------|----------|----------| | 1 | A01 | Thompson | | 2 | A02 | Wilson | | 3 | B01 | Davis | | 4 | C01 | Brown |

Job data: | job_code | job_title | |----------|--------------| | A01 | Sales | | A02 | Development | | B01 | Finance |

anti_join(job_data, by = join_by(job_code == job_code))

From the source dataframe, returns only rows where "job_code" does not have a match in the target dataframe. Job code "C01" is included in the result because it doesn't exist in the job data.

employee_id	job_code	name
4	C01	Brown

Source data (product data): | product_id | product_name | price | |------------|--------------|-------| | P001 | Laptop | 80000 | | P002 | Mouse | 3000 | | P003 | Keyboard | 5000 |

Sale period data: | product_id | sale_start_date | sale_end_date | discount_rate | |------------|-----------------|---------------|---------------| | P001 | 2023-06-01 | 2023-06-15 | 10 | | P001 | 2023-07-01 | 2023-07-15 | 15 | | P002 | 2023-06-10 | 2023-06-20 | 20 | | P004 | 2023-06-01 | 2023-06-30 | 25 |

anti_join(sale_period_data, by = join_by(product_id == product_id, between(as.Date("2023-06-05"), sale_start_date, sale_end_date)))

From the source dataframe, returns only rows where "product_id" does not match in the target dataframe or "2023-06-05" does not fall within the sale period. Product_id=P002 (sale starts after 2023-06-05) and product_id=P003 (no sale period) are included in the result.

product_id	product_name	price
P002	Mouse	3000
P003	Keyboard	5000

Source data (order data): | order_id | order_date | amount | |----------|------------|--------| | 1001 | 2023-01-15 | 3000 | | 1002 | 2023-02-20 | 5000 | | 1003 | 2023-03-10 | 8000 | | 1004 | 2023-04-05 | 4000 |

Exchange rate data: | rate_date | usd_jpy_rate | |------------|--------------| | 2023-01-01 | 130.5 | | 2023-02-01 | 132.0 | | 2023-03-01 | 133.5 |

anti_join(exchange_rate_data, by = join_by(closest(order_date >= rate_date)))

From the source dataframe, returns only rows that do not meet the condition with rows in the target dataframe. Extracts orders that don't have a closest rate date that is on or before the order date. Order with id=1004 is included because there's no corresponding rate date.

order_id	order_date	amount
1004	2023-04-05	4000

Source data (meeting schedule): | meeting_id | start_time | end_time | room | |------------|----------------------|----------------------|----------| | M001 | 2023-06-01 10:00 | 2023-06-01 11:30 | Room A | | M002 | 2023-06-01 13:00 | 2023-06-01 14:00 | Room B | | M003 | 2023-06-02 09:30 | 2023-06-02 12:00 | Room A | | M004 | 2023-06-03 10:00 | 2023-06-03 11:00 | Room C |

Reservation data: | reservation_id | start_time | end_time | reserved_by | |----------------|----------------------|----------------------|-------------| | R001 | 2023-06-01 09:00 | 2023-06-01 10:30 | Smith | | R002 | 2023-06-01 11:00 | 2023-06-01 12:00 | Johnson | | R003 | 2023-06-01 13:30 | 2023-06-01 15:00 | Brown | | R004 | 2023-06-02 10:00 | 2023-06-02 11:00 | Wilson |

anti_join(reservation_data, by = join_by(overlaps(start_time, end_time, start_time, end_time)))

From the source dataframe, returns only rows that do not meet the condition with rows in the target dataframe. Extracts meetings that don't have overlapping times with reservations. Meeting with id=M004 is included because it doesn't overlap with any reservation.

meeting_id	start_time	end_time	room
M004	2023-06-03 10:00	2023-06-03 11:00	Room C

Source data (store hours): | store_id | opening_time | closing_time | |----------|--------------|--------------| | S001 | 09:00 | 21:00 | | S002 | 10:00 | 22:00 | | S003 | 08:30 | 20:00 | | S004 | 07:00 | 15:00 |

Event data: | event_id | start_time | end_time | description | |----------|------------|----------|---------------| | E001 | 08:00 | 10:00 | Morning Mtg | | E002 | 11:00 | 13:00 | Seminar | | E003 | 15:00 | 19:00 | Exhibition |

anti_join(event_data, by = join_by(within(opening_time, closing_time, start_time, end_time)))

From the source dataframe, returns only rows that do not meet the condition with rows in the target dataframe. Extracts stores where no event completely fits within the store's opening hours. Store S004 is included because the Morning Mtg event (08:00-10:00) fits within its opening hours (07:00-15:00).

store_id	opening_time	closing_time
S004	07:00	15:00

arrange

Summary
Sort rows by given column(s).

Syntax
arrange(<column(s)>)

Used Together
desc

Example
arrange(POPULATION)
Sort by POPULATION values in a ascending order.
arrange(desc(POPULATION))
Sort by POPULATION values in a descending order.
arrange(POPULATION, PRODUCTION)
Sort by POPULATION, PRODUCTION columns in a ascending order.
arrange(-POPULATION, -PRODUCTION)
Sort by POPULATION, PRODUCTION columns in both descending order. This is an equivalent to the following
arrange(desc(POPULATION), desc(PRODUCTION))

between

Summary
Return TRUE or FALSE based on a given value is within a given range (between).

Syntax
between(<column_num_date>, <left>, <right>)

Arguments

left - the minimum value of the range between
right - the max value of the range between

Return Value
Logical

Example
filter(between(revenue, 1000, 6000))
Keep data whose revenue is between 1000 and 6000.

bind_cols

Summary
Bind multiple data frames by column.

Syntax
bind_cols(<data_set(s)>)

Example
bind_cols(ECONOMY)
Add columns from a data frame 'ECONOMY' to the existing data frame.

bind_rows

Summary
Bind multiple data frames by row.

Syntax
bind_rows(<dataframe1>, <dataframe2>, ..., .id = <id_column_name>, current_df_name = <current_dataframe_name>, force_data_type = <TRUE|FALSE>, use_col_index_as_col_name = <TRUE|FALSE>, encoding = <encoding>)

Arguments

id_column_name = Data frames identifier. When id_column_name is supplied, a new column of identifiers is created to link each row to its original data frame. The labels are taken from the named arguments to bind_rows(). When a list of data frames is supplied, the labels are taken from the names of the list. If no names are found, a numeric sequence is used instead.
current_df_name = When current_df_name is supplied, it uses the value as the name of the current data frame which is passed via dplyr pipe operation.
force_data_type - When this argument is set as TRUE, it first converts all the column data types to Character, then change it back to guessed values after the merge is done.
use_col_index_as_col_name - An argument for row binding based on column order. Default is FALSE. When combining data where column values are the same but column names differ completely, setting this to TRUE allows row binding based on column order, ignoring column names.
encoding - This encoding is used when force_data_type argument is set as TRUE.

Example

Existing data:

DATA2013

date	sales	product_id
11/05/2013	5000	1
12/20/2013	2000	2

Additional data to bind:

DATA2014

date	sales	product_id
01/05/2014	1000	1
02/20/2014	2000	2

DATA2015

date	sales	product_id
01/05/2015	3000	1
02/20/2015	3000	2

bind_rows(DATA2014)

Add a data frame 'DATA2014' to the existing data.

date	sales	product_id
11/05/2013	5000	1
12/20/2013	2000	2
01/05/2014	1000	1
02/20/2014	2000	2

bind_rows(DATA2014, DATA2015)

Add rows from a data frame 'DATA2014' and 'DATA2015' to the existing data.

date	sales	product_id
11/05/2013	5000	1
12/20/2013	2000	2
01/05/2014	1000	1
02/20/2014	2000	2
01/05/2015	3000	1
02/20/2015	3000	2

bind_rows(DATA2014, DATA2015, force_data_type = TRUE)

Add rows from a data frame 'DATA2014' and 'DATA2015' to the existing data and force all columns data types as character, then reevaluate columns types after the merge.

For example, if product_id column only has character values for DATA2015, the product_id column data type is Character. And if you try to bind this to other data frames whose the product_id column data type is Integer, it gives an error becuase of this data type mismatch.

date	sales	product_id
01/05/2015	3000	a
02/20/2015	3000	b

In this case you can set force_data_type argument as TRUE. This will create a data frame like below:

date	sales	product_id
11/05/2013	5000	1
12/20/2013	2000	2
01/05/2014	1000	1
02/20/2014	2000	2
01/05/2015	3000	a
02/20/2015	3000	b

bind_rows(DATA2014, DATA2015, current_df_name = "DATA2013", id_column_name = "df_name")

Add rows from a data frame 'DATA2014' and 'DATA2015' to the DATA2013 and it will create a column called 'df_name' and use "DATA2013", "DATA2014", and "DATA2015" to mark each row from two data frames

df_name	date	sales	product_id
DATA2013	11/05/2013	5000	1
DATA2013	12/20/2013	2000	2
DATA2014	01/05/2014	1000	1
DATA2014	02/20/2014	2000	2
DATA2015	01/05/2015	3000	1
DATA2015	02/20/2015	3000	2

Dataframe1 (with product-related column names) | product_name | price | stock | |--------------|-------|-------| | Laptop | 80000 | 10 | | Mouse | 3000 | 50 | | Keyboard | 5000 | 30 |

Dataframe2 (with similar but different column names) | product_name | price | quantity | |--------------|-------|----------| | Monitor | 25000 | 15 | | USB drive | 2000 | 100 | | Headphones | 8000 | 20 |

Normal column name-based binding

bind_rows(Dataframe1, Dataframe2)

Because the column names are different, they are not recognized as common columns, and all columns are combined as separate columns.

product_name	price	stock	product_name	price	quantity
Laptop	80000	10	NA	NA	NA
Mouse	3000	50	NA	NA	NA
Keyboard	5000	30	NA	NA	NA
NA	NA	NA	Monitor	25000	15
NA	NA	NA	USB drive	2000	100
NA	NA	NA	Headphones	8000	20

Column position-based binding bind_rows(Dataframe1, Dataframe2, use_col_index_as_col_name = TRUE)

Combines based on column position, ignoring column names. The column names from the first dataframe (Dataframe1) are used in the combined result.

product_name	price	stock
Laptop	80000	10
Mouse	3000	50
Keyboard	5000	30
Monitor	25000	15
USB drive	2000	100
Headphones	8000	20

count

Summary
Summarize the data by either calling n() or sum() for each group.

Syntax
count(<column(s)>, wt = <column_num_or_base_aggregate_expression>, sort = <logical>)

Arguments

columns - columns to group by.
wt (Optional) - If not specified, will count the number of rows for each group. If specified, it will return sum values for a given column. You can also use basic aggregate functions (sum, mean, max, min, n_distinct, n(), last, first) to override the default function (sum).
sort - The default is FALSE. TRUE will sort output in descending order of n

Example
count(CARRIER)
Groups by 'CARRIER' and returns the number of rows for each 'CARRIER'.
count(CARRIER, sort = TRUE)
Groups by 'CARRIER' and returns the number of rows for each 'CARRIER' from the highest number to the lowest.
count(CARRIER, wt = ARR_DELAY)
Groups by 'CARRIER' and returns the sum total of 'ARR_DELAY'.
count(CARRIER, TAIL_NUM, wt = ARR_DELAY)
Groups by 'CARRIER' and 'TAIL_NUM', and returns the sum total of 'ARR_DELAY'.
count(CARRIER, wt = ARR_DELAY, sort = TRUE)
Groups by 'CARRIER' and returns the sum total of 'ARR_DELAY' and sorted by the total number.
count(CARRIER, wt = mean(ARR_DELAY, na.rm = TRUE))
Groups by 'CARRIER' and returns the average of 'ARR_DELAY' after removing NA values.

cumall

Summary
Returns TRUE if all the values up to the current position are TRUE.

Syntax
cumall(<condition>)

Arguments

Return Value
Logical

Example
// X = TRUE, TRUE, FALSE, TRUE, FALSE
cumall(X)
Returns TRUE, TRUE, FALSE, FALSE, FALSE

cumany

Summary
Returns TRUE if any of the values up to the current position is TRUE.

Syntax
cumany(<condition>)

Arguments

Return Value
Logical

Example
// X = TRUE, TRUE, FALSE, TRUE, FALSE
cumany(X)
Returns TRUE, TRUE, TRUE, TRUE, TRUE

cume_dist

Summary
Cumulative distribution. Proportion of all values less than or equal to the current rank. Missing values are left as is.

Syntax
cume_dist(<column_num>)

Return Value
Numeric

Example
mutate( = cume_dist())
Create a new column to have the cumulative distribution number.
filter(cume_dist() < .25)
Select rows whose cumulative distribution number are less .25.

cummean

Summary
Returns the cumulative mean (average).

Syntax
cummean(<column_num>)

Arguments

Return Value
Numeric

Example
// X = 1, 2, 3, 2, 1
cummean(X)
Returns 1, 3, 6, 8, 9

dense_rank

Summary
Ranks with no gaps. Missing values are left as is.

Syntax
dense_rank(<column_num>)

Return Value
Numeric

Example
mutate(population_rank = dense_rank(population))
Create a new column to have the rank number based on 'population' values.
filter(dense_rank(population) < 10)
Select rows whose rank numbers are less 10.

desc

Summary
Change the sorting order to a descending order. e.g. 9 to 1 instead of 1 to 9, z to a instead of a to z.

Syntax
desc(<column>)

Example
arrange(desc(POPULATION))
Sort by POPULATION values in a descending order.
arrange(desc(POPULATION), desc(PRODUCTION))
Sort by POPULATION, PRODUCTION columns in both descending order.

distinct

Summary
Select distinct/unique rows. Only the first row will be preserved. Set .keep_all argument to TRUE to keep all the original columns.

Syntax
distinct(<column(s)>, .keep_all = <logical>)

Arguments

.keep_all - The default is FALSE. Set TRUE to keep all the original columns.

Example
distinct()
Keep only unique/distinct rows based on all the columns.
distinct(CARRIER, TAIL_NUM, FL_NUM)
Keep only unique/distinct rows based on CARRIER, TAIL_NUM, FL_NUM columns.
distinct(CARRIER, .keep_all = TRUE)
Keep only unique/distinct rows based on CARRIER and return all the columns.

filter

Summary
Select rows with conditions.

Syntax
filter(<condition>)

Used Together
is.na, between,

Example
filter(revenue < 6000)
Keep data whose revenue is less than 6000.
filter(revenue < 6000 & region == “West")
Keep data whose revenue is less than 6000 AND region is "West".
filter(revenue < 6000 | region == “West")
Keep data whose revenue is less than 6000 OR region is "West".
filter(between(revenue, 1000, 6000))
Keep data whose revenue is between 1000 and 6000.
filter(is.na(revenue))
Keep data whose revenue is NA.
filter(!is.na(revenue))
Keep data whose revenue is NOT NA.

first

Summary
returns the first value of all the values.

Syntax
first(<column>, order_by = <column(s)>, default = <text>)

Arguments

order_by (Optional) - An optional vector used to determine the order
default (Optional) - A default value to use if the position does not exist in the input. This is guessed by default for atomic vectors, where a missing value of the appropriate type is return, and for lists, where a NULL is return. For more complicated objects, you'll need to supply this value.

Example
summarize(total = first(name))
Create a new column for returning the first value for each group.

full_join

Overview
Performs a full outer join. Returns all rows from both the current dataframe (left side) and the target dataframe (right side), and all columns from both dataframes. When a row in one dataframe has no matching row in the other dataframe, columns from the non-matching side will contain NA values. If there are multiple matches, all combinations will be returned.

Syntax
full_join(<target_dataframe>, by = join_by(<join_condition>), suffix = c(<source_dataframe_suffix>, <target_dataframe_suffix>), na_matches = <NA_handling_method>, ignore_case = <TRUE|FALSE>)

Arguments

target_dataframe - Specifies the dataframe to join with.
by - Specifies the join condition:
- join_by(source_column == target_column) - Equality join
- join_by(source_column1 == target_column1, source_column2 == target_column2) - Multiple equality joins
- join_by(complex_condition) - Complex join conditions (see below)
suffix (optional) - Default is c(".x", ".y"). When both dataframes have duplicate column names that are not used in the join condition, these suffixes are added to the output to avoid column name conflicts.
na_matches (optional) - Specifies how to handle NA values:
- "never" (default) - NA values don't match each other
- "na" - NA values match each other
ignore_case (optional) - Specifies whether to ignore case when joining string columns:
- FALSE (default) - Case-sensitive matching
- TRUE - Case-insensitive matching

Complex join conditions with join_by

join_by(source_column == target_column) - Equality join (works with all data types)
join_by(source_column < target_column) - "Less than" condition (dates or numeric values only)
join_by(source_column <= target_column) - "Less than or equal" condition (dates or numeric values only)
join_by(source_column > target_column) - "Greater than" condition (dates or numeric values only)
join_by(source_column >= target_column) - "Greater than or equal" condition (dates or numeric values only)
join_by(between(source_key_column, target_key_column1, target_key_column2)) - Range join. Matches when source value is between two target values (dates or numeric values only)
join_by(within(source_start_column, source_end_column, target_start_column, target_end_column)) - Interval-within join. Matches when source interval is completely contained within target interval (dates or numeric values only)
join_by(overlaps(source_start_column, source_end_column, target_start_column, target_end_column)) - Interval-overlap join. Matches when source interval overlaps with target interval (dates or numeric values only)
join_by(closest(source_column < target_column)) - Nearest-neighbor join. Matches with the closest value that satisfies the condition (dates or numeric values only, requires an operator other than equality)
Multiple conditions can be combined: join_by(ID_column == ID_column, between(date_column, start_date_column, end_date_column))

Examples

Source data: | customer_id | customer_name | order_amount | |-------------|---------------|--------------| | 101 | Smith | 5000 | | 102 | Johnson | 3000 | | 103 | Brown | 7000 |

Customer attributes data: | customer_id | region | age | |-------------|-------------|-----| | 101 | New York | 35 | | 102 | California | 42 | | 104 | Texas | 28 |

full_join(customer_attributes, by = join_by(customer_id == customer_id))

Joins all rows from both dataframes. For customer_id=104 which has no match in the source dataframe, customer_name and order_amount columns will contain NA values. For customer_id=103 which has no match in the target dataframe, region and age columns will contain NA values.

customer_id	customer_name	order_amount	region	age
101	Smith	5000	New York	35
102	Johnson	3000	California	42
103	Brown	7000	NA	NA
104	NA	NA	Texas	28

Source data (employees): | employee_id | job_code | name | |-------------|----------|----------| | 1 | A01 | Thompson | | 2 | A02 | Wilson | | 3 | NA | Davis |

Job data: | job_code | job_title | |----------|--------------| | A01 | Sales | | A02 | Development | | A03 | Finance | | NA | Unassigned |

full_join(job_data, by = join_by(job_code == job_code), na_matches = "na")

Joins all rows from both dataframes. With na_matches="na", NA values are considered matching. For job_code=A03 which has no match in the source dataframe, employee_id and name columns will contain NA values.

employee_id	job_code	name	job_title
1	A01	Thompson	Sales
2	A02	Wilson	Development
3	NA	Davis	Unassigned
NA	A03	NA	Finance

Source data (user data): | user_id | name | |---------|-------| | 1 | john | | 2 | MARY | | 3 | David |

Profile data: | user_id | name | age | |---------|-------|-----| | 1 | John | 25 | | 2 | mary | 30 | | 4 | Steve | 35 |

full_join(profile_data, by = join_by(name == name), ignore_case = TRUE)

Joins all rows from both dataframes, ignoring case differences. For name=Steve (user_id=4) which has no match in the source dataframe, the source user_id column will contain NA value. For name=David (user_id=3) which has no match in the target dataframe, the target user_id and age columns will contain NA values.

user_id.x	name	user_id.y	age
1	john	1	25
2	MARY	2	30
3	David	NA	NA
NA	Steve	4	35

Source data (product data): | product_id | product_name | price | |------------|--------------|-------| | P001 | Laptop | 80000 | | P002 | Mouse | 3000 | | P003 | Keyboard | 5000 |

full_join(sale_period_data, by = join_by(product_id == product_id, between(as.Date("2023-06-05"), sale_start_date, sale_end_date)))

Joins all rows from both dataframes. For combinations where "product_id" matches and "2023-06-05" falls within the sale period, all data is included. For other combinations, columns from the side that doesn't satisfy the condition will contain NA values.

product_id	product_name	price	sale_start_date	sale_end_date	discount_rate
P001	Laptop	80000	2023-06-01	2023-06-15	10
P001	Laptop	80000	2023-07-01	2023-07-15	15
P002	Mouse	3000	2023-06-10	2023-06-20	20
P003	Keyboard	5000	NA	NA	NA
P004	NA	NA	2023-06-01	2023-06-30	25

Source data (order data): | order_id | order_date | amount | |----------|------------|--------| | 1001 | 2023-01-15 | 3000 | | 1002 | 2023-02-20 | 5000 | | 1003 | 2023-03-10 | 8000 |

Exchange rate data: | rate_date | usd_jpy_rate | |------------|--------------| | 2023-01-01 | 130.5 | | 2023-02-01 | 132.0 | | 2023-03-01 | 133.5 | | 2023-04-01 | 135.2 |

full_join(exchange_rate_data, by = join_by(closest(order_date >= rate_date)))

Joins all rows from both dataframes. For each order date, finds the closest preceding exchange rate. For rate_date=2023-04-01 which has no corresponding order, order_id and amount columns will contain NA values.

order_id	order_date	amount	rate_date	usd_jpy_rate
1001	2023-01-15	3000	2023-01-01	130.5
1002	2023-02-20	5000	2023-02-01	132.0
1003	2023-03-10	8000	2023-03-01	133.5
NA	NA	NA	2023-04-01	135.2

full_join(reservation_data, by = join_by(overlaps(start_time, end_time, start_time, end_time)))

Joins all rows from both dataframes. For combinations where meeting times overlap with reservation times, all data is included. For reservation_id=R005 which has no overlapping meeting, meeting_id and room columns will contain NA values.

meeting_id	start_time	end_time	room	reservation_id	reserved_by
M001	2023-06-01 10:00	2023-06-01 11:30	Room A	R001	Smith
M001	2023-06-01 10:00	2023-06-01 11:30	Room A	R002	Johnson
M002	2023-06-01 13:00	2023-06-01 14:00	Room B	R003	Brown
M003	2023-06-02 09:30	2023-06-02 12:00	Room A	R004	Wilson
NA	NA	NA	NA	R005	Taylor

Source data (store hours): | store_id | opening_time | closing_time | |----------|--------------|--------------| | S001 | 09:00 | 21:00 | | S002 | 10:00 | 22:00 | | S003 | 08:30 | 20:00 |

full_join(event_data, by = join_by(within(opening_time, closing_time, start_time, end_time)))

Joins all rows from both dataframes. For combinations where events completely fit within store opening hours, all data is included. For event_id=E001 and event_id=E004 which don't fit within any store's hours, store_id, opening_time, and closing_time columns will contain NA values.

store_id	opening_time	closing_time	event_id	start_time	end_time	description
S001	09:00	21:00	E002	11:00	13:00	Seminar
S002	10:00	22:00	E002	11:00	13:00	Seminar
S003	08:30	20:00	E002	11:00	13:00	Seminar
S001	09:00	21:00	E003	15:00	19:00	Exhibition
S002	10:00	22:00	E003	15:00	19:00	Exhibition
S003	08:30	20:00	E003	15:00	19:00	Exhibition
NA	NA	NA	E001	08:00	10:00	Morning Mtg
NA	NA	NA	E004	19:30	22:30	Party

funs

Summary
funs provides a flexible way to generate a named list of functions for input to other functions like summarise_each.

Syntax
funs(<function_name(s)>)

Arguments

A list of functions specified by: * Their name, "mean" * The function itself, mean * A call to the function with . as a dummy parameter, mean(., na.rm = TRUE)

Example
funs(mean, "mean", mean(., na.rm = TRUE))
# Override default names
funs(m1 = mean, m2 = "mean", m3 = mean(., na.rm = TRUE))

group_by

Summary
Converts the data frame into a grouped data frame where the following operations will be performed based on the groups. Grouping should be done before you want to aggregate values.

Syntax
group_by(<column(s)>)

Example
group_by(COUNTRY)
Group the data by COUNTRY.
group_by(REGION, COUNTRY)
Group the data by REGION and COUNTRY.
group_by(FIRSTNAME + LASTNAME)
Group by a dynamically generated value of 'FIRSTNAME + LASTNAME'

inner_join

Overview
Performs an inner join. Returns rows that match in both the current dataframe (left side) and the target dataframe (right side), and all columns from both dataframes. Rows without matches are excluded from the result. If there are multiple matches, all combinations will be returned.

Syntax
inner_join(<target_dataframe>, by = join_by(<join_condition>), suffix = c(<source_dataframe_suffix>, <target_dataframe_suffix>), na_matches = <NA_handling_method>, ignore_case = <TRUE|FALSE>)

Arguments

target_dataframe - Specifies the dataframe to join with.
by - Specifies the join condition:
- join_by(source_column == target_column) - Equality join
- join_by(source_column1 == target_column1, source_column2 == target_column2) - Multiple equality joins
- join_by(complex_condition) - Complex join conditions (see below)
suffix (optional) - Default is c(".x", ".y"). When both dataframes have duplicate column names that are not used in the join condition, these suffixes are added to the output to avoid column name conflicts.
na_matches (optional) - Specifies how to handle NA values:
- "never" (default) - NA values don't match each other
- "na" - NA values match each other
ignore_case (optional) - Specifies whether to ignore case when joining string columns:
- FALSE (default) - Case-sensitive matching
- TRUE - Case-insensitive matching

Complex join conditions with join_by

join_by(source_column == target_column) - Equality join (works with all data types)
join_by(source_column < target_column) - "Less than" condition (dates or numeric values only)
join_by(source_column <= target_column) - "Less than or equal" condition (dates or numeric values only)
join_by(source_column > target_column) - "Greater than" condition (dates or numeric values only)
join_by(source_column >= target_column) - "Greater than or equal" condition (dates or numeric values only)
join_by(between(source_key_column, target_key_column1, target_key_column2)) - Range join. Matches when source value is between two target values (dates or numeric values only)
join_by(within(source_start_column, source_end_column, target_start_column, target_end_column)) - Interval-within join. Matches when source interval is completely contained within target interval (dates or numeric values only)
join_by(overlaps(source_start_column, source_end_column, target_start_column, target_end_column)) - Interval-overlap join. Matches when source interval overlaps with target interval (dates or numeric values only)
join_by(closest(source_column < target_column)) - Nearest-neighbor join. Matches with the closest value that satisfies the condition (dates or numeric values only, requires an operator other than equality)
Multiple conditions can be combined: join_by(ID_column == ID_column, between(date_column, start_date_column, end_date_column))

Examples

Source data: | customer_id | customer_name | order_amount | |-------------|---------------|--------------| | 101 | Smith | 5000 | | 102 | Johnson | 3000 | | 103 | Brown | 7000 |

Customer attributes data: | customer_id | region | age | |-------------|-------------|-----| | 101 | New York | 35 | | 102 | California | 42 | | 104 | Texas | 28 |

inner_join(customer_attributes, by = join_by(customer_id == customer_id))

Returns only rows where "customer_id" matches in both dataframes. Customer_id=103 and customer_id=104 are excluded from the result because they exist in only one of the dataframes.

customer_id	customer_name	order_amount	region	age
101	Smith	5000	New York	35
102	Johnson	3000	California	42

Source data (employees): | employee_id | job_code | name | |-------------|----------|----------| | 1 | A01 | Thompson | | 2 | A02 | Wilson | | 3 | NA | Davis |

Job data: | job_code | job_title | |----------|--------------| | A01 | Sales | | A02 | Development | | A03 | Finance | | NA | Unassigned |

inner_join(job_data, by = join_by(job_code == job_code), na_matches = "na")

Returns only rows where "job_code" matches in both dataframes. With na_matches="na", NA values are considered matching. Job_code=A03 is excluded from the result because it exists only in the job data.

employee_id	job_code	name	job_title
1	A01	Thompson	Sales
2	A02	Wilson	Development
3	NA	Davis	Unassigned

Source data (user data): | user_id | name | |---------|-------| | 1 | john | | 2 | MARY | | 3 | David |

Profile data: | user_id | name | age | |---------|-------|-----| | 1 | John | 25 | | 2 | mary | 30 | | 4 | Steve | 35 |

inner_join(profile_data, by = join_by(name == name), ignore_case = TRUE)

Returns only rows where "name" matches in both dataframes, ignoring case differences. Name="john" matches name="John", and name="MARY" matches name="mary". Name="David" and name="Steve" are excluded because they exist in only one of the dataframes.

user_id.x	name	user_id.y	age
1	john	1	25
2	MARY	2	30

Source data (product data): | product_id | product_name | price | |------------|--------------|-------| | P001 | Laptop | 80000 | | P002 | Mouse | 3000 | | P003 | Keyboard | 5000 |

inner_join(sale_period_data, by = join_by(product_id == product_id, between(as.Date("2023-06-05"), sale_start_date, sale_end_date)))

Returns only rows where "product_id" matches and "2023-06-05" falls within the sale period. The second period for product_id=P001 (July) and the period for product_id=P002 (starting from 2023-06-10) are excluded because 2023-06-05 is not within those periods. Product_id=P003 has no sale period, and product_id=P004 doesn't exist in the product data, so they're also excluded.

product_id	product_name	price	sale_start_date	sale_end_date	discount_rate
P001	Laptop	80000	2023-06-01	2023-06-15	10

Source data (order data): | order_id | order_date | amount | |----------|------------|--------| | 1001 | 2023-01-15 | 3000 | | 1002 | 2023-02-20 | 5000 | | 1003 | 2023-03-10 | 8000 |

Exchange rate data: | rate_date | usd_jpy_rate | |------------|--------------| | 2023-01-01 | 130.5 | | 2023-02-01 | 132.0 | | 2023-03-01 | 133.5 | | 2023-04-01 | 135.2 |

inner_join(exchange_rate_data, by = join_by(closest(order_date >= rate_date)))

Returns only rows where there's a matching condition between dataframes. For each order date, finds the closest preceding exchange rate. Rate_date=2023-04-01 is excluded because there's no corresponding order.

order_id	order_date	amount	rate_date	usd_jpy_rate
1001	2023-01-15	3000	2023-01-01	130.5
1002	2023-02-20	5000	2023-02-01	132.0
1003	2023-03-10	8000	2023-03-01	133.5

inner_join(reservation_data, by = join_by(overlaps(start_time, end_time, start_time, end_time)))

Returns only rows where meeting times overlap with reservation times. Reservation_id=R005 is excluded because it doesn't overlap with any meeting.

meeting_id	start_time	end_time	room	reservation_id	reserved_by
M001	2023-06-01 10:00	2023-06-01 11:30	Room A	R001	Smith
M001	2023-06-01 10:00	2023-06-01 11:30	Room A	R002	Johnson
M002	2023-06-01 13:00	2023-06-01 14:00	Room B	R003	Brown
M003	2023-06-02 09:30	2023-06-02 12:00	Room A	R004	Wilson

Source data (store hours): | store_id | opening_time | closing_time | |----------|--------------|--------------| | S001 | 09:00 | 21:00 | | S002 | 10:00 | 22:00 | | S003 | 08:30 | 20:00 |

inner_join(event_data, by = join_by(within(opening_time, closing_time, start_time, end_time)))

Returns only rows where events completely fit within store opening hours. Event_id=E001 and event_id=E004 are excluded because they don't satisfy the condition with any store.

store_id	opening_time	closing_time	event_id	start_time	end_time	description
S001	09:00	21:00	E002	11:00	13:00	Seminar
S002	10:00	22:00	E002	11:00	13:00	Seminar
S003	08:30	20:00	E002	11:00	13:00	Seminar
S001	09:00	21:00	E003	15:00	19:00	Exhibition
S002	10:00	22:00	E003	15:00	19:00	Exhibition
S003	08:30	20:00	E003	15:00	19:00	Exhibition

lag

Summary
Provides access to a row at a given physical offset prior to that position.

Syntax
lag(<column>, n = <number>, default = <text>, order_by = <column(s)>)

Arguments

n (Optional) - a positive integer of length 1, giving the number of positions to lag by. The default is 1.
default (Optional) - default value for NA, the default is NA.
order_by (Optional) - override the default ordering to use another vector

Example
// x <- c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10)
lag(x)
Returns NA 1 2 3 4 5 6 7 8 9
lag(x, 2)
Returns NA NA 1 2 3 4 5 6 7 8

last

Summary
returns the last value of all the values.

Syntax
last(<column>, order_by = <column(s)>, default = <text>)

Arguments

order_by (Optional) - An optional vector used to determine the order
default (Optional) - A default value to use if the position does not exist in the input. This is guessed by default for atomic vectors, where a missing value of the appropriate type is return, and for lists, where a NULL is return. For more complicated objects, you'll need to supply this value.

Example
summarize(total = last(name))
Create a new column for returning the last value for each group.

lead

Summary
Provides access to a row at a given physical offset that follows the current row.

Syntax
lead(<column>, n = <number>, default = <text>, order_by = <column(s)>)

Arguments

n (Optional) - a positive integer of length 1, giving the number of positions to lead by. The default is 1.
default (Optional) - default value for NA, the default is NA.
order_by (Optional) - override the default ordering to use another vector

Example
// x <- c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10)
lead(x)
Returns 2 3 4 5 6 7 8 9 10 NA
lead(x, 2)
Returns 3 4 5 6 7 8 9 10 NA NA

left_join

Overview
Performs a left outer join. Returns all rows from the current dataframe (left side) and all columns from both the current dataframe and the target dataframe (right side). When a row in the left dataframe has no matching row in the right dataframe, the columns added from the right side will contain NA values. If there are multiple matches, all combinations will be returned.

Syntax
left_join(<target_dataframe>, by = join_by(<join_condition>), suffix = c(<source_dataframe_suffix>, <target_dataframe_suffix>), na_matches = <NA_handling_method>, ignore_case = <TRUE|FALSE>)

Arguments

target_dataframe - Specifies the dataframe to join with.
by - Specifies the join condition:
- join_by(source_column == target_column) - Equality join
- join_by(source_column1 == target_column1, source_column2 == target_column2) - Multiple equality joins
- join_by(complex_condition) - Complex join conditions (see below)
suffix (optional) - Default is c(".x", ".y"). When both dataframes have duplicate column names that are not used in the join condition, these suffixes are added to the output to avoid column name conflicts.
na_matches (optional) - Specifies how to handle NA values:
- "never" (default) - NA values don't match each other
- "na" - NA values match each other
ignore_case (optional) - Specifies whether to ignore case when joining string columns:
- FALSE (default) - Case-sensitive matching
- TRUE - Case-insensitive matching

Complex join conditions with join_by

join_by(source_column == target_column) - Equality join (works with all data types)
join_by(source_column < target_column) - "Less than" condition (dates or numeric values only)
join_by(source_column <= target_column) - "Less than or equal" condition (dates or numeric values only)
join_by(source_column > target_column) - "Greater than" condition (dates or numeric values only)
join_by(source_column >= target_column) - "Greater than or equal" condition (dates or numeric values only)
join_by(between(source_key_column, target_key_column1, target_key_column2)) - Range join. Matches when source value is between two target values (dates or numeric values only)
join_by(within(source_start_column, source_end_column, target_start_column, target_end_column)) - Interval-within join. Matches when source interval is completely contained within target interval (dates or numeric values only)
join_by(overlaps(source_start_column, source_end_column, target_start_column, target_end_column)) - Interval-overlap join. Matches when source interval overlaps with target interval (dates or numeric values only)
join_by(closest(source_column < target_column)) - Nearest-neighbor join. Matches with the closest value that satisfies the condition (dates or numeric values only, requires an operator other than equality)
Multiple conditions can be combined: join_by(ID_column == ID_column, between(date_column, start_date_column, end_date_column))

Examples

Source data: | customer_id | customer_name | order_amount | |-------------|---------------|--------------| | 101 | Smith | 5000 | | 102 | Johnson | 3000 | | 103 | Brown | 7000 |

Customer attributes data: | customer_id | region | age | |-------------|-------------|-----| | 101 | New York | 35 | | 102 | California | 42 | | 104 | Texas | 28 |

left_join(customer_attributes, by = join_by(customer_id == customer_id))

Keeps all rows from the source dataframe and joins with the "customer_attributes" dataframe using the "customer_id" column as the key. For customer_id=103 which has no match in the target dataframe, region and age columns will contain NA values.

customer_id	customer_name	order_amount	region	age
101	Smith	5000	New York	35
102	Johnson	3000	California	42
103	Brown	7000	NA	NA

Source data (employees): | employee_id | job_code | name | |-------------|----------|----------| | 1 | A01 | Thompson | | 2 | A02 | Wilson | | 3 | NA | Davis |

Job data: | job_code | job_title | |----------|--------------| | A01 | Sales | | A02 | Development | | NA | Unassigned |

left_join(job_data, by = join_by(job_code == job_code), na_matches = "na")

Keeps all rows from the source dataframe and joins with the "job_data" dataframe using the "job_code" column as the key. NA values are also used as join keys since na_matches="na".

employee_id	job_code	name	job_title
1	A01	Thompson	Sales
2	A02	Wilson	Development
3	NA	Davis	Unassigned

Source data (user data): | user_id | name | |---------|-------| | 1 | john | | 2 | MARY | | 3 | David |

Profile data: | user_id | name | age | |---------|-------|-----| | 1 | John | 25 | | 2 | mary | 30 | | 4 | Steve | 35 |

left_join(profile_data, by = join_by(name == name), ignore_case = TRUE)

Keeps all rows from the source dataframe and joins with the "profile_data" dataframe using the "name" column as the key, ignoring case differences.

user_id.x	name	user_id.y	age
1	john	1	25
2	MARY	2	30
3	David	NA	NA

Source data (product data): | product_id | product_name | price | |------------|--------------|-------| | P001 | Laptop | 80000 | | P002 | Mouse | 3000 | | P003 | Keyboard | 5000 |

left_join(sale_period_data, by = join_by(product_id == product_id, between(as.Date("2023-06-05"), sale_start_date, sale_end_date)))

Keeps all rows from the source dataframe and joins with matching products where "2023-06-05" falls within the sale period.

product_id	product_name	price	sale_start_date	sale_end_date	discount_rate
P001	Laptop	80000	2023-06-01	2023-06-15	10
P002	Mouse	3000	NA	NA	NA
P003	Keyboard	5000	NA	NA	NA

Source data (order data): | order_id | order_date | amount | |----------|------------|--------| | 1001 | 2023-01-15 | 3000 | | 1002 | 2023-02-20 | 5000 | | 1003 | 2023-03-10 | 8000 |

Exchange rate data: | rate_date | usd_jpy_rate | |------------|--------------| | 2023-01-01 | 130.5 | | 2023-02-01 | 132.0 | | 2023-03-01 | 133.5 | | 2023-04-01 | 135.2 |

left_join(exchange_rate_data, by = join_by(closest(order_date >= rate_date)))

Keeps all rows from the source dataframe and joins with the closest exchange rate that is on or before each order date.

order_id	order_date	amount	rate_date	usd_jpy_rate
1001	2023-01-15	3000	2023-01-01	130.5
1002	2023-02-20	5000	2023-02-01	132.0
1003	2023-03-10	8000	2023-03-01	133.5

left_join(reservation_data, by = join_by(overlaps(start_time, end_time, start_time, end_time)))

Keeps all rows from the source dataframe and joins with reservations that overlap with meeting times.

meeting_id	start_time	end_time	room	reservation_id	reserved_by
M001	2023-06-01 10:00	2023-06-01 11:30	Room A	R001	Smith
M001	2023-06-01 10:00	2023-06-01 11:30	Room A	R002	Johnson
M002	2023-06-01 13:00	2023-06-01 14:00	Room B	R003	Brown
M003	2023-06-02 09:30	2023-06-02 12:00	Room A	R004	Wilson

Source data (store hours): | store_id | opening_time | closing_time | |----------|--------------|--------------| | S001 | 09:00 | 21:00 | | S002 | 10:00 | 22:00 | | S003 | 08:30 | 20:00 |

left_join(event_data, by = join_by(within(opening_time, closing_time, start_time, end_time)))

Keeps all rows from the source dataframe and joins with events that completely fit within store hours.

store_id	opening_time	closing_time	event_id	start_time	end_time	description
S001	09:00	21:00	E002	11:00	13:00	Seminar
S001	09:00	21:00	E003	15:00	19:00	Exhibition
S002	10:00	22:00	E002	11:00	13:00	Seminar
S002	10:00	22:00	E003	15:00	19:00	Exhibition
S003	08:30	20:00	E002	11:00	13:00	Seminar
S003	08:30	20:00	E003	15:00	19:00	Exhibition

min_rank

Summary
Ranks. Ties get min rank. Missing values are left as is.

Syntax
min_rank(<column_num>)

Return Value
Numeric

Example
mutate(population_rank = min_rank(population))
Create a new column to have the rank number based on 'population' values. The smaller the numbers are the higher the ranks are.
mutate(population_rank = min_rank(desc(population)))
Create a new column to have the rank number based on 'population' values. The bigger the numbers are the higher the ranks are.
filter(min_rank(population) < 10)
Select rows whose rank numbers are less 10.

mutate

Summary
Mutate creates new column(s) with given expressions and preserves existing . You can use transmute to create new columns while removing the original ones.

Syntax
mutate(<new_column> = <expression>, ...)

Example
mutate(profit = revenue - expense)
Create 'profit' column based on the calculation result of 'revenue - expense'.

n

Summary
Returns the count of the rows for each group.

Syntax
n()

Arguments

Return Value
Numeric

Example
summarize(total = n())
Create a new column for calculating the number of entries for each group.

n_distinct

Summary
Returns the count of unique values.

Syntax
n_distinct(<column(s)>, na.rm = <logical>)

Arguments

na.rm - TRUE or FALSE. The default is FALSE. Decides whether missing values (including NaN) to be removed or not.

Return Value
Numeric

Example
summarize(total = n_distinct(TAIL_NUM))
Create a new column for calculating the count of unique values of TAIL_NUM for each group.

summarize(total = n_distinct(STATE, CITY))
Create a new column for calculating the count of unique combinations of STATE and CITY for each group.

nth

Summary
returns the nth value of all the values.

Syntax
nth(<column>, n = <number>, order_by = <column(s)>, default = <text>)

Arguments

n - For nth_value, a single integer specifying the position. Negative integers select from the end.
order_by (Optional) - An optional vector used to determine the order
default (Optional) - A default value to use if the position does not exist in the input. This is guessed by default for atomic vectors, where a missing value of the appropriate type is return, and for lists, where a NULL is return. For more complicated objects, you'll need to supply this value.

Example
summarize(total = nth(name, 5))
Returns the fifth value for each group.

summarize(total = nth(name, -5))
Returns the fifth value from the end for each group.

ntile

Summary
Divides data into a specified number (n) of "equal frequency" groups (with approximately the same number of rows in each group) and returns an integer value (from 1 to n) indicating which group each row belongs to. Note that this "equal frequency division" distributes rows evenly across groups, which differs from equal width (same value range) division.

Syntax
ntile(<column>, <number>)

Arguments

number - set number of buckets.

Return Value
Numeric

Example

ntile(sales_amount, 4) Divides "sales_amount" into 4 equal frequency groups. This splits the data so each group contains the same number of rows. For example, if there are 8 rows total, each group will contain 2 rows. | ID | sales_amount | quartile_group | |----|-------------|---------------| | 1 | 10000 | 1 | | 2 | 15000 | 1 | | 3 | 20000 | 2 | | 4 | 25000 | 2 | | 5 | 30000 | 3 | | 6 | 35000 | 3 | | 7 | 40000 | 4 | | 8 | 45000 | 4 |

percent_rank

Summary
Ranks rescaled to [0, 1]. Missing values are left as is.

Syntax
percent_rank(<column_num>)

Return Value
Numeric

Example
mutate(population_rank = percent_rank(population))
Create a new column to have the percent rank number based on 'population' values.
filter(percent_rank(population) < .25)
Select rows whose percent rank numbers are less .25.

rename

Summary
Rename existing column names.

Syntax
rename(<new_column_name> = <column>, ...)

Example
rename(state = X1)
Rename X1 with 'state'
rename(state = X1, country = X2)
Rename X1 with 'state' and X2 with 'country'

right_join

Overview
Performs a right outer join. Returns all rows from the target dataframe (right side) and all columns from both the current dataframe (left side) and the target dataframe. When a row in the right dataframe has no matching row in the left dataframe, the columns added from the left side will contain NA values. If there are multiple matches, all combinations will be returned.

Syntax
right_join(<target_dataframe>, by = join_by(<join_condition>), suffix = c(<source_dataframe_suffix>, <target_dataframe_suffix>), na_matches = <NA_handling_method>, ignore_case = <TRUE|FALSE>)

Arguments

target_dataframe - Specifies the dataframe to join with.
by - Specifies the join condition:
- join_by(source_column == target_column) - Equality join
- join_by(source_column1 == target_column1, source_column2 == target_column2) - Multiple equality joins
- join_by(complex_condition) - Complex join conditions (see below)
suffix (optional) - Default is c(".x", ".y"). When both dataframes have duplicate column names that are not used in the join condition, these suffixes are added to the output to avoid column name conflicts.
na_matches (optional) - Specifies how to handle NA values:
- "never" (default) - NA values don't match each other
- "na" - NA values match each other
ignore_case (optional) - Specifies whether to ignore case when joining string columns:
- FALSE (default) - Case-sensitive matching
- TRUE - Case-insensitive matching

Complex join conditions with join_by

join_by(source_column == target_column) - Equality join (works with all data types)
join_by(source_column < target_column) - "Less than" condition (dates or numeric values only)
join_by(source_column <= target_column) - "Less than or equal" condition (dates or numeric values only)
join_by(source_column > target_column) - "Greater than" condition (dates or numeric values only)
join_by(source_column >= target_column) - "Greater than or equal" condition (dates or numeric values only)
join_by(between(source_key_column, target_key_column1, target_key_column2)) - Range join. Matches when source value is between two target values (dates or numeric values only)
join_by(within(source_start_column, source_end_column, target_start_column, target_end_column)) - Interval-within join. Matches when source interval is completely contained within target interval (dates or numeric values only)
join_by(overlaps(source_start_column, source_end_column, target_start_column, target_end_column)) - Interval-overlap join. Matches when source interval overlaps with target interval (dates or numeric values only)
join_by(closest(source_column < target_column)) - Nearest-neighbor join. Matches with the closest value that satisfies the condition (dates or numeric values only, requires an operator other than equality)
Multiple conditions can be combined: join_by(ID_column == ID_column, between(date_column, start_date_column, end_date_column))

Examples

Source data: | customer_id | customer_name | order_amount | |-------------|---------------|--------------| | 101 | Smith | 5000 | | 102 | Johnson | 3000 | | 103 | Brown | 7000 |

Customer attributes data: | customer_id | region | age | |-------------|-------------|-----| | 101 | New York | 35 | | 102 | California | 42 | | 104 | Texas | 28 |

right_join(customer_attributes, by = join_by(customer_id == customer_id))

Keeps all rows from the target dataframe (customer_attributes) and joins with the source dataframe using the "customer_id" column as the key. For customer_id=104 which has no match in the source dataframe, customer_name and order_amount columns will contain NA values.

customer_id	customer_name	order_amount	region	age
101	Smith	5000	New York	35
102	Johnson	3000	California	42
104	NA	NA	Texas	28

Source data (employees): | employee_id | job_code | name | |-------------|----------|----------| | 1 | A01 | Thompson | | 2 | A02 | Wilson | | 3 | NA | Davis |

Job data: | job_code | job_title | |----------|--------------| | A01 | Sales | | A02 | Development | | A03 | Finance | | NA | Unassigned |

right_join(job_data, by = join_by(job_code == job_code), na_matches = "na")

Keeps all rows from the target dataframe (job_data) and joins with the source dataframe using the "job_code" column as the key. NA values are also used as join keys since na_matches="na". For job_code=A03 which has no match in the source dataframe, employee_id and name columns will contain NA values.

employee_id	job_code	name	job_title
1	A01	Thompson	Sales
2	A02	Wilson	Development
NA	A03	NA	Finance
3	NA	Davis	Unassigned

Source data (user data): | user_id | name | |---------|-------| | 1 | john | | 2 | MARY | | 3 | David |

Profile data: | user_id | name | age | |---------|-------|-----| | 1 | John | 25 | | 2 | mary | 30 | | 4 | Steve | 35 |

right_join(profile_data, by = join_by(name == name), ignore_case = TRUE)

Keeps all rows from the target dataframe (profile_data) and joins with the source dataframe using the "name" column as the key, ignoring case differences. For name=Steve (user_id=4) which has no match in the source dataframe, the source user_id column will contain NA value.

user_id.x	name	user_id.y	age
1	john	1	25
2	MARY	2	30
NA	Steve	4	35

Source data (product data): | product_id | product_name | price | |------------|--------------|-------| | P001 | Laptop | 80000 | | P002 | Mouse | 3000 | | P003 | Keyboard | 5000 |

right_join(sale_period_data, by = join_by(product_id == product_id, between(as.Date("2023-06-05"), sale_start_date, sale_end_date)))

Keeps all rows from the target dataframe (sale_period_data) and joins with matching products where "2023-06-05" falls within the sale period. For rows that don't satisfy the condition or have no matching product_id, product_name and price columns will contain NA values.

product_id	product_name	price	sale_start_date	sale_end_date	discount_rate
P001	Laptop	80000	2023-06-01	2023-06-15	10
P001	NA	NA	2023-07-01	2023-07-15	15
P002	NA	NA	2023-06-10	2023-06-20	20
P004	NA	NA	2023-06-01	2023-06-30	25

Source data (order data): | order_id | order_date | amount | |----------|------------|--------| | 1001 | 2023-01-15 | 3000 | | 1002 | 2023-02-20 | 5000 | | 1003 | 2023-03-10 | 8000 |

Exchange rate data: | rate_date | usd_jpy_rate | |------------|--------------| | 2023-01-01 | 130.5 | | 2023-02-01 | 132.0 | | 2023-03-01 | 133.5 | | 2023-04-01 | 135.2 |

right_join(exchange_rate_data, by = join_by(closest(order_date >= rate_date)))

Keeps all rows from the target dataframe (exchange_rate_data) and joins with the closest order that is on or after each rate date. For rate_date=2023-04-01 which has no matching order date, order_id and amount columns will contain NA values.

order_id	order_date	amount	rate_date	usd_jpy_rate
1001	2023-01-15	3000	2023-01-01	130.5
1002	2023-02-20	5000	2023-02-01	132.0
1003	2023-03-10	8000	2023-03-01	133.5
NA	NA	NA	2023-04-01	135.2

right_join(reservation_data, by = join_by(overlaps(start_time, end_time, start_time, end_time)))

Keeps all rows from the target dataframe (reservation_data) and joins with meetings that overlap with reservation times. For reservation_id=R005 which has no overlapping meeting, meeting_id and room columns will contain NA values.

meeting_id	start_time	end_time	room	reservation_id	reserved_by
M001	2023-06-01 10:00	2023-06-01 11:30	Room A	R001	Smith
M001	2023-06-01 10:00	2023-06-01 11:30	Room A	R002	Johnson
M002	2023-06-01 13:00	2023-06-01 14:00	Room B	R003	Brown
M003	2023-06-02 09:30	2023-06-02 12:00	Room A	R004	Wilson
NA	NA	NA	NA	R005	Taylor

Source data (store hours): | store_id | opening_time | closing_time | |----------|--------------|--------------| | S001 | 09:00 | 21:00 | | S002 | 10:00 | 22:00 | | S003 | 08:30 | 20:00 |

right_join(event_data, by = join_by(within(opening_time, closing_time, start_time, end_time)))

Keeps all rows from the target dataframe (event_data) and joins with stores where events completely fit within store hours. For events that don't fit within any store's hours (e.g., E001 and E004), store_id, opening_time, and closing_time columns will contain NA values.

store_id	opening_time	closing_time	event_id	start_time	end_time	description
NA	NA	NA	E001	08:00	10:00	Morning Mtg
S001	09:00	21:00	E002	11:00	13:00	Seminar
S002	10:00	22:00	E002	11:00	13:00	Seminar
S003	08:30	20:00	E002	11:00	13:00	Seminar
S001	09:00	21:00	E003	15:00	19:00	Exhibition
S002	10:00	22:00	E003	15:00	19:00	Exhibition
S003	08:30	20:00	E003	15:00	19:00	Exhibition
NA	NA	NA	E004	19:30	22:30	Party

row_number

Summary
Returns row numbers. Equivalent to Rank.

Syntax
row_number()

Arguments

Return Value
Numeric

Example
mutate(rowid = row_number())
Returns an unique row number for each row.

sample_frac

Summary
Sample n fraction of rows from the data frame.

Syntax
sample_frac(<number>, weight=<column_num>, replace=<logical>)

Arguments

n - number between 0 and 1.
weight (Optional) - Sampling weight. The ratio of probability the row should be chosen. This should be non-negative numeric vector whose size is the number of rows. It's normalized so that the sum of the probability will be 1.
replace (Optional) - The default is FALSE. If it's TRUE, replacement will be executed when sampling.

Example
sample_frac(.2)
Select randomly selected 20% of of data.
sample_frac(.2, weight = ARR_DELAY)
Select randomly selected 20% of of data with a weight on ARR_DELAY column.

sample_n

Summary
Sample n rows from the data frame.

Syntax
sample_n(<number>, weight=<column_num>, replace=<logical>)

Arguments

n - number
weight (Optional) - Sampling weight. The ratio of probability the row should be chosen. This should be non-negative numeric vector whose size is the number of rows. It's normalized so that the sum of the probability will be 1.
replace (Optional) - The default is FALSE. If it's TRUE, replacement will be executed when sampling.

Example
sample_n(100)
Select 100 randomly selected rows.
sample_n(100, weight = ARR_DELAY)
Select 100 randomly selected rows from the data with a weight on ARR_DELAY column.

select

Summary
Select column(s) by column names.

Syntax
select(<column(s)>,...)
select(-<column(s)>,...)
select(starts_with(<text>, ignore.case = <logical>))
select(ends_with(<text>, ignore.case = <logical>))
select(contains(<text>, ignore.case = <logical>))
select(matches(<text>, ignore.case = <logical>))
select(num_range(<text>, <start_num>:<end_num>))
select(any_of(<text1>, <text2>, ...))
select(<column>, everything())

Arguments

Used Together
starts_with, ends_with, contains, matches, num_range, any_of, everything

Example
select(X, Y)
Select column X and Y.
select(-X,- Y)
Remove column X and Y.
select(starts_with("X", ignore.case = TRUE))
Select only columns whose name starts with "X"
select(ends_with("ABR", ignore.case = TRUE))
Select only columns whose name ends with "ABR"
select(contains("ABR", ignore.case = TRUE))
Select only columns whose name contains with "ABR"
select(matches("[:digit:]+", ignore.case = TRUE))
Select only columns whose names contains digits (regular expression) select(num_range("X", 1:6))
Selects all columns (numerically) from X1 to X5. 
select(any_of("x", "y", "z”))
Selects columns provided inside the brackets.
select(X, everything())
Reorder columns: keep the column “X" in the front
select_if(is.numeric)
Select only numeric columns.

semi_join

Overview
Returns only those rows from the current dataframe that have matching values in the target dataframe. Only columns from the current dataframe are returned; columns from the target dataframe are not included. Unlike inner_join, duplicates do not occur even if there are multiple matching rows.

Syntax
semi_join(<target_dataframe>, by = join_by(<join_condition>))

Arguments

target_dataframe - Specifies the dataframe to join with.
by - Specifies the join condition:
- join_by(source_column == target_column) - Equality join
- join_by(source_column1 == target_column1, source_column2 == target_column2) - Multiple equality joins
- join_by(complex_condition) - Complex join conditions (see below)

Complex join conditions with join_by

join_by(source_column == target_column) - Equality join (works with all data types)
join_by(source_column < target_column) - "Less than" condition (dates or numeric values only)
join_by(source_column <= target_column) - "Less than or equal" condition (dates or numeric values only)
join_by(source_column > target_column) - "Greater than" condition (dates or numeric values only)
join_by(source_column >= target_column) - "Greater than or equal" condition (dates or numeric values only)
join_by(between(source_key_column, target_key_column1, target_key_column2)) - Range join. Matches when source value is between two target values (dates or numeric values only)
join_by(within(source_start_column, source_end_column, target_start_column, target_end_column)) - Interval-within join. Matches when source interval is completely contained within target interval (dates or numeric values only)
join_by(overlaps(source_start_column, source_end_column, target_start_column, target_end_column)) - Interval-overlap join. Matches when source interval overlaps with target interval (dates or numeric values only)
join_by(closest(source_column < target_column)) - Nearest-neighbor join. Matches with the closest value that satisfies the condition (dates or numeric values only, requires an operator other than equality)
Multiple conditions can be combined: join_by(ID_column == ID_column, between(date_column, start_date_column, end_date_column))

Examples

Source data: | customer_id | customer_name | order_amount | |-------------|---------------|--------------| | 101 | Smith | 5000 | | 102 | Johnson | 3000 | | 103 | Brown | 7000 |

Customer attributes data: | customer_id | region | age | |-------------|-------------|-----| | 101 | New York | 35 | | 102 | California | 42 | | 104 | Texas | 28 |

semi_join(customer_attributes, by = join_by(customer_id == customer_id))

From the source dataframe, returns only rows where "customer_id" has a match in the target dataframe. Columns from the target dataframe are not included.

customer_id	customer_name	order_amount
101	Smith	5000
102	Johnson	3000

Source data (employees): | employee_id | job_code | name | |-------------|----------|----------| | 1 | A01 | Thompson | | 2 | A02 | Wilson | | 3 | B01 | Davis | | 4 | C01 | Brown |

Job data: | job_code | job_title | |----------|--------------| | A01 | Sales | | A02 | Development | | B01 | Finance |

semi_join(job_data, by = join_by(job_code == job_code))

From the source dataframe, returns only rows where "job_code" has a match in the target dataframe. Job code "C01" doesn't exist in the job data, so it's excluded from the results.

employee_id	job_code	name
1	A01	Thompson
2	A02	Wilson
3	B01	Davis

Source data (product data): | product_id | product_name | price | |------------|--------------|-------| | P001 | Laptop | 80000 | | P002 | Mouse | 3000 | | P003 | Keyboard | 5000 |

semi_join(sale_period_data, by = join_by(product_id == product_id, between(as.Date("2023-06-05"), sale_start_date, sale_end_date)))

From the source dataframe, returns only rows where "product_id" matches in the target dataframe and "2023-06-05" falls within the sale period. The period 2023-06-01 to 2023-06-15 for product_id=P001 satisfies this condition.

product_id	product_name	price
P001	Laptop	80000

Exchange rate data: | rate_date | usd_jpy_rate | |------------|--------------| | 2023-01-01 | 130.5 | | 2023-02-01 | 132.0 | | 2023-03-01 | 133.5 |

semi_join(exchange_rate_data, by = join_by(closest(order_date >= rate_date)))

From the source dataframe, returns only rows that meet the condition with rows in the target dataframe. Extracts orders that have a closest rate date that is on or before the order date. Order with id=1004 is excluded because there's no corresponding rate date.

order_id	order_date	amount
1001	2023-01-15	3000
1002	2023-02-20	5000
1003	2023-03-10	8000

semi_join(reservation_data, by = join_by(overlaps(start_time, end_time, start_time, end_time)))

From the source dataframe, returns only rows that meet the condition with rows in the target dataframe. Extracts meetings that have overlapping times with reservations. Meeting with id=M004 is excluded because it doesn't overlap with any reservation.

meeting_id	start_time	end_time	room
M001	2023-06-01 10:00	2023-06-01 11:30	Room A
M002	2023-06-01 13:00	2023-06-01 14:00	Room B
M003	2023-06-02 09:30	2023-06-02 12:00	Room A

semi_join(event_data, by = join_by(within(opening_time, closing_time, start_time, end_time)))

From the source dataframe, returns only rows that meet the condition with rows in the target dataframe. Extracts stores where at least one event completely fits within the store's opening hours. All stores except S004 (which doesn't fit the Morning Mtg event) are included in the result.

store_id	opening_time	closing_time
S001	09:00	21:00
S002	10:00	22:00
S003	08:30	20:00

slice

Summary
Select rows by positions. This is not supported for relational databases, in which case you should use filter() with row_number() function.

Syntax
slice(<numeric>)
slice(<numeric>:<numeric>)

Arguments

n - position / nth row

Example
slice(1) : equivalent to -> filter(row_number() == 1)
Select only the first row.
slice(5:n()) : equivalent to -> filter(between(row_number(), 5, n()))
Select only rows starting from the 5th to the last.
slice(n()) : equivalent to -> filter(row_number() == n())
Select only the last row
slice(n()-10:n())
Select only the last 10 rows

summarize

Summary
Summarize a column values to a single value by using aggregate functions. Make sure you have grouped the data frame already using group_by() already.

Syntax
summarize(<new_column> = <aggregate_expression>, ...)

Example
summarise(avg = mean(revenue))
Create a new column 'avg' to store average values of 'revenue' for each group.
summarize(avg = mean(revenue), total = sum(revenue))
Create new columns 'avg' for 'average of revenue' and 'total' for 'sum of revenue' for each group.

tally

Summary
Summarize the data by either calling n() or sum() on top of the grouped data. group_by() operation should have been done before.

Syntax
tally(wt = <column>, sort = <logical>)

Arguments

wt (Optional) - If not specified, will tally the number of rows. If specified, will perform a "weighted" tally but summing over the specified variable.
sort (Optional) - The default is FALSE. If TRUE will sort output in descending order of n

Example
tally()
Returns the number of rows for each group.
tally(wt = ARR_DELAY)
Returns the sum total of 'ARR_DELAY' for each group.
tally(wt = ARR_DELAY, sort = TRUE)
Returns the sum total of 'ARR_DELAY' and sorted by the total number.

top_n

Summary
Select the top or bottom N data points within each group based on a specified column. This function allows you to filter the dataset to keep only the top or bottom N data points for each group according to a given measure.

Syntax
top_n(<number>, <column_num>)
top_n(<number>, desc(<column_num>))

Arguments

number - Specifies how many data points to retrieve.
- A positive integer selects the top N data points.
- A negative integer selects the bottom N data points.
column_num - Specifies the column used for ranking. The function selects data based on the values in this column.

Used Together
desc

Example

Original Data

Country	Population (millions)	GDP (trillion yen)
Japan	126	500
USA	331	2100
China	1400	1500
India	1380	300
Germany	83	400

top_n(3, Population (millions)) Select the top 3 countries with the highest population.

Country	Population (millions)	GDP (trillion yen)
China	1400	1500
India	1380	300
USA	331	2100

top_n(-2, Population (millions))

Select the bottom 2 countries with the lowest population. This is equivalent to top_n(2, desc(Population (millions))).

Country	Population (millions)	GDP (trillion yen)
Germany	83	400
Japan	126	500

group_by(Region) %>% top_n(1, Sales (10,000 yen)) Select the store with the highest sales in each region.

Region	Store	Sales (10,000 yen)
Tokyo	Store C	1500
Osaka	Store D	1100
Nagoya	Store G	900

transmute

Summary
Transmute adds new columns and drops existing columns.

Syntax
transmute(<new_column> = <expression>, ...)

Example
transmute(profit = revenue - expense)
Create 'profit' column based on the calculation result of 'revenue - expense', and drop 'revenue' and 'expense' columns.

ungroup

Summary
Ungroup existing grouping

Syntax
ungroup()

Example
ungroup()
Ungroup the existing groups.

ceiling_date

Summary
Rounds date/time up to the nearest integer value of the specified time unit.

Syntax
ceiling_date(<column_date>, unit = <unit>)

Arguments

unit - second, minute, hour, day, week, month, bimonth, quarter, halfyear, or year.

Return Value
POSIXct

Example
ceiling_date(A, "second")
returns "2015-10-01 06:10:15 UTC"
ceiling_date(A, "minute")
returns "2015-10-01 06:11:00 UTC"
ceiling_date(A, "hour")
returns "2015-10-01 07:00:00 UTC"
ceiling_date(A, "day")
returns "2015-10-02 UTC"
ceiling_date(A, "week")
returns "2015-10-04 UTC"
ceiling_date(A, "month")
returns "2015-11-01 UTC"
ceiling_date(A, "quarter")
returns "2016-01-01 UTC"
ceiling_date(A, "year")
returns "2016-01-01 UTC"

date_decimal

Summary
Converts numeric data calculated as a fraction of the year to date data type.

Syntax
date_decimal(<column_num>, tz = <timezone>)

Arguments

tz (Optional) - Specify which time zone to parse the date with. You can find a list of time zones.

Return Value
POSIXct

Example
date_decimal(2015.953)
Returns "2015-12-14 20:16:47 UTC"

day

Summary
Extract Day

Syntax
day(<column_date>)

Return Value
Numeric

Example
day("2015-10-01 06:15:30")
returns 1

days

Summary
Convert a number to day periods so you can use it for further calculations.

Syntax
days(<column_num>)

Return Value
Period

Example
days(1) returns "1d 0H 0M 0S"
ymd_hms("2015-10-01 06:15:30") + days(1)
returns "2015-10-02 06:15:30 UTC"

decimal_date

Summary
Converts date data to numeric data type by calculating the date to a fraction of the year.

Syntax
decimal_date(<column_date>)

Arguments

Return Value
Numeric

Example
decimal_date(ymd("2015-12-15"))
Returns 2015.953

dmy

Summary
Convert Character or Number to Date / Period when data contains only Date, but not Time.

Syntax
dmy(<column>, tz = <timezone>, locale = <locale>, truncated = <number>)

Arguments

tz (Optional) - Specify which time zone to parse the date with. You can find a list of time zones here.
locale (Optional) - Specify a locale to parse the date data.
truncated (Optional) - Set number to indicate how many formats can be missing.

Return Value
Date

Example
dmy("10-01-2015")
returns "2015-10-01"
dmy("10/01/2015")
returns "2015-10-01"
dmy("Created on 10 1 2015")
returns "2015-10-01"
dmy("1 Oct, 2015", locale = "English")
returns "2015-10-01"

dmy_h

Summary
Convert Character or Number to Date / Period when data contains Date and Time.

Syntax
dmy_h(<column>, tz = <timezone>, locale = <locale>, truncated = <number>)

Arguments

tz (Optional) - Specify which time zone to parse the date with. You can find a list of time zones here.
locale (Optional) - Specify a locale to parse the date data.
truncated (Optional) - Set number to indicate how many formats can be missing.

Return Value
POSIXct

Example
dmy_h("01-10-2015 06:10")
returns "2015-10-01 06:10:00 UTC"
dmy_h("Created on 1 10 2015, at 6:10AM")
returns "2015-10-01 06:10:00 UTC"
dmy_h("10-01-2015 06", truncated = 1)
returns "2015-10-01 06:00:00 UTC"
dmy_h("Oct 01 2015 06", locale = "English")
returns "2015-10-01 06:00:00 UTC"

dmy_hm

Summary
Convert Character or Number to Date / Period when data contains Date and Time.

Syntax
dmy_hm(<column>, tz = <timezone>, locale = <locale>, truncated = <number>)

Arguments

tz (Optional) - Specify which time zone to parse the date with. You can find a list of time zones here.
locale (Optional) - Specify a locale to parse the date data.
truncated (Optional) - Set number to indicate how many formats can be missing.

Return Value
POSIXct

Example
dmy_hm("01-10-2015 06:10")
returns "2015-10-01 06:10:00 UTC"
dmy_hm("Created on 1 10 2015, at 6:10AM")
returns "2015-10-01 06:10:00 UTC"
dmy_hm("01-10-2015 06", truncated = 1)
returns "2015-10-01 06:00:00 UTC"
dmy_hm("01 Oct 2015 06:00",locale = "English)
returns "2015-10-01 06:00:00 UTC"

dmy_hms

Summary
Convert Character or Number to Date / Period when data contains Date and Time.

Syntax
dmy_hms(<column>, tz = <timezone>, locale = <locale>, truncated = <number>)

Arguments

tz (Optional) - Specify which time zone to parse the date with. You can find a list of time zones here.
locale (Optional) - Specify a locale to parse the date data.
truncated (Optional) - Set number to indicate how many formats can be missing.

Return Value
POSIXct

Example
dmy_hms("01-10-2015 06:10:15")
returns "2015-10-01 06:10:15 UTC"
dmy_hms("01/10/2015 06:10:15")
returns "2015-10-01 06:10:15 UTC"
dmy_hms("Created on 1 10 2015, at 06:10:15 AM")
returns "2015-10-01 06:10:15 UTC"
dmy_hms("01-10-2015 06")
returns NA, because there is not enough data for Date and Time parsing.
dmy_hms("01-10-2015 06", truncated = 2)
returns "2015-10-01 06:00:00 UTC" as expected
dmy_hms("01 Oct 2015 06:00:00", locale = "English")
returns "2015-10-01 06:00:00 UTC"

duration

Summary
Create a duration

Syntax
duration(<column_num>, units = <units>)

Arguments

units - seconds, minutes, hours, days, weeks, years

Return Value
Duration

Example
duration(1.5, "minutes") returns 90s
duration(mins = 1.5) returns 90s

dym

Summary
Convert Character or Number to Date / Period when data contains only Date, but not Time.

Syntax
dym(<column>, tz = <timezone>, locale = <locale>, truncated = <number>)

Arguments

tz (Optional) - Specify which time zone to parse the date with. You can find a list of time zones here.
locale (Optional) - Specify a locale to parse the date data.
truncated (Optional) - Set number to indicate how many formats can be missing.

Return Value
Date

Example
dym("01-2015-10")
returns "2015-10-01"
dym("01/2015/10")
returns "2015-10-01"
dym("Created on 1 2015 10")
returns "2015-10-01"
dym("1 2015, Oct", locale = "English")
returns "2015-10-01"

floor_date

Summary
Rounds date/time down to the nearest integer value of the specified time unit.

Syntax
floor_date(<column_date>, unit = <unit>)

Arguments

unit - second, minute, hour, day, week, month, bimonth, quarter, halfyear, or year.

Return Value
POSIXct

Example
floor_date(A, "second")
returns "2015-10-01 06:10:15 UTC"
floor_date(A, "minute")
returns "2015-10-01 06:10:00 UTC"
floor_date(A, "hour")
returns "2015-10-01 06:00:00 UTC"
floor_date(A, "day")
returns "2015-10-01 UTC"
floor_date(A, "week")
returns "2015-09-27 UTC"
floor_date(A, "month")
returns "2015-10-01 UTC"
floor_date(A, "quarter")
returns "2015-10-01 UTC"
floor_date(A, "year")
returns "2015-01-01 UTC"

force_tz

Summary
Re-registers the data with a given timezone.

Syntax
force_tz(<column_date>, tzone = <timezone>)

Arguments

tzone - Set a time zone to convert to.

Return Value
POSIXct

Example

as.POSIXct("2014-01-01 00:15:00", tz = "UTC")
Returns "2014-01-01 00:15:00 UTC". force_tz(as.POSIXct("2014-01-01 00:15:00", tz = "America/Los_Angeles"))
Returns "2014-01-01 00:15:00 PST". Notice that the timezone is now "PST" instead of the original "UTC"

here

Summary
Get current time in your local timezone

Syntax
here()

Return Value
POSIXct

Example
here()
returns "2015-12-31 00:04:59 PST"

hm

Summary
Convert text to Period data type if a given column holds values that look like Hours and Minutes

Syntax
hm(<column>, tz = <timezone>)

Arguments

tz (Optional) - Specify which time zone to parse the date with. You can find a list of time zones here.

Return Value
Period

Example
hm("06:10")
returns "6H 10M 0S"
hm("06:10", "UTC")
returns "6H 10M 0S"

hms

Summary
Create hms data. It's difftime dealt as seconds.

Syntax
hms(seconds = <column_num>, minutes = <column_num>, hours = <column_num>, days = <column_num>)

Arguments

seconds (Optional) - How many seconds the difftime has.
minutes (Optional) - How many minutes the difftime has.
hours (Optional) - How many hours the difftime has.
days (Optional) - How many days the difftime has.

Return Value
hms

Example
hms(seconds = 20, minutes = 1, hours = 5)
returns 20 + 1 * 60 + 5 * 60 * 60 = 18080 seconds difftime

hour

Summary
Extract hour

Syntax
hour(<column_date>)

Return Value
Integer

Example
hour("2015-10-01 06:15:30")
returns 6

hours

Summary
Convert a number to hour periods so you can use it for further calculations.

Syntax
hours(<column_num>)

Return Value
Period

Example
hours(1) returns "1H 0M 0S"
ymd_hms("2015-10-01 06:15:30") + hours(1) returns "2015-10-01 07:15:30 UTC"

interval

Summary
Return intervals between two dates. You can use this result to get the duration by either dividing by a given units such as days, weeks, etc, OR converting it to numeric data type with "as.numeric()".

Syntax
interval(<column_date>, <column_date>)

Return Value
Interval

Example
interval("2015-10-01", "2015-10-18")
returns 2015-10-01 UTC--2015-10-18 UTC
interval("2015-10-01", "2015-10-18") %/% days(1)
returns 17
as.numeric(interval("2015-10-01", "2015-10-18"), units = "days")
returns 17
interval("2015-10-01", "2015-10-18") %/% weeks(1)
returns 2

is.Date

Summary
Returns true if a given object is Date type.

Syntax
is.Date

Arguments

Return Value
Logical

Example

mutate_if(is.Date, wday)
Extract day of the week information from Date columns.
mutate_if(is.Date, funs(week_day = wday))
Extract day of the week information from Date columns and creating new columns.
mutate_if(is.Date, wday, label = TRUE)
Extract day of the week information using wday function with 'label = TRUE' argument from Date columns.

is.difftime

Summary
Returns true if a given object is difftime type.

Syntax
is.difftime

Arguments

Return Value
Logical

Example

mutate_if(is.difftime, as.numeric)
Convert difftime type columns to numeric type.

is.duration

Summary
Returns true if a given object is duration type.

Syntax
is.duration

Arguments

Return Value
Logical

Example

mutate_if(is.duration, as.numeric)
Convert duration type columns to numeric type.

is.POSIXct

Summary
Returns true if a given object is POSIXct type.

Syntax
is.POSIXct

Arguments

Return Value
Logical

Example

mutate_if(is.POSIXct, wday)
Extract day of the week information from Date columns.
mutate_if(is.POSIXct, funs(week_day = wday))
Extract day of the week information from Date columns and creating new columns.
mutate_if(is.POSIXct, wday, label = TRUE)
Extract day of the week information using wday function with 'label = TRUE' argument from Date columns.

isoweek

Summary
Extract week numbers of the year. Weeks start from Monday.

Syntax
isoweek(<column_date>)

Return Value
Numeric

Example
isoweek(ymd(c( \ "2020-01-01",\ "2020-01-02",\ "2020-01-03",\ "2020-01-04",\ "2020-01-05",\ "2020-01-06",\ "2020-01-07",\ "2020-01-08")))\ returns 1,1,1,1,1,2,2,2 (2020-01-06 is Monday)

mday

Summary
Extract Day of Month

Syntax
mday(<column_date>)

Return Value
Numeric

Example
mday("2015-10-01 06:15:30")
returns 1

mdy

Summary
Convert Character or Number to Date / Period when data contains only Date, but not Time.

Syntax
mdy(<column>, tz = <timezone>, locale = <locale>, truncated = <number>)

Arguments

tz (Optional) - Specify which time zone to parse the date with. You can find a list of time zones here.
locale (Optional) - Specify a locale to parse the date data.
truncated (Optional) - Set number to indicate how many formats can be missing.

Return Value
Date

Example
mdy("01-10-2015")
returns "2015-10-01"
mdy("01/10/2015")
returns "2015-10-01"
mdy("Created on 1 10 2015")
returns "2015-10-01"
mdy("Oct, 1 2015", locale = "English")
returns "2015-10-01"

mdy_h

Summary
Convert Character or Number to Date / Period when data contains Date and Time.

Syntax
mdy_h(<column>, tz = <timezone>, locale = <locale>, truncated = <number>)

Arguments

tz (Optional) - Specify which time zone to parse the date with. You can find a list of time zones here.
locale (Optional) - Specify a locale to parse the date data.
truncated (Optional) - Set number to indicate how many formats can be missing.

Return Value
POSIXct

Example
mdy_h("10-01-2015 06")
returns "2015-10-01 06:00:00 UTC"
mdy_h("Created on 10 1 2015 6 AM")
returns "2015-10-01 06:00:00 UTC"
mdy_h("Oct 1, 2015 06 AM", locale = "English")
returns "2015-10-01 06:00:00 UTC"

mdy_hm

Summary
Convert Character or Number to Date / Period when data contains Date and Time.

Syntax
mdy_hm(<column>, tz = <timezone>, locale = <locale>, truncated = <number>)

Arguments

tz (Optional) - Specify which time zone to parse the date with. You can find a list of time zones here.
locale (Optional) - Specify a locale to parse the date data.
truncated (Optional) - Set number to indicate how many formats can be missing.

Return Value
POSIXct

Example
mdy_hm("10-01-2015T06:10")
returns "2015-10-01 06:10:00 UTC"
mdy_hm("Created on 10 1 2015, at 6:10 AM")
returns "2015-10-01 06:10:00 UTC"
mdy_hm("Oct 1, 2015 06:10 AM", locale = "English")
returns "2015-10-01 06:10:00 UTC"

mdy_hms

Summary
Convert Character or Number to Date / Period when data contains Date and Time.

Syntax
mdy_hms(<column>, tz = <timezone>, locale = <locale>, truncated = <number>)

Arguments

tz (Optional) - Specify which time zone to parse the date with. You can find a list of time zones here.
locale (Optional) - Specify a locale to parse the date data.
truncated (Optional) - Set number to indicate how many formats can be missing.

Return Value
POSIXct

Example
mdy_hms("10-01-2015T06:10:15")
returns "2015-10-01 06:10:15 UTC"
mdy_hms("10/01/2015T06:10:15")
returns "2015-10-01 06:10:15 UTC"
mdy_hms("Created on 10 1 2015")
returns "2015-10-01 06:10:15 UTC"
mdy_hms("10-01-2015 06")
returns NA, because there is not enough data for Date and Time parsing.
mdy_hms("10-01-2015 06", truncated = 2)
returns "2015-10-01 06:00:00 UTC" as expected
mdy_hms("Oct 1, 2015 06:10:15 AM", locale = "English")
returns "2015-10-01 06:10:15 UTC"

milliseconds

Summary
Convert a number to milliseconds periods so you can use it for further calculations.

Syntax
milliseconds(<column_num>)

Return Value
Period

Example
milliseconds(1) returns "1S"
ymd_hms("2015-10-01 06:15:30") + milliseconds(1)
returns "2015-10-01 06:15:31 UTC"

minute

Summary
Extract minute

Syntax
minute(<column_date>)

Return Value
Integer

Example
minute("2015-10-01 06:15:30")
returns 15

minutes

Summary
Convert a number to minute periods so you can use it for further calculations.

Syntax
minutes(<column_num>)

Return Value
Period

Example
minutes(1)_
returns "1M 0S"
ymd_hms("2015-10-01 06:15:30") + minutes(1)
returns "2015-10-01 06:16:30 UTC"

month

Summary
Extract Month

Syntax
month(<column_date>, label = <logical>, abbr = <logical>, locale = <locale>)

Arguments

label (Optional) - Default is FALSE. If TRUE it will return text (e.g. January), if FALSE it will return as number (e.g. 7)
abbr (Optional) - Default is TRUE. FALSE will display the month of the year as an ordered factor of character strings, such as "January." TRUE will display an abbreviated version of the label, such as "Jan".
locale (Optional) - Specify a locale for month names.

Return Value
Numeric or Text

Example
month("2015-10-01 06:15:30")
returns 10 month("2015-10-01 06:15:30", label = TRUE)
returns October month("2015-10-01 06:15:30", label = TRUE, abbr = TRUE)
returns Oct

ms

Summary
Convert text to Period data type if a given column holds values that look like Minutes and Seconds

Syntax
ms(<column>, tz = <timezone>)

Arguments

tz (Optional) - Specify which time zone to parse the date with. You can find a list of time zones. here.

Return Value
Period

Example
ms("10:15")
returns "10M 15S"
ms("10:15", "UTC")
returns "10M 15S"

myd

Summary
Convert Character or Number to Date / Period when data contains only Date, but not Time.

Syntax
myd(<column>, tz = <timezone>, locale = <locale>, truncated = <number>)

Arguments

tz (Optional) - Specify which time zone to parse the date with. You can find a list of time zones here.
locale (Optional) - Specify a locale to parse the date data.
truncated (Optional) - Set number to indicate how many formats can be missing.

Return Value
Date

Example
myd("01-2015-10")
returns "2015-01-10"
myd("01/2015/10")
returns "2015-01-10"
myd("Created on 1 2015 10")
returns "2015-01-10"
myd("Oct, 2015 1 ", locale = "English")
returns "2015-10-01"

now

Summary
Returns current date and time. A similar function 'today()' returns only date.

Syntax
now(tzone = <timezone>)

Arguments

tzone (Optional) - Set a time zone for the date to be returned. The default is the sytem time zone of your computer. You can find a list of time zones here.

Return Value
POSIXct

Example
now()
returns "2015-12-31 08:04:59 PST"
now("GMT")
returns "2015-12-31 08:04:59 PST"
now("America/Los_Angeles")
returns "2015-12-31 00:04:59 PST"

parse_date_time

Summary
Convert data to Date/Time data type.

Syntax
parse_date_time(<column>, orders = <date_time_order>, locale = <locale>, tz = <timezone>, truncated = <integer>, quiet = <logical>, exact = <logical>, lt = <logical>)

Arguments

orders - Specify the order of parameters in the original data. The supported parameters are
- a - Abbreviated weekday name in the current locale. (Also matches full name)
- A - Full weekday name in the current locale. (Also matches abbreviated name).
- b - Abbreviated month name in the current locale (also matches full name). C parser understands English months only.
- B - Same as b.
- d - Day of the month as decimal number (01–31 or 0–31)
- H - Hours as decimal number (00–24 or 0–24).
- I - Hours as decimal number (01–12 or 1–12).
- j - Day of year as decimal number (001–366 or 1–366).
- q - Quarter (1-4). The quarter month is added to parsed month if m format is present.
- m - Month as decimal number (01–12 or 1–12).
- M - Minute as decimal number (00–59 or 0–59).
- p - AM/PM indicator in the locale. Normally used in conjunction with I and not with H. But lubridate C parser accepts H format as long as hour is not greater than 12. C parser understands only English locale AM/PM indicator.
- S - Second as decimal number (00–61 or 0–61), allowing for up to two leap-seconds.
- OS - Fractional second.
- U - Week of the year as decimal number (00–53 or 0-53) using Sunday as the first day 1 of the week (and typically with the first Sunday of the year as day 1 of week 1). The US convention.
- w - Weekday as decimal number (0–6, Sunday is 0).
- W - Week of the year as decimal number (00–53 or 0-53) using Monday as the first day of week (and typically with the first Monday of the year as day 1 of week 1). The UK convention.
- y - Year without century (00–99 or 0–99).
- Y - Year with century.
- z - ISO8601 signed offset in hours and minutes from UTC. For example -0800, -08:00 or -08, all represent 8 hours behind UTC. This format also matches the Z (Zulu) UTC indicator. Because strptime doesn't fully support ISO8601 this format is implemented as an union of 4 orders: Ou (Z), Oz (-0800), OO (-08:00) and Oo (-08). You can use these four orders as any other but it is rarely necessary. parse_date_time2 and fast_strptime support all of the timezone formats.
- Om - Matches numeric month and English alphabetic months (Both, long and abbreviated forms).
- Op - Matches AM/PM English indicator.
- r - Matches Ip and H orders.
- R - Matches HM andIMp orders.
- T - Matches IMSp, HMS, and HMOS orders.
locale - The default is the session default locale. Set the date/time related configuration like time zone, encoding, decimal mark, and day/month names.
tz (Optional) - Specify which time zone to parse the date with. You can find a list of time zones.
trundated (Optional) - The default is 0. Number of parameters that can be missed.
quiet (Opional) - The default is FALSE. If TRUE, it suppresses the error when the format is wrong.
exact (Optional) - The default is FALSE. If TRUE, the orders parameters must explicit format (ex. "%m/%d/%y %I:%M").
lt (Optional) - The default is FALSE. If FALSE, it returns POSIXct. If TRUE, it return POSIXlt.

Return Value
POSIXct

Example
parse_date_time("03/15/05 1:30 PM", "mydHMp")
Returns "2015-03-05 13:30:00 PDT"
lubridate::parse_date_time("1:30 PM", "HMp")
Returns "0000-01-01 13:30:00 PDT. The default to each parameters are 0 and it always returns POSIXct"

qday

Summary
Extract Day of Quarter

Syntax
qday(<column_date>)

Return Value
Numeric

Example
qday("2015-10-01 06:15:30")
returns 7

quarter

Summary
Extract Quarter

Syntax
quarter(<column_date>)

Return Value
Numeric

Example
quarter("2015-08-01 06:15:30") returns 3
quarter("2015-06-16 07:42:29") returns 2

rollback

Summary
Calculate the last day of the previous month or to the first day of the month based on a given date.

Syntax
rollback(<column_date>, roll_to_first = <logical>, preserve_hms = <logical>)

Arguments

roll_to_first (Optional) - The default is FALSE. Rollback to the first day of the month instead of the last day of the previous month
preserve_hms (Optional) - The default is TRUE. Retains the same hour, minute, and second information? If FALSE, the new date will be at 00:00:00.

Return Value
POSIXct

Example
rollback("2015-03-03 06:15:30")
returns "2010-02-28 06:15:30 UTC"

rollback("2015-03-03 06:15:30", roll_to_first = TRUE)
returns "2010-03-01 06:15:30 UTC"

rollback("2015-03-03 06:15:30", preserve_hms = FALSE)
returns "2010-02-28 UTC"

round_date

Summary
Rounds date/time to the nearest integer value of the specified time unit.

Syntax
round_date(<column_date>, unit = <unit>)

Arguments

unit - an unit to round. - second, minute, hour, day, week, month, bimonth, quarter, halfyear, or year.

Return Value
POSIXct

Example
round_date("2015-10-01 06:10:15 UTC", "second")
returns "2015-10-01 06:10:15 UTC"
round_date("2015-10-01 06:10:15 UTC", "minute")
returns "2015-10-01 06:10:00 UTC" round_date("2015-10-01 06:10:15 UTC", "hour")
returns "2015-10-01 06:00:00 UTC"
round_date("2015-10-01 06:10:15 UTC", "day")
returns "2015-10-01 UTC"
round_date("2015-10-01 06:10:15 UTC", "week")
returns "2015-10-04 UTC"
round_date("2015-10-01 06:10:15 UTC", "month")
returns "2015-10-01 UTC"
round_date("2015-10-01 06:10:15 UTC", "quarter")
returns "2015-10-01 UTC"
round_date("2015-10-01 06:10:15 UTC", "year")
returns "2015-01-01 UTC"

second

Summary
Extract second

Syntax
second(<column_date>)

Return Value
Integer

Example
second("2015-10-01 06:15:30")
returns 30

seconds

Summary
Convert a number to second periods so you can use it for further calculations.

Syntax
seconds(<column_num>)

Return Value
Period

Example
seconds(1) returns "1S"
ymd_hms("2015-10-01 06:15:30") + seconds(1)
returns "2015-10-01 06:15:31 UTC"

time_length

Summary
Calculate the exact time length between two dates.

Syntax
time_length(<interval_or_duration>, unit = <unit>)

Arguments

unit - The default is "second" - second, minute, hour, day, week, month, bimonth, quarter, halfyear, or year.

Return Value
Numeric

Example
time_length(interval("2015-10-01", "2015-10-18"), "day")
returns 17
time_length(interval("2015-10-01", "2015-10-18"), "month")
returns 0.5483871
trunc(time_length(interval("2015-10-01", "2015-10-18"), "month"))
returns 0
round(time_length(interval("2015-10-01", "2015-10-18"), "month"))
returns 1

today

Summary
Returns current date without time. A similar function 'now()' returns date including time.

Syntax
today(tzone = <timezone>)

Arguments

tzone (Optional) - Set a time zone for the date to be returned. The default is the sytem time zone of your computer. You can find a list of time zones here.

Return Value
Date

Example
today("GMT")
returns "2015-12-31"

today("America/Los_Angeles")
returns "2015-12-31"

wday

Summary
Extract day of the week as a decimal number (01-07, Sunday is 1) or as Text (e.g. Sunday). The Text can be a full text or an abbreviated text. Sunday vs. Sun

Syntax
wday(<column_date>, label = <logical>, abbr = <logical>, week_start = <number>, locale = <locale>)

Arguments

label (Optional) - Default is FALSE. If TRUE it will return text (e.g. Saturday), if FALSE it will return as number (e.g. 7)
abbr (Optional) - Default is TRUE. FALSE will display the day of the week as an ordered factor of character strings, such as "Sunday." TRUE will display an abbreviated version of the label, such as "Sun".
week_start (Optional) - 7 (default) means a week starts on Sunday. 1 means a week starts on Monday. Controls factor order of the outcome.
locale (Optional) - Specify a locale for day names.

Return Value
Numeric or Text

Example
wday("2015-10-01 06:15:30")
returns 5
wday(A, label = TRUE)
returns Thurs
wday(A, label = TRUE, abbr=FALSE)
returns Thursday

week

Summary
Extract week numbers of the year. The 1st day of the 1st week always starts from January 1st regardless of the day of the week.

Syntax
week(<column_date>)

Return Value
Numeric

Example
week(ymd(c( \ "2020-01-01",\ "2020-01-02",\ "2020-01-03",\ "2020-01-04",\ "2020-01-05",\ "2020-01-06",\ "2020-01-07",\ "2020-01-08")))\ returns 1,1,1,1,1,1,1,2

weeks

Summary
Convert a number to week periods so you can use it for further calculations.

Syntax
weeks(<column_num>)

Return Value
Period

Example
weeks(1) returns "7d 0H 0M 0S"
ymd_hms("2015-10-01 06:15:30") + weeks(1)
returns "2015-10-08 06:15:30 UTC"

with_tz

Summary
Returns Date / Time in a given time zone

Syntax
with_tz(<column_date>, tzone = <timezone>)

Arguments

tzone - Set a time zone to convert to.

Return Value
POSIXct

Example

with_tz(as.POSIXct("2014-01-01 00:15:00", tz = "UTC"), tzone = "America/Los_Angeles")
Returns "2013-12-31 16:15:00 PST". The original data is registered in UTC, and with_tz() function is returning it in Pacific Timezone. with_tz(as.POSIXct("2014-01-01 00:15:00", tz = "UTC"))
Returns "2013-12-31 16:15:00 PST". with_tz() function returns the value in the default timezone (e.g. "America/Los_Angeles").

yday

Summary
Extract Day of Year

Syntax
yday(<column_date>)

Return Value
Numeric

Example
yday("2015-10-01 06:15:30")
returns 274

ydm

Summary
Convert Character or Number to Date / Period when data contains only Date, but not Time.

Syntax
ydm(<column>, tz = <timezone>, locale = <locale>, truncated = <number>)

Arguments

tz (Optional) - Specify which time zone to parse the date with. You can find a list of time zones here.
locale (Optional) - Specify a locale to parse the date data.
truncated (Optional) - Set number to indicate how many formats can be missing.

Return Value
Date

Example
ydm("2015-10-01")
returns "2015-01-10"
ydm("2015/10/01")
returns "2015-01-10"
ydm("Created on 2015 10 1")
returns "2015-01-10"
ydm("2015, 1 Oct", locale = "English") returns "2015-10-01"

ydm_h

Summary
Convert Character or Number to Date / Period when data contains Date and Time.

Syntax
ydm_h(<column>, tz = <timezone>, locale = <locale>, truncated = <number>)

Arguments

tz (Optional) - Specify which time zone to parse the date with. You can find a list of time zones here.
locale (Optional) - Specify a locale to parse the date data.
truncated (Optional) - Set number to indicate how many formats can be missing.

Return Value
POSIXct

Example
ydm_h("2015-01-10T06")
returns "2015-10-01 06:00:00 UTC"
ydm_h("2015/01/10 06")
returns "2015-10-01 06:00:00 UTC"
ydm_h("Created on 15-01-10 at 6 AM")
returns "2015-10-01 06:00:00 UTC"
ydm_h("2015 01 Oct, at 6 AM", locale = "English")
returns "2015-10-01 06:00:00 UTC"

ydm_hm

Summary
Convert Character or Number to Date / Period when data contains Date and Time.

Syntax
ydm_hm(<column>, tz = <timezone>, locale = <locale>, truncated = <number>)

Arguments

tz (Optional) - Specify which time zone to parse the date with. You can find a list of time zones here.
locale (Optional) - Specify a locale to parse the date data.
truncated (Optional) - Set number to indicate how many formats can be missing.

Return Value
POSIXct

Example
ydm_hm("2015-01-10 06:10")
returns "2015-10-01 06:10:00 UTC"
ydm_hm("2015/01/10 06:00 UTC")
returns "2015-10-01 06:10:15 UTC"
ydm_hm("Created on 15-01-10 at 06:10 AM")
returns "2015-10-01 06:10:00 UTC"
ydm_hm("2015 01 Oct, at 6:10 AM", locale = "English")
returns "2015-10-01 06:10:00 UTC"

ydm_hms

Summary
Convert Character or Number to Date / Period when data contains Date and Time.

Syntax
ydm_hms(<column>, tz = <timezone>, locale = <locale>, truncated = <number>)

Arguments

tz (Optional) - Specify which time zone to parse the date with. You can find a list of time zones here.
locale (Optional) - Specify a locale to parse the date data.
truncated (Optional) - Set number to indicate how many formats can be missing.

Return Value
POSIXct

Example
ydm_hms("2015-01-10T06:10:15")
returns "2015-10-01 06:10:15 UTC"
ydm_hms("2015/01/10 06:10:15 UTC")
returns "2015-10-01 06:10:15 UTC"
ydm_hms("Created on 15-01-10 at 06:10:15 AM")
returns "2015-10-01 06:10:15 UTC"
ydm_hms("2015-01-10 06")
returns NA, because there is not enough data for Date and Time parsing.
ydm_hms("2015-01-10 06", truncated = 2)
returns "2015-10-01 06:00:00 UTC" as expected
ydm_hms("2015 01 Oct, at 6:10:15 AM", locale = "English")
returns "2015-10-01 06:10:15 UTC"

year

Summary
Extract Year

Syntax
year(<column_date>)

Return Value
Numeric

Example
year("2015-10-01 06:15:30")
returns 2015

years

Summary
Convert a number to year periods so you can use it for further calculations.

Syntax
years(<column_num>)

Return Value
Period

Example
ymd_hms("2015-10-01 06:15:30") + years(1)
returns "2016-10-01 06:15:30 UTC"

ymd

Summary
Convert Character or Number to Date / Period when data contains only Date, but not Time.

Syntax
ymd(<column>, tz = <timezone>, locale = <locale>, truncated = <number>)

Arguments

tz (Optional) - Specify which time zone to parse the date with. You can find a list of time zones here.
locale (Optional) - Specify a locale to parse the date data.
truncated (Optional) - Set number to indicate how many formats can be missing.

Return Value
Date

Example
ymd("2015-10-01")
returns "2015-10-01"
ymd("2015/10/01")
returns "2015-10-01"
ymd("Created on 2015 10 1")
returns "2015-10-01"
ymd("2015, Oct 1", locale = "English") returns "2015-10-01"
ß

ymd_h

Summary
Convert Character or Number to Date / Period when data contains Date and Time.

Syntax
ymd_h(<column>, tz = <timezone>, locale = <locale>, truncated = <number>)

Arguments

tz (Optional) - Specify which time zone to parse the date with. You can find a list of time zones here.
locale (Optional) - Specify a locale to parse the date data.
truncated (Optional) - Set number to indicate how many formats can be missing.

Return Value
POSIXct

Example
ymd_h("2015-10-01T06")
returns "2015-10-01 06:00:00 UTC"
ymd_h("2015/10/01 06")
returns "2015-10-01 06:00:00 UTC"
ymd_h("Created on 15-10-01 at 6 AM")
returns "2015-10-01 06:00:00 UTC"
ymd_h("2015, Oct 1 6AM", locale = "English")
returns "2015-10-01 06:00:00 UTC"

ymd_hm

Summary
Convert Character or Number to Date / Period when data contains Date and Time.

Syntax
ymd_hm(<column>, tz = <timezone>, locale = <locale>, truncated = <number>)

Arguments

tz (Optional) - Specify which time zone to parse the date with. You can find a list of time zones here.
locale (Optional) - Specify a locale to parse the date data.
truncated (Optional) - Set number to indicate how many formats can be missing.

Return Value
POSIXct

Example
ymd_hm("2015-10-01 06:10")
returns "2015-10-01 06:10:00 UTC"
ymd_hm("2015/10/01 06:00 UTC")
returns "2015-10-01 06:10:15 UTC"
ymd_hm("Created on 15-10-01 at 06:10 AM")
returns "2015-10-01 06:10:00 UTC"
ymd_hm("2015, Oct 1 06:10 AM", locale = "English")
returns "2015-10-01 06:10:00 UTC"

ymd_hms

Summary
Convert Character or Number to Date / Period when data contains Date and Time.

Syntax
ymd_hms(<column>, tz = <timezone>, locale = <locale>, truncated = <number>)

Arguments

tz (Optional) - Specify which time zone to parse the date with. You can find a list of time zones here.
locale (Optional) - Specify a locale to parse the date data.
truncated (Optional) - Set number to indicate how many formats can be missing.

Return Value
POSIXct

Example
ymd_hms("2015-10-01T06:10:15")
returns "2015-10-01 06:10:15 UTC"
ymd_hms("2015/10/01 06:10:15 UTC")
returns "2015-10-01 06:10:15 UTC"
ymd_hms("Created on 15-10-01 at 06:10:15 AM")
returns "2015-10-01 06:10:15 UTC"
ymd_hms("2015-10-01 06")
returns NA, because there is not enough data for Date and Time parsing.
ymd_hms("2015-10-01 06", truncated = 2)
returns "2015-10-01 06:00:00 UTC" as expected
ymd_hms("2015, Oct 1 06:10:15 AM", locale = "English")
returns "2015-10-01 06:10:15 UTC"

cor

Summary
Computes the correlation of two column values. The covariance divided by the product of the standard deviations of the two column values.

Syntax
cor(<column_num>, <column_num>, use = <cor_na_operation>, method = <cor_method>)

Arguments

use (Optional) - The default is "everything". Set an operation type for dealing with missing values.
method (Optional) - The default is "pearson". Set a method to compute correlation coefficient among "pearson", "kendall", or "spearman".

Example
summarize(correlation = cor(ARR_DELAY, DEP_DELAY))
Return the correlation value for ARR_DELAY and DEP_DELAY columns.

cov

Summary
Computes the covariance of two column values.

Syntax
cov(<column_num>, <column_num>, use = <cor_na_operation>, method = <cor_method>)

Arguments

use (Optional) - The default is "everything". Set an operation type for dealing with missing values.
method (Optional) - The default is "pearson". Set a method to compute the covariance among "pearson", "kendall", or "spearman".

Example
summarize(covariance = cov(ARR_DELAY, DEP_DELAY))
Return the covariance value for ARR_DELAY and DEP_DELAY columns.

IQR

Summary
computes interquartile range of the x values

Syntax
IQR(<column_num_logic_date>, na.rm = <logical>, type = 1|2|3|4|5|6|7|8|9)

Arguments

na.rm - TRUE or FALSE. The default is FALSE. Decides whether missing values (including NaN) to be removed or not.
type (Optional) - The default is 8. An integer selecting one of the many quantile algorithms, see here for the detail.

Return Value
Numeric

Example
// x <- c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10)
IQR(x)
Returns 4.5

mad

Summary
returns the median absolute deviation of the values.

Syntax
mad(<column_num_logic_date>, na.rm = <logical>)

Arguments

na.rm - TRUE or FALSE. The default is FALSE. Decides whether missing values (including NaN) to be removed or not.
center - Optionally, the centre: defaults to the median.
constant - scale factor.
low - if TRUE, compute the ‘lo-median’, i.e., for even sample size, do not average the two middle values, but take the smaller one.
high - if TRUE, compute the ‘hi-median’, i.e., take the larger of the two middle values for even sample size.

Return Value
Numeric

Example
summarize(mad = mad(revenue))
Create a new column for calculating the median absolute deviation of revenue for each group.

median

Summary
Returns the numerical median value.

Syntax
median(<column_num_logic_date>, na.rm = <logical>)

Arguments

na.rm - TRUE or FALSE. The default is FALSE. Decides whether missing values (including NaN) to be removed or not.

Return Value
Numeric

Example
summarize(revenue_median = median(revenue))
Create a new column for calculating the median value of revenue for each group.

predict

Summary

Returns data augmented with fitted values and residuals.

Syntax

predict(<column_model>, data = <column_data_set>, newdata = <data_set>, type.predict = "link"|"response")

Arguments

data (Optional) - A data frame name or a column that has original data for the model.
newdata (Optional) - A data frame name that has the data to predict.
type.predict (Optional) - If it's 'link', the return value is on the scale of the linear predictors (ex. β0+β1*x). If it's 'response', the return value is on the scale of the response variable. If the model is logistic regression by glm function, 'response' returns the value for classification.

Example

predict(model) - assume that 'model' column contains a fitted model of 'lm', 'glm' or 'kmeans'. If it's lm, the output is like below.

.rownames	ARR_DELAY	DEP_DELAY	.fitted	.se.fit	.resid	.hat	.sigma	.cooksd	.std.resid
1	-15	-9	-13.934	0.0247	-1.066	0.0000035503	13.0886	7.8503e-9	-0.0814
2	-28	-19	-22.4116	0.035	-5.5884	0.0000071333	13.0886	4.3348e-7	-0.427

.fitted - Fitted values.
.se.fit - Standard errors of fitted values.
.resid - Residuals. The gap between fitted value and observed value.
.hat - Diagonal of the hat matrix. If this is too large (often larger than 0.5), it's regarded as an outlier.
.sigma - Estimate of residual standard deviation when corresponding observation is dropped from model. Estimate of the accuracy of the dependent variable.
.cooksd - Cooks distance. How influential the observation is to the model.
.std.resid - Standardized residuals. Residual / Standard deviation of Residual.

quantile

Summary
sample quantiles corresponding to the given probabilities. The smallest observation corresponds to a probability of 0 and the largest to a probability of 1.

Syntax
quantile(<column_num_logic_date>, probs = <percent_number>, na.rm = <logical>, [type = 1|2|3|4|5|6|7|8|9])

Arguments

probs - numeric probability number within [0,1].
na.rm - TRUE or FALSE. The default is FALSE. Decides whether missing values (including NaN) to be removed or not.
type - The default is 8. An integer between 1 and 9 selecting one of the nine quantile algorithms detailed here to be used.

Return Value
Numeric

Example
//x <- c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10)
quantile(x, 0.25)
Returns 3.25
quantile(x, .5)
Returns 5.5

sd

Summary
returns the standard deviation of the values.

Syntax
sd(<column_num_logic_date>, na.rm = <TRUE|FALSE>)

Arguments

na.rm - TRUE or FALSE. The default is FALSE. Decides whether missing values (including NaN) to be removed or not.

Return Value
Numeric

Example
summarize(sd = sd(revenue))
Create a new column for calculating the standard deviation of revenue for each group.

var

Summary
returns the variance of the values.

Syntax
var(<column_num_logic_date>, na.rm = <logical>)

Arguments

na.rm - TRUE or FALSE. The default is FALSE. Decides whether missing values (including NaN) to be removed or not.

Return Value
Numeric

Example
summarize(total = sum(revenue))
Create a new column for calculating the variance of revenue for each group.

stri_compare

Summary
Compares two strings lexicographically and determines their order. The result is returned as a numeric value: -1 (comes before), 0 (equal), or 1 (comes after).

Syntax
stri_compare(<text_column1>, <text_column2>, ..., opts_collator = NULL)

Arguments

opts_collator (Optional) - A list specifying collation options. If NULL, default collation rules are used. The strength parameter can be set from 1 to 4 using the stri_opts_collator() function:
- 1: Compare only base characters (e.g., a = A)
- 2: Distinguish accents but not case (e.g., a = A, á ≠ a)
- 3: Distinguish both accents and case (e.g., a ≠ A, á ≠ a)
- 4: Also distinguish punctuation and spaces (e.g., a ≠ A, á ≠ a, "a-b" ≠ "ab")

Example

stri_compare(X1, Y1)

Compares strings in columns X1 and Y1 lexicographically, returning -1 if X1 comes before Y1, 0 if they're equal, and 1 if X1 comes after Y1.

X1	Y1	NEW
apple	banana	-1
banana	banana	0
cat	banana	1

stri_compare(X1, Y1, opts_collator = stri_opts_collator(strength = 2))

Compares strings without distinguishing between uppercase and lowercase.

X1	Y1	NEW
APPLE	apple	0
banana	BANANA	0
Cat	cat	0

stri_datetime_format

Summary
Converts a date-time column to a string based on the specified format.

Syntax
stri_datetime_format(<date_time_column>, format = <format>, tz = <timezone>, locale = <locale>)

Arguments

format - The default is "uuuu-MM-dd HH:mm:ss". Specifies the date-time format.
- uuuu - Year (correctly handles dates before the UNIX epoch)
- MM - Month (uppercase)
- dd - Day (lowercase)
- HH - Hour (24-hour format)
- mm - Minute (lowercase)
- ss - Second (lowercase)
- SS - Millisecond (uppercase, 1/1000 second units)
tz - The default is NULL. Specifies the timezone.
locale - The default is NULL. Specifies the locale.

Example

stri_datetime_format(X1, format="MM/dd/uuuu")

Converts values in date-time column X1 to strings in month/day/year format.

X1	NEW
2024-01-01 09:00:00	01/01/2024
2024-01-02 14:30:00	01/02/2024
2024-01-03 18:15:00	01/03/2024

stri_escape_unicode

Summary
Converts non-ASCII characters (such as Japanese multibyte characters or special symbols) in a string to Unicode escape sequences (\uXXXX format). ASCII characters (alphanumeric characters and basic symbols) are preserved as is.

Syntax
stri_escape_unicode(<string_column>)

Arguments

Return Value
Character

Example

stri_escape_unicode(X1)

Converts strings in column X1 to ASCII escape sequences.

X1	NEW
Hello	Hello
日本語	\u65E5\u672C\u8A9E
♥	\u2665

stri_numbytes

Summary
Returns the number of bytes required to store text data. In UTF-8 encoding, one character can be represented by 1-4 bytes, so the number of characters and the number of bytes may differ.

Syntax
stri_numbytes(<string>)

Arguments

Example

stri_numbytes(X1)

Returns the number of bytes needed to store the strings in column X1.

X1	NEW
hello	5
こんにちは	15
123	3

stri_rand_strings

Summary
Generates random strings of specified length. Various character types can be used for generation, including ASCII characters and Unicode characters.

Syntax
stri_rand_strings(n = <number_of_strings>, length = <length_of_each_string>, pattern = <character_pattern>)

Arguments

n - Specifies the number of strings to generate
length - Specifies the length of each string
pattern - Specifies the character pattern to use
- "[A-Za-z0-9]" - Alphanumeric characters only
- "[A-Z]" - Uppercase letters only
- "[0-9]" - Numbers only
- "[a-z]" - Lowercase letters only

Example

stri_rand_strings(X1, length=5, pattern="[A-Z]")

Generates random strings of uppercase letters with a length of 5 characters based on the values in column X1.

X1	NEW
1	KXMPQ
2	RNWVY
3	LBFTH

stri_rand_strings(X1, length=8, pattern="[0-9]")

Generates random strings of digits with a length of 8 characters based on the values in column X1.

X1	NEW
1	47295836
2	91634507
3	25803914

stri_rand_strings(X1, length=6, pattern="[A-Za-z0-9]")

Generates random alphanumeric strings with a length of 6 characters based on the values in column X1.

X1	NEW
1	kR9nP2
2	mX4vL8
3	tB7hJ5

regex

Summary
Modifier functions control the meaning of the pattern argument to stringr functions

Syntax
regex(pattern, ignore_case = <logical>, multiline = <logical>, comments = <logical>, dotall = <logical>, locale = <character>)

Arguments

pattern - Pattern to match.
ignore_case - If TRUE, the pattern matching is case insensitive.
multiline - If TRUE, the pattern matching is multiline.
comments - If TRUE, white space and comments beginning with # are ignored. Escape literal spaces with \ .
dotall - If TRUE, . will also match line terminators.
locale - The locale to use for the pattern matching.

Return Value
A stringr modifier object, i.e. a character vector with parent S3 class stringr_pattern.

Example

regex("Capital", ignore_case = TRUE)

str_c

Summary
Concatenates multiple text from multiple columns.

Syntax
str_c(<column(s)>, sep = <text>, collapse = <text>)

Arguments

sep (Optional) - Specify a character(s) to concatenate values. The default is zero-length string.
collapse (Optional) - Specify a character(s) to combine all concatenated text into one single text.

Return Value
Character

Example
str_c(CARRIER, TAIL_NUM)
Concatenate CARRIER and TAIL_NUM values.
str_c(CARRIER, TAIL_NUM, sep = "-")
Use dash as a separator to concatenate CARRIER and TAIL_NUM values.
summarize(combined_text = str_c(CARRIER, collapse = ",")
Use comma ',' as a separator to combine all the CARRIER names into one text for each group.

str_conv

Summary
Convert text to a specified encoding. e.g. UTF-8

Syntax
str_conv(<column_text>, <encoding>)

Arguments

encoding - String to re-encode.

Return Value
Character

Example
str_conv("Exploratory", "UTF-8")

str_count

Summary
Return number of letters or words in text data. You can also set a pattern to count only the matched pattern or text.

Syntax
str_count(<column_text>, pattern = [<regular_expression>])
str_count(<column_text>, pattern = [boundary("word")])

Arguments

pattern (Optional) - Regular expression to match

Return Value
Numeric

Example
str_count("I'm traveling to New York on 1/1.")
returns 33 including punctuations and space
str_count("I'm traveling to New York on 1/1.", "New")
returns 1 by matching text 'New'
str_count("I'm traveling to New York on 1/1.", "[:alpha:]")
returns 22 counting only alphabet letters
str_count("I'm traveling to New York on 1/1.", "[:digit:]")
returns 2 counting only digit letters
str_count("I'm traveling to New York.", boundary("word"))
returns 5 counting as words
str_count("I'm traveling to New York on 1/1.", ".")
returns 33 because dot matches anything
str_count("I'm traveling to New York on 1/1.", fixed("."))
returns 1 because dot is searching dot as letter with 'fixed()'

str_detect

Summary
Return TRUE or FALSE based on whether Text data contains a given text or not

Syntax
str_detect(<column_text>, <pattern>)

Arguments

pattern - Text or regular expression to look for.

Return Value
Logical

Example
str_detect("New York", "New")
returns TRUE because there is 'New' in the data.
str_detect("New York on 1/1.", "[:alpha:]")
returns TRUE because there are alphabet letters in the data.
str_detect("New York on 1/1.", "[:digit:]")
returns TRUE because there are numeric (digit) letters in the data.
str_detect("New York on 1/1.", "^New")
returns TRUE because there is "New" text at the beginning of the data.
str_detect("New York on 1/1.", "New$")
returns TRUE because there is no "New" text at the end of the data.
str_detect("New York on 1/1.", "N.w")
returns TRUE because 'New' starts with 'N' and ends with 'w'
str_detect("New York on 1/1.", fixed("N.w"))
returns FALSE because 'N.w' exact matching letters don't exist
str_detect("New York on 1/1.", c("New", "Old"))
returns TRUE because a given text matches with one of matching texts

str_dup

Summary
Repeat text values.

Syntax
str_dup(<column_text>, times = <numeber>)

Arguments

times - Number to repeat

Return Value
Character

Example
str_dup("wow", 2)
returns "wowwow".

str_extract

Summary
Extract only letters that match with a given letters or patterns

Syntax
str_extract(<column_text>, <pattern>)

Arguments

pattern - Text or regular expression to look for.

Return Value
Character

Example
str_extract(X1, "[A-Za-z]+") Extracts the first sequence of alphabetic characters from column X1.

X1	NEW
ABC123	ABC
123DEF	DEF
A1B2C3	A

str_extract(X1, "[:alpha:]+") Extracts the first sequence of alphabetic characters from column X1.

X1	NEW
ABC123	ABC
123DEF	DEF
A1B2C3	A

str_extract(X1, "[0-9]+") Extracts the first sequence of digits from column X1. | X1 | NEW | |---|---| | ABC123 | 123 | | 456DEF | 456 | | A1B2C3 | 1 |

str_extract(X1, "[:digit:]+") Extracts the first sequence of digits from column X1. | X1 | NEW | |---|---| | ABC123 | 123 | | 456DEF | 456 | | A1B2C3 | 1 |

str_extract_all

Summary
Extract all characters that match with a given letters or patterns. Can be combined with list_to_text function to convert extracted strings into a concatenated text.

Syntax
str_extract_all(<column_text>, <pattern>)

Arguments

pattern - Text or regular expression to look for.

Return Value
List

Example
list_to_text(str_extract_all(X1, "[:alpha:]+")) Extracts all letters from column X1.

X1	NEW
Hello world	Hello, world
ABC123DEF	ABC, DEF
R2D2C3PO	R, D, C, PO

list_to_text(str_extract_all(X1, "[a-zA-Z]+")) Extracts all alphabets from column X1.

X1	NEW
Hello123World	Hello, World
R2D2C3PO	R, D, C, PO
Data_Science_2023	Data, Science

list_to_text(str_extract_all(Col_Name, "[:digit:]+")) Extracts all number from column X1.

X1	NEW
ABC123DEF456	123, 456
Phone: 555-1234	555, 1234
Year2023Month01	2023, 01

list_to_text(str_extract_all(X1, "[a-zA-Z0-9]+")) Extracts alphabetic characters and numbers from column X1.

X1	NEW
Hello_123_World!	Hello,123,World
R2D2_C3PO	R2D2,C3PO
Data-Science-2023	Data,Science,2023

list_to_text(str_extract_all(X1, "[:alnum:]+")) Extracts all characters and numbers from column X1.

X1	NEW
Hello_123_World!	Hello, 123, World
TYO-Tokyo-2023	TYO, Tokyo, 2023
Data@Science#2023	Data, Science, 2023

str_length

Summary
Return number of letters in text data.

Syntax
str_length(<column_text>)

Return Value
Numeric

Example
str_length("I'm traveling to New York.")
Returns 26 including punctuations and space

str_pad

Summary
Add white spaces to text values to make the total length to be exact a specified number.

Syntax
str_pad(<column_text>, width = <number>, side = "left"|"right"|"both", pad = <text>)

Arguments

width - Minimum width of padded strings.
side (Optional) - The default is "left". Side on which padding character is added (left, right or both).
pad (Optional) - The default is a space. Single padding character.

Return Value
Character

Example
str_pad("Exploratory", 3)
returns "Exploratory" because 3 is less than number of the original given text.

str_pad("Exploratory", 30, "left")
returns " Exploratory" to make the length of the text to be 30 by adding white spaces.

str_replace

Summary
Replace letters that matches with a given letters or expressions

Syntax
str_replace(<column_text>, <pattern>, <replacement>)

Arguments

pattern - Text or regular expression to look for
replacement - Text to replace with

Return Value
Character

Example
str_replace("I am traveling to New York!!", "York", "Jersey")
returns "I am traveling to New Jersey"
str_replace("I am traveling to New York!!", "[:punct:]", "")
returns "I am traveling to New York!" replacing only the first matching punctuation. You want to use str_replace_all() to replace them all.
str_replace("New York New", "^New", "Old")
returns "Old York New" by finding "New" at the beginning and replace it with "Old"
str_replace("New York New.", "New$", "Old")
returns "New York Old" by finding "New" at the end and replace it with "Old"
str_replace("New York", str_sub("New York", 4, 8), "")
returns "New".

str_replace_all

Summary
Replace letters that matches with a given letters or expressions

Syntax
str_replace_all(<column_text>, <pattern>, <replacement>)

Arguments

pattern - Text or regular expression to look for
replacement - Text to replace with

Return Value
Character

Example
str_replace_all("I am traveling to New York!!", "[:punct:]", "")
returns "I am traveling to New York"

str_replace_na

Summary
Replace NA values to a specific Text

Syntax
str_replace_na(<column_text>, <replacement>)

Arguments

replacement - Text to replace with

Return Value
Character

Example
str_replace_na(ARR_DELAY, "Not Available")
Returns "Not Available" for NA in ARR_DELAY column.

str_split

Summary
Split a given text into multiple text by a given separator. It will return a list so you want to unnest the newly created column with unnest() function.

Syntax
str_split(<column_text>, pattern = <pattern>, n = <numeric>)

Arguments

pattern - Text or regular expression to look for.
n (Optional) - Number of pieces to return. Default (Inf) uses all possible split positions.

Return Value
List

Example
str_split("a, b, c", ",")
returns "a" " b" " c" in a list format.
str_split("a, b, c", ",", 2)
returns "a" and " b, c" in a list format.
mutate(y = str_split(text_a, ",")) %>% unnest()
First split a given text column, this will create a list column. Then, unnest() function will break the list by given each element of the list its own row. unnest(y = str_split(text_a, ","))
This is equivalent of the above command.

str_sub

Summary
Extract letters from Text values based on the position

Syntax
str_sub(<column_text>, <start>, <end>)

Arguments

start - A position of the first letter
end - A position of the last letter

Return Value
Character

Example
str_sub("Kevin Spacey", start = 7, end = 12)
returns "Spacey". Starts from 7 to 12.
str_sub("Kevin Spacey", start = 7)
returns "Spacey". Starts from 7 to the end.
str_sub(Text, start = -6)
returns "Spacey". Last 6 letters.

str_to_lower

Summary
Convert text to lowercase

Syntax
str_to_lower(<column_text>, locale = <locale>)

Arguments

locale (Optional) - Locale to use for the transformation.

Return Value
Character

Example
str_to_lower("CALIFORNIA")
Returns "california"

str_to_title

Summary
Convert text to Title-case

Syntax
str_to_title(<column_text>, locale = <locale>)

Arguments

locale (Optional) - Locale to use for the transformation.

Return Value
Character

Example
str_to_title("CALIFORNIA")
Returns "California"

str_to_upper

Summary
Convert text to UPPERCASE

Syntax
str_to_upper(<column_text>, locale = <locale>)

Arguments

locale (Optional) - Locale to use for the transformation.

Return Value
Character

Example
str_to_upper("california")
Returns "CALIFORNIA"
str_to_upper("i", "en")
Setting English locale would return 'i'
str_to_upper("i", "tr")
Setting Turkish locale would return dotted 'I' in capital

str_trim

Summary
Trim (or remove) white spaces in Text values

Syntax
str_trim(<column_text>, side = "both"|"left"|"right")

Arguments

side (Optional) - The default is "both". Side on which white space to be removed. (left, right or both).

Return Value
Character

Example
str_trim(" String with trailing and leading white space\t")
returns "String with trailing and leading white space".
str_trim("\n\nString with trailing and leading white space\n\n")
returns "String with trailing and leading white space".

word

Summary
Extract First / Last Nth Word in Text data

Syntax
word(<column_text>, start = <number>, end = <number>, sep = <text>)

Arguments

start - An integer number indicating a position of the first word to extract.
end (Optional) - An integer number indicating a position of the last word to extract.
sep - Set a character(s) to separate words.

Return Value
Character

Example

word(text_column, 1, sep = " ")

Extracts the first word from the text column using space as separator.

text_column	new_column
I am traveling	I
Hello World	Hello
Data Analysis	Data

word(text_column, 1, 3, sep = " ")

Extracts the 1st to 3rd words from the text column using space as separator.

text_column	new_column
I am traveling to	I am traveling
This is good	This is good
A B C D	A B C

word(text_column, -1, sep = " ")

Extracts the last word from the text column using space as separator.

text_column	new_column
I am traveling	traveling
Hello World	World
Data Analysis	Analysis

word(text_column, -2, -1, sep = " ")

Extracts the second-to-last and last words from the text column using space as separator.

text_column	new_column
I am traveling	am traveling
Hello World	Hello World
Big Data Analysis	Data Analysis

word(text_column, 1, sep = "-")

Extracts the first word from the text column using hyphen as separator.

text_column	new_column
web-development	web
user-interface-design	user
back-end-server-configuration	back

complete

Summary
Complete a data frame by adding missing combinations of data.

Syntax
complete(<column(s)>, fill = <list>)

Arguments

fill - Set a value for NA. You can set for each column by setting them in a list format.

Used Together
full_seq, nesting

Example

Sample data:

year	product	type	sales
2013	iPhone	5	100
2013	iPad	mini	50
2015	iPhone	6	200

complete(year, product)
Returns all the combinations of year and product values.

year	product	type	sales
2013	iPad	mini	50
2013	iPhone	5	100
2015	iPad	NA	NA
2015	iPhone	6	200

complete(year, product, fill = list(sales = 0))
Returns all the combinations of year and product values and fill with 0 for NA for sales column.

year	product	type	sales
2013	iPad	mini	50
2013	iPhone	5	100
2015	iPad	NA	0
2015	iPhone	6	200

complete(year = full_seq(year, period = 1), product)
Returns all the combinations of year and product values plus fill missing years (2014).

year	product	type	sales
2013	iPad	mini	50
2013	iPhone	5	100
2014	iPad	NA	NA
2014	iPhone	NA	NA
2015	iPad	NA	NA
2015	iPhone	6	200

complete(year, product, type)
Returns all the combinations of year, product, and type values.

year	product	type	sales
2013	iPad	5	NA
2013	iPad	6	NA
2013	iPad	mini	50
2013	iPhone	5	100
2013	iPhone	6	NA
2013	iPhone	mini	NA
2015	iPad	5	NA
2015	iPad	6	NA
2015	iPad	mini	NA
2015	iPhone	5	NA
2015	iPhone	6	200
2015	iPhone	mini	NA

complete(year, nesting(product, type))
By using nesting() function you can set the combination of values to be the ones that really exist in data for some columns.

year	product	type	sales
2013	iPad	mini	50
2013	iPhone	5	100
2013	iPhone	6	NA
2015	iPad	mini	NA
2015	iPhone	5	NA
2015	iPhone	6	200

expand

Summary
Expand a data frame by adding all the combination of given columns.

Syntax
expand(<column(s)>)

Arguments

Used Together
full_seq, nesting

Example

Sample data:

year	product	type	sales
2013	iPhone	5	100
2013	iPad	mini	50
2015	iPhone	6	200

expand(year, product)
Returns all the combinations of year and product values.

year	product
2013	iPad
2013	iPhone
2015	iPad
2015	iPhone

expand(year = full_seq(year, period = 1), product)
Returns all the combinations of year and product values plus fill missing years (2014).

year	product
2013	iPad
2013	iPhone
2014	iPad
2014	iPhone
2015	iPad
2015	iPhone

expand(year, product, type)
Returns all the combinations of year, product, and type values.

year	product	type
2013	iPad	5
2013	iPad	6
2013	iPad	mini
2013	iPhone	5
2013	iPhone	6
2013	iPhone	mini
2015	iPad	5
2015	iPad	6
2015	iPad	mini
2015	iPhone	5
2015	iPhone	6
2015	iPhone	mini

expand(year, nesting(product, type))
By using nesting() function you can set a certain combination of column values to be fixed and have them nested inside other columns.

year	product	type
2013	iPad	mini
2013	iPhone	5
2013	iPhone	6
2015	iPad	mini
2015	iPhone	5
2015	iPhone	6

extract_numeric

Summary
Extract only the number from a given text. This is useful for strings that are supposed to be numbers with extra formatting (e.g. $1,200.34, -12%, X1, etc.).

Syntax
extract_numeric(<column>)

Return Value
Numeric

Example
extract_numeric("$56,500.00")
Returns 56500
extract_numeric("-45%")
Returns -45
mutate( = extract_numeric()
Crate a new column to have only the extracted numbers from the original column.

fill

Summary
Fills missing values in using the previous entry. This is convenient especially when values are not presented in some cells to avoid duplicated entries.

Syntax
fill(<column(s)>, .direction = <fill_direction>)

Arguments

.direction - The default is "down". Either "down" or "up" to fill the values using either previous or next value.

Example
fill(POPULATION)
Fills using the previous appearing value for POPULATION column.
fill(POPULATION, .direction = "up")
Fills using the next appearing value for POPULATION column.
fill(POPULATION, PRODUCTION)
Fills using the previous appearing value for POPULATION and PRODUCTION columns.
fill(-UNEMPLOYMENT)
Fills using the previous appearing value for all the columns other than UNEMPLOYMENT column.

full_seq

Summary
Generate a sequence of numbers based on the values in a given column.

Syntax
full_seq(<column_num>, period = <number>)

Arguments

Used Together
complete, expand,

Example
Sample data:

year	product	type	sales
2013	iPhone	5	100
2013	iPad	mini	50
2015	iPhone	6	200

expand(year = full_seq(year, period = 1), product)
Returns all the combinations of year and product values plus fill missing years (2014).

year	product
2013	iPad
2013	iPhone
2014	iPad
2014	iPhone
2015	iPad
2015	iPhone

Without full_seq() function.

expand(year, product)
Returns all the combinations of year and product values.

year	product
2013	iPad
2013	iPhone
2015	iPad
2015	iPhone

gather

Summary
Takes multiple columns and collapses into key-value pairs, duplicating all other columns as needed.

Syntax
gather(<new_key_column>, <new_value_column>, <column(s)>, na.rm = <logical>, convert = <logical>)

Arguments

key, value - Names of key and value columns to create in output.
na.rm - The default is FALSE. Set TRUE to remove NA rows from the result.
convert - The default is FALSE. Set TRUE to automatically guess the most possible data type for the key column.

Example
gather(COUNTRY, POPULATION, 2:50, na.rm = FALSE)
Takes the 2nd column to the 50th column into two new columns, COUNTRY and POPULATION.
gather(COUNTRY, POPULATION, -YEAR)
Takes all the columns other than YEAR column into two new columns, COUNTRY and POPULATION.
gather(COUNTRY, POPULATION, starts_with("country"))
Takes the columns whose names start with "country" into two new columns, COUNTRY and POPULATION.

nest

Summary
Nest a set of columns together as a list column.

Syntax
nest(<column(s)> , .key = <text>)

Arguments

.key (Optional) - set a name for a newly created column to hold the list.

Example
Sample Data

year	product	type	sales
2013	iPhone	5	100
2013	iPad	mini	50
2015	iPhone	6	200

nest(product, type, sales, .key = a)
This will create a new column that is a list of data frames and each row of the data frame holds 'product', 'type', 'sales' values.

year	a
2013	list()
2015	list()

nesting

Summary
Fix multiple columns so that only the combination of actually existing values in those columns will be used for complete() or expand() operation.

Syntax
nesting(<column(s)>)

Arguments

Used Together
complete, expand,

Example
Sample data:

year	product	type	sales
2013	iPhone	5	100
2013	iPad	mini	50
2015	iPhone	6	200

expand(year, nesting(product, type))
By using nesting() function you can set a certain combination of column values to be fixed and have them nested inside other columns.

year	product	type
2013	iPad	mini
2013	iPhone	5
2013	iPhone	6
2015	iPad	mini
2015	iPhone	5
2015	iPhone	6

Without nesting():

expand(year, product, type)
Returns all the combinations of year, product, and type values.

year	product	type
2013	iPad	5
2013	iPad	6
2013	iPad	mini
2013	iPhone	5
2013	iPhone	6
2013	iPhone	mini
2015	iPad	5
2015	iPad	6
2015	iPad	mini
2015	iPhone	5
2015	iPhone	6
2015	iPhone	mini

separate

Summary
Separates a column with delimited values into multiple columns.

Syntax
separate(<column_text_date>, into = c("<new_column_name(s)>"), sep = <text>, remove=<logical>, convert=<logical>, extra = "warn"|"drop"|"merge", fill = "warn"|"right"|"left")

Arguments

into - new column names
sep - separating text
remove - The default is TRUE. Set FALSE to drop the original columns.
convert - The default is FALSE. Set TRUE to automatically guess the most possible data type for the key column.
extra - The default is "warn". Set "drop" to drop any text beyond the number of the new columns, "merge" to merge the text with the value in the last column.
fill - The default is "warn". A command to controls what happens when there are not enough pieces. "warn" is the default and fills from the right. "right" fills with missing values on the right. "left" fills with missing values on the left.

Example
separate(NAME, into = c("FIRSTNAME", "LASTNAME"), extra="drop")
Separates NAME column into FIRSTNAME and LASTNAME columns by non-alphanumeric values. It will drop the rest if any.
separate(NAME, into = c("FIRSTNAME", "LASTNAME"), remove = FALSE)
Separates NAME column into FIRSTNAME and LASTNAME columns by non-alphanumeric values. It will keep NAME column.
separate(NAME, into = c("FIRSTNAME", "LASTNAME"), extra="merge")
Separates NAME column into FIRSTNAME and LASTNAME columns by non-alphanumeric values. It will merge the rest into the last column.
separate(NAME, into = c("FIRSTNAME", "LASTNAME"), sep = "-")
Separates NAME column into FIRSTNAME and LASTNAME columns by dash "-".
separate(NAME, into = c("FIRSTNAME", "LASTNAME"), sep = 2)
Separates NAME column into FIRSTNAME and LASTNAME by a position of 2.
separate(DATE, into = c("Y", "M", "D"), convert = TRUE)
Separate DATE into Year, Month, and Day columns and set the data type for each column based on the heuristics.

spread

Summary
Spread a key-value pair across multiple columns. Missing values will be replaced with NA.

Syntax
spread(<key_column>, <value_column>, fill = <value>, convert = <logical>, drop = <logical>, sep=<text>)

Arguments

key, value - columns to spread.
fill - Value to fill missing values
convert - The default is FALSE. Set TRUE to automatically guess the most possible data type for the key column.
drop - The default is TRUE. Set FALSE to keep the levels that don’t appear in the data, filling in missing combinations with the text set with 'fill' argument.
sep - The default is NULL. If NULL, the column names are values of the key column. Set text to make the column names "{key_column_name}{sep}{key_value}".

Example
spread(COUNTRY, POPULATION)
Spread a COUNTRY-POPULATION pair across multiple columns of COUNTRY.
spread(COUNTRY, POPULATION, fill = “NA")
Spread a COUNTRY-POPULATION pair across multiple columns of COUNTRY, and fill with "NA" text for missing values.

unite

Summary
Unite multiple columns together into one column with given uniting characters. It will concatenate values by “_" and remove the original column by default. Multiple columns specified in the list can be combined together at once.

Syntax
unite(<new_column>, <column(s)>, sep = <text>, remove = <logical>)

Arguments

sep (Optional) - text to connect text values.
remove (Optional) - remove the original columns when TRUE, keep them when FALSE. The default is TRUE.

Example
unite(NAME, FIRSTNAME, LASTNAME)
Unite FIRSTNAME and LASTNAME into a new column called NAME using _ as a uniting character.
unite(NAME, FIRSTNAME, LASTNAME, sep = "-")
Unite FIRSTNAME and LASTNAME into a new column called NAME using _ as a uniting character.

unnest

Summary
Unnest a list column or a list of data frames by making each element of the list to be presented in its own row.

Syntax
unnest(<column_list> , .drop = <logical>, .id = <text>, .sep = <text>)

Arguments

.drop (Optional) - The default is TRUE. Set whether the other list column(s) to be dropped or not.
.id (Optional) - The default is NULL. If it's text, it creates a new column of list keys, giving a unique identifier. This is most useful if the list column is named.
.sep (Optional) - The default is NULL. If it's text, the names of unnested data frame columns are combined with list keys separated by the text.

Example
Sample Data with a list column

issue_id	assignee
720	list(Kan, Kei)
721	list(Kan, Marie, Simon)

unnest() or unnest(assignee)
Unnest 'assignee' column which is a list. This will give each value of the list its own row.

issue_id	assignee
720	Kan
720	Kei
721	Kan
721	Marie
721	Simon

Sample Data with a list of Data Frames:

issue_id	assignee
720	list(name = Kan, id = 105, country = US )
721	list(name = Kei, id = 106, country = Japan )

unnest()
Unnest 'assignee' column which is a list of data frames. This will create columns for each column of the nested data frame.

issue_id	name	id	country
720	Kan	105	US
721	Kei	106	Japan

unnest(.sep="-")
Unnest With .sep argument, the result columns are with original list column name connected by .sep argument.

issue_id	assignee-name	assignee-id	assignee-country
720	Kan	105	US
721	Kei	106	Japan

Sample Data with two list columns

issue_id	assignee	category
720	list(Kan, Kei)	list(Critical, Security)
721	list(Kan, Marie, Simon)	list(Critical, Minor Issue)

unnest(assignee, .drop = FALSE)
Unnest 'assignee' column which is a list. This will give each value of the list its own row.

issue_id	assignee	category
720	Kan	list(Critical, Security)
720	Kei	list(Critical, Security)
721	Kan	list(Critical, Security)
721	Marie	list(Critical, Minor Issue)
721	Simon	list(Critical, Minor Issue)

param_remove

Summary
Removes parameters from a given URL.

Syntax
param_remove(<column_text>, keys = <text>)

Arguments

keys - parameter names to be removed. Use c() to remove multiple parameters together. E.g. c("q", "issue")

Return Value
Character

Example
param_remove(urls = "https://github.com/hideaki/tam/issues?q=is%3Aopen&ticket=open", keys = c("q","ticket"))
Returns "https://github.com/hideaki/tam/issues"

url_decode

Summary
Decodes a given encoded URL. It assume character encoding is UTF-8.

Syntax
url_decode(<column_text>)

Arguments * column - column that you want to decode.

Return Value
Character

Example
url_decode("http://exploratory.io%3fname%3dflight%20data%20analysis%20%28V1%29")
Returns "http://exploratory.io?name=flight data analysis (V1)"

url_encode

Summary
Encodes a given URL. It assume character encoding is UTF-8.

Syntax
url_encode(<column_text>)

Arguments * column - column that you want to encode.

Return Value
Character

Example
url_encode("http://exploratory.io?name=flight data analysis (V1)")
Returns "http://exploratory.io%3fname%3dflight%20data%20analysis%20%28V1%29"

url_parameters

Summary
Returns decoded query parameters from url.

Syntax
url_parameters(<column_text>)

Arguments

Return Value
Character

Example

url_parameters("https://exploratory.io/?debug=%3Ctrue%3E#test")
returns "debug=<true>".

roll_max

Summary
Returns the rolling (or moving) max value.

Syntax
roll_max(<column_num>, n = <numeric>, weights = NULL, by = <numeric>, fill = <numeric>, align = <alignment>, normalize = <logical>, na.rm = <logical>)

Arguments

n - The default is 1. Integer; the window size to roll over.
weights - Set weights for each value of the window size (n).
by - The default is 1.
fill - This is a mandatory argument. Set a value for NA.
partial - The default is FALSE.
align - Align windows on the "left", "center" or "right".
normalize - The default is TRUE. Whether it should be normalized or not.
na.rm - The default is FALSE. Decides whether missing values (including NaN) to be removed or not.

Return Value
Numeric

Example
rolling_max = roll_max(revenue, n = 3, align="center", fill = 0)
Returns a max value among three values starting from one previous and one next value, setting NA to 0.

year	revenue	rolling_max
2012	100	0
2013	200	300
2014	300	400
2015	400	500
2016	500	0

rolling_max = roll_max(revenue, n = 3, align="left", fill = 0)
Returns a max value among three values starting from the current value and the next two values, setting NA to 0.

year	revenue	rolling_max
2012	100	300
2013	200	400
2014	300	500
2015	400	0
2016	500	0

roll_mean

Summary
Calculate a moving average for a column. By specifying a window size, it is possible to return the moving average over the specified interval.

Syntax
roll_mean(<column_num>, n = <numeric>, weights = NULL, by = <numeric>, fill = <numeric>, align = <alignment>, normalize = <logical>, na.rm = <logical>)

Arguments

n - The default is 1. Integer; the window size to roll over.
weights - Set weights for each value of the window size (n).
by - The default is 1.
fill - This is a mandatory argument. Set a value for NA.
partial - The default is FALSE.
align - Align windows on the "left", "center" or "right".
normalize - The default is TRUE. Whether it should be normalized or not.
na.rm - The default is FALSE. Decides whether missing values (including NaN) to be removed or not.

Return Value
Numeric

Example
rolling_mean = roll_mean(revenue, n = 3, align="center", fill = 0)
Returns a mean value among three values starting from one previous and one next value, setting NA to 0.

year	revenue	rolling_mean
2012	100	0
2013	200	200
2014	300	300
2015	400	400
2016	500	0

rolling_mean = roll_mean(revenue, n = 3, align="left", fill = 0)
Returns a mean value among three values starting from the current value and the next two values, setting NA to 0.

year	revenue	rolling_mean
2012	100	200
2013	200	300
2014	300	400
2015	400	0
2016	500	0

roll_median

Summary
Returns the rolling (or moving) median value.

Syntax
roll_median(<column_num>, n = <numeric>, weights = NULL, by = <numeric>, fill = <numeric>, align = <alignment>, normalize = <logical>, na.rm = <logical>)

Arguments

n - The default is 1. Integer; the window size to roll over.
weights - Set weights for each value of the window size (n).
by - The default is 1.
fill - This is a mandatory argument. Set a value for NA.
partial - The default is FALSE.
align - Align windows on the "left", "center" or "right".
normalize - The default is TRUE. Whether it should be normalized or not.
na.rm - The default is FALSE. Decides whether missing values (including NaN) to be removed or not.

Return Value
Numeric

Example
rolling_median = roll_median(revenue, n = 3, align="center", fill = 0)
Returns a median value among three values starting from one previous and one next value, setting NA to 0.

year	revenue	rolling_median
2012	100	0
2013	200	200
2014	300	300
2015	400	400
2016	500	0

rolling_median = roll_median(revenue, n = 3, align="left", fill = 0)
Returns a median value among three values starting from the current value and the next two values, setting NA to 0.

year	revenue	rolling_median
2012	100	200
2013	200	300
2014	300	400
2015	400	0
2016	500	0

roll_min

Summary
Returns the rolling (or moving) min value.

Syntax
roll_min(<column_num>, n = <numeric>, weights = NULL, by = <numeric>, fill = <numeric>, align = <alignment>, normalize = <logical>, na.rm = <logical>)

Arguments

n - The default is 1. Integer; the window size to roll over.
weights - Set weights for each value of the window size (n).
by - The default is 1.
fill - This is a mandatory argument. Set a value for NA.
partial - The default is FALSE.
align - Align windows on the "left", "center" or "right".
normalize - The default is TRUE. Whether it should be normalized or not.
na.rm - The default is FALSE. Decides whether missing values (including NaN) to be removed or not.

Return Value
Numeric

Example
rolling_max = roll_min(revenue, n = 3, align="center", fill = 0)
Returns a min value among three values starting from one previous and one next value, setting NA to 0.

year	revenue	rolling_min
2012	100	0
2013	200	100
2014	300	200
2015	400	300
2016	500	0

rolling_max = roll_min(revenue, n = 3, align="left", fill = 0)
Returns a min value among three values starting from the current value and the next two values, setting NA to 0.

year	revenue	rolling_min
2012	100	100
2013	200	200
2014	300	300
2015	400	0
2016	500	0

roll_prod

Summary
Returns the rolling (or moving) prod value.

Syntax
roll_prod(<column_num>, n = <numeric>, weights = NULL, by = <numeric>, fill = <numeric>, align = <alignment>, normalize = <logical>, na.rm = <logical>)

Arguments

n - The default is 1. Integer; the window size to roll over.
weights - Set weights for each value of the window size (n).
by - The default is 1.
fill - This is a mandatory argument. Set a value for NA.
partial - The default is FALSE.
align - Align windows on the "left", "center" or "right".
normalize - The default is TRUE. Whether it should be normalized or not.
na.rm - The default is FALSE. Decides whether missing values (including NaN) to be removed or not.

Return Value
Numeric

Example
rolling_prod = roll_prod(revenue, n = 3, align="center", fill = 0)
Returns a prod value among three values starting from one previous and one next value, setting NA to 0.

year	revenue	rolling_prod
2012	1	0
2013	2	6
2014	3	24
2015	4	60
2016	5	0

rolling_prod = roll_prod(revenue, n = 3, align="left", fill = 0)
Returns a prod value among three values starting from the current value and the next two values, setting NA to 0.

year	revenue	rolling_prod
2012	1	6
2013	2	24
2014	3	60
2015	4	0
2016	5	0

roll_sd

Summary
Returns the rolling (or moving) standard deviation value.

Syntax
roll_sd(<column_num>, n = <numeric>, weights = NULL, by = <numeric>, fill = <numeric>, align = <alignment>, normalize = <logical>, na.rm = <logical>)

Arguments

n - The default is 1. Integer; the window size to roll over.
weights - Set weights for each value of the window size (n).
by - The default is 1.
fill - This is a mandatory argument. Set a value for NA.
partial - The default is FALSE.
align - Align windows on the "left", "center" or "right".
normalize - The default is TRUE. Whether it should be normalized or not.
na.rm - The default is FALSE. Decides whether missing values (including NaN) to be removed or not.

Return Value
Numeric

Example
rolling_sd = roll_sd(revenue, n = 3, align="center", fill = 0)
Returns a standard deviation value among three values starting from one previous and one next value, setting NA to 0.

year	revenue	rolling_sd
2012	100	0
2013	200	100
2014	300	100
2015	400	100
2016	500	0

rolling_sd = roll_sd(revenue, n = 3, align="left", fill = 0)
Returns a min value among three values starting from the current value and the next two values, setting NA to 0.

year	revenue	rolling_sd
2012	100	100
2013	200	100
2014	300	100
2015	400	0
2016	500	0

roll_sum

Summary
Returns the rolling (or moving) sum value.

Syntax
roll_sum(<column_num>, n = <numeric>, weights = NULL, by = <numeric>, fill = <numeric>, align = <alignment>, normalize = <logical>, na.rm = <logical>)

Arguments

n - The default is 1. Integer; the window size to roll over.
weights - Set weights for each value of the window size (n).
by - The default is 1.
fill - This is a mandatory argument. Set a value for NA.
partial - The default is FALSE.
align - Align windows on the "left", "center" or "right".
normalize - The default is TRUE. Whether it should be normalized or not.
na.rm - The default is FALSE. Decides whether missing values (including NaN) to be removed or not.

Return Value
Numeric

Example
rolling_sum = roll_sum(revenue, n = 3, align="center", fill = 0)
Returns a sum value among three values starting from one previous and one next value, setting NA to 0.

year	revenue	rolling_sum
2012	100	0
2013	200	600
2014	300	900
2015	400	1200
2016	500	0

rolling_sum = roll_sum(revenue, n = 3, align="left", fill = 0)
Returns a sum value among three values starting from the current value and the next two values, setting NA to 0.

year	revenue	rolling_sum
2012	100	600
2013	200	900
2014	300	1200
2015	400	0
2016	500	0

roll_var

Summary
Returns the rolling (or moving) variance value.

Syntax
roll_var(<column_num>, n = <numeric>, weights = NULL, by = <numeric>, fill = <numeric>, align = <alignment>, normalize = <logical>, na.rm = <logical>)

Arguments

n - The default is 1. Integer; the window size to roll over.
weights - Set weights for each value of the window size (n).
by - The default is 1.
fill - This is a mandatory argument. Set a value for NA.
partial - The default is FALSE.
align - Align windows on the "left", "center" or "right".
normalize - The default is TRUE. Whether it should be normalized or not.
na.rm - The default is FALSE. Decides whether missing values (including NaN) to be removed or not.

Return Value
Numeric

Example
rolling_var = roll_var(revenue, n = 3, align="center", fill = 0)
Returns a variance value among three values starting from one previous and one next value, setting NA to 0.

year	revenue	rolling_var
2012	100	0
2013	200	10000
2014	300	10000
2015	400	10000
2016	500	0

rolling_var = roll_var(revenue, n = 3, align="left", fill = 0)
Returns a variance value among three values starting from the current value and the next two values, setting NA to 0.

year	revenue	rolling_var
2012	100	10000
2013	200	10000
2014	300	10000
2015	400	0
2016	500	0

-

Summary
Removes column(s). It can be used along with any column(s) selection operation like select(), gather(), etc.

Syntax
-

Return Value

Example
select(-Country)
Remove Country column.
select(-starts_with("user"))
Remove all the columns whose names start with "user".
gather(Year, Population, -Country)
Gather all the columns except Country column into Year and Population columns.

select_if

Summary
Select column(s) with a predicate (conditional) function.

Syntax
select_if(<predicate_condition>,...)

Arguments

Example
select_if(is.numeric)
Select only numeric data type columns.
select_if(is.character)
Select only character data type columns.
select_if(function(x) is.numeric(x) && mean(x) > 5)
Select only the columns that are numeric data type and their mean values are greater than 5.

starts_with

Summary
Returns the column names that starts with a given text.

Syntax
starts_with(<text>, ignore.case = <logical>)

Arguments

ignore.case - The default is TRUE. Decides whether it should ignore the text case (uppercase/lowercase) or not.

Example
select(starts_with("X", ignore.case = TRUE))
Select only columns whose name starts with "X"

ends_with

Summary
Returns the column names that ends with a given text.

Syntax
ends_with(<text>, ignore.case = <logical>)

Arguments

ignore.case - The default is TRUE. Decides whether it should ignore the text case (uppercase/lowercase) or not.

Example
select(ends_with("ABR", ignore.case = TRUE))
Select only columns whose name ends with "ABR"

contains

Summary
Returns the column names that contain a given text.

Syntax
contains(<text>, ignore.case = <logical>)

Arguments

ignore.case - The default is TRUE. Decides whether it should ignore the text case (uppercase/lowercase) or not.

Example
select(contains("ABR", ignore.case = TRUE))
Select only columns whose name contains with "ABR"

matches

Summary
Returns the column names that matches with a given text.

Syntax
matches(<text>, ignore.case = <logical>)

Arguments

ignore.case - The default is TRUE. Decides whether it should ignore the text case (uppercase/lowercase) or not.

Example
select(matches("[:digit:]+", ignore.case = TRUE))
Select only columns whose names contains digits (regular expression)

num_range

Summary
Returns the column names that starts with a given text with numbers that are within a given range. It's useful especially when you have column names like X1, X2, X3, X4, etc.

Syntax
num_range(<text>, <start_num>:<end_num>)

Arguments

text - Prefix of the column names to select.
start_num:end_num - Range of the numbers that comes after the prefix in the column names.

Example
select(num_range("X", 1:5))
Selects all columns (numerically) from X1 to X5.

any_of

Summary
Returns the column names that are any of the given names.

Syntax
any_of(<text1>, <text2>, ...)

Arguments

ignore.case - The default is TRUE. Decides whether it should ignore the text case (uppercase/lowercase) or not.

Example
select(any_of("x", "y", "z”))
Selects columns whose names are "x", "y", and "z".

everything

Summary
Returns all the column names. It's useful when you want to have particular column(s) first before everything else.

Syntax
everything()

Arguments

Example
select(X, everything())
Reorder columns: keep the column “X" before all the other columns.

where

Summary
Returns the columns for which the specified column selection function (e.g. is.numeric) returns TRUE.

Syntax
where(<columns_select_function>)

Arguments

columns_select_function - The function that returns TRUE for the columns to select.

Example
select(where(is.numeric))
Select all numeric columns.

across

Summary
This function is used for selecting columns for other functions such as summarize_row.

Syntax

across(c(<column(s)>))
across(c(-<column(s)>))
across(where(<columns_select_function>))
across(starts_with(<text>, ignore.case = <logical>))
across(ends_with(<text>, ignore.case = <logical>))
across(contains(<text>, ignore.case = <logical>))
across(matches(<text>, ignore.case = <logical>))
across(num_range(<text>, <start_num>:<end_num>))

Arguments

column(s) - Columns to select.
columns_select_function - Function that returns TRUE for the columns to select. e.g. is.numeric to select all numeric columns.

Example

mutate(Total = summarize_row(across(c(Sales_Consumer, Sales_Business, Sales_Government)), sum, na.rm = TRUE)
Sums up values from the specified columns for each row, and stores the sums in the Total column.

mutate(Total = summarize_row(across(where(is.numeric)), sum, na.rm = TRUE)
Sums up values from all numeric columns for each row, and stores the sums in the Total column.

mutate_all

Summary
Apply functions to all the columns.

Syntax
mutate_all(funs(<function(s)>), ...)

Arguments

.funs - List of function calls generated by funs(), or a character vector of function names, or simply a function.

Example
mutate_all(funs(. * 0.5))
Multiply 0.5 to all the column values and override the existing columns.
mutate_all(funs(half = . * 0.5))
Multiply 0.5 to all the column values and create new columns whose names will end with "_half".
mutate_all(funs(min_rank, dense_rank)) Apply min_rank and dense_rank functions to all the column values and create new columns whose names will end with the function name (min_rank, dense_rank).

mutate_at

Summary
Apply functions to specified columns.

Syntax
mutate_at(vars(<column(s)>), funs(<function(s)>), ...)

Arguments

.cols - A list of columns generated by vars(), or a character vector of column names, or a numeric vector of column positions.
.funs - A list of function calls generated by funs(), or a character vector of function names, or simply a function.

Example
mutate_at(vars(ends_with("DELAY")), min_rank)
Apply 'min_rank' function to the columns whose name ending with "DELAY".
mutate_at(vars(ends_with("DELAY")), funs(. * .5))
Multiply 0.5 to the columns whose name ending with "DELAY".
mutate_at(c("ARR_DELAY", "DEP_DELAY"), min_rank)
Apply 'min_rank' function to "ARR_DELAY" and "DEP_DELAY" columns.
mutate_at(c(1,3), min_rank)
Apply 'min_rank' function to 1st and 3rd columns.

mutate_if

Summary
Apply functions to only the columns that match with a given condition.

Syntax
mutate_if(<predicate_condition>, funs(<function(s)>), ...)

Arguments

predicate_condition - set a predicate condition that will return TRUE or FALSE.
.funs - List of function calls generated by funs(), or a character vector of function names, or simply a function.

Example
mutate_if(is_numeric, mean)
Apply 'mean' function to all the columns that are numeric data type.
mutate_if(is.numeric, funs(. * 0.2))
Multiply all the numeric columns by 0.2 overriding existing columns.
mutate_if(is.numeric, funs(calc = . * 0.2))
Multiply all the numeric columns by 0.2 as new columns.

summarize_all

Summary
Apply functions to all the columns.

Syntax
summarize_all(funs(<function(s)>), ...)

Arguments

.funs - List of function calls generated by funs(), or a character vector of function names, or simply a function.

Example
summarize_all(mean)
Apply mean function to all the columns values per group and override the existing columns.
summarize_all(mean, na.rm = TRUE)
Apply mean function with argument "na.rm = TRUE" to all the columns values per group.
summarize_all(funs(min, max))
Apply min and max functions to all the columns values per group and create new columns.
summarize_all(funs(average = mean), na.rm = TRUE)
Apply mean function to all the columns values per group and create new columns whose names ending with 'average'.

summarize_at

Summary
Apply functions to specified columns.

Syntax
summarize_at(vars(<column(s)>), funs(<function(s)>), ...)

Arguments

.cols - A list of columns generated by vars(), or a character vector of column names, or a numeric vector of column positions.
.funs - A list of function calls generated by funs(), or a character vector of function names, or simply a function.

Example
summarize_at(vars(ends_with("DELAY")), mean)
Apply 'mean' function to the columns whose name ending with "DELAY".
summarize_at(vars(ends_with("DELAY")), mean, na.rm = TRUE)
Apply 'mean' function with "na.rm = TRUE" argument to the columns whose name ending with "DELAY".
summarize_at(c("ARR_DELAY", "DEP_DELAY"), mean)
Apply 'mean' function to "ARR_DELAY" and "DEP_DELAY" columns.
summarize_at(c(1,3), mean)
Apply 'mean' function to 1st and 3rd columns.

summarize_if

Summary
Apply functions to only the columns that match with a given condition.

Syntax
summarize_if(<predicate_condition>, )

Arguments

predicate_condition - set a predicate condition that will return TRUE or FALSE.
.funs - List of function calls generated by funs(), or a character vector of function names, or simply a function.

Example
summarize_if(is_numeric, mean)
Apply 'mean' function to all the columns that are numeric data type for each group.

vars

Summary
Returns a list of user selected columns.

Syntax
vars(<column(s)>)

Arguments

Example
vars(starts_with("ARR"))
Select columns whose name starting with "ARR".
vars(ARR_DELAY, ARR_TIME)
Select "ARR_DELAY" and "ARR_TIME" columns.

if_else

Summary Returns values based on whether a given condition is satisfied or not.

Syntax

if_else(<condition>, <return_value_when_TRUE>, <return_value_when_FALSE>, missing = NULL)

Arguments

condition - a condition to test.
return_value_when_TRUE - return value when the condition test returns TRUE.
return_value_when_FALSE - return value when the condition test returns FALSE.
missing - The default is NULL. Set a value to replace missing values.

Return Value
Vector

Example

mutate(category = ifelse(price >= 500, "Expensive", "Cheap"))

price	category
1000	Expensive
500	Expensive
100	Cheap
50	Cheap

case_when

Summary Returns values based on multiple conditions. This is similar to CASE WHEN conditions in SQL. You can use two sided formula to define the condition and the mapping values, and use TRUE as ELSE condition. See examples below.

Syntax
case_when(...)

Arguments

Return Value
Vector

Example

mutate(category = case_when(price > 900 ~ "Super Expensive", price >= 500 ~ "Expensive", price >= 100 ~ "Mild")))

price	category
1000	Super Expensive
500	Expensive
100	Mild
50	NA

mutate(category = case_when(price > 900 ~ "Super Expensive", price >= 500 ~ "Expensive", price >= 100 ~ "Mild", TRUE ~ "Cheap")))

price	category
1000	Super Expensive
500	Expensive
100	Mild
50	Cheap

coalesce

Summary
Replace the missing values with a given value or the values that are at the same position in other columns.

Syntax
coalesce(<column>, <column(s)>)

Arguments

Return Value
Vector

Example

Original data:

price	price_old
1500	1000
NA	500
NA	100
50	10

mutate(price = coalesce(price, 0))

price	price_old
1500	1000
0	500
0	100
50	10

mutate(price = coalesce(price, price_old))

price	price_old
1500	1000
500	500
100	100
50	10

na_if

Summary
Replaces the matching values to NA.

Syntax

na_if(<column>, <column_value>)

Arguments

Return Value
Vector

Example

mutate(after = na_if(before, 999))

before	after
A	A
G	G
999	NA

near

Summary
Compares two numeric vectors (columns). This is a safe way of comparing if two vectors of floating point numbers are (pairwise) equal. This is safer than using ==, because it has a built in tolerance.

Syntax
near(<column_num>, <column_num>, tol = <numeric>)

Arguments

tol - The default is .Machine$double.eps^0.5 (1.490116e-08) - Tolerance of comparison.

Return Value
Logical

Example

sqrt(2) ^ 2 == 2 Returns FALSE.

near(sqrt(2) ^ 2, 2) Returns TRUE.

recode

Summary

Recodes (or replaces) the original values with given values. You can replace numeric values either by the name with backticks or based on their position, and character values by their name.

Syntax

recode(<column>, ..., .default = <text>, .missing = <text>)

Arguments

.default - The default is NULL. Set the default value for non matching values. If not set, it will use the original values when the replacements are the same type as the original values and it will replace with NA when the replacements are not compatible with the original values.
.missing - The default is NULL. Set the value for the missing values in the original data.
.ordered - The default is FALSE. Set TRUE if you want to create an ordered factor.

Return Value
Vector

Example

Original data is numeric:

ID
4716
4755
2432

mutate(name = recode(ID, `4716` = "Oracle", `4755` = "Rakuten"))

ID	name
4716	Oracle
4755	Rakuten
2432	NA

Since the original value 2432 doesn't have a replacement value yet its data type (numeric) is different from the other replacement values (character) its replacement value becomes NA. If the data type is same then it will use the original value.

mutate(name = recode(ID, `4716` = "Oracle", `4755` = "Rakuten", .default = "Not Known"))

You can use '.default' argument to set the default value.

ID	name
4716	Oracle
4755	Rakuten
2432	Not known

Original data is character:

ID
A
G
M

mutate(name = recode(ID, "A" = "Apple", "G" = "Google"))

ID	name
A	Apple
G	Google
M	M

Since the original value 'M' doesn't have a replacement value yet its data type (character) is same as the other replacement values (character) it will use the original value as 'M'.

mutate(name = recode(ID, A = "Apple", G = "Google", .default = "Not known"))

You can use '.default' argument to set the default value.

ID	name
A	Apple
G	Google
M	Not Known

recode_factor

Summary

Recodes (or replaces) the original values with given values and create a factor column. You can replace numeric values either by the name with backticks or based on their position, and character values by their name.

Syntax

recode_factor(<column>, ..., .default = <text>, .missing = <text>, .ordered = <logical>)

Arguments

.default - The default is NULL. Set the default value for non matching values. If not set, it will use the original values when the replacements are the same type as the original values and it will replace with NA when the replacements are not compatible with the original values.
.missing - The default is NULL. Set the value for the missing values in the original data.
.ordered - The default is FALSE. Set TRUE if you want to create an ordered factor.

Return Value
Factor

Example

Original Data:

ID
M
T
S

mutate(name = recode_factor(ID, S = "Sunday", M = "Mondy", T = "Tuesday", .ordered = TRUE))

ID	name
M	Monday
T	Tuesday
S	Sunday

rename_with

Summary
Batch converts multiple column names by applying a function.

Syntax rename_with(.data = <dataframe_name>, .fn = <function_to_transform_column_names>, .cols = <columns_to_change>)

Arguments * .data - A data frame, or a data frame extension like tibble. * .fn - The function to apply. Specifies a function to transform the selected column names. * .cols - Specifies which columns to change. Default is everything(), which selects all columns.

Example

rename_with(iris, ~ toupper(.x))

Converts all column names of the iris dataframe to uppercase.

SEPAL.LENGTH	SEPAL.WIDTH	PETAL.LENGTH	PETAL.WIDTH	SPECIES
5.1	3.5	1.4	0.2	setosa
4.9	3.0	1.4	0.2	setosa

separate_rows

Summary
Separates a column with delimited values into multiple rows.

Syntax
separate_rows(<column_text_date>, sep = <text>, convert=<logical>)

Arguments

sep - separating text
convert - The default is FALSE. Set TRUE to automatically guess the most possible data type for the key column.

Example

Original data:

company	investors
Uber	Lowercase Capital, Benchmark Capital, Google Ventures
Xiaomi	Digital Sky Technologies, QiMing Venture Partners, Qualcomm Ventures

seperate_rows(investors, sep=", ")

company	investors
Uber	Lowercase Capital
Uber	Benchmark Capital
Uber	Google Ventures
Xiaomi	Digital Sky Technologies
Xiaomi	QiMing Venture Partners
Xiaomi	Qualcomm Ventures

unnest_wider

Summary
Unnest a list column by turning each element of a list-column into a column.

Syntax
unnest_wider(col = <column_list> , names_sep = <text>, names_repair = <text>, simplify = <logical>, ptype = <list>, transform = <list>)

Arguments

col - List-column to extract components from.
names_sep (Optional) - The default is NULL. If NULL the names will be left as is. If a string, the inner and outer names will be paste together using names_sep as a separator.
values_to (Optional) - The default is col. Name of column to store vector values.
indices_to (Optional) - The default is column name with _id suffix. A string giving the name of column which will contain the inner names or position (if not named) of the values.
indices_include (Optional) - The default is TRUE when the column has inner names. Add an index column.
names_repair (Optional) - Used to check that output data frame has valid names. Must be one of the following options
- "minimal": no name repair or checks, beyond basic existence,
- "unique": make sure names are unique and not empty,
- "check_unique": (the default), no name repair, but check they are unique,
- "universal": make the names unique and syntactic
- a function: apply custom name repair.
- tidyr_legacy: use the name repair from tidyr 0.8.
- a formula: a purrr-style anonymous function (see rlang::as_function())
simplify (Optional) - If TRUE, will attempt to simplify lists of length-1 vectors to an atomic vector.
ptype (Optional) - A named list of prototypes declaring the desired output type of each component. Use this argument if you want to check each element has the types you expect when simplifying.
transform (Optional) - A named list of transformation functions applied to each component. Use this function if you want transform or parse individual elements as they are hoisted.

unnest_longer

Summary
Unnest a list column by turning each element of a list-column into a row.

Syntax
unnest_longer(col = <column_list> , values_to = <text>, indices_to = <text>, indices_include = <logical>, names_repair = <text>, simplify = <logical>, ptype = <list>, transform = <list>)

Arguments

col - List-column to extract components from.
values_to (Optional) - The default is col. Name of column to store vector values.
indices_to (Optional) - The default is column name with _id suffix. A string giving the name of column which will contain the inner names or position (if not named) of the values.
indices_include (Optional) - The default is TRUE when the column has inner names. Add an index column.
names_repair (Optional) - Used to check that output data frame has valid names. Must be one of the following options
- "minimal": no name repair or checks, beyond basic existence,
- "unique": make sure names are unique and not empty,
- "check_unique": (the default), no name repair, but check they are unique,
- "universal": make the names unique and syntactic
- a function: apply custom name repair.
- tidyr_legacy: use the name repair from tidyr 0.8.
- a formula: a purrr-style anonymous function (see rlang::as_function())
simplify (Optional) - If TRUE, will attempt to simplify lists of length-1 vectors to an atomic vector.
ptype (Optional) - A named list of prototypes declaring the desired output type of each component. Use this argument if you want to check each element has the types you expect when simplifying.
transform (Optional) - A named list of transformation functions applied to each component. Use this function if you want transform or parse individual elements.

hoist

Hoist allows you to selectively pull components of a list-column out in to their own top-level columns.

Syntax
hoist(.col = <column_list> , .remove = <logical>, .simplify = <logical>, .ptype = <list>, .transform=<list>)

Arguments

.col - The column names must be unique in a call to hoist(), although existing columns with the same name will be overwritten.
.remove (Optional) - The default is TRUE. If TRUE, it will remove extracted components from .col. This ensures that each value lives only in one place.
.simplify (Optional) - If TRUE, will attempt to simplify lists of length-1 vectors to an atomic vector
.ptype (Optional) - A named list of prototypes declaring the desired output type of each component. Use this argument if you want to check each element has the types you expect when simplifying.
.transform (Optional) - A named list of transformation functions applied to each component. Use this function if you want transform or parse individual elements.

drop_na

Summary
Drop rows that have NA value.

Syntax
drop_na(<column(s)>,...)
drop_na(-<column(s)>,...)
drop_na(starts_with(<text>, ignore.case = <logical>))
drop_na(ends_with(<text>, ignore.case = <logical>))
drop_na(contains(<text>, ignore.case = <logical>))
drop_na(matches(<text>, ignore.case = <logical>))
drop_na(num_range(<text>, <start_num>:<end_num>))
drop_na(one_of(<text1>, <text2>, ...))
drop_na(<column>, everything())

Arguments

Example

Sample data:

Original data:

location	year	plant_date	harvest_date
Als	1900	NA	35.02899
Als	1901	10.00000	24.02899
Als	1902	12.58863	45.02899
Bea	1900	5.00000	18.00000
Bea	1901	3.31119	10.00000
Bea	1902	29.12891	NA
Bor	1900	12.52136	18.14896
Bor	1901	NA	10.64896
Bor	1902	10.23056	20.64896

drop_na(ends_with("date"))
Returns a data frame without NA values in plant_date and harvest_date columns.

location	year	plant_date	harvest_date
Als	1901	10.00000	24.02899
Als	1902	12.58863	45.02899
Bea	1900	5.00000	18.00000
Bea	1901	3.31119	10.00000
Bor	1900	12.52136	18.14896
Bor	1902	10.23056	20.64896

add_row

Summary
Add rows with given column values.

Syntax
add_row()

Arguments

Return Value
Data Frame

Example

Original Data:

ID	name
4716	Oracle
4755	Rakuten
2432	DeNA

add_row(ID = 4689)

ID	name
4716	Oracle
4755	Rakuten
2432	DeNA
4689	NA

add_row(ID = 4689, name = "Yahoo")

ID	name
4716	Oracle
4755	Rakuten
2432	DeNA
4689	Yahoo

parse_number

Summary
Parse characters, extract numeric values, and convert to number data type including

Syntax
parse_number(<column>, na = c("", "NA"), locale = <locale>)

Arguments

na - The default is null and "NA". Provide a list of text to be registered as NA.
locale - The default is the session locale using default_locale() function.

Return Value
Numeric

Example
parse_number("12345")
Returns 12345
parse_number("12345.10")
Returns 12345.1
parse_number("$12,345.10")
Returns 12345.1
parse_number("$12.20M")
Returns 12.2
parse_number(as.Date("2015-01-30") - as.Date("2015-01-15"))
Returns 15

parse_double

Summary
Convert data to double precision number data type.

Syntax
parse_double(<column>, na = c("", "NA"), locale = <locale>)

Arguments

na - The default is null and "NA". Provide a list of text to be registered as NA.
locale - The default is the session locale.

Return Value
Double

Example
parse_double("12345")
Returns 12345
parse_double("12345.10")
Returns 12345.1

parse_euro_double

Summary
Convert data to double precision number data type.

Syntax
parse_euro_double(<column>, na = c("", "NA"))

Arguments

na - The default is null and "NA". Provide a list of text to be registered as NA.

Return Value
Double

Example
parse_euro_double("12345")
Returns 12345
parse_euro_double("12345.10")
Returns 12345.1

parse_integer

Summary
Convert data to integer number data type.

Syntax
parse_integer(<column>, na = c("", "NA"), locale = <locale>)

Arguments

na - The default is null and "NA". Provide a list of text to be registered as NA.
locale - The default is the session locale.

Return Value
Integer

Example
parse_integer("12345")
Returns 12345
parse_integer("12345.10")
Returns 12345

parse_time

Summary
Converts character strings containing time values into time objects (hms format). When combined with as.character(), it can effectively transform time data from various formats into standardized hms objects.

Syntax
parse_time(<column>, format = <time_format>, locale = <locale>)

Arguments

format (Optional) - Specifies the time format of the input data. If not specified, automatic recognition is attempted.
- %H - Hour (00-23 in 24-hour notation)
- %I - Hour (01-12 in 12-hour notation)
- %M - Minute (00-59)
- %S - Second (00-59)
- %z - Offset from GMT (e.g., +0900)
- %Z - Time zone name (e.g., JST)
- %p - AM/PM designation
locale (Optional) - Default is the system locale setting. You can specify settings for time zone, encoding, AM/PM notation, etc.

Return Value
time (hms type time object)

Example

R_Start parse_time(as.character(time_column)) R_End

Converts standard time format strings to hms type time.

time_column	converted_time
13:10:05	13:10:05
09:45:30	09:45:30
23:59:59	23:59:59

R_Start parse_time(as.character(time_column), format = "%I:%M:%S %p") R_End

Converts 12-hour format (with AM/PM) time strings to 24-hour format hms type.

time_column	converted_time
10:20:15 PM	22:20:15
08:45:00 AM	08:45:00
12:00:00 PM	12:00:00

R_Start parse_time(as.character(time_column), locale = locale(tz = "America/Los_Angeles")) R_End

Converts time strings with a specific time zone specified.

time_column	converted_time
13:10:05	13:10:05
09:45:30	09:45:30

R_Start parse_time(as.character(time_column), locale = locale("ja")) R_End

Converts time strings in Japanese locale (including AM/PM notation).

time_column	converted_time
10:20:15 午後	22:20:15
08:45:00 午前	08:45:00

R_Start parse_time(as.character(time_column), format = "%H.%M") R_End

Converts period-separated time format (hour.minute) to standard time type.

time_column	converted_time
13.45	13:45:00
09.30	09:30:00
22.15	22:15:00

R_Start parse_time(as.character(time_column), format = "%H時%M分%S秒") R_End

Converts Japanese format time notation (hour-minute-second) to standard time type.

time_column	converted_time
13時45分30秒	13:45:30
09時15分00秒	09:15:00
23時59分59秒	23:59:59

parse_character

Summary
Convert a given data to character data type.

Syntax
parse_character(<column>, na = c("", "NA"), locale = <locale>)

Arguments

na - The default is null and "NA". Provide a list of text to be registered as NA.
locale - The default is the session locale.

Return Value
Character

Example
parse_character(123.11)
Returns "123.11" as character.

parse_factor

Summary
Parse data and convert to factor data type.

Syntax
parse_factor(x, levels, ordered = <logical>, na = c("", "NA"), locale = <locale>)

Arguments

levels (Optional) - Set a set of levels (categories) explicitly. If not specified, the unique values of the given column will be used.
ordered - The default is FALSE. Set if this should be an ordered factor or not.
na - The default is null and "NA". Provide a list of text to be registered as NA.
locale - The default is the session locale.

Return Value
Factor

Example
columnA - "iMac", "iPod", "iPhone", "iPod", "iPhone"
parse_factor(columnA)
Returns iMac, iPod, iPhone, iPod, iPhone (It has Levels information of : iMac iPhone iPod)

parse_logical

Summary
Convert data to Logical (Boolean) data type.

Syntax
parse_logical(<column>, na = c("", "NA"), locale = <locale>)

Arguments

na - The default is null and "NA". Provide a list of text to be registered as NA.
locale - The default is the session locale.

Return Value
Logical

Example
parse_logical("TRUE")
Returns TRUE
parse_logical("T")
Returns TRUE

type_convert

Summary
Heuristically guess the data type for each column by reading the first 1000 rows, parse the data, and set appropriate data types for all the columns of the data frame.

Syntax
type_convert()

Arguments

Return Value
Data Frame

Example
type_convert()

ym

Summary
Convert Character or Number to Date / Period when data contains only Date, but not Time.

Syntax
ym(<column>, tz = <timezone>, locale = <locale>)

Arguments

tz (Optional) - Specify which time zone to parse the date with. You can find a list of time zones here.
locale (Optional) - Specify a locale to parse the date data.

Return Value
Date

Example
ym("2015-10")
returns "2015-10-01"
ym("2015/10")
returns "2015-10-01"
ym("Created on 2015 10")
returns "2015-10-01"
ym("2015, Oct", locale = "English") returns "2015-10-01"

my

Summary
Convert Character or Number to Date / Period when data contains only Date, but not Time.

Syntax
my(<column>, tz = <timezone>, locale = <locale>)

Arguments

tz (Optional) - Specify which time zone to parse the date with. You can find a list of time zones here.
locale (Optional) - Specify a locale to parse the date data.

Return Value
Date

Example
returns "2015-01-01"
my("01/2015")
returns "2015-01-01"
my("Created on 1 2015")
returns "2015-01-01"
my("Oct, 2015", locale = "English") returns "2015-10-01"

yq

Summary
Convert text to Date data type if a given column holds values that look like Year and Quoter.

Syntax
yq(<column>, tz = <timezone>, locale = <locale>)

Arguments

tz (Optional) - Specify which time zone to parse the date with. You can find a list of time zones. here.
locale (Optional) - Specify a locale to parse the date data.

Return Value
Date

Example
yq("2016.2")
returns "2016-04-01"
yq("2016-03") returns "2016-07-01"

epiweek

Summary
Extract week numbers of the year. Weeks start from Sunday.

Syntax
epiweek(<column_date>)

Return Value
Numeric

Example
epiweek(ymd(c( \ "2020-01-01",\ "2020-01-02",\ "2020-01-03",\ "2020-01-04",\ "2020-01-05",\ "2020-01-06",\ "2020-01-07",\ "2020-01-08")))\ returns 1,1,1,1,2,2,2,2 (2020-01-05 is Sunday)

as_date

Summary
Convert a given POSIXct to Date.

Syntax
as_date(<column_date>)

Arguments

Return Value
Date

Example
as_date(as_datetime("2015-10-10 13:10:05"))

Returns "2015-10-10"

as_datetime

Summary
Convert a given data to date.

Syntax
as_datetime(<column_date>, format = <date_time_format>, tz = <timezone>, origin = <text>)

Arguments

format (Optional) - specify a date format to read the given data. If not specified, it will try "%Y-%m-%d" then "%Y/%m/%d"
- %Y - 4-digit year
- %y - 2-digit year
- %m - Decimal month
- %b - Abbreviated month
- %B - Full month
- %d - Decimal date
- %H - Decimal hours (24 hour)
- %I - Decimal hours (12 hour)
- %M - Decimal minute
- %z - Offset from GMT
- %Z - Time zone (character)
- %a - Abbreviated weekday
- %A - Full weekday
- %j - Decimal day of the year
- %p - Locale-specific AM/PM
- %S - Decimal second
- %U - Decimal week of the year (starting on Sunday)
- %w - Decimal Weekday (0=Sunday)
- %W - Decimal week of the year (starting on Monday)
- %x - Locale-specific Date
- %X - Locale-specific Time
- %c - Locale-specific date and time
tz (Optional) - specify which timezone to use to read the given data.
origin (Optional) - specify the origin date to calculate a given internal number of days from.

Return Value
POSIXct

Example

as_datetime("2015-10-10 13:10:05") Returns "2015-10-10 13:10:05 PDT"
as_datetime("2015-10-10 13:10:05", tz = "America/Los_Angeles") Returns "2015-10-10 13:10:05 PDT"
as.as_datetime("2015-10-10 13:10:05", tz = "Asia/Tokyo") Returns "2015-10-10 13:10:05 JST"
as_datetime("05-10-15T13:10:05", format = "%d-%m-%yT%H:%M") Returns "2015-10-05 13:10:00 PDT"

str_squish

Summary
Remove leading and trailing white spaces and convert multiple consecutive white spaces into a single space.

Syntax
str_squish(<column_text>)

Arguments

Return Value
Character

Example
str_squish(X1) Remove leading and trailing white spaces and convert multiple consecutive white spaces into a single space.

X1	NEW
String with spaces	String with spaces
Hello World	Hello World
Multiple white spaces	Multiple white spaces

str_trunc

Summary
Truncate text to make the maximum length to be the specified number.

Syntax
str_trunc(<column_text>, width = <number>, side = "right"|"left"|"center", ellipsis = <text>)

Arguments

width - Maximum width of truncated strings.
side (Optional) - The default is "right". Side on which ellipsis string is added ("right", "left" or "center").
ellipsis (Optional) - The string to be added as ellipsis. default is "...".

Return Value
Character

Example
str_trunc("Exploratory", 6)
returns "Exp..."

str_trunc("Exploratory", 6, side = "left")
returns "...ory"

str_remove

Summary
Remove letters that match with given letters or expressions.

Syntax
str_remove(<column_text>, <pattern>)

Arguments

pattern - Text or regular expression to look for

Return Value
Character

Example
str_remove("I am traveling to New York!!", "New") returns "I am traveling to York!!"

str_remove_all

Summary
Remove letters that match with given letters or expressions

Syntax
str_remove_all(<column_text>, <pattern>)

Arguments

pattern - Text or regular expression to look for

Return Value
Character

Example
str_remove_all("I am traveling to New York and New Jersey!!", "new")
returns "I am traveling to York and Jersey!!"

str_remove_word

Summary
Remove a word from sentence.

Syntax
str_remove_word(<column_text>, start = <number>, end = <number>, sep = <text>)

Arguments

start - integer vector giving position of first word to remove. Defaults to first word. If negative, counts backwards from last character.
end - integer vector giving position of last word to remove. Defaults to first word. If negative, counts backwards from last character.
sep - separator between words. Defaults to single space.

Return Value
Character

Example

str_remove_word(text_column, start = 1, sep = " ")

Removes the first word from the text column.

text_column	new_column
Hello World	World
R Programming	Programming
Data Analysis	Analysis

str_remove_word(text_column, start = -1, sep = " ")

Removes the last word from the text column.

text_column	new_column
Hello World	Hello
R Programming	R
Data Analysis	Data

str_remove_word(text_column, end = -1, sep = "-")

Removes the last hyphen-separated word from the text column.

text_column	new_column
user-name-id	user-name
first-last-middle	first-last
product-color-size	product-color

str_replace_word

Summary
Replace word from a sentence with given text.

Syntax
str_replace_word(<column_text>, start = <number>, end = <number>, sep = <text>, rep = <text>)

Arguments

start - integer vector giving position of first word to remove. Defaults to first word. If negative, counts backwards from last character.
end - integer vector giving position of last word to remove. Defaults to first word. If negative, counts backwards from last character.
sep - separator between words. Defaults to single space.
rep - text to replace the word.

NOTE: For now only the first and the last work.

Return Value
Character

Example

str_replace_word(text_column, start = -1, sep = " ", rep = "England")

Replaces the last word in the text column with "England".

コピー| text_column | new_column | |-------------|------------| | Hello World | Data World | | R Programming | Data Programming | | Text Analysis | Data Analysis |

str_replace_word(text_column, start = 1, sep = " ", rep = "Data")

Replaces the first word in the text column with "Data".

text_column	new_column
Hello World	Data World
R Programming	Data Programming
Text Analysis	Data Analysis

str_replace_word(text_column, start = 1, sep = "-", rep = "PREFIX")

Replaces the first hyphen-separated word in the text column with "PREFIX".

text_column	new_column
user-name-id	PREFIX-name-id
first-last-middle	PREFIX-last-middle
product-color-size	PREFIX-color-size

str_remove_inside

Summary
Replace a text inside of characters.

Syntax
str_remove_inside(<column_text>, begin = <text>, end = <text>, all = <logical>, include_special_chars = <logical>)

Arguments

begin - Begin character such as "(".
end - End character suc as ")".
all - If TRUE, remove all the matches.
include_special_chars - if TRUE, it removes the text including both begin and end characters.

Return Value
Character

Example
exploratory::str_remove_inside("I am traveling to New York (2020)", begin = "(", end = ")") returns "I am traveling to New York "

str_replace_inside

Summary
Replace a text inside of characters.

Syntax
str_replace_inside(<column_text>, begin = <text>, end = <text>, rep = <text>, all = <logical>, include_special_chars = <logical>)

Arguments

begin - Begin character such as "(".
end - End character suc as ")".
rep - text to replace the word.
all - If TRUE, replace all the matches.
include_special_chars - if TRUE, it replaces the text including both begin and end characters.

Return Value
Character

Example
exploratory::str_replace_inside("I am traveling to New (York)", begin = "(", end = ")", rep = "England") returns "I am traveling to New England"

str_remove_url

Summary
Remove an URL inside of characters.

Syntax
str_remove_url(<column_text>, position = <text>)

Arguments

position - start/end/any, position of the URL that you want to remove.

Return Value
Character

Example
exploratory::str_remove_url("I am traveling to New York http://example.com", position = "any") returns "I am traveling to New York "

str_replace_url

Summary
Replace an URL inside of characters with provided text.

Syntax
str_replace_url(<column_text>, rep = <text>)

Arguments

rep - text to replace the word.

Return Value
Character

Example
exploratory::str_replace_url("I am traveling to New http://example.com/ny", rep = "York") returns "I am traveling to New York"

str_extract_url

Summary
Replace an URL inside of characters with provided text.

Syntax
str_extract_url(<column_text>)

Arguments

Return Value
Character

Example
exploratory::str_extract_url("I am traveling to New http://example.com/ny") returns "http://example.com/ny"

str_remove_emoji

Summary
Replace an URL inside of characters with provided text.

Syntax
str_remove_emoji(<column_text>)

Arguments

Return Value
Character

Example
exploratory::str_remove_emoji("I am traveling to New York 🤩") returns "I am traveling to New York "

model_info

Summary

Returns a summary information of a given model in a tidy (normalized data frame) format.

Syntax

model_info(<column_list>, output = <model_info_output_type>, ...)

Arguments
* output (Optional) - The default is "summary". This changes the type of output. Can be one of the following.

"summary"
"variables"
"data"

Returned Values

Example

Following examples assumes that 'model' column contains a fitted model of 'lm' by do() command.

model_info(model, output = "summary")

r.squared	adj.r.squared	sigma	statistic	p.value	df	logLik	AIC	BIC	deviance	df.residual
0.8879	0.8879	13.0886	1716441.58	0	3	-1729152.62	3458313.24	3458357.15	74228528.48	433295

model_info(model, output = "variables")

term	estimate	std.error	statistic
(Intercept)	-3.0966	0.0341	-90.749
2 DEP_DELAY	1.003	0.0005	1851.9463
3 DISTANCE	-0.0037	0	-114.3364

model_info(model, output = "data")

.rownames	ARR_DELAY	DEP_DELAY	.fitted	.se.fit	.resid	.hat	.sigma	.cooksd	.std.resid
1	-15	-9	-13.934	0.0247	-1.066	0.0000035503	13.0886	7.8503e-9	-0.0814
2	-28	-19	-22.4116	0.035	-5.5884	0.0000071333	13.0886	4.3348e-7	-0.427

as.hms

Summary
Converts numeric values representing seconds into hms (hours:minutes:seconds) time format.

Syntax
as.hms(<column_num>)

Arguments

Return Value
hms

Example
as.hms(X1) Converts the numerical values in column X1 to hms type, handling them in seconds.

X1	NEW
20	00:00:20
3600	01:00:00
7320	02:02:00

is.hms

Summary
Returns true if a given object is hms type.

Syntax
is.hms(<column>)

Arguments

Return Value
Logical

Example
mutate_if(is.hms, as.character)
Convert hms columns into character columns. An example of the value is "05:01:20".

remove_empty

Summary
Removes empty rows or columns. Useful for removing rows or columns that are empty (NA, NULL, blank) from a data frame.

Syntax
remove_empty(<data_frame>, which = <"rows"|"cols">, quiet = <TRUE|FALSE>)

Arguments * data_frame - The data frame from which to remove empty rows or columns * which - Specifies what to remove: * "rows" - Removes empty rows * "cols" - Removes empty columns * c("rows", "cols") - Removes both empty rows and columns * quiet (Optional) - Specifies whether to display information messages about removed rows or columns. Default is FALSE (displays messages).

Return Value
data.frame

Example

Original data:

name	age	city
Smith	25	London
NA	NA	NA
Johnson	30	Paris

remove_empty(data_frame, which = "rows") Removes rows where all columns are NA, NULL, or blank values.

name	age	city
Smith	25	London
Johnson	30	Paris

Original data:

name	age	city	notes
Smith	25	London	NA
Johnson	30	Paris	NA
Williams	28	Berlin	NA

remove_empty(data_frame, which = "cols")

Removes columns where all rows are NA, NULL, or blank values.

name	age	city
Smith	25	London
Johnson	30	Paris
Williams	28	Berlin

Original data:

name	age	city	notes
Smith	25	London	NA
NA	NA	NA	NA
Johnson	30	Paris	NA

remove_empty(data_frame, which = c("rows", "cols"))

Removes both empty rows and columns.

name	age	city
Smith	25	London
Johnson	30	Paris

clean_names

Summary
Make column names clean by using underscores (_), lowercase letters, and numbers for easier data handling and standardization.

Syntax
clean_names(<data_frame>, case = <"Format">)

Arguments

data_frame - The data frame whose names you want to clean
case (Optional) - Format for the resulting character strings:
- "snake" - snake_case (e.g., column_name) (default)
- "lower_camel" - lowerCamelCase (e.g., columnName)
- "upper_camel" - UpperCamelCase (e.g., ColumnName)
- "screaming_snake" - SCREAMING_SNAKE_CASE (e.g., COLUMN_NAME)
- "lower_upper" - lOWER_uPPER (each word's first letter lowercase, the rest uppercase)
- "upper_lower" - Upper_lower (each word's first letter uppercase, the rest lowercase)
- "none" - No case conversion, only replacing spaces and special characters with underscores

Return Value
data.frame

Example

clean_names(data_frame)

Original data:

Item Name	.transaction
banana	1
apple	1
carrot	2
apple	2
orange	3
NA	3

Returns data with clean column names using default settings (snake_case):

item_name	transaction
banana	1
apple	1
carrot	2
apple	2
orange	3
NA	3

clean_names(data_frame, case = "upper_camel")

Using case="upper_camel" (UpperCamelCase):

ItemName	Transaction
banana	1
apple	1
carrot	2
apple	2
orange	3
NA	3

get_dupes

Summary

Extracts rows that are duplicated in the specified column(s). If no columns are specified, it returns rows that are duplicated across all columns.

Syntax
get_dupes(<column(s)>)

Arguments

column_name - The column(s) to check for duplicates. Multiple column names can be specified. If no column names are specified, rows that are duplicate across all columns will be returned.

Return Value

data.frame (containing only the duplicated rows from the original data frame, with an added "dupe_count" column showing the number of duplicates)

Example

get_dupes(data_frame, name)

Original data:

name	n
banana	1
apple	1
apple	1
apple	2

Returns rows that are duplicated in the name column (with a dupe_count column added):

name	n	dupe_count
apple	1	2
apple	1	2
apple	2	1

get_dupes(data_frame, name, n)

Returns rows that are duplicated across both name and n columns:

name	n	dupe_count
apple	1	2
apple	1	2

excel_numeric_to_date

Summary
Converts Excel numeric date values (e.g., 42370) to date format (e.g., 2016-01-01)

Syntax
excel_numeric_to_date(<column_num>, date_system = "modern"|"mac pre-2011")

Arguments

date_system (Optional) - "modern" or "mac pre-2011". The default is "modern". The date system of the data.

Example

Original data:

user	last_login
Lisa	42370
Emily	42318
John	42489

excel_numeric_to_date(last_login) Converts Excel numeric date values to date format as shown below.

user	last_login
Lisa	2016-01-01
Emily	2015-11-10
John	2016-04-29

Return Value
Date

tabyl

Summary
Create frequency table for a column in a data frame. Useful for quickly understanding data distribution with percentage calculations and NA handling.

Syntax
tabyl(<column_name>, sort = <TRUE|FALSE>, show_na = <TRUE|FALSE>, show_missing_levels = <TRUE|FALSE>)

Arguments

column_name - The column for which to create a frequency table
sort - Whether the result should be sorted by frequency. Default is FALSE (unsorted)
show_na - Whether NA values should be included in the result. Default is TRUE (include NA values)
show_missing_levels - For factor columns, whether to show levels that don't appear in the data. Default is FALSE (don't show missing levels)

Return Value
data.frame (frequency table)

Example

tabyl(data_frame, item_name)

Original data:

item_name	transaction_id
banana	1
apple	1
NA	NA
carrot	2
apple	2
NA	NA
orange	3
NA	3

Shows frequency table for item_name column (including NA values):

item_name	n	percent	valid_percent
apple	2	0.250	0.40
banana	1	0.125	0.20
carrot	1	0.125	0.20
orange	1	0.125	0.20
NA	3	0.375	NA

tabyl(data_frame, item_name, sort = TRUE)

Sorted by frequency (high to low):

item_name	n	percent	valid_percent
NA	3	0.375	NA
apple	2	0.250	0.40
banana	1	0.125	0.20
carrot	1	0.125	0.20
orange	1	0.125	0.20

tabyl(data_frame, item_name, show_na = FALSE)

Excluding NA values:

item_name	n	percent
apple	2	0.40
banana	1	0.20
carrot	1	0.20
orange	1	0.20

as_factor

Summary
Convert a given column to a factor. Compared to base R's as.factor, this function creates levels in the order in which they appear, which will be the same on every platform (base R sorts in the current locale which can vary from place to place).

Syntax
as_factor(<column_char_num_date>)

Arguments

Example

x <- c("FR", "US", "JP")
as_factor(x)
Returns a factor with FR, US, JP with levels of FR, US, JP.

fct_anon

Summary
Anonymize factor levels. Replaces factor levels with arbitrary numeric identifiers. Neither the values nor the order of the levels are preserved.

Syntax
fct_anon(<column_char_num_factor>, prefix = <text>)

Arguments

prefix - A character prefix to insert in front of the random labels.

Example

x <- factor(c("FR", "US", "JP"))
fct_anon(x)
Returns a factor with 1, 2, 3.

fct_expand

Summary
Add additional levels to a factor

Syntax
fct_expand(<column_char_num_factor>, <text>)

Arguments

Example

x <- factor(c("FR", "US"))
fct_expand(x, "JP", "CA")
Returns a factor with levels, FR US JP CA.

fct_drop

Summary
Drop unused levels.

Syntax
fct_drop(<column_char_num_factor>, only = <text>)

Arguments

only - A character vector restricting the set of levels to be dropped. If supplied, only levels that have no entries and appear in this vector will be removed.

Example

x <- factor(c("FR", "US"), levels = c("FR", "US", "JP", "CA"))
fct_drop(x)
Returns a factor with levels of FR, US.

fct_drop(x, only = "CA")
Returns a factor with levels of FR, US, JP.

fct_explicit_na

Summary
This gives missing value an explicit factor level, ensuring that they appear in the charts.

Syntax
fct_explicit_na(<column_char_num_factor>, na_level = <text>)

Arguments

na_level - The default is "Missing". Set a level to use for missing values.

Example

x <- factor(c("FR", NA, "US", NA, "JP"))
fct_explicit_na(x, "No value")
Returns a factor with FR, No value, US, No value, JP and with levels of FR JP US No value

fct_lump

Summary
Moves least/most common factor levels into "Other" category.

Syntax
fct_lump(<column_char_num_factor>, n = <number>, prop = <number>, other_level = <text>, ties.method = <ties_method_type>, w = <column_num>)

Arguments

If both n and prop are missing, fct_lump lumps together the least frequent levels into "other", while ensuring that "other" is still the smallest level.

n - Set how many unique values (levels) you want to keep other than the others. Positive n preserves the most common n values. Negative n preserves the least common -n values. It there are ties, you will get at least abs(n) values.
prop - Positive prop, preserves values that appear at least prop of the time. Negative prop, preserves values that appear at most -prop of the time.
other_level - The default is "Other", Set a level name for "other" values.
ties.method - Set how ties are treated.
- min
- average
- first
- last
- random
- max
w - An optional numeric vector that is used as weight when calculating frequency of each value (not level) in character vector.

Example

mutate(STATE = fct_lump(STATE))

This will make STATE column to keep most of the values in order to make the 'Other' level will have the smallest counts compared to all the other levels.

mutate(ORIGIN_STATE_ABR = fct_lump(ORIGIN_STATE_ABR, n = 10))

This will make STATE column to keep only 10 States from the most frequent ones and add 'Other' level for putting all the others together.

mutate(ORIGIN_STATE_ABR = fct_lump(ORIGIN_STATE_ABR, prop = .1))

This will make ORIGIN_STATE_ABR column to keep only States that have more than 10% proportion of all the entries and add 'Other' level for putting all the others together.

mutate(ORIGIN_STATE_ABR = fct_lump(ORIGIN_STATE_ABR, n = 5, w = DISTANCE))

This will make ORIGIN_STATE_ABR column to keep only 5 States that have the most weight value, which is calculated as sum of DISTANCE for each State, and add 'Other' level for putting all the others together.

fct_other

Summary
Creates "Other" category by moving specified values to "Other", or by keeping only specified values from "Other".

Syntax
fct_other(<column_char_num_factor>, keep = c("<column_value(s)>"), drop = c("<column_value(s)>"), other_level = <text>)

Arguments

keep - List of values to be kept from "Other" category.
drop - List of values to move to "Other" category.
other_level - The default is "Other", Set a level name for "Other" category.

Example

mutate(STATE = fct_other(STATE, keep = c("CA", "AK", "AZ")))

This will make STATE column to keep only the levels for specified values and the 'Other' level for all other values.

mutate(STATE = fct_other(STATE, drop = c("CA", "AK", "AZ")))

This will make STATE column to keep the levels that are not specified here, and the 'Other' level for the specified values.

fct_inorder

Summary
Reorder factors levels by first appearance.

Syntax
fct_inorder(<column_char_num_factor>, ordered = <logical>)

Arguments

ordered (Optional) - A logical which determines the "ordered" status of the output factor. NA pre- serves the existing status of the factor.

Example

x <- factor(c("FR", "US", "US", "JP", "US", "JP"))
fct_inorder(x)
Returns a factor with levels in the following order, FR US JP.

x <- c("FR", "US", "US", "JP", "US", "JP") -- it can be used for Character data type column. fct_inorder(x)
Returns a factor with levels in the following order, FR US JP.

fct_infreq

Summary
Reorder factors levels by frequency (the most frequently appears to the least.)

Syntax fct_infreq(<column_char_num_factor>, ordered = <logical>)

Arguments

ordered (Optional) - The default is NA. A logical which determines the "ordered" status of the output factor. NA pre- serves the existing status of the factor.

Example

x <- factor(c("FR", "US", "US", "JP", "US", "JP"))
fct_infreq(x)
Returns a factor with levels in the following order, US JP FR.

x <- c("FR", "US", "US", "JP", "US", "JP") -- it can be used for Character data type column. fct_infreq(x)
Returns a factor with levels in the following order, US JP FR.

fct_relevel

Summary
Sets the level of a given factor column by moving any number of levels to any location.

Syntax
fct_relevel(<column_char_num_factor>, ...)

Arguments

use (Optional) - The default is "everything". Set an operation type for dealing with missing values.
method (Optional) - The default is "pearson". Set a method to compute correlation coefficient among "pearson", "kendall", or "spearman".

Example
x <- c("France", "Japan", "Australia", "United States") -- creating a character column. fct_relevel(x, "Japan")
Return as Japan, Australia, France, United States. The character column is converted to factor with the default level setting, which is an alphabetical order.

x <- factor(level = c("France", "Japan", "Australia", "United States")) -- creating a factor column with levels.
fct_relevel(x, "Japan")
Return as Japan, France, Australia, United States

fct_relevel(x, "Japan", "United States")
Return as Japan, United States, France, Australia

fct_reorder

Summary
Sets the order of values based on sorted values from another column and converts the data type to factor.

Syntax

fct_reorder(<column_char_num_factor>, <column_num>, .fun = <aggregate_function>, ..., .desc = <logical>)

Arguments

.fun (Optional) - The default is median. Set an aggregation function. (e.g. mean, sum, etc.)
.desc (Optional) - The default is FALSE. Set TRUE if you want it to be sorted in descending order.

Example
fct_reorder(Country, Sales) Sets the Country column levels by the median values of the Sales column in ascending order. Since the median Sales values are France(500) < Japan(1000) < USA(2000), the levels are set in this order.

Country	Sales	NEW
Japan	1000	Japan
Japan	1200	Japan
Japan	800	Japan
USA	2000	USA
USA	2200	USA
USA	1800	USA
France	500	France
France	600	France
France	400	France

fct_reorder(Country, Sales, .desc = TRUE) Sets the Country column levels by the median values of the Sales column in descending order. Since the median Sales values are USA(2000) > Japan(1000) > France(500), the levels are set in this order.

Country	Sales	NEW
Japan	1000	Japan
Japan	1200	Japan
Japan	800	Japan
USA	2000	USA
USA	2200	USA
USA	1800	USA
France	500	France
France	600	France
France	400	France

fct_reorder(Country, Sales, .fun = sum, .desc = TRUE, na.rm = TRUE) Sets the Country column levels by the sum of the Sales column in descending order. Since the sum of Sales values are USA(6000) > Japan(3000) > France(900), the levels are set in this order.

Country	Sales	NEW
Japan	1000	Japan
Japan	1200	Japan
Japan	800	Japan
USA	2000	USA
USA	2200	USA
USA	1800	USA
France	500	France
France	NA	France
France	400	France

fct_rev

Summary
Reverse order of factor levels.

Syntax

fct_rev(<column_factor>)

Arguments

Example
x <- factor(c("FR", "US", "JP"))
This creates a factor column with levels in an order of, FR, JP, US. (alphabetical order as default.)

fct_rev(x)
Returns a factor with levels in an order of US, JP, FR.

xgboost_reg

Summary
Create extreme gradient boosting model for regression.

Syntax
xgboost_reg(watchlist_rate = <numeric>, output_type = <xgb_reg_output_type>, nrounds = <integer>, booster = <xgboost_booster_type>, eval_metric = <xgb_reg_evaluation_type>, weight = <column_num>, early_stopping_rounds = <integer>)

Arguments

watchlist_rate (Optional) - The default is 0. Ratio of data to use as validation data set to watch the performance of the model against data that is not used for learning process. It prevents overfitting.
output_type (Optional) - The default is "linear". What distribution the target variable follows. This can be
- "linear"
- "logistic"
- "gamma"
- "tweedie"
nrounds (Optional) - The default is 10. Max number of iterations for training.
booster (Optional) - The default is "gbtree". Distribution that the target variable follows. This can be
- "gbtree"
- "gblinear"
- "dart"
eval_metric (Optional) - The default is "rmse" for linear and logistic, "gamma-nloglik" for gamma and "tweedie-nloglik" for tweedie output type. Evaluation metrics to optimize by training.
- "rmse" - Root mean square error
- "mae" - Mean absolute error
- "gamma-nloglik" - Negative log likelihood for gamma distribution
- "gamma-deviance" - Residual deviance for gamma distribution
weight (Optional) - The default is NULL. A column with weight for each data.
early_stopping_rounds (Optional) - The default is NULL. The number of iterations to stop after the performance doesn't improve.

Return Value
Model

Example

Original data:

CARRIER	ARR_DELAY	DEP_DELAY	DISTANCE
UA	10	7	300
UA	20	30	1000
UA	-5	0	200
AA	20	20	500
AA	-5	3	700
AA	10	8	1000

%>% group_by(CARRIER)

%>% build_model(model_func = xgboost_reg, formula = DEP_DELAY ~ ARR_DELAY + DISTANCE, test_rate = 0.1)
Returns a data frame that stores a xgb.Booster model based on a formula defined as 'Predict DEP_DELAY based on ARR_DELAY and DISTANCE.' It also returns a column of original data.

CARRIER	source_data	model	.test_index
UA	source dataframe	xgb.Booster model	c(1)
AA	source dataframe	xgb.Booster model	c(2)

You can use model_coef or model_stats function to get the summary information about the models. Also, you can use prediction function to predict with the data that is stored in the same data frame.

xgboost_binary

Summary
Create extreme gradient boosting model for binary classification.

Syntax
xgboost_binary(watchlist_rate = <numeric>, output_type = <xgb_binary_output_type>, nrounds = <integer>, booster = <xgboost_booster_type>, eval_metric = <xgb_binary_evaluation_type>, weight = <column_num>, early_stopping_rounds = <integer>)

Arguments

watchlist_rate (Optional) - The default is 0. Ratio of data to use as validation data set to watch the performance of the model against data that is not used for learning process. It prevents overfitting.
output_type (Optional) - The default is "logistic". This can be
- "logistic" - Predict probabilities of the classification.
- "logitraw" - Predict values before converted to logit.
nrounds (Optional) - The default is 10. Max number of iterations for training.
eval_metric (Optional) - The default is "auc". Evaluation metrics to optimize by training. This can be
- "auc" - Area under curve
- "error" - Misclassification rate
- "logloss" - Negative log likelihood
weight (Optional) - The default is NULL. A column with weight for each data.
early_stopping_rounds (Optional) - The default is NULL. The number of iterations to stop after the performance doesn't improve.

Return Value
Model

xgboost_multi

Summary
Create extreme gradient boosting model for binary classification.

Syntax
xgboost_multi(watchlist_rate = <numeric>, output_type = <xgb_multi_output_type>, nrounds = <integer>, booster = <xgboost_booster_type>, eval_metric = <xgb_multi_evaluation_type>, weight = <column_num>, early_stopping_rounds = <integer>)

Arguments

watchlist_rate (Optional) - The default is 0. Ratio of data to use as validation data set to watch the performance of the model against data that is not used for learning process. It prevents overfitting.
output_type (Optional) - The default is "softprob". This can be.
- "softprob" - Predict proabilities for each class.
- "softmax" - Predict only the labels with maximum probability.
nrounds (Optional) - The default is 10. Max number of iterations for training.
booster (Optional) - The default is "gbtree". Distribution that the target variable follows. This can be
- "gbtree"
- "gblinear"
- "dart"
eval_metric (Optional) - The default is "merror". Evaluation metrics to optimize by training. This can be
- "merror" - Misclassification Rate.
- "mlogloss" - Negative log likelihood.
weight (Optional) - The default is NULL. A column with weight for each data.
early_stopping_rounds (Optional) - The default is NULL. The number of iterations to stop after the performance doesn't improve.

Return Value
Model

Example

randomForestReg

Summary
Create random forest model for regression.

Syntax
randomForestReg(subset = <column_logical>, na.action = <'na_action_type'>, ntree = <integer>, mtry = <numeric>, replace = <logical>, strata = <column>, sampsize = <integer>, nodesize = <integer>, maxnodes = <integer>, importance = <logical>, localImp = <logical>, nPerm = <integer>, proximity = <logical>, oob.prox = <logical>, keep.forest = <logical>, corr.bias = <logical>, keep.inbag = <logical>)

Arguments

subset (Optional) - Logical column to subset data.
ntree (Optional) - The default is 500. Number of trees to grow.
mtry (Optional) - The default is 1/3 of number of variables. Number of variables randomly sampled as candidates at each split.
replace (Optional) - The default is TRUE. Whether sampling is done with replacement.
strata (Optional) - A (factor) variable that is used for stratified sampling.
sampsize (Optional) - Size of sample to draw.
nodesize (Optional) - The default is 5. Minimum size of terminal nodes.
maxnodes (Optional) - Maximum number of terminal nodes trees in the forest can have.
importance (Optional) - The default is FALSE. Whether importance of predictors is assessed.
localImp (Optional) - The default is FALSE. Whether casewise importance measure is computed.
nPerm (Optional) - The default is 1. Number of times the OOB data are permuted per tree for assessing variable importance.
proximity (Optional) - The default is FALSE. Whether proximity measure among the rows is calculated.
oob.prox (Optional) - The default is same as proximity. Whether proximity is calculated only on "out-of-bag" data.
keep.forest (Optional) - The default is TRUE. Whether the forest is retained in the output object.
corr.bias (Optional) - The default is FALSE. If bias correction for regression is performed.
keep.inbag (Optional) - The default is FALSE. Whether an n by ntree matrix be returned that keeps track of which samples are “in-bag” in which trees.

Return Value
Model

Example

randomForestBinary

Summary
Create random forest model for binary classification.

Syntax
randomForestReg(subset = <column_logical>, na.action = <'na_action_type'>, ntree = <integer>, mtry = <numeric>, replace = <logical>, classwt = <numeric>, cutoff = <numeric>, strata = <column>, sampsize = <integer>, nodesize = <integer>, maxnodes = <integer>, importance = <logical>, localImp = <logical>, nPerm = <integer>, proximity = <logical>, oob.prox = <logical>, norm.votes = <logical>, keep.forest = <logical>, corr.bias = <logical>, keep.inbag = <logical>)

Arguments

subset (Optional) - Logical column to subset data.
ntree (Optional) - The default is 500. Number of trees to grow.
mtry (Optional) - The default is square root of number of variables. Number of variables randomly sampled as candidates at each split.
replace (Optional) - The default is TRUE. Whether sampling is done with replacement.
classwt (Optional) - Priors of the classes.
cutoff (Optional) - A vector of length equal to number of classes. The ‘winning’ class for an observation is the one with the maximum ratio of proportion of votes to cutoff. Default is 1/k where k is the number of classes.
strata (Optional) - A (factor) variable that is used for stratified sampling.
sampsize (Optional) - Size of sample to draw.
nodesize (Optional) - The default is 1. Minimum size of terminal nodes.
maxnodes (Optional) - Maximum number of terminal nodes trees in the forest can have.
importance (Optional) - The default is FALSE. Whether importance of predictors is assessed.
localImp (Optional) - The default is FALSE. Whether casewise importance measure is computed.
proximity (Optional) - The default is FALSE. Whether proximity measure among the rows is calculated.
oob.prox (Optional) - The default is same as proximity. Whether proximity is calculated only on "out-of-bag" data.
norm.votes (Optional) - The default is TRUE. Whether the final result of votes are expressed as fractions. If FALSE, raw vote counts are returned (useful for combining results from different runs).
keep.forest (Optional) - The default is TRUE. Whether the forest is retained in the output object.
corr.bias (Optional) - The default is FALSE. If bias correction for regression is performed.
keep.inbag (Optional) - The default is FALSE. Whether an n by ntree matrix be returned that keeps track of which samples are “in-bag” in which trees.

Return Value
Model

Example

randomForestMulti

Summary
Create random forest model for multi class classification.

Arguments

subset (Optional) - Logical column to subset data.
ntree (Optional) - The default is 500. Number of trees to grow.
mtry (Optional) - The default is 1/3 of number of variables. Number of variables randomly sampled as candidates at each split.
replace (Optional) - The default is TRUE. Whether sampling is done with replacement.
classwt (Optional) - Priors of the classes.
cutoff (Optional) - A vector of length equal to number of classes. The ‘winning’ class for an observation is the one with the maximum ratio of proportion of votes to cutoff. Default is 1/k where k is the number of classes.
strata (Optional) - A (factor) variable that is used for stratified sampling.
sampsize (Optional) - Size of sample to draw.
nodesize (Optional) - The default is 1. Minimum size of terminal nodes.
maxnodes (Optional) - Maximum number of terminal nodes trees in the forest can have.
importance (Optional) - The default is FALSE. Whether importance of predictors is assessed.
localImp (Optional) - The default is FALSE. Whether casewise importance measure is computed.
proximity (Optional) - The default is FALSE. Whether proximity measure among the rows is calculated.
oob.prox (Optional) - The default is same as proximity. Whether proximity is calculated only on "out-of-bag" data.
norm.votes (Optional) - The default is TRUE. Whether the final result of votes are expressed as fractions. If FALSE, raw vote counts are returned (useful for combining results from different runs).
keep.forest (Optional) - The default is TRUE. Whether the forest is retained in the output object.
corr.bias (Optional) - The default is FALSE. If bias correction for regression is performed.
keep.inbag (Optional) - The default is FALSE. Whether an n by ntree matrix be returned that keeps track of which samples are “in-bag” in which trees.

Return Value
Model

Example

list_extract

Summary
Extract a value of a list data type column based on a position or name when the data inside the list is data frame.

Syntax
list_extract(<column_list>, position = <numeric_text>, rownum = <numeric>)

Arguments

position - The default is 1. Set a position of the element inside the list you want to extract a value from.
rownum - The default is 1. Set a row number of the data frame inside the list you want to extract a value from. It's in effect only when the data is a list of data frames.

Return Value
Character

Example

1) Column is List of Character

mutate(text = list_extract(categories, 1))
Returns the following.

category	text
c("Doctors", "Health & Medical")	Doctors
c("Bars", "American (New)", "Nightlife")	Bars

mutate(text = list_extract(categories, -1))
Returns the value of the 1st element counted from the last.

category	text
c("Doctors", "Health & Medical")	Health & Medical
c("Bars", "American (New)", "Nightlife")	Nightlife

2) Column is List of Data Frame

mutate(text = list_extract(labels, 1))
Returns the value of the 1st column and 1st row. The default is 1st row when not specified.

labels	text
list(name = c("Regression", "UI"), color = c("5319e7", "207de5"))	Regression

mutate(text = list_extract(labels, -1))
Returns the value of the last column and 1st row..

labels	text
list(name = c("Regression", "UI"), color = c("5319e7", "207de5"))	5319e7

mutate(text = list_extract(labels, 1, 2))
Returns the value of the 1st column and 2nd row.

labels	text
list(name = c("Regression", "UI"), color = c("5319e7", "207de5"))	UI

mutate(text = list_extract(labels, "name"))
Returns the value of a column, 'name', and the 1st row. The default is 1st row when not specified.

labels	text
list(name = c("Regression", "UI"), color = c("5319e7", "207de5"))	Regression

list_to_text

Summary
Concatenates texts from all the elements of a list data type column.

Syntax
list_to_text(<column_list> , sep = <text>)

Arguments

sep (Optional) - The default is ', ' (comma and space). Set character(s) to connect concatenating texts.

Return Value
character

Example

mutate(category_text = list_to_text(category))
Returns the following.

category	category_text
c("Doctors", "Health & Medical")	Doctors, Health & Medical
Nightlife	Nightlife
c("Active Life", "Mini Golf", "Golf")	Active Life, Mini Golf, Golf
c("Bars", "American (New)", "Nightlife")	Bars, American (New), Nightlife
c("Bars", "American (Traditional)", "Nightlife")	Bars, American (Traditional), Nightlife

list_concat

Summary
Concatenates values from multiple columns into a list.

Syntax
list_concat(<column(s)>, collapse = <logical>)

Arguments

collapse (Optional) - The default is FALSE. If TRUE, after concatenating column values, all the rows are combined into one row. Basically, if it's FALSE, it's used with "mutate" and if it's TRUE, it's used with "summarize".

Return Value
list

Example

mutate(items = list_concat(item1, item2))
Returns the following.

Before:

name	item1	item2
Cathy	c("Banana", "Apple")	"Apple"
Mary	"Lemon"	c("Lemon", "Banana")
Lisa	c("Lemon", "Banana")	c("Banana", "Apple")

After:

name	item1	item2	items
Cathy	c("Banana", "Apple")	"Apple"	c("Banana", "Apple", "Apple")
Mary	"Lemon"	c("Lemon", "Banana")	c("Lemon", "Lemon", "Banana")
Lisa	c("Lemon", "Banana")	c("Banana", "Apple")	c("Lemon", "Banana", "Banana", "Apple")

summarize(categories = list_concat(category, collapse = TRUE))
Returns the following.

Before:

grouped by "region"

region	category
East	c("Doctors", "Health & Medical")
East	Nightlife
East	c("Active Life", "Mini Golf", "Golf")
West	c("Bars", "American (New)", "Nightlife")
West	c("Bars", "American (Traditional)", "Nightlife")

After:

region	categories
East	c("Doctors", "Health & Medical", "Nightlife", "Active Life", "Mini Golf", "Golf")
West	c("Bars", "American (New)", "Nightlife", "Bars", "American (Traditional)", "Nightlife")

list_n

Summary
Returns number of elements inside a list data type column for each row.

Syntax
list_n(<column_list>)

Arguments

Return Value
Numeric

Example

mutate(count = list_n(categories)) Returns the following.

category	count
c("Doctors", "Health & Medical")	2
Nightlife	1
c("Active Life", "Mini Golf", "Golf")	3
c("Bars", "American (New)", "Nightlife", "Lounges", "Restaurants")	5
c("Bars", "American (Traditional)", "Nightlife", "Restaurants")	4

normalize

Summary
Centers and/or scales the numeric values of a column.

Syntax
normalize(<column_num>, center = <logical>, scale = <logical>)

Arguments

center (Optional) - The default is TRUE. Either a logical value of TRUE or FALSE, or a numeric value. If it is TRUE then centering is done by subtracting the mean (omitting NAs) of the column values. If center is FALSE, no centering is done.
scale (Optional) - The default is TRUE. Either a logical value of TRUE or FALSE, or a numeric value. When it is TRUE and center is TRUE, scaling is done by dividing by the standard deviation. In other words, it will end up as calculating (x - mean(x)) / sd(x). When it is TRUE and center is FALSE, scaling is done by dividing by the root-mean-square, which is same as calculating x / sqrt(sum(x^2)/(length(x)-1)). When it is a numeric value, input values are divided by this numeric value. When it is FALSE, no scaling is done.

Return Value
Numeric

Example

// x <- c(1, 2, 3, 4, 5, 100)
normalize(x, center = 0) Returns -0.4584610 -0.4332246 -0.4079882 -0.3827518 -0.3575154 2.0399410. These are same as the below.
(x - mean(x)) / sd(x)

normalize(x, center = FALSE) Returns 0.02229944 0.04459888 0.06689832 0.08919776 0.11149720 2.22994404. This is same as the below.
x / sqrt(sum(x^2)/(length(x)-1))

normalize(x, center = 0, scale = max(x)))
Returns 0.01 0.02 0.03 0.04 0.05 1.00

anonymize

Summary
Anonymize values by hashing algorithms.

Syntax
anonymize(<column>, algo = <hash_algorithms>, seed = <integer>, chars = <character>, n_chars = <integer>)

Arguments

algo (Optional) - The default is "sha256". Algorithm for hashing. Supported algorithms are listed below.
- sha256
- sha512
- md5
- sha1
- crc32
- xxhash32
- xxhash64
- murmur32
seed (Optional) - Random seed to generate anonymized characters.
chars (Optional) - Characters to be used for characters before hashing.
n_chars (Optional) - The default is 5. Number of characters to convert before hashing.

Example

Original data:

user	login
Lisa	1482148710
Lisa	1484148710
Emily	1479913692
John	1481128318

mutate(user = anonymize(user, algo = "murmur32") Returns anonymized values .

user	login
6fe21ea2	1482148710
6fe21ea2	1484148710
2f968d4a	1479913692
9b5f08f7	1481128318

Return Value
Character

excel_numeric_to_datetime

Summary
Convert number of date/time from excel to POSIXct.

Syntax
excel_numeric_to_datetime(<column_num>, tz = <timezone>)

Arguments

tz (Optional) - Specify which time zone to read the number with. You can find a list of time zones here.

Example
mutate(Last_Login_Time_Converted = excel_numeric_to_datetime(Last_Login_Time_converted)) Converts Last_Login_Time to POSIXct.

Customer	Last_Login_Time	Last_Login_Time_Converted
Lisa	42370.5	2016-01-01 12:00:00
Emily	42318.25	2015-11-10 06:00:00
John	42489.75	2016-04-29 18:00:00

Return Value
POSIXct

unixtime_to_datetime

Summary
Convert unix time numeric values to POSIXct.

Syntax
unixtime_to_datetime(<column_num>)

Example

Original data:

user	last_login
Lisa	1482148710
Emily	1479913692
John	1481128318

mutate(last_login = unixtime_to_datetime(last_login)) Returns data frame with last_login column as Date.

user	last_login
Lisa	2016-12-19 11:58:30 GMT
Emily	2016-11-23 15:08:12 GMT
John	2016-12-07 16:31:58 GMT

Return Value
Date, POSIXct

do_svd.kv

Summary
Calculates coordinations by reducing dimensionality using SVD (Singular Value Decomposition).

Syntax
do_svd.kv(<subject_column>, <key_column>, <value_column>, type = <svd_type>, fill = <numeric>, fun.aggregate = <aggregate_function>, n_component = <integer>, centering = <logical>, output = <svd_output>)

Arguments

subject column - Set a column to be vectorized by reduced dimensions.
key column - Set a column to use as dimensions.
value column (Optional) - Set a column to use as values.
type (Optional) - The default is "group".
- "group" is to see how groups are projected to a new coordinations. This corresponds to u matrix of svd.
- "dimension" is to see what direction the new coordinations face. This corresponds to v matrix of svd.
- "variance" is to see how data is scattered in each of the new coordinations. This corresponds to d vector of svd.
fill (Optional) - The default is 0. This is what should be used for missing value in groups.
fun.aggregate (Optional) - The default is mean. This is how duplicated data should be aggregated.
n_component (Optional) - The default is 3. Set number of dimensions to reduce to. This must be less than the number of subjects and the number of keys.
centering (Optional) - The default is TRUE. Set if data should be centralized to the origin.
output (Optional) - The default is "wide".
- "wide" - The components will be in columns
- "long" - The component will be in row

Return Value
Data frame

Example

Original data:

author_name	keywords	tfidf_value
Cathy	party	3.2
Cathy	exciting	1.3
Cathy	friends	0.8
Mary	study	2.4
Mary	exam	2.1
Mary	hard	1.5
Lisa	sports	1.9
Lisa	exciting	1.8
Lisa	hard	1.6

do_svd.kv(document_name, keywords, tfidf_value, n_component=2)
Reducing the number of the dimensions (keywords) to 2 and returning the result in a 'long' data form.

author_name	new.dimension	value
Cathy	1	-0.7131255
Cathy	2	0.3976414
Lisa	1	0.0121952
Lisa	2	-0.8164055
Mary	1	0.7009303
Mary	2	0.4187641

do_svd.kv(name, key, value, n_component=2, output="wide")
Reducing the number of the dimensions (keywords) to 2 and returning the result in a 'wide' data form.

author_name	axis1	axis2
Cathy	-0.7131255	0.3976414
Lisa	0.0121952	-0.8164055
Mary	0.7009303	0.4187641

do_svd.kv(name, key, value, n_component=2, type="dimension")
Reducing the number of the dimensions (keywords) to 2 and returning the result by 'dimensions' instead of by 'group'.

top_key_words	new.dimension	value
exam	1	0.4163953
exam	2	0.3259924
exciting	1	-0.2560436
exciting	2	-0.3531243
friends	1	-0.1613866
friends	2	0.1179235
hard	1	0.3029449
hard	2	-0.2513704
party	1	-0.6455466
party	2	0.4716940
sports	1	0.0065547
sports	2	-0.5750138
study	1	0.4758803
study	2	0.3725628

do_svd.kv(name, key, value, n_component=2, type="variance")
Reducing the number of the dimensions (keywords) to 2 and returning the 'variance' values for each dimension

new.dimension	value
1	3.534991
2	2.697623

do_svd

Summary
Calculates coordinations by reducing dimensionality using SVD (Singular Value Decomposition).

Syntax
do_svd(<column(s)>, skv = <character_vector>, type = <svd_type>, fill = <numeric>, fun.aggregate = <aggregate_function>, n_component = <integer>, centering = <logical>, output = <svd_output>

Arguments

column(s) (Optional) - Columns to be considered as variables. If no skv, this is necessary.
skv (Optional) - Column names which are considered as subject, key and value (If not indicated, value becomes the number of subject and key pair). If no column(s) argument, this is necessary.
type (Optional) - The default is "group".
- "group" is to see how groups are projected to a new coordinations. This corresponds to u matrix of svd.
- "dimension" is to see what direction the new coordinations face. This corresponds to v matrix of svd.
- "variance" is to see how data is scattered in each of the new coordinations. This corresponds to d vector of svd.
n_component (Optional) - The default is 2. Set number of dimensions to reduce to. This must be less than the number of subjects and the number of keys.
centering (Optional) - The default is TRUE. Set if data should be centralized to the origin.
output (Optional) - The default is "wide".
- "wide" - The components will be in columns
- "long" - The component will be in row
keep_cols - Default is FALSE. Set TRUE if original columns should be kept when type is "u" and output is "wide".

Return Value
Data frame

build_lm

Summary
Builds a linear regression model (lm) and store it in a data frame.

Syntax
build_lm(<formula>, test_rate = <numeric>, subset = <numeric_vector>, weights = <numeric_vector>, na.action = <na_action_type>, method = <lm_method_type>, model = <logical>, x = <logical>, y = <logical>, qr = <logical>, singular.ok = <logical>, contrasts = <list>, offset = <numeric_vector>)

Arguments

test_rate (Optional) - The default is 0.0. Ratio of test data.
subset (Optional) - A vector to subset data.
weights (Optional) - Weights vector.
na.action (Optional) - The default is "na.fail". This changes the behavior of NA data. Can be one of the following.
- "na.omit"
- "na.fail"
- "na.exclude"
- "na.pass"
- NULL
method (Optional) - The default is "qr". The method to be used in fitting the model. This can be
- "qr"
- "model.frame"
model (Optional) - The default is TRUE. If model object should be returned.
x (Optional) - The default is FALSE. If x should be returned.
y (Optional) - The default is FALSE. If y should be returned.
qr (Optional) - The default is TRUE. If qr should be returned.
singular.ok (Optional) - The default is TRUE. Whether only one observation fitting should be accepted.
contrasts (Optional) - Replacement values for contrasts.
offset (Optional) - Already known components for linear predictors during fitting.

Return Value
Data frame

Example

Original data:

CARRIER	ARR_DELAY	DEP_DELAY	DISTANCE
UA	10	7	300
UA	20	30	1000
UA	-5	0	200
AA	20	20	500
AA	-5	3	700
AA	10	8	1000

%>% group_by(CARRIER)

%>% build_lm(ARR_DELAY ~ DEP_DELAY + DISTANCE, test_rate = 0.1)
Returns a data frame that stores a lm (Linear Regression) model based on a formula defined as 'Predict ARR_DELAY based on DEP_DELAY and DISTANCE.' It also returns a column of original data.

CARRIER	source_data	model	.test_index
UA	source dataframe	lm model	c(1)
AA	source dataframe	lm model	c(2)

do_survfit

Summary
Calculates Survival Curve from survival time and survival status.

Syntax
do_survfit(<column_num>, <column_num_logic>, weights = <numeric_vector>, subset = <numeric_vector>, na.action = <na_action_type>, id = <id_column>, istate = <column>, type = <survival_curve_type>, error = <survfit_error_type>, conf.type = <survfit_conf_int_type>, conf.lower = <survfit_lower_conf_int_type>, conf.int = <numeric>, se.fit = <logical>, influence = <logical>)

Arguments

time - Time of survival
status - Status of survival (dead or alive). The data can be in TRUE or FALSE, 1 or 0, or 2 or 1. Former means "dead" and latter means "alive" in each of those value pairs.
weights (Optional) - Weights vector.
subset (Optional) - A vector to subset data.
na.action (Optional) - The default is "na.fail". This changes the behavior of NA data. Can be one of the following.
- "na.omit"
- "na.fail"
- "na.exclude"
- "na.pass"
- NULL
id - Identifier of the subject.
istate - Initial state of the subject in multi-state models.
type - Type of Survival Curve to calculate.
- "kaplan-meier"
- "fleming-harrington"
- "fh2"
error - Type of error.
- "greenwood"
- "tsiatis"
- "aalen"
- "robust"
conf.type - Type of confidence interval.
- "none"
- "plain"
- "log"
- "log-log"
conf.lower - Type of modification to lower confidence interval limit
- "usual"
- "peto"
- "modified"
conf.int - Confidence level for two-sided confidence interval for the survival curve.
se.fit - Whether standard error should be calculated or not.
influence - Whether to return the infinitesimal jackknife values.

Return Value
Data frame

Example

Original data:_

status	time
1	40
1	30
0	35
0	8

do_survfit(time, status) Returns a data frame as follows.

time	n_risk	n_event	n_censor	estimate	std_error	conf_high	conf_low
8	4	0	1	1.0000000	0.0000000	1	1.0000000
30	3	1	0	0.6666667	0.4082483	1	0.2995071
35	2	0	1	0.6666667	0.4082483	1	0.2995071
40	1	1	0	0.0000000	Inf	NA	NA

prediction_survfit

Summary
Simulates Survival Curve for specified cohort based on a survival model.

Syntax
prediction_survfit(newdata = <cohort_data_set>, na.action = <na_action_type>, type = <coxph_survival_curve_type>, conf.type = <survfit_conf_int_type>, conf.int = <numeric>, se.fit = <logical>, censor = <logical>, start.time = <numeric>)

Arguments

newdata - Time of survival
na.action (Optional) - The default is "na.pass". This changes the behavior of NA data. Can be one of the following.
- "na.omit"
- "na.fail"
- "na.exclude"
- "na.pass"
- NULL
type - Type of Survival Curve to calculate.
- "aalen"
- "efron"
- "kalbfleisch-prentice"
conf.type - Type of confidence interval.
- "none"
- "plain"
- "log"
- "log-log"
conf.int - Confidence level for two-sided confidence interval for the survival curve.
se.fit - Whether standard error should be calculated or not.
censor - Whether to include censoring time points.
start.time - Time to start simulated survival curve.

Return Value
Data frame

Example

Original data: model data frame.

prediction_survfit(newdata = data.frame(age = c(50, 60), sex = (0, 1))) Returns a data frame as follows.

time	n_risk	n_event	n_censor	age_sex	estimate	std_error	conf_high	conf_low
8	4	0	1	50_0:	1.0000000	0.0000000	1	1.0000000
30	3	1	0	50_0:	0.6666667	0.4082483	1	0.2995071
35	2	0	1	50_0:	0.6666667	0.4082483	1	0.2995071
40	1	1	0	50_0:	0.0000000	Inf	NA	NA

build_lr

Summary
Builds logistic regression model and store it in a data frame.

Syntax
build_lr(<formula>, test_rate = <numeric>, weights = <numeric_vector>, subset = <numeric_vector>, na.action = <na_action_type>, start = <numeric_vector>, etastart = <numeric_vector>, mustart = <numeric_vector>, offset = <numeric_vector>, epsilon = <numeric>, maxit = <integer>, trace = <logical>, model = <logical>, method = <glm_method_type>, contrasts = <list>, x = <logical>, y = <logical>)

Arguments

test_rate (Optional) - The default is 0.0. Ratio of test data.
family (Optional) - The default is gaussian. You can use other families like poisson, Gamma.
weights (Optional) - Weights vector.
subset (Optional) - A vector to subset data.
na.action (Optional) - The default is "na.fail". This changes the behavior of NA data. Can be one of the following.
- "na.omit"
- "na.fail"
- "na.exclude"
- "na.pass"
- NULL
start (Optional) - Values of parameters to start.
etastart (Optional) - Values of predictors to start.
mustart (Optional) - Means to start.
offset (Optional) - Already known components for linear predictors during fitting.
epsilon (Optional) - The default is 1e-8. Convergence tolerance ε for fitting.
maxit (Optional) - The default is 25. Maximum number of iteration for fitting.
trace (Optional) - The default is FALSE. Whether results should be generated for each iteration.
model (Optional) - The default is TRUE. If model should be returned or not.
method (Optional) - The default is "glm.fit". Fitting method. This can be
- "glm.fit"
- "model.frame"
- function
x (Optional) - The default is FALSE. If x should be returned.
y (Optional) - The default is TRUE. If y should be returned.
contrasts (Optional) - How to assign values for categorical variables. The value is a list whose names are predictor column names and values are one of "contr.helmert", "contr.poly", "contr.sum" and "contr.treatment".

Example

Original data:

CARRIER	IS_DELAYED	DEP_DELAY	DISTANCE
UA	TRUE	7	300
UA	TRUE	30	1000
UA	FALSE	0	200
AA	TRUE	20	500
AA	FALSE	3	700
AA	TRUE	8	1000

%>% group_by(CARRIER)

%>% build_lr(IS_DELAYED ~ DEP_DELAY + DISTANCE, test_rate = 0.1)
Returns a data frame that stores a GLM model based on a formula defined as 'Predict ARR_DELAY based on DEP_DELAY and DISTANCE.'

CARRIER	source_data	model	.test_index
UA	source dataframe	glm model	c(1)
AA	source dataframe	glm model	c(2)

build_glm

Summary
Builds generalized linear models (glm) and store it in a data frame.

Syntax
build_glm(<formula>, test_rate = <numeric>, family = <glm_family_type>, weights = <numeric_vector>, subset = <numeric_vector>, na.action = <na_action_type>, start = <numeric_vector>, etastart = <numeric_vector>, mustart = <numeric_vector>, offset = <numeric_vector>, epsilon = <numeric>, maxit = <integer>, trace = <logical>, model = <logical>, method = <glm_method_type>, contrasts = <list>, x = <logical>, y = <logical>)

Arguments

test_rate (Optional) - The default is 0.0. Ratio of test data.
family (Optional) - The default is gaussian. You can use other families listed below.
- binomial
- gaussian
- Gamma
- inverse.gaussian
- poisson
- quasi
- quasibinomial
- quasipoisson
weights (Optional) - Weights vector.
subset (Optional) - A vector to subset data.
na.action (Optional) - The default is "na.fail". This changes the behavior of NA data. Can be one of the following.
- "na.omit"
- "na.fail"
- "na.exclude"
- "na.pass"
- NULL
start (Optional) - Values of parameters to start.
etastart (Optional) - Values of predictors to start.
mustart (Optional) - Means to start.
offset (Optional) - Already known components for linear predictors during fitting.
epsilon (Optional) - The default is 1e-8. Convergence tolerance ε for fitting.
maxit (Optional) - The default is 25. Maximum number of iteration for fitting.
trace (Optional) - The default is FALSE. Whether results should be generated for each iteration.
model (Optional) - The default is TRUE. If model should be returned or not.
method (Optional) - The default is "glm.fit". Fitting method.
x (Optional) - The default is FALSE. If x should be returned.
y (Optional) - The default is TRUE. If y should be returned.
contrasts (Optional) - How to assign values for categorical variables. The value is a list whose names are predictor column names and values are one of "contr.helmert", "contr.poly", "contr.sum" and "contr.treatment".

Example

Original data:

CARRIER	ARR_DELAY	DEP_DELAY	DISTANCE
UA	10	7	300
UA	20	30	1000
UA	-5	0	200
AA	20	20	500
AA	-5	3	700
AA	10	8	1000

%>% group_by(CARRIER)

%>% build_glm(ARR_DELAY ~ DEP_DELAY + DISTANCE, test_rate = 0.1)
Returns a data frame that stores a GLM model based on a formula defined as 'Predict ARR_DELAY based on DEP_DELAY and DISTANCE.'

CARRIER	source_data	model	.test_index
UA	source dataframe	lm model	c(1)
AA	source dataframe	lm model	c(2)

build_multinom

Summary
Builds multinomial logistic regression model and store it in a data frame.

Syntax
build_multinom(<formula>, test_rate = <numeric>, subset = <numeric_vector>, weights = <numeric_vector>, na.action = <na_action_type>, Hess = <logical>, summ = <multinom_summ_type>, censored = <logical>, model = <logical>, contrasts = <list>)

Arguments

test_rate (Optional) - The default is 0.0. Ratio of test data.
subset (Optional) - A vector to subset data.
weights (Optional) - Weights vector.
na.action (Optional) - The default is "na.fail". This changes the behavior of NA data. Can be one of the following.
- "na.omit"
- "na.fail"
- "na.exclude"
- "na.pass"
- NULL
Hess (Optional) - If Hessian Object should be returned.
censored (Optional) - The default is FALSE. Whether the predicted variable should be dealt as impossible classes.
model (Optional) - The default is TRUE. Whether model should be returned or not.
contrasts (Optional) - How to assign values for categorical variables. The value is a list whose names are predictor column names and values are one of "contr.helmert", "contr.poly", "contr.sum" and "contr.treatment".

Example

Original data:

CARRIER	ARR_DELAY	DEP_DELAY	DISTANCE
UA	10	7	300
UA	20	30	1000
DL	-5	0	200
AA	20	20	500
AA	-5	3	700
AA	10	8	1000

%>% build_multinom(CARRIER ~ ARR_DELAY + DEP_DELAY, test_rate = 0.1)
Returns a data frame that stores a GLM model based on a formula defined as 'Predict ARR_DELAY based on DEP_DELAY and DISTANCE.'

source_data	model	.test_index
source dataframe	multinom model	c(1)

build_coxph

Summary
Builds Cox Proportional Hazard Model for survival analysis and store it in a data frame.

Syntax
build_coxph(<formula>, test_rate = <numeric>, subset = <numeric_vector>, weights = <numeric_vector>, na.action = <na_action_type>, init = <numeric>, ties = <coxph_tie_type>, singular.ok = <logical>, model = <logical>, x = <logical>, y = <logical>, tt = <function>)

Arguments

formula
test_rate (Optional) - The default is 0.0. Ratio of test data.
subset (Optional) - A vector to subset data.
weights (Optional) - Weights vector.
na.action (Optional) - The default is "na.fail". This changes the behavior of NA data. Can be one of the following.
- "na.omit"
- "na.fail"
- "na.exclude"
- "na.pass"
- NULL
init (Optional) - Values of parameters to start.
ties (Optional) - How to handle ties.
- "efron"
- "breslow"
- "exact"
singular.ok (Optional) - Determines how to handle collinearity in the model matrix.
model (Optional) - The default is TRUE. Whether model should be returned or not.
x (Optional) - The default is TRUE. Whether model should be returned or not.
y (Optional) - The default is TRUE. Whether x matrix should be returned or not.
tt (Optional) - The default is TRUE. Whether response vector should be returned or not.

Example

Original data:

time	status	age	sex
40	2	55	female
30	2	68	male
35	1	59	male
8	1	80	female

build_coxph(survival::Surv(time, status) ~ age + sex) Builds a Cox Proportional Hazard Model that estimates survival of the subjects based on age and sex. The resulting model is stored in returning data frame. You can use model_coef or model_stats function to get the summary information about the models. Also, you can use prediction_coxph function to predict with the data that is stored in the same data frame.

prediction

Summary
Returns a data frame with regression information about a model.

Syntax
prediction(data = "training"|"test"|"newdata", data_frame = <data_set>)

Arguments * data (Optional) - The default is "training". Type of data to use in the prediction. * data_frame (Optional) - This works only when data = "newdata". A name of another data frame to apply the model.

Return Value
Data Frame

Example

Original data: model data frame.

prediction(data = "test")
Returns a data frame with predicted values.

prediction_binary

Summary
Returns a data frame with binary classification information about a model.

Syntax
prediction_binary(<predicted_probability_column>, <actual_value_column>, data = "training"|"test"|"newdata", threshold = <binary_metrics>, data_frame = <data_set>)

Arguments

predicted_value_column - A column name of classification probability.
actual_value_column - A column name of actual values.
data (Optional) - The default is "training". Type of data to use in the prediction.
data_frame (Optional) - This works only when data = "newdata". A name of another data frame to apply the model.
threshold (Optional) - The default is 0.5. Manually enter a threshold value to predict TRUE or FALSE, or select a metric to automatically set the threshold value for. The metrics can be
- f_score
- accuracy
- precision
- recall
- specificity

Return Value
Data Frame

Example

Original data: binary classification model data frame.

prediction_binary(data = "test", threshold = 0.2)
Returns a data frame with predicted values of binary classification.

prediction_coxph

Summary
Returns a data frame with predicted values of Cox Proportional Hazard Model.

Syntax
prediction_coxph(data = "training"|"test", type.predict = <coxph_predict_type>, type.residuals = <coxph_residuals_type>)

Arguments * data (Optional) - The default is "training". Type of data to use in the prediction. * type.predict (Optional) - The default is "lp". The type of prediction. This can be * "lp" * "risk" * "expected" * type.residuals (Optional) - The default is "martingale". The type of residuals. This can be * "martingale" * "deviance" * "score" * "schoenfeld" * "scaledsch" * "dfbeta" * "dfbetas"

Return Value
Data Frame

Example

Original data: coxph model data frame.

prediction_coxph(data = "test")
Returns a data frame with predicted values.

model_coef

Summary
Returns a data frame with 'Parameter Estimates (Coefficients)' information about a model including the below.

term - Term in lm that is estimated.
estimate - Estimated coefficients. The larger the number, the more positive effect the variable has.
std_error - Standard Error. How much prediction error there is in the variable.
t_ratio - estimate / std_error. This value in t distribution is p_value.
p_value - The probability that the variable is not effective.
conf_low - The lower bound of confidence interval.
conf_high - The upper bound of confidence interval.

Syntax
model_coef(conf_int = <conf_int_type>, conf.level = <numeric>)

Arguments

conf_int (Optional) - The default is NULL. How to calculate confidence intervals for coefficients. "default" calculates confidence interval by parameters inside model, so this is faster. "profile" uses profiling, so this takes time but better result especially for small number of samples. If NULL, this doesn't return confidence interval. This can be
- NULL
- "default"
- "profile"
conf.level (Optional) - The default is 0.95. Confidence interval level. The larger this parameter, it tries to get wider range so that the true value can be between the values. This is ignored if conf_int is NULL.

Return Value
Data Frame

Example

Original data: lm model or glm model

model_coef(conf_int = "default", conf.level = 0.90)
Returns a data frame with 'Parameter Estimates' information.

model_stats

Summary
Returns a data frame with 'Summary of Fit' information about a model including the below.

r_square - R Square.
r_square_adj - Adjusted R Square.
root_mean_square_error - Root mean square error.
f_ratio - F ratio.
p_value - P value.
df - Degree of freedom.
null_deviance - Null deviance.
df_for_null_model - Degree of freedom for null model.
log_likelihood - Log likelifood.
deviance - Deviance.
AIC - Akaike's information criterion.
BIC - Bayesian information criterion.
deviance - Deviance.
residual_df - Residual degree of freedom.

Syntax
model_stats()

Example

Original data: lm model or glm model

model_stats()
Returns a data frame with 'Summary of Fit' information.

Return Value
Data Frame

model_anova

Summary
Returns a data frame with anova test information about a model including the below.

df - Degree of freedom.
sum_of_squares
mean_square
f_ratio
p_value
deviance
residual_df
residual_deviance

Syntax
model_anova()

Return Value
Data Frame

Example

Original data: lm model or glm model

model_anova()
Returns a data frame with anova test information.

evaluate_regression

Summary
Returns a data frame with evaluation score of regression including the below.

r_squared
explained_variance
mean_square_error
misclassification_error
root_mean_square_error
mean_absolute_error
mean_absolute_percentage_error

Syntax
evaluate_regression(<predicted_value_column>, <actual_value_column>)

Arguments

predicted_value_column - A column name of predicted values.
actual_value_column - A column name of actual values.

Return Value
Data Frame

Example

Original data:

ARR_DELAY	predicted_value
10	7
20	30
-5	0
20	20
-5	3
10	8

evaluate_regression(predicted_value, ARR_DELAY)
Returns a data frame with evaluation scores of regression.

evaluate_binary

Summary
Returns a data frame with evaluation score of binary classification including the below.

AUC
f_score
accuracy
misclassification_rate
precision
recall
specificity
true_positive - Number of positive predictions that actually are positive.
false_positive - Number of positive predictions that actually are negative.
true_negative - Number of negative predictions that actually are negative.
false_negative - Number of negative predictions that actually are positive.
test_size - The number of tested data.
threshold - threshold value for prediction.

Syntax
evaluate_binary(<predicted_probability_column>, <actual_value_column>, threshold = <binary_metrics>)

Arguments

predicted_value_column - A column name of classification probability.
actual_value_column - A column name of actual values.
threshold - The default is "f_score". Manually enter a threshold value to predict TRUE or FALSE, or select a metric to automatically set the threshold value for. The metrics can be
- f_score
- accuracy
- precision
- recall
- specificity

Return Value
Data Frame

Example

Original data:

IS_DELAYED	predicted_probability
TRUE	0.82
TRUE	0.67
FALSE	0.34
TRUE	0.98
FALSE	0.19
TRUE	0.26

evaluate_binary(predicted_value, ARR_DELAY, threshold = "accuracy")
Returns a data frame with evaluation scores of binary classification.

evaluate_multi

Summary
Returns a data frame with evaluation score of multi classification including the below.

micro_f_score
macro_f_score
accuracy
misclassification_rate

Syntax
evaluate_multi(<predicted_label_column>, <actual_value_column>)

Arguments

predicted_label_column - A column name of predicted class label.
actual_value_column - A column name of actual values.

Return Value
Data Frame

Example

Original data:

CARRIER	predicted_label
AA	AA
9E	9E
AS	AA
9E	9E
AA	AA
AS	FL

evaluate_binary(predicted_value, ARR_DELAY, threshold = "accuracy")
Returns a data frame with evaluation scores of multi classification.

micro_f_score	macro_f_score	accuracy	misclassification_rate
0.6666667	0.45	0.6666667	0.3333333

do_roc

Summary
Returns coordinates of roc curve.

Syntax
do_roc(<predicted_probability_column>, <actual_value_column>)

Arguments

predicted_value_column - A column name of classification probability.
actual_value_column - A column name of actual values.

Return Value
Data Frame

Example

Original data:

IS_DELAYED	predicted_probability
TRUE	0.82
TRUE	0.67
FALSE	0.34
TRUE	0.98
FALSE	0.19
TRUE	0.26

do_roc(predicted_value, ARR_DELAY)
Returns a data frame with true_positive_rate and false_positive_rate.

true_positive_rate	false_positive_rate
0.00	0.0
0.25	0.0
0.50	0.0
0.75	0.0
0.75	0.5
1.00	0.5
1.00	1.0

With this data, you can assign "false_positive_rate" to X axis and "true_positive_rate" to Y axis to draw ROC curve.

build_kmeans.cols

Summary
Builds a clustering model (k-means) from variable columns and returns the summary of the model or the augmented data depending on the parameter value.

Syntax
build_kmeans.cols(<column(s)>, centers = <integer>, fill = <numeric>, fun.aggregate = <aggregate_function>, seed = <integer>, iter.max = <integer>, nstart = <integer>, algorithm = <kmeans_algorithms>, trace = <logical>, normalize_data = <logical>, augment = <logical>, keep.source = <logical>)

Arguments

centers (Optional) - Set an integer number to decide how many clusters (groups) to build.
seed (Optional) - The default is 0. This is random seed. You can change the result if you change this number.
iter.max (Optional) - The default is 10. The maximum number of cluster update iteration.
nstart (Optional) - The default is 1. This works only when the centers argument is a number. How many random initial configuration should be tried. The best result is chosen as output.
algorithm (Optional) - The default is Hartigan-Wong. This can be
- "Hartigan-Wong"
- "Lloyd"
- "Forgy"
- "MacQueen"
trace (Optional) - This works only when algorighm is Hartigan-Wong. If the output should contain information in the process of calculation.
normalize_data (Optional) - The default is TRUE. Whether the input columns should be normalized before feeding them into clustering algorithm.
augment (Optional) - The default is FALSE. Whether the cluster label should be augmented immediately or not.
keep.source - The default is TRUE. It will make source.data column to preserve source data. You can see the original information using augment_kmeans(model, source.data)

Return Value
Data frame

Example

Original data:

location	1900	1901	1902
Als	35.02899	24.02899	45.02899
Bea	18.00000	10.00000	27.00000
Bor	18.14896	10.64896	20.64896

build_kmeans.cols(-location, centers=2, augment=TRUE)
Builds a K-means clustering model selecting all the columns other than 'location' and setting 2 as the number of the clusters, and scoring the original data right away. If you want to evaluate the model set 'augment' to FALSE.

location	X1900	X1901	X1902	.cluster
Als	35.02899	24.02899	45.02899	2
Bea	18.00000	10.00000	27.00000	1
Bor	18.14896	10.64896	20.64896	1

build_kmeans.kv

Summary
Builds a clustering model (k-means) from key-value columns and store the model into a generated data frame or augment the original data with the clustered ID.

Syntax
build_kmeans.kv(<subject_column>, <key_column>, <value_column>, centers = <integer>, seed = <integer>, iter.max = <integer>, nstart = <integer>, algorithm = <kmeans_algorithms>, trace = <logical>, normalize_data = <logical>, augment = <logical>, keep.source = <logical>)

Arguments

fill (Optional) - The default is 0. This is what should be used for missing value in groups.
fun.aggregate (Optional) - The default is mean. This is how duplicated data should be aggregated.
centers (Optional) - Set an integer number to decide how many clusters (groups) to build.
seed (Optional) - The default is 0. This is random seed. You can change the result if you change this number.
iter.max (Optional) - The default is 10. The maximum number of cluster update iteration.
nstart (Optional) - The default is 1. This works only when the centers argument is a number. How many random initial configuration should be tried. The best result is chosen as output.
algorithm (Optional) - The default is Hartigan-Wong. This can be
- "Hartigan-Wong"
- "Lloyd"
- "Forgy"
- "MacQueen"
trace (Optional) - This works only when algorighm is Hartigan-Wong. If the output should contain information in the process of calculation.
normalize_data (Optional) - The default is TRUE. Whether the aggregated data should be normalized before feeding them into clustering algorithm.
augment (Optional) - The default is FALSE. Whether the cluster label should be augmented immediately or not.
keep.source - The default is TRUE. It will make source.data column to preserve source data. You can see the original information using augment_kmeans(model, source.data)

Return Value
Data frame

Example

Original data:

location	year	harvest_date
Als	1900	35.02899
Als	1901	24.02899
Als	1902	45.02899
Bea	1900	18.00000
Bea	1901	10.00000
Bea	1902	27.00000
Bor	1900	18.14896
Bor	1901	10.64896
Bor	1902	20.64896

build_kmeans.kv(location, year, harvest_date, centers=2, augment=TRUE)

Builds a K-means clustering model selecting 'location' as the subject to cluster, 'year' as the dimension, and 'harvest_date' as the value, and setting 2 as the number of the clusters, and scoring the original data right away. If you want to evaluate the model set 'augment' to FALSE.

location	year	harvest_date	.cluster
Als	1900	35.02899	2
Als	1901	24.02899	2
Als	1902	45.02899	2
Bea	1900	18.00000	1
Bea	1901	10.00000	1
Bea	1902	27.00000	1
Bor	1900	18.14896	1
Bor	1901	10.64896	1
Bor	1902	20.64896	1

build_kmeans

Summary
Builds a clustering model (k-means). Stores the model into a generated data frame or the augmented data depending on the parameter value.

Syntax
build_kmeans(<column(s)>, skv = <character_vector>, fill = <numeric>, fun.aggregate = <aggregate_function>, centers = <integer>, iter.max = <integer>, nstart = <integer>, algorithm = <kmeans_algorithms>, trace = <logical>, normalize_data = <logical>, seed = <integer>, augment = <logical>, keep.source = <logical>)

Arguments

column(s) (Optional) - Columns to be considered as variables. If no skv, this is necessary.
skv (Optional) - Column names which are considered as subject, key and value (If not indicated, value becomes the number of subject and key pair). If no column(s) argument, this is necessary.
fill (Optional) - The default is 0. This is what should be used for missing value in groups.
fun.aggregate (Optional) - The default is mean. This is how duplicated data should be aggregated.
centers (Optional) - Set an integer number to decide how many clusters (groups) to build.
iter.max (Optional) - The default is 10. The maximum number of cluster update iteration.
nstart (Optional) - The default is 1. This works only when the centers argument is a number. How many random initial configuration should be tried. The best result is chosen as output.
algorithm (Optional) - The default is Hartigan-Wong. This can be
- "Hartigan-Wong"
- "Lloyd"
- "Forgy"
- "MacQueen"
trace (Optional) - This works only when algorighm is Hartigan-Wong. If the output should contain information in the process of calculation.
normalize_data (Optional) - The default is TRUE. Whether the data should be normalized before feeding them into clustering algorithm.
seed (Optional) - The default is 0. This is random seed. You can change the result if you change this number.
augment (Optional) - The default is FALSE. Whether the cluster label should be augmented immediately or not.
keep.source - The default is TRUE. It will make source.data column to preserve source data. You can see the original information using augment_kmeans(model, source.data)

Return Value
Data frame

Example

Original data:

location	1900	1901	1902
Als	35.02899	24.02899	45.02899
Bea	18.00000	10.00000	27.00000
Bor	18.14896	10.64896	20.64896

build_kmeans(-location, centers=2, augment=TRUE)
Builds a K-means clustering model selecting all the columns other than 'location' and setting 2 as the number of the clusters, and scoring the original data right away. If you want to evaluate the model set 'augment' to FALSE.

location	X1900	X1901	X1902	.cluster
Als	35.02899	24.02899	45.02899	2
Bea	18.00000	10.00000	27.00000	1
Bor	18.14896	10.64896	20.64896	1

Original data:

location	year	harvest_date
Als	1900	35.02899
Als	1901	24.02899
Als	1902	45.02899
Bea	1900	18.00000
Bea	1901	10.00000
Bea	1902	27.00000
Bor	1900	18.14896
Bor	1901	10.64896
Bor	1902	20.64896

build_kmeans(skv = c("location", "year", "harvest_date"), centers = 2, augment = TRUE)

location	year	harvest_date	.cluster
Als	1900	35.02899	2
Als	1901	24.02899	2
Als	1902	45.02899	2
Bea	1900	18.00000	1
Bea	1901	10.00000	1
Bea	1902	27.00000	1
Bor	1900	18.14896	1
Bor	1901	10.64896	1
Bor	1902	20.64896	1

do_t.test

Summary
Execute t-test, which checks differences of means of variables.

Syntax
do_t.test(<value_column>, <group_column>, alternative = <alternative_type>, mu = <integer>, paired = <logical>, var.equal = <logical>, conf.level = <numeric>, subset = <column_logical>, na.action = <na_action_type>)

Arguments

value_column - Numeric column whose means should be calculated.
group_column (Optional) - Label column of 2 groups. Without this argument, one sample test, which compares mean with "mu" value, will be executed.
alternative (Optional) - The default is "two.sided". What kind of alternative hypothesis to be used. This can be
- "two.sided"
- "less"
- "greater"
mu (Optional) - The default is 0. If it's one sample test, true value of the mean. If it's two sample test, the difference in means.
paired (Optional) - The default is FALSE. If it should be paired t-test, which is often used to test before values and after values are different.
var.equal (Optional) - The default is FALSE. If the variances of the two groups can be considered as equal.
conf.level (Optional) - The default is 0.95. The level of confidence interval.
subset (Optional) - a vector to specify a subset of observations to be used.
na.action (Optional) - The default is "na.fail". This changes the behaviour of NA data. Can be one of the following.
- "na.omit"
- "na.fail"
- "na.exclude"
- "na.pass"
- NULL

Return Value
Data frame

Example

Original data:

location	year	harvest_date
Als	1900	35.02899
Als	1901	24.02899
Als	1902	45.02899
Bor	1900	18.14896
Bor	1901	10.64896
Bor	1902	20.64896

do_t.test(harvest_date, location)
Returns a data frame with one row that has t-test result.

effect_size - gap of estimated means
mean1 - one of estimated means
mean2 - the other one of estimated means
t.value - t value
p.value - p value
digrees_of_freedom - the degrees of freedom for the t-statistic
conf.low - lower limit of confidence interval
conf.high - higher limit of confidence interval
method - method that was used
alternative - alternative that was used

do_var.test

Summary
Execute F-test, which checks the differences of variances between groups.

Syntax
do_var.test(<value_column>, <group_column>, ratio = <numeric>, alternative = <alternative_type>, conf.level = <numeric>, na.action = <na_action_type>)

Arguments

value_column - Numeric column whose means should be calculated.
group_column - Label column of 2 groups.
ratio (Optional) - The default is 1. The hypothesized population variances ratio.
alternative (Optional) - The default is "two.sided". What kind of alternative hypothesis to be used. This can be
- "two.sided"
- "less"
- "greater"
conf.level (Optional) - The default is 0.95. The level of confidence interval.
subset (Optional) - a vector to specify a subset of observations to be used.
na.action (Optional) - The default is "na.fail". This changes the behaviour of NA data. Can be one of the following.
- "na.omit"
- "na.fail"
- "na.exclude"
- "na.pass"
- NULL

Return Value
Data frame

Example

Original data:

location	year	harvest_date
Als	1900	35.02899
Als	1901	24.02899
Als	1902	45.02899
Bor	1900	18.14896
Bor	1901	10.64896
Bor	1902	20.64896

do_var.test(harvest_date, location)
Returns a data frame with one row that has t-test result.

variance_ratio - ratio of variances
numerator_degrees_of_freedom - degree of freedom
denominator_degrees_of_freedom - denominator degrees of freedom
f.value - F value
p.value - p value
conf.low - lower limit of confidence interval
conf.high - higher limit of confidence interval
method - method that was used
alternative - alternative that was used

do_chisq.test

Summary
Execute chi-squared contingency table tests and goodness-of-fit tests.

Syntax
do_chisq.test(<column(s)>, correct = <logical>, p = <column_num>, rescale.p = <logical>, simulate.p.value = <logical>, B = <numeric>)

Arguments

correct (Optional) - Whether continuity correction will be applied for 2 by 2 tables.
p (Optional) - This works when one column is selected. A column to be considered as probability to be compared.
rescale.p (Optional) - The default is TRUE. If TRUE, p is rescaled to sum to 1. If FALSE and p doesn't sum to 1, it causes an error.
simulate.p.value (Optional) - The default is FALSE. Whether p value should be calculated by Monte Carlo simulation.
B (Optional) - The default is 2000. This works only when simulate.p.value is TRUE. The number of replicates for Monte Carlo test.

Return Value
Data frame

Example

Original data:

clarity	GIA	HRD	IGI
IF	6	4	34
VS1	61	13	7
VS2	36	15	2
VVS1	15	23	14
VVS2	33	24	21

do_chisq.test(-clarity)
Returns a data frame with one row that has chi-squared test result.

statistic	p.value	parameter	method
112.7472	0	8	Pearson's Chi-squared test

statistic - Value in chi-squared distribution
p.value - The p-value for the test
parameter - The degree of freedom.
method - Type of test.

do_apriori

Summary
Find rules of what tend to occur at the same time from transaction data.

Syntax
do_apriori(<subject_column>, <key_column>, minlen = <integer>, maxlen = <integer>, min_support = <numeric>, max_support = <numeric>, min_confidence = <numeric>, lhs = <character_vector>, rhs = <character_vector>, max_basket_items = <numeric>)

Arguments

subject column - Set a column to find rules for. If it's purchasing data, this should be a column of product names.
key column - Set a column to use as an id of transactions.
minlen, maxlen - As default, minlen is 1 and maxlen is 10. How many subjects should appear in a rule.
min_support, max_support - As default, min_support is 0.1 and max_support is 1. The range of support value for results.
min_confidence - The default is 0.5. The minimum value of confidence.
lhs, rhs - The default is NULL. What subjects should come to lhs or rhs.
max_basket_items - Maximum number of most frequent items in a basket to keep before running rule mining algorithm (Apriori algorithm). Less frequent items are dropped so that rule mining completes within reasonable time.

Return Value
Data frame

Example

Original data:

item_name	transaction_id
banana	1
apple	1
carrot	2
apple	2
orange	3
carrot	3
apple	4
carrot	4

do_apriori(item_name, transaction_id)
Returns association rules about item names.

lhs	rhs	support	confidence	lift
	apple	0.75	0.7500000	1.0000000
	carrot	0.75	0.7500000	1.0000000
banana	apple	0.25	1.0000000	1.3333333
orange	carrot	0.25	1.0000000	1.3333333
apple	carrot	0.50	0.6666667	0.8888889
carrot	apple	0.50	0.6666667	0.8888889

lhs is conditions of the rule (left hand side)
rhs is the result of the rule (right hand side)
support is the rate of transactions that support the rule
confidence is how accurate the rule is
lift is how strongly connected lhs and rhs are

do_apriori(item_name, transaction_id, minlen=2)
Returns association rules where the total number of item names in lhs and rhs are more than 2.

lhs	rhs	support	confidence	lift
banana	apple	0.25	1.0000000	1.3333333
orange	carrot	0.25	1.0000000	1.3333333
apple	carrot	0.50	0.6666667	0.8888889
carrot	apple	0.50	0.6666667	0.8888889

do_apriori(item_name, transaction_id, lhs=c( "apple", "orange"))
Returns association rules where apple or orange come to lhs.

lhs	rhs	support	confidence	lift
orange	carrot	0.25	1.0000000	1.3333333
apple	carrot	0.50	0.6666667	0.8888889

do_anomaly_detection

Summary
Detect anomaly in time series data frame.

Syntax
do_anomaly_detection(<column_date>, <column_num>, direction = <anomaly_direction_type>, e_value = <logical>, max_anoms = <numeric>, alpha = <numeric>, only_last = <anomaly_last_type>, threshold = <anomaly_threshold_type>, longterm = <logical>, piecewise_median_period_weeks = <integer>)

Arguments

direction (Optional) - The default is "both". Direction of anomaly. This can be
- "both" - Both positive and negative direction.
- "pos" - Only positive direction.
- "neg" - Only negative direction.
e_value (Optional) - The default is TRUE. Whether expected_values should be returned.
max_anoms (Optional) - The default is 0.1. The maximum ratio of anomaly data compared to the number of total data.
alpha (Optional) - The default is 0.05. The larger the value, the more anomaly data are captured.
only_last (Optional) - The default is NULL.　Find only last anomalies within a day or hour. This can be
- NULL - Find all anomarlies.
- "day" - Find last anomalies within a day.
- "hr" - Find last anomalies within an hour.
threshold (Optional) - The default is 'None'. If this is specified, only positive anomalies above the threshold are reported. This can be
- 'None' - No threshold.
- 'med_max' - Median of daily max values.
- 'p95' - 95th percentile of the daily max values.
- 'p99' - 99th percentile of the daily max values.
longterm (Optional) - The default is FALSE. This should be TRUE if the time span is longer than a month.
piecewise_median_period_weeks (Optional) - The default is 2. The size of piecewise median time window (span of seasons). The unit is weeks.

Return Value
Data frame

Example

Original data:

dateHour	newUsers
...	...
2017-01-20 09:00:00	4
2017-01-20 10:00:00	2
2017-01-20 11:00:00	5
2017-01-20 12:00:00	31
2017-01-20 13:00:00	3
2017-01-20 14:00:00	3
...	...

do_anomaly_detection(dateHour, newUsers, direction = "both", e_value = TRUE)
Returns data frame with columns about anomaly data.

dateHour	newUsers	pos_anomaly	pos_value	neg_anomaly	neg_value	expected_value
...	...	...	...	...	...	...
2017-01-20 09:00:00	4	FALSE	NA	FALSE	NA	4
2017-01-20 10:00:00	2	FALSE	NA	FALSE	NA	2
2017-01-20 11:00:00	5	FALSE	NA	FALSE	NA	5
2017-01-20 12:00:00	31	TRUE	31	FALSE	NA	5
2017-01-20 13:00:00	3	FALSE	NA	FALSE	NA	3
2017-01-20 14:00:00	3	FALSE	NA	FALSE	NA	3
...	...	...	...	...	...	...

do_prophet

Summary
Add forecast data to time series data frame.

Syntax
do_prophet(<column_date>, <column_num>, <numeric>, time_unit = <prophet_time_unit>, include_history = <logical>, fun.aggregate = <aggregate_function>, growth = "linear"|"logistics", seasonality.prior.scale = <numeric>, yearly.seasonality = <logical>, weekly.seasonality = <logical>, n.changepoints = <numeric>, changepoint.prior.scale = <numeric>, changepoints = <list>, holidays.prior.scale = <numeric>, holidays = <data_set>, mcmc.samples = , interval.width = , uncertainty.samples = )

Arguments

column_date - Column that has time data
column_num - Column that has value data
numeric - Number of time periods (e.g. days. unit is determined by time_unit) to forecast.
time_unit - "day", "week", "month", "quarter", or "year"
fun.aggregate - Function to aggregate values.

Return Value
Data frame

Example

Original data:

date	pageviews
...	...
2017-03-19	2383
2017-03-20	4441
2017-03-21	3147
2017-03-22	456
...	...

do_prophet(date, pageviews, 2, time_unit = "day")
Returns data frame with forecasted rows and columns about trend and seasonality.

date	pageviews	forecasted_value	forcasted_value_high	forcasted_value_low	trend	trend_high	trend_low	seasonal	seasonal_low	seasonal_high	yearly	yearly_low	yearly_high	weekly	weekly_low	weekly_high
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
2017-03-19	2383	3200.6539	4150.7993	2199.57	1131.882	1131.882	1131.882	2068.7719	2068.7719	2068.7719	1014.3165	1014.3165	1014.3165	-177.1076	-177.1076	-177.1076
2017-03-20	4441	3401.953	4330.5388	2401.9095	1089.3266	1089.3266	1089.3266	2312.6264	2312.6264	2312.6264	1258.171	1258.171	1258.171	-174.3324	-174.3324	-174.3324
2017-03-21	3147	2723.3344	3655.037	1841.2349	1046.7712	1046.7712	1046.7712	1676.5631	1676.5631	1676.5631	622.1077	622.1077	622.1077	15.2078	15.2078	15.2078
2017-03-22	456	1209.1918	2152.4326	298.3122	1004.2159	1004.2159	1004.2159	204.976	204.976	204.976	-849.4794	-849.4794	-849.4794	143.4983	143.4983	143.4983
2017-03-23	NA	654.9152	1581.3289	-276.3799	876.5498	876.5498	876.5497	-221.6346	-221.6346	-221.6346	-1276.09	-1276.09	-1276.09	107.7315	107.7315	107.7315
2017-03-24	NA	2090.3595	3049.9388	1133.799	833.9944	833.9944	833.9944	1256.3651	1256.3651	1256.3651	201.9097	201.9097	201.9097	77.2003	77.2003	77.2003
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...

do_market_impact

Summary
Estimate impact of an event (advertisement, etc.) on a market, by using other markets to form a synthetic control.

Syntax
do_market_impact(<column_date>, <column_num>, <column_text>, target_market = <text>, time_unit = <market_impact_time_unit>, fun.aggregate = <aggregate_function>, event_time = <text>, output_type = "series"|"model_stats"|"model_coef"|"predictor_market_candidates", na_fill_type = "spline"|"interpolate"|"previous"|"value", na_fill_value = <numeric>, distance_weight = <numeric>, alpha = <numeric>, niter = <numeric>, standardize.data = <logical>, prior.level.sd = <numeric>, nseasons = <numeric>, season.duration = <numeric>, dynamic.regression = <logical>)

Arguments

column_date - Column that has time data
column_num - Column that has value data
market_col - Column that has id/name of market
target_market - The market of interest
time_unit - "day", "week", "month", "quarter", or "year"
fun.aggregate - Function to aggregate values.
event_time - The point of time when the event of interest happened. e.g. "2017-05-12"
output_type - Type of output data frame:
- "series" - time series (default)
- "model_stats" - model fit summary from broom::glance() on the bsts model.
- "model_coef" - model coefficients from broom::tidy() on the bsts model.
- "predictor_market_candidates" - candidates of predictor market and their ranking based on distance and correlation.
na_fill_type - Type of NA fill:
- "spline" - Spline interpolation.
- "interpolate" - Linear interpolation.
- "previous" - Fill with last previous non-NA value.
- "value" - Fill with the value of na_fill_value.
na_fill_value - Value to fill NA when na_fill_type is "value"
distance_weight - Weight of distance (vs. correlation) for calculating ranking of candidate control markets. A value between 0 and 1.
alpha - Tail-area probability of posterior interval (a concept that is similar to confidence interval.)
niter - Number of MCMC Samples.
standardize.data - Whether to standardize data.
prior.level.sd - Prior Standard Deviation of Random Walk
nseasons - Period of Seasonal Trend. e.g. 7, for weekly trend.
season.duration - Used with nseasons. How many unit time one season consists of. e.g. 24, when unit time is hour.
dynamic.regression - Whether to include time-varying regression coefficients.

Return Value
Data frame

Example

Original data:

date	pageviews	country
...	...	...
2017-03-19	2383	Japan
2017-03-19	3625	UK
2017-03-19	3825	US
2017-03-20	4441	Japan
2017-03-20	4721	UK
2017-03-20	8367	US
2017-03-21	383	Japan
2017-03-21	2693	UK
2017-03-21	3147	US
...	...	...

do_market_impact(date, pageviews, country, target_market = "Japan", event_time = "2017-03-20")
Returns data frame with pageviews of target market, synthetic control calculated from pageviews of other markets, and estimated impact of the event which happened only to the target market.

Output data frame columns:

time - Time for each row.
actual - Actual values from the target market.
expected - Expected values of the target market if it were not for the event of the interest (advertisement etc.). Synthetic control calculated from values of other markets.
expected_high / expected_low - Confidence interval of expected.
impact - Impact of the event calculated as the difference between actual value and expected value.
impact_high / impact_low - Confidence interval of impact.
cumulative_impact - Cumulated sum of the impact.
cumulative_impact_high / cumulative_impact_low - Confidence interval of cumulative_impact.
actual_at_event_time - Actual value at the event time. Useful to mark the event time at visualization.

row_as_header

Summary
Use a row as column names.

Syntax
row_as_header(row_index = <integer>, prefix = <character>, clean_names = <logical>)

Arguments

row_index (Optional) - The default is 1. Index of a row to use as column names.
prefix (Optional) - The default is "". Prefix for the new column names.
clean_names (Optional) - The default is TRUE. If TRUE, column names are cleaned by only underscore, lowercase letters, and numbers (same logic as clean_names from janitor package).

Return Value
Data frame

Example

Original data:

V1	V2
NA	NA
item.name	transaction.id
banana	1
apple	1
carrot	2
apple	2
orange	3
carrot	3
apple	4
carrot	4

row_as_header(row_index = 2, prefix = "c_", clean_names = TRUE)
Returns data frame using the 2nd row as column names with prefix and dot is replaced by underscore.

c_item_name	c_transaction_id
NA	NA
banana	1
apple	1
carrot	2
apple	2
orange	3
carrot	3
apple	4
carrot	4

pivot

Summary
Pivot columns into rows and columns. Values are count of pairs of rows and columns or aggregation of another column.

Syntax
pivot(<formula>, value = <column>, fill = <numeric>, fun.aggregate = <character>, na.rm = <logical>)

Arguments

formula - Lhs is composed of columns for rows and rhs is for cols For example, data1 + data2 ~ var1 + var2 makes a matrix of combinations of values in data1, data2 pair and var, var2 pair.
value (Optional) - A column for values in output data frame.
fill (Optional) - The default is 0. This is what should be used for missing value in groups.
fun.aggregate (Optional) - The default is mean. This is how duplicated data should be aggregated.
na.rm (Optional) - The default is TRUE. If NA should be removed from value.

Return Value
Data frame

Example

Original data:

clarity	certification	colour	price
VVS2	HRD	F	10796
VS1	GIA	F	1551
VS1	GIA	F	4291
VVS2	GIA	F	5881
VS2	GIA	H	4585
VVS2	GIA	H	5193

pivot(clarity+colour~certification, value = price, fill=0, fun.aggregate = sum)
Returns data frame with rows of combinations of clarity and colour and with columns of certification. Values are sum of price and missing values are filled by 0.

clarity_colour	GIA	HRD
VS1_F	5842	0
VS2_H	4585	0
VVS2_F	5881	10796
VVS2_H	5193	0

Original data:

FL_DATE	ARR_DELAY	CARRIER
2018-10-01	10	AA
2018-10-02	20	UA
2018-10-03	30	AA
2018-10-04	40	UA
2018-11-01	20	UA
2018-11-02	10	AA
2018-11-03	20	UA
2018-11-04	30	AA

pivot(floor_date(FL_DATE, unit="month") ~ CARRIER, value = ARR_DELAY, fun.aggregate = mean)
Returns data frame with rows of FL_DATE (floored to month) and ARR_DELAY with columns of each CARRIER. Values are mean of ARR_DELAY.

FL_DATE	AA	UA
2018-10-01	20	30
2018-11-01	20	20

do_cor.cols

Summary
Calculates correlations for all the pairs of the variables (columns).

Syntax
do_cor.cols(<column(s)>, use = <cor_na_operation>, method = <cor_method>, distinct = <logical>, diag = <logical>)

Arguments

use (Optional) - The default is "pairwise.complete.obs". Set an operation type for dealing with missing values.
method (Optional) - The default is "pearson". Set a method to compute correlation coefficient among "pearson", "kendall", or "spearman".
distinct (Optional) - The default is FALSE. Whether the pair of output should be unique. If this is TRUE, a pair appears only once but if it's FALSE, a pair appears twice in swapped order. If you want to filter the pairs by names, it's better to be FALSE.
diag (Optional) - The default is FALSE. Whether the output should contain the similarity of documents with itself.

Return Value
Data frame

Example

Original data:

year	Als	Bea	Bor
1900	35.02899	18	18.14896
1901	24.02899	10	10.64896
1902	45.02899	27	20.64896

do_cor.cols(-year)

pair.name.1	pair.name.2	value
Als	Bea	0.9981135
Als	Bor	0.9680283
Bea	Als	0.9981135
Bea	Bor	0.9508014
Bor	Als	0.9680283
Bor	Bea	0.9508014

do_cor.kv

Summary
Calculates correlations for all the pairs of subject columns.

Syntax
do_cor.kv(<subject_column>, <key_column>, <value_column>, use = <cor_na_operation>, method = <cor_method>, distinct = <logical>, diag = <logical>, fill = <numeric>, fun.aggregate = <aggregate_function>)

Arguments

subject column - Set a column to find the correlation for.
key column - Set a column to use as dimensions.
value column (Optional) - Set a column to use as values. In text mining, this would be often a TF-IDF column. If empty, the value is count of subject and key.
use (Optional) - The default is "pairwise.complete.obs". Set an operation type for dealing with missing values.
method (Optional) - The default is "pearson". Set a method to compute correlation coefficient among "pearson", "kendall", or "spearman".
distinct (Optional) - The default is FALSE. Whether the pair of output should be unique. If this is TRUE, a pair appears only once but if it's FALSE, a pair appears twice in swapped order. If you want to filter the pairs by names, it's better to be FALSE.
diag (Optional) - The default is FALSE. Whether the output should contain the similarity of documents with itself.
fill (Optional) - The default is 0. This is what should be used for missing value in groups.
fun.aggregate (Optional) - The default is mean. This is how duplicated data should be aggregated.

Return Value
Data frame

Example

Original data:

location	year	harvest_date
Als	1900	35.02899
Als	1901	24.02899
Als	1902	45.02899
Bea	1900	18.00000
Bea	1901	10.00000
Bea	1902	27.00000
Bor	1900	18.14896
Bor	1901	10.64896
Bor	1902	20.64896

do_cor.kv(location, year, harvest_date)

location.x	location.y	value
Als	Bea	0.9981135
Als	Bor	0.9680283
Bea	Als	0.9981135
Bea	Bor	0.9508014
Bor	Als	0.9680283
Bor	Bea	0.9508014

do_cor

Summary
Calculates correlations for all the pairs of the variables or subjects.

Syntax
do_cor(<column(s)>, skv = <character_vector>, use = <cor_na_operation>, method = <cor_method>, distinct = <logical>, diag = <logical>, fill = <numeric>, fun.aggregate = <aggregate_function>)

Arguments

column(s) (Optional) - Columns to be considered as variables. If no skv, this is necessary.
skv (Optional) - Column names which are considered as subject, key and value (If not indicated, value becomes the number of subject and key pair). If no column(s) argument, this is necessary.
use (Optional) - The default is "pairwise.complete.obs". Set an operation type for dealing with missing values.
method (Optional) - The default is "pearson". This can be
- "pearson"
- "kendall"
- "spearman"
distinct (Optional) - The default is FALSE. Whether the pair of output should be unique. If this is TRUE, a pair appears only once but if it's FALSE, a pair appears twice in swapped order. If you want to filter the pairs by names, it's better to be FALSE.
diag (Optional) - The default is FALSE. Whether the output should contain the similarity of documents with itself.
fill (Optional) - The default is 0. This is what should be used for missing value in groups.
fun.aggregate (Optional) - The default is mean. This is how duplicated data should be aggregated.

Return Value
Data frame

Example

Original data:

year	Als	Bea	Bor
1900	35.02899	18	18.14896
1901	24.02899	10	10.64896
1902	45.02899	27	20.64896

do_cor(-year)

pair.name.1	pair.name.2	value
Als	Bea	0.9981135
Als	Bor	0.9680283
Bea	Als	0.9981135
Bea	Bor	0.9508014
Bor	Als	0.9680283
Bor	Bea	0.9508014

Original data:

location	year	harvest_date
Als	1900	35.02899
Als	1901	24.02899
Als	1902	45.02899
Bea	1900	18.00000
Bea	1901	10.00000
Bea	1902	27.00000
Bor	1900	18.14896
Bor	1901	10.64896
Bor	1902	20.64896

do_cor( skv = c("location", "year", "harvest_date"))

location.x	location.y	value
Als	Bea	0.9981135
Als	Bor	0.9680283
Bea	Als	0.9981135
Bea	Bor	0.9508014
Bor	Als	0.9680283
Bor	Bea	0.9508014

do_dist.kv

Summary
Calculate the distances between each of the pairs.

Syntax
do_dist.kv(<subject_column>, <key_column>, <value_column>, distinct = <logical>, diag = <logical>, method = <dist_method>, fill = <numeric>, fun.aggregate = <aggregate_function>, p = <numeric> )

Arguments

subject column - Set a column to find the distance for.
key column - Set a column to use as dimensions.
value column (Optional) - Set a column to use as values.
distinct (Optional) - The default is FALSE. Whether the pair of output should be unique. If this is TRUE, a pair appears only once but if it's FALSE, a pair appears twice in swapped order. If you want to filter the pairs by names, it's better to be FALSE.
diag (Optional) - The default is FALSE. Whether the output should contain the similarity of documents with itself.
method (Optional) - The default is "euclidean".
- euclidean - The most common measurement of distance. sqrt(sum((x - y)^2))
- maximum - Maximum difference between elements of two vectors.
- manhattan - Sum of difference in each elements of two vectors.
- canberra - Manhattan distance weighted by the elements. sum(|x_i - y_i| / |x_i + y_i|)
- binary - non-zero elements are regarded as 1 and zero elements are regarded as 0. The distance is the number of elements only one is 1 divided by numbers of them at least one is 1.
- minkowski - This is generalized case of euclidean and manhattan. If p=2, it becomes euclidean and if p=1, it becomes manhattan. sum((x - y)^p)^(1/p)
fill (Optional) - The default is 0. This is what should be used for missing value in groups.
fun.aggregate (Optional) - The default is mean. This is how duplicated data should be aggregated.

Return Value
Data frame

Example

Original data:

date	name	product	num
2015-06-12	Lisa	orange	5
2015-06-12	Emily	apple	3
2015-06-12	John	carrot	3
2015-06-13	Emily	apple	4
2015-06-13	John	apple	6
2015-06-16	Lisa	orange	2
2015-06-16	John	carrot	4
2015-06-16	Emily	orange	7

do_dist.kv(name, product, num)

This considers matrix like this. Duplicated entries is aggregated to mean as default and you can change it by fun.aggregate argument.

	Emily	John	Lisa
apple	3.5	6.0	0.0
carrot	0.0	3.5	0.0
orange	7.0	0.0	3.5

Then calculate euclidean distances between pairs of each column.

name.x	name.y	value
Emily	John	8.215838
Emily	Lisa	4.949747
John	Emily	8.215838
John	Lisa	7.778175
Lisa	Emily	4.949747
Lisa	John	7.778175

do_dist.kv(name, product, num, fun.aggregate=sum, method="manhattan")

This considers matrix like this. Duplicated entries is aggregated to sum.

	Emily	John	Lisa
apple	7	6	0
carrot	0	7	0
orange	7	0	7

Then calculate manhattan distances between pairs of each column.

name.x	name.y	value
Emily	John	15
Emily	Lisa	7
John	Emily	15
John	Lisa	20
Lisa	Emily	7
Lisa	John	20

do_dist.cols

Summary
Calculate distances of each of the pairs.

Syntax
do_dist.cols(<column(s)>, distinct = <logical>, diag = <logical>, method = <dist_method>, p = <numeric> )

Arguments

distinct (Optional) - The default is FALSE. Whether the pair of output should be unique. If this is TRUE, a pair appears only once but if it's FALSE, a pair appears twice in swapped order. If you want to filter the pairs by names, it's better to be FALSE.
diag (Optional) - The default is FALSE. Whether the output should contain the similarity of documents with itself.
method (Optional) - The default is "euclidean".
- euclidean - The most common measurement of distance. sqrt(sum((x - y)^2))
- maximum - Maximum difference between elements of two vectors.
- manhattan - Sum of difference in each elements of two vectors.
- canberra - Manhattan distance weighted by the elements. sum(|x_i - y_i| / |x_i + y_i|)
- binary - non-zero elements are regarded as 1 and zero elements are regarded as 0. The distance is the number of elements only one is 1 divided by numbers of them at least one is 1.
- minkowski - This is generalized case of euclidean and manhattan. If p=2, it becomes euclidean and if p=1, it becomes manhattan. sum((x - y)^p)^(1/p)
p (Optional) - This works if method is minkowski. The default is 2. If p=2, it becomes euclidean and if p=1, it becomes manhattan. sum((x - y)^p)^(1/p)

Return Value
Data frame

Example

Original data:

year	Als	Bea	Bor
1900	35.02899	18	18.14896
1901	24.02899	10	10.64896
1902	45.02899	27	20.64896

do_dist.cols(-year)

pair.name.1	pair.name.2	value
Als	Bea	28.492868
Als	Bor	32.532238
Bea	Als	28.492868
Bea	Bor	6.385847
Bor	Als	32.532238
Bor	Bea	6.385847

do_dist

Summary
Calculate distances of each of the pairs of the variables or subjects.

Syntax
do_dist(<column(s)>, skv = <character_vector>, distinct = <logical>, diag = <logical>, method = <dist_method>, p = <numeric>, fill = <numeric>, fun.aggregate = <aggregate_function>)

Arguments

column(s) (Optional) - Columns to be considered as variables. If no skv, this is necessary.
skv (Optional) - Column names which are considered as subject, key and value (If not indicated, value becomes the number of subject and key pair). If no column(s) argument, this is necessary.
distinct (Optional) - The default is FALSE. Whether the pair of output should be unique. If this is TRUE, a pair appears only once but if it's FALSE, a pair appears twice in swapped order. If you want to filter the pairs by names, it's better to be FALSE.
diag (Optional) - The default is FALSE. Whether the output should contain the similarity of documents with itself.
method (Optional) - The default is "euclidean".
- euclidean - The most common measurement of distance. sqrt(sum((x - y)^2))
- maximum - Maximum difference between elements of two vectors.
- manhattan - Sum of difference in each elements of two vectors.
- canberra - Manhattan distance weighted by the elements. sum(|x_i - y_i| / |x_i + y_i|)
- binary - non-zero elements are regarded as 1 and zero elements are regarded as 0. The distance is the number of elements only one is 1 divided by numbers of them at least one is 1.
- minkowski - This is generalized case of euclidean and manhattan. If p=2, it becomes euclidean and if p=1, it becomes manhattan. sum((x - y)^p)^(1/p)
p (Optional) - This works if method is minkowski. The default is 2. If p=2, it becomes euclidean and if p=1, it becomes manhattan. sum((x - y)^p)^(1/p)
fill (Optional) - The default is 0. This is what should be used for missing value in groups.
fun.aggregate (Optional) - The default is mean. This is how duplicated data should be aggregated.

Return Value
Data frame

Example

Original data:

year	Als	Bea	Bor
1900	35.02899	18	18.14896
1901	24.02899	10	10.64896
1902	45.02899	27	20.64896

do_dist(-year)

pair.name.1	pair.name.2	value
Als	Bea	28.492868
Als	Bor	32.532238
Bea	Als	28.492868
Bea	Bor	6.385847
Bor	Als	32.532238
Bor	Bea	6.385847

Original data:

date	name	product	num
2015-06-12	Lisa	orange	5
2015-06-12	Emily	apple	3
2015-06-12	John	carrot	3
2015-06-13	Emily	apple	4
2015-06-13	John	apple	6
2015-06-16	Lisa	orange	2
2015-06-16	John	carrot	4
2015-06-16	Emily	orange	7

do_dist(skv = c("name", "product", "num"))

This considers matrix like this. Duplicated entries is aggregated to mean as default and you can change it by fun.aggregate argument.

	Emily	John	Lisa
apple	3.5	6.0	0.0
carrot	0.0	3.5	0.0
orange	7.0	0.0	3.5

Then calculate euclidean distances between pairs of each column.

name.x	name.y	value
Emily	John	8.215838
Emily	Lisa	4.949747
John	Emily	8.215838
John	Lisa	7.778175
Lisa	Emily	4.949747
Lisa	John	7.778175

do_dist(skv = c("location", "year", "harvest_date"), fun.aggregate=sum, method="manhattan")

This considers matrix like this. Duplicated entries is aggregated to sum.

	Emily	John	Lisa
apple	7	6	0
carrot	0	7	0
orange	7	0	7

Then calculate manhattan distances between pairs of each column.

name.x	name.y	value
Emily	John	15
Emily	Lisa	7
John	Emily	15
John	Lisa	20
Lisa	Emily	7
Lisa	John	20

str_clean

Summary
Cleans up text by removing escape characters (e.g. \n, \t), extra white spaces, extra period, and leading and trailing spaces.

Syntax
str_clean(<column_text>)

Arguments

Return Value
Character

Example

str_clean(" Exploratory ..io ")
returns "Exploratory.io".

str_count_all

Summary
Count patterns from texts

Syntax
str_count_all(<column_text>, patterns = <character_vector>, remove.zero = <logical>)

Arguments

patterns - Pattern to count. They can be regex.
remove.zero - The default is TRUE. Set whether you want to keep the text patterns with zero count.

Return Value
List of data frame

Example

str_count_all("I ate banana, apple, and peach yesterday, and banana, peach today.", patterns=c("apple", "banana"), remove.zero=TRUE)
Returns a list column of data frames with 'apple' and 'banana' columns.

str_normalize

Summary
Normalize text by replacing full-width alphabets, numbers, special characters with regular alphabets, numbers, special characters, replacing half-width Kana characters with full-width Kana characters, and applying other normalization rules. It follows the rule of Unicode Normalization Forms. This is a wrapper function around stringi::stri_trans_nfkc function.

Syntax
str_normalize(<column_text>)

Arguments

Return Value
Character

Example

str_normalize("ＡＢＣ１２３＋ー＝") returns "ABC123+-=".

str_extract_inside

Summary
Extract text inside symbols with given begin and end symbol. If multiple strings are extracted, they will be returned as a list type, so please wrap them with the list_to_text function."

Syntax
str_extract_inside(<column_text>, begin = <text>, end = <text>, include_special_chars = <TRUE|FALSE>, all = <TRUE|FALSE>)

Arguments

begin - Single letter that indicates the start point of enclosing.
end - Single letter that indicates the end point of enclosing.
include_special_chars - Specifies whether to include the specified characters when extracting
all - Specifies whether to extract all matches when there are multiple occurrences of the specified characters or symbols

Return Value
Character

Example

list_to_text(str_extract_inside(company, begin = "(", end = ")", include_special_chars = FALSE, all = TRUE))

Original Data: "Exploratory (San Francisco)"
Result: San Francisco

list_to_text(str_extract_inside(company, begin = "{", end = "}", include_special_chars = FALSE, all = TRUE))

Original Data: "Google {Mountain View}"
Result: Mountain View

list_to_text(str_extract_inside(company, begin = "[", end = "]", include_special_chars = FALSE, all = TRUE)) - Original Data: "Apple [Cupertino]" - Result: Cupertino

list_to_text(str_extract_inside(company, begin = "'", end = "'", include_special_chars = FALSE, all = TRUE))

Original Data: "Facebook 'Menlo Park'"
Result: Menlo Park

list_to_text(str_extract_inside(company, begin = '"', end = '"', include_special_chars = FALSE, all = TRUE))

Original Data: 'Amazon "Seatle"'
Result: Seatle

str_logical

Summary
Converts the data type of a column to logical type, transforming values into TRUE or FALSE. If the "true_value" argument is not specified, it converts "true", "Yes", "yes", "1" to TRUE, and "false", "No", "no", "0" to FALSE. If you want to make a specific value TRUE and all other values FALSE, specify the original value in true_value. Even if the target column for logical conversion contains a mix of uppercase and lowercase letters, the conversion ignores case and focuses only on the string content.

Syntax
str_logical(<column_text>, true_value = <text>)

Arguments

true_value (Optional) - Text for TRUE value. It's case insensitive.

Return Value
Logical

Example

Original data:

'Status' column is Character data type.

Status
yes
no
No
NO
YES
Yes

mutate(New_Status = str_logical(Status))

Result:

'New_status' column is Logical data type.

Status	New_Status
yes	TRUE
no	FALSE
No	FALSE
NO	FALSE
YES	TRUE
Yes	TRUE

Original data:

'Status' column is Character data type.

Status
TRUE
FalSE
True
False
NA
tRUE
FALSE

mutate(New_Status = str_logical(Status))

Result:

'New_status' column is Logical data type.

Status	New_Status
TRUE	TRUE
FalSE	FALSE
True	TRUE
False	FALSE
NA	NA
tRUE	TRUE
FALSE	FALSE

Original data:

'Status' column is Numeric data type.

Status
1
0
1
0
NA
1
0

mutate(New_Status = str_logical(Status))

Result:

'New_status' column is Logical data type.

Status	New_Status
1	TRUE
0	FALSE
1	TRUE
0	FALSE
NA	NA
1	TRUE
0	FALSE

Original data:

'Status' column is Character data type.

Status
Sign Up
Not Yet
Sign Up
Not Yet
NA
sign Up
Not yet

mutate(New_Status = str_logical(Status, "Sign Up")

Result:

'New_status' column is Logical data type.

Status	New_Status
Sign Up	TRUE
Not Yet	FALSE
Sign Up	TRUE
Not Yet	FALSE
NA	NA
sign Up	TRUE
Not yet	FALSE

get_stopwords

Summary
Returns stop words like "a", "the", "and", etc.

Syntax
get_stopwords(lang = <language>, include = <character_vector>, exclude = <character_vector>)

Arguments

lang (Optional) - The default is "english". Set a dictionary type. Choose supported languages listed below or "english_snowball"(same as "english"), "english_smart" or "english_onix" for English stopword type.
- Danish
- Dutch
- English
- Finnish
- French
- German
- Hungarian
- Italian
- Japanese
- Norwegian
- Portuguese
- Russian
- Spanish
- Swedish
include (Optional) - Values you want to include as stop words.
exclude (Optional) - Values you want to exclude from stop words.

Return Value
Character

Example

get_stopwords()
Return a list of the stop words like "a", "the", "and".

word_to_sentiment

Summary
Returns sentiment types of positive or negative based on word(s).

Syntax
word_to_sentiment(<column_text>, lexicon = <sentiment_lexicon>)

Arguments

lexicon (Optional) - The default is "bing". Set a type of dictionary among 'nrc', 'bing' and 'AFINN'.

Return Value
If lexicon argument is "bing" (default), Character. If lexicon argument is "AFINN", Numeric. If lexicon argument is "nrc", List.

Example

word_to_sentiment("good")
Return "positive".

word_to_sentiment("bad", lexicon="AFINN")
Return -3.

word_to_sentiment("cry", lexicon="nrc")
Return c("negative" "sadness").

get_sentiment

Summary
Returns the sentiment score for a text sentence. The positive score indicates the positive sentence and the negative score indicates the opposite. 0 means 'neutral.'

Syntax
get_sentiment(<column_text>)

Arguments

Return Value
Numeric

Example

get_sentiment("I'm happy")
Return 0.7071068.

get_sentiment("I'm sick.")
Return -0.7071068.

get_sentiment("I'm not so good.")
Return -0.5000000.

stem_word

Summary
Stem word so that almost the same words which have a little different spells can be recognized as the same words.

Syntax
stem_word(<column_text>, language = <stem_language>)

Arguments

language (Optional) - The default is "porter". It uses Porter's stemming algorithm, which is very popular in natural language processing. This can be
- "porter"
- "english"
- "danish"
- "dutch"
- "finnish"
- "french"
- "german"
- "hungarian"
- "italian"
- "norwegian"
- "portuguese"
- "russian"
- "spanish"
- "swedish"

Return Value
Character

Example

stem_word(c("stand","stands", "cheerful", "cheering"))
Return c("stand","stand","cheer","cheer").

is_stopword

Summary
Returns TRUE if a word is included in a list of the stop words defined by one of the dictionary.

Syntax
is_stopword(<column_text>, lang = <language>, include = <character_vector>, exclude = <character_vector>)

Arguments

lang (Optional) - The default is "english". Set a dictionary type. Choose supported languages listed below or "english_snowball"(same as "english"), "english_smart" or "english_onix" for English stopword type.
- Danish
- Dutch
- English
- Finnish
- French
- German
- Hungarian
- Italian
- Japanese
- Norwegian
- Portuguese
- Russian
- Spanish
- Swedish
include (Optional) - Values you want to include as stop words.
exclude (Optional) - Values you want to exclude from stop words.

Return Value
Logical

Example

is_stopword(c("a", "and", "stopword", "the"))
Return c(TRUE, TRUE, FALSE, TRUE).

is_empty

Summary
Returns TRUE if a text is empty string or NA.

Syntax
is_empty(<column_text>)

Arguments

Return Value
Logical

Example
is_stopword(X1) Determines whether each word in column X1 is a stopword and returns a logical value (TRUE or FALSE).

X1	NEW
a	TRUE
and	TRUE
stopword	FALSE

filter(!is_stopword(X1)) Filters the data to keep only rows where X1 is not a stopword, retaining meaningful content words.

X1
stopword
analysis
computer
programming

is_alphabet

Summary
Returns TRUE if a text contains only alphabets.

Syntax
is_alphabet(<column_text>)

Arguments

Return Value
Data frame

Example

is_alphabet(c("1", "132", "32MB", "hello", "Hello World"))
Return c(FALSE, FALSE, FALSE, TRUE, FALSE). This is expected to be used to tokens and if this is used to sentences, returns FALSE.

ip_to_country

Summary
Returns country names from IP addresses.

Syntax
ip_to_country(<column_text>)

Arguments

Return Value
Character

Example

ip_to_country("133.43.96.45")
returns "Japan".

url_domain

Summary
Returns domain (ex. "exploratory.io") from url.

Syntax
url_domain(<column_text>)

Arguments

Return Value
Character

Example

url_domain("https://exploratory.io/login/")
returns "exploratory.io".

url_fragment

Summary
Returns fragment from url.

Syntax
url_fragment(<column_text>)

Arguments

Return Value
Character

Example

url_fragment("https://exploratory.io/?debug=true#test")
returns "test".

url_path

Summary
Returns path from url.

Syntax
url_path(<column_text>)

Arguments

Return Value
Character

Example

url_path("https://exploratory.io/reference/")
returns "reference/".

url_port

Summary
Returns port from url.

Syntax
url_port(<column_text>)

Arguments

Return Value
Character

Example

url_port("https://exploratory.io:443")
returns "443".

url_scheme

Summary
Returns scheme (ex. "http", "https") from url.

Syntax
url_scheme(<column_text>)

Arguments

Return Value
Character

Example

url_scheme("https://exploratory.io")
returns "https".

url_suffix

Summary
Returns suffix (ex. "com", "org") from url.

Syntax
url_suffix(<column_text>)

Arguments

Return Value
Character

Example

url_suffix("https://exploratory.io")
returns "io".

url_suffix("http://sample.edu.co")
returns "edu.co".

url_subdomain

Summary
Returns subdomain (ex. "www", "blog") from url.

Syntax
url_subdomain(<column_text>)

Arguments

Return Value
Character

Example

url_subdomain("https://blog.exploratory.io")
returns "blog".

url_tld

Summary
Returns top-level domain (ex. "com", "co") from url.

Syntax
url_tld(<column_text>)

Arguments

Return Value
Character

Example

url_tld("http://sample.edu.co")
returns "co".

url_param

Summary
Returns decoded query parameter from url.

Syntax
url_param(<column_text>, parameter_name = <character>)

Arguments
* parameter_name - Name of query. If there's "?user=ferret" in the end of the url, the name is "user".

Return Value
Character

Example

url_param("http://example.com/?user=ferret", "user")
returns "ferret".

countrycode

Summary
Converts country names or country codes to other formats of country codes or names (country names, continent names, region names, etc.). This is useful for standardizing country codes from different data sources when conducting international data analysis or creating reports.

Syntax
countrycode(<column of country names or country codes>, origin = <format of country codes or country names before conversion>, destination = <format of country codes or country/region names after conversion>)

Arguments

column of country names or country codes - Column containing country names or country codes to be converted
origin - Specifies the format of country codes or country names before conversion
destination - Specifies the format of country codes or country names, continent names, region names, etc. after conversion

Possible values for origin and destination arguments:

cowc - Correlates of War (string format: USA, JPN, GBR, etc.)
cown - Correlates of War (numeric format: 2, 81, 200, etc.)
iso3c - ISO 3166-1 alpha-3 (3-letter country codes: USA, JPN, GBR, etc.)
iso3n - ISO 3166-1 numeric (3-digit numeric country codes: 840, 392, 826, etc.)
iso2c - ISO 3166-1 alpha-2 (2-letter country codes: US, JP, GB, etc.)
imf - International Monetary Fund codes (111, 158, 112, etc.)
fips104 - Federal Information Processing Standard (FIPS 10-4) country codes
fao - Food and Agriculture Organization of the United Nations country codes
ioc - International Olympic Committee country codes (USA, JPN, GBR, etc.)
un - United Nations numeric codes
wb - World Bank country codes (string)
country.name - Official English country names (ISO compliant)
continent - Continent names (Africa, Americas, Asia, Europe, Oceania) can only be specified for destination
region - Region names classified into 7 regions (can only be specified for destination)
region11 - Region names subdivided into 11 regions (can only be specified for destination)
region23 - Region names classified in detail into 23 regions (can only be specified for destination)

Return Value
Country names, country codes, continent names, or region names after conversion (depending on the specified destination format)

Example

countrycode(country_name, origin = "country.name", destination = "iso2c")

Converts country names in the country_name column to ISO 3166-1 alpha-2 (2-letter country codes). Abbreviations and common names are automatically recognized.

country_name	ISO2_code
U.S.	US
United States	US
Japan	JP
China	CN
South Korea	KR

countrycode(ISO2_code, origin = "iso2c", destination = "country.name")

Converts 2-letter country codes in the ISO2_code column to official English country names.

ISO2_code	country_name
US	United States
JP	Japan
GB	United Kingdom
DE	Germany
FR	France

countrycode(COW_code, origin = "cowc", destination = "country.name")

Converts Correlates of War string format country codes in the COW_code column to official English country names.

COW_code	country_name
USA	United States
JPN	Japan
GBR	United Kingdom
RUS	Russia
CHN	China

countrycode(COW_numeric_code, origin = "cown", destination = "country.name")

Converts Correlates of War numeric codes in the COW_numeric_code column to official English country names.

COW_numeric_code	country_name
2	United States
4	United Kingdom
81	Japan
365	Russia
710	China

countrycode(ISO3_code, origin = "iso3c", destination = "iso2c")

Converts 3-letter country codes in the ISO3_code column to 2-letter country codes.

ISO3_code	ISO2_code
USA	US
JPN	JP
GBR	GB
DEU	DE
FRA	FR

countrycode(country_name, origin = "country.name", destination = "continent")

Converts country names in the country_name column to the continent names to which they belong.

country_name	continent_name
United States	Americas
Japan	Asia
Germany	Europe
South Africa	Africa
Australia	Oceania

countrycode(country_name, origin = "country.name", destination = "region")

Converts country names in the country_name column to region names classified into 7 regions.

country_name	region_name
United States	Northern America
Japan	Eastern Asia
Germany	Western Europe
Brazil	South America
Australia	Australia and New Zealand

statecode

Summary
This function takes a column that has US State information and returns either US State names, abbreviations, numeric codes, division, or region, based on 'output_type' argument setting. The US State information can be either US State names, US State abbreviations (FIPS / ANSI codes), or US State numeric code (FIPS) in any combination.

Syntax
statecode(<state_column>, output_type = <statecode_type>)

Arguments

output_type - Desired output type from the list below.

Output Types:

alpha_code - US States abbreviation (e.g. CA for California)
num_code - US States (e.g. 06 for California)
name - US States names (e.g. California)
division - Division such as 'Pacific', 'New England', 'South Atlantic', etc. (only for destination)
region - Region such as 'West', 'Northeast', 'South', 'North', etc. (only for destination)

Return Value
State name, State code, etc.

Example

statecode("CA", "name")
Return California.

statecode("CA", "num_code")
Return 06.

statecode("California", "num_code")
Return 06.

statecode("CA", "region")
Return West.

countycode

Summary
Generate US county code (FIPS - Federal Information Processing Standard) based on US State and County names.

Syntax
countycode(state = <state_column>, county = <county_column>)

Arguments

state - US State names or alphabet codes. (e.g. California or CA)
county - US County names. (San Francisco or San Mateo County)

Return Value
County code

Example

countycode("California", "San Mateo")
Return 06081.

countycode("CA", "San Mateo")
Return 06081.

countycode("CA", "San Mateo County")
Return 06081.

do_tokenize

Summary
Returns one token (e.g. word) per row after tokenizing a text.

Syntax
do_tokenize(<column_text>, token = <token_type>, keep_cols = <logical>, to_lower = <logical>, drop = <logical>, output = <new_column_name>, with_id = <logical>, pattern = <pattern>)

Arguments

input - Set a column of which you want to split the text or tokenize.
token (Optional) - The default is "words". Select the unit of token from
- "words"
- "characters"
- "sentences"
- "lines"
- "paragraphs"
- "regex"
keep_cols (Optional) - The default is FALSE. Whether existing columns should remain.
to_lower (Optional) - The default is TRUE. Whether output should be lower cased.
drop (Optional) - The default is TRUE. Whether input column should be removed.
output (Optional) - The default is "token". Set a column name for the new column to store the tokenized values.
with_id (Optional) - The default is TRUE. Whether output should contain original document id and sentence id in each document.
pattern (Optional) - Regex to split text when token argument is "regex". The default is \\s+ which is white spaces.

Return Value
Data frame

Example

Original data:

index	text
First	It was a great party.
Second	She has just left. She will be off tomorrow.

do_tokenize(text)

index	document_id	sentence_id	token
First	1	1	it
First	1	1	was
First	1	1	a
First	1	1	great
First	1	1	party
Second	2	1	she
Second	2	1	has
Second	2	1	just
Second	2	1	left
Second	2	2	she
Second	2	2	will
Second	2	2	be
Second	2	2	off
Second	2	2	tomorrow

do_tokenize(text, token="sentences")

index	token
First	it was a great party.
Second	she has just left.
Second	she will be off tomorrow.

pair_count

Summary
Count pairs of words (tokens) that cooccur within a group

Syntax
pair_count(group = <column>, value = <column>, distinct = <logical>, diag = <logical>, sort = <logical>)

Arguments

group - A column name to be grouped.
value - A column name to count pairs.
distinct (Optional) - The default is TRUE. If FALSE, duplicated pairs appear in reverse order.
diag (Optional) - The default is FALSE. If TRUE, count of the value itself appears.
sort (Optional) - The default is FALSE. If TRUE, the output is in decreasing order of frequency.

Return Value
Data frame

Example

pair_count(group=title, value=word, distinct=FALSE, diag = FALSE, sort=TRUE)
Return a data frame with word.x and word.y which is from "word" column and value that has count.

do_tfidf

Summary
Calculates TF-IDF for each term against a group. TF-IDF is a weighting mechanism that calculates the importance of each word to each document by increasing the importance based on the term frequency while decreasing the importance based on the document frequency.

Syntax
do_tfidf(<document_column>, <token_column>, tf_weight = <tf_weight_type>, idf_log_scale = <function>, norm = "l2"|"l1"|FALSE)

Arguments

document_column - Set a grouping column for a set of the tokens. In most of the cases this would be a document id column.
tf_weight (Optional) - The default is "raw".
- "raw" is count of a term in a document.
- "binary" is if it exists or not. If it exists, it is 1 and if not, it is 0.
- "log_scale" is 1+log(count of a term in a document).
idf_log_scale (Optional) - The default is log. This is a function to suppress the increase of idf value. Idf is calculated by log_scale_function((the total number of documents)/(the number of documents which have the token)). It's how rare the token is in the set of documents. It might be worth trying log2 or log10. log2 increases the value more easily and log10 increases it more slowly.
norm (Optional) - The default is l2. How to normalize the tfidf vector.
- "l2" is normalization that Euclidean distance of the tfidf vector for a document becomes 1.
- "l1" is normalization that Manhattan distance (sum of values) of the tfidf vector for a document becomes 1.
- FALSE doesn't normalize the result.

Return Value
Data frame

Example

Original data:

document_id	token
1	this
1	is
1	what
1	it
1	is
2	which
2	is
2	better

do_tfidf(document_id, token)

document_id	token	count_per_doc	count_of_docs	tfidf
1	is	2	2	0.0000000
1	it	1	1	0.5773503
1	this	1	1	0.5773503
1	what	1	1	0.5773503
2	better	1	1	0.7071068
2	is	1	2	0.0000000
2	which	1	1	0.7071068

do_ngram

Summary
Create columns of n-grams connected in sentences.

Syntax
do_ngram(<token_column>, <document_column>, <sentence_column>, maxn = <numeric>, sep = <text>)

Arguments

token column - Set a column to construct n-grams based on.
document_column - A column considered as document id.
sentence_column - A column considered as sentence id in a document.
maxn (Optional) - The default is 2. Maximum number of tokens to be connected.
sep (Optional) - The default is "_". Character to be used to connect ngrams.

Return Value
Data frame

Example

Original data:

document_id	sentence_id	token
1	1	it
1	1	is
1	1	good
2	1	she
2	1	left
2	2	she
2	2	will
2	2	come
2	2	tomorrow

do_ngram(token, document_id, sentence_id, maxn=3)

document_id	sentence_id	gram	token
1	1	1	it
1	1	1	is
1	1	1	good
1	1	2	it_is
1	1	2	is_good
1	1	3	it_is_good
2	1	1	she
2	1	1	left
2	1	2	she_left
2	2	1	she
2	2	1	will
2	2	1	come
2	2	1	tomorrow
2	2	2	she_will
2	2	2	will_come
2	2	2	come_tomorrow
2	2	3	she_will_come
2	2	3	will_come_tomorrow

do_cosine_sim.kv

Summary
Calculates the similarity between each pair of the documents using the cosine similarity algorithm. Cosine similarity measures the cosine of the angle between two vectors in the multi-dimensional space.

Syntax
do_cosine_sim.kv(<subject_column>, <key_column>, <value_column>, distinct = <logical>, diag = <logical>, fun.aggregate = <aggregate_function>)

Arguments

subject column - Set a column to find the similarity for. In text mining this would be often a document name or id column.
key column - Set a column to use as dimensions. In text mining this would be often a term or word column.
value column (Optional) - Set a column to use as values. In text mining, this would be often a TF-IDF column. If empty, the value is count of subject and key.
distinct (Optional) - The default is FALSE. Whether the pair of output should be unique. If this is TRUE, a pair appears only once but if it's FALSE, a pair appears twice in swapped order. If you want to filter the pairs by names, it's better to be FALSE.
diag (Optional) - The default is FALSE. Whether the output should contain the similarity of documents with itself.
fun.aggregate (Optional) - The default is mean. This is how duplicated data should be aggregated.

Return Value
Data frame

Example

Original data:

document_id	token	value
Lisa	it	0.8966972
Lisa	was	0.2655087
Lisa	good	0.3721239
Emily	she	0.5728534
Emily	is	0.9082078
Emily	nice	0.2016819
John	she	0.8983897
John	is	0.9446753
John	good	0.6607978

do_cosine_sim.kv(document_id, token, value)

document_id.x	document_id.y	value
Lisa	Emily	0.0000000
Lisa	John	0.1671574
Emily	Lisa	0.0000000
Emily	John	0.8595770
John	Lisa	0.1671574
John	Emily	0.8595770

do_cmdscale

Summary
Execute multidimensional scaling (MDS). Calculate approximated coordinations from distances of entity pairs.

Syntax
do_cmdscale(<name1_column>, <name2_column>, <value_column>, k = <integer>)

Arguments

name1 column - Set one of name columns to calculate coordinations for. If you use do_dist in the previous step, this will be pair.name.1.
name2 column - Set another column of names to calculate coordinations for. If you use do_dist in the previous step, this will be pair.name.2.
value column - Set a column to use as a distance value. If you use do_dist in the previous step, this will be value.
k (Optional) - The default is 2. Number of coordinations to output.

Return Value
Data frame

Example

Original data:

name.x	name.y	value
cow	horse	36
cow	turtle	82
horse	turtle	48

do_cmdscale(pair.name.1, pair.name.2, value)

name	V1	V2
cow	-38.915551	-3.435417
horse	-4.163866	5.962475
turtle	43.079417	-2.527058

impute_na

Summary

Impute NA by mean / median / mode values, manually entered values, or predicted values. It will build a linear regression model for the prediction.

Syntax

impute_na(<target_column>, type = <impute_na_type>, val = <column>, <column(s)>)

Arguments

predictor_columns (Optional) - This works only when type is "predict". Columns to be used as predictors.
type (Optional) - The default is "mean". Set how you want to impute NA. This can be numeric value to replace NA or types listed below.
- "mean" - Mean value
- "median" - Median value
- "mode" - Mode value
- "value" - Replace the missing values with val
- "predict" - Predict possible values based on other columns
val (Optional) - This works when type is "value" or "predict". When it's "value", a value to replace NA or a column to replace NA at the same position. If it's "predict", a column to use for prediction of NA. More than one column can be used by indicating <column(s)>.

Example

mutate(New = impute_na(DEP_DELAY, type = "mean"))

Imputes missing values in the DEP_DELAY column with the mean value of that column.

CARRIER	DEP_DELAY	NEW
UA	7	7
UA	NA	8.75
UA	0	0
AA	20	20
AA	NA	8.75
AA	8	8

mutate(New = impute_na(DEP_DELAY, type = "median"))

Imputes missing values in the DEP_DELAY column with the median value of that column.

CARRIER	DEP_DELAY	NEW
UA	7	7
UA	NA	7.5
UA	0	0
AA	20	20
AA	NA	7.5
AA	8	8

mutate(New = impute_na(product_category, type = "mode"))

Imputes missing values in the product_category column with the most frequently occurring value in that column.

product_ID	product_category	NEW
001	Food	Food
002	NA	Food
003	Clothing	Clothing
004	Food	Food
005	NA	Food
006	Daily Items	Daily Items

mutate(New = impute_na(DEP_DELAY, type = "value", val = 10))

Imputes missing values in the DEP_DELAY column with the value 10.

CARRIER	ARR_DELAY	DEP_DELAY	DISTANCE	NEW
UA	10	7	300	7
UA	20	NA	1000	10
UA	-5	0	200	0
AA	20	20	500	20
AA	-5	NA	700	10
AA	10	8	100	8

mutate(New = impute_na(DEP_DELAY, type = "predict", CARRIER, ARR_DELAY, DISTANCE))

Imputes missing values in the DEP_DELAY column with predicted values using linear regression based on the CARRIER, ARR_DELAY, and DISTANCE columns.

CARRIER	ARR_DELAY	DEP_DELAY	DISTANCE	NEW
UA	10	7	300	7
UA	20	NA	1000	3.47
UA	-5	0	200	0
AA	20	20	500	20
AA	-5	NA	700	3.59
AA	10	8	100	8

detect_outlier

Summary

Detects outliers in a numeric column and returns labels indicating if values are outliers ('Upper' or 'Lower') or not ('Normal'). This function is useful for labeling outliers or for easily removing outliers when used in combination with a filter.

Syntax

detect_outlier(<column_num>, type = <detect_outlier_type>, threshold = <numeric>)

Arguments

type (Optional) - The default is "iqr". How NA should be guessed. This can be
- "iqr" - Breaks are defined by inter quantile range(IQR). (75% quantile) + 1.5 * IQR and (25% quantile) - 1.5 * IQR
- "percentile" - Regard values out of the percentile of threshold as outliers
- "standard_deviation" - Guess the value by linear regression of predictor_columns
threshold (Optional) - This works only when type is "percentile" or "standard_deviation". If it's "percentile" the default is 0.95. Where to put threshold in percentile. If it's "standard_deviation", the default is 2. Put thresholds based on how far from mean by standard deviations.

Example

mutate(New = detect_outlier(DISTANCE, type = "iqr"))

Detects outliers in the data using IQR and labels them as "Upper", "Normal", or "Lower".

CARRIER	DISTANCE	NEW
UA	800	Normal
UA	5000	Upper
UA	600	Normal
AA	500	Normal
AA	700	Normal
AA	10	Lower

filter(detect_outlier(population, "iqr") == "Normal")

Removes outliers detected using IQR on both sides

state	population
California	3979576
Texas	2693976

mutate(outlier_label = detect_outlier(population, type = "percentile", threshold = 0.95))

Labels outliers using the 95th percentile method

state	population	outlier_label
New York	8336817	Upper
California	3979576	Normal
Texas	2693976	Normal

filter(detect_outlier(population, type = "percentile", threshold = 0.99) == "Normal")

Removes outliers detected using the 99th percentile

state	population
New York	8336817
California	3979576
Texas	2693976

mutate(outlier_label = detect_outlier(population, type = "standard_deviation", threshold = 2))

Labels values exceeding 2 standard deviations as outliers

state	population	outlier_label
New York	8336817	Upper
California	3979576	Normal
Texas	2693976	Normal
Vermont	193976	Lower

filter(detect_outlier(population, type = "standard_deviation", threshold = 3) == "Normal")

Removes outliers detected using 3 standard deviations

state	population
California	3979576
Texas	2693976

cluster

Summary

Build K-means clustering model and label each row with cluster id.

Syntax

cluster(<column(s)>, n_cluster = <integer>)

Arguments

n_cluster (Optional) - The default is 3. Number of clusters.

Example

Original data:

CARRIER	ARR_DELAY	DEP_DELAY	DISTANCE
UA	10	7	800
UA	20	NA	5000
UA	-5	0	600
AA	20	20	500
AA	-5	NA	700
AA	10	8	10

mutate(cluster = cluster(DEP_DELAY, DISTANCE))
Label each row with the cluster id (factor) based on DEP_DELAY and DISTANCE columns values.

CARRIER	ARR_DELAY	DEP_DELAY	DISTANCE	cluster
UA	10	7	800	2
UA	20	NA	5000	NA
UA	-5	0	600	1
AA	20	20	500	1
AA	-5	NA	700	NA
AA	10	8	10	3

build_model

Summary

Create data frame with models from input data frame by model function and arguments.

Syntax

build_model(model_func = <model_function>, seed = <integer>, test_rate = <numeric>, <model_parameters>)

Arguments

model_func - Function to create models. Input data frame is passed as "data" argument of this function.
seed (Optional) - The default is 0. Random seed to choose training data and test data.
test_rate (Optional) - The default is 0. Ratio of test data.
model_parameters - Parameters to be passed to model_func.

Example

Original data:

CARRIER	ARR_DELAY	DEP_DELAY	DISTANCE	WEIGHTS
UA	10	7	300	0.8
UA	20	30	1000	0.5
UA	-5	0	200	1.2
AA	20	20	500	0.8
AA	-5	3	700	0.9
AA	10	8	1000	0.5

%>% build_model(model_func = lme4::lmer, formula = ARR_DELAY ~ DEP_DELAY + (DISTANCE|CARRIER), test_rate = 0.1, weights = WEIGHTS)
Returns a data frame that stores a linear mixed-effects model. It also returns a column of original data.

source_data	model	.test_index
source dataframe	lme4 model	c(1)

one_hot

Summary

One-hot encodes a categorical column, producing separate columns for each categorical values, each of which has values of 1 or 0 that tells whether a row has the value the column represents.

Syntax

one_hot(<column>)

Example

Original data:

ARR_DELAY	CARRIER
10	UA
20	UA
-5	UA
20	AA
-5	AA
10	AA

one_hot(CARRIER)

ARR_DELAY	CARRIER_UA	CARRIER_AA
10	1	0
20	1	0
-5	1	0
20	0	1
-5	0	1
10	0	1

get_mode

Summary

Returns the most frequent value (mode) in a column.

Syntax

get_mode(<column>, na.rm = <logical>)

Arguments
* column - Column to get mode of * na.rm (Optional) - The default is FALSE. If NA should be ignored even if it is the most frequent value.

Example

Original data:

ARR_DELAY	CARRIER
10	UA
20	UA
-5	UA
20	AA
-5	AA

summarize(ARR_DELAY_mode = get_mode(ARR_DELAY), CARRIER_mode = get_mode(CARRIER))

ARR_DELAY_mode	CARRIER_mode
20	UA

summarize_group

Summary

Summarize (or aggregate) column values with specified aggregate functions. If grouping columns are provided then summarize the values for each group.

Syntax

summarize_group(group_cols=<column_list>, group_funs = <function(s)>, ...)

Arguments
* group_cols - Columns to group by * grp_aggregatons - Aggregate Functions to apply to group by Columns. * ... - A pair of a newly created column name and an aggregation function with a column name. You can give multiple pairs to aggregate on multiple columns.

Example

Original data:

ARR_DELAY	CARRIER
10	UA
20	UA
30	UA
-20	UA
20	AA
-5	AA
-9	AA

summarize_group(group_cols=c("CARRIER"), group_funs=("none"), ARR_DELAY_mean = mean(ARR_DELAY))

CARRIER	ARR_DELAY_mean
UA	10
AA	2

Original data:

ARR_DELAY	DEP_DELAY	CARRIER
10	20	UA
20	15	UA
30	-5	UA
-20	30	UA
20	10	AA
-5	20	AA
-9	30	AA

summarize_group(group_cols=c("CARRIER"), group_funs=("none"), ARR_DELAY_mean = mean(ARR_DELAY), DEP_DEALY_sum = sum(DEP_DELAY))

CARRIER	ARR_DELAY_mean	DEP_DELAY_sum
UA	10	60
AA	2	60

Original data:

FL_DATE	ARR_DELAY	DEP_DELAY
06/01/2018	10	20
06/12/2018	20	15
06/18/2018	30	-5
06/21/2018	-20	30
07/14/2018	20	10
07/16/2018	-5	20
07/25/2018	-9	30

summarize_group(group_cols=c("FL_DATE"), group_funs=("month"), ARR_DELAY_mean = mean(ARR_DELAY), DEP_DEALY_sum = sum(DEP_DELAY))

FL_DATE_month	ARR_DELAY_mean	DEP_DELAY_sum
6	10	60
7	2	60

summarize_row

Summary

Summarize (or aggregate) column values across columns for each row with the specified aggregate function.

Syntax

summarize_row(across(c(<column(s)>)), <aggregate_function>, ...)
summarize_row(across(c(-<column(s)>)), <aggregate_function>, ...)
summarize_row(across(where(<columns_select_function>)), <aggregate_function>, ...)
summarize_row(across(starts_with(<text>, ignore.case = <logical>)), <aggregate_function>, ...)
summarize_row(across(ends_with(<text>, ignore.case = <logical>)), <aggregate_function>, ...)
summarize_row(across(contains(<text>, ignore.case = <logical>)), <aggregate_function>, ...)
summarize_row(across(matches(<text>, ignore.case = <logical>)), <aggregate_function>, ...)
summarize_row(across(num_range(<text>, <start_num>:<end_num>)), <aggregate_function>, ...)

Arguments

across function - Specifies the columns to aggregate.
aggregate_function - Aggregate function to apply to the column values. The default is mean.
columns_select_function - Function that returns TRUE for the columns to aggregate. e.g. is.numeric to aggregate all numeric columns.
... - Additional arguments to be passed to the aggregate function.

Example

mutate(Total = summarize_row(across(c(Sales_Consumer, Sales_Business, Sales_Government)), sum)
Sums up values from the specified columns for each row, and stores the sums in the Total column. If any of the column values are NA, the result will be NA too.

mutate(Total = summarize_row(across(c(Sales_Consumer, Sales_Business, Sales_Government)), sum, na.rm = TRUE)
Same as the above example, except that even if some of the column values are NA, the result will be calculated from the rest of the values.

mutate(Total = summarize_row(across(where(is.numeric)), sum, na.rm = TRUE)
Sums up values from all numeric columns for each row, and stores the sums in the Total column.

mutate_group

Summary

Create new columns or update existing columns with specified functions such as Window Calculations. If grouping columns are provided then it peforms calculation for each group.

Syntax

mutate_group(group_cols=<column_list>, group_funs = <function(s)>, sort_cols=<column_list>, sort_funs = <function(s)>, ...)

Arguments

group_cols - Columns to group by
group_funs - Aggregate Functions to apply to group by Columns.
sort_cols - Columns to sort
sort_funs - Sort Functions to apply to sort Columns. Either desc or none and none means "asc".
... - A pair of a newly created column name and a calculation with a column name.

Example

Original data:

ARR_DELAY	CARRIER
10	UA
20	UA
30	UA
20	UA
20	AA
5	AA
9	AA

mutate_group(group_cols=c("CARRIER"), group_funs=("none"), ARR_DELAY_cumsum = cumsum(ARR_DELAY))

CARRIER	ARR_DELAY_cumsum
UA	10
UA	30
UA	60
UA	80
AA	20
AA	25
AA	34

Original data:

ARR_DELAY	DEP_DELAY	CARRIER
10	20	UA
20	15	UA
30	-5	UA
-20	30	UA
20	10	AA
-5	20	AA
-9	30	AA

mutate_group(group_cols=c("CARRIER"), group_funs=("none"), ARR_DELAY_dfprev = ARR_DELAY - lag(ARR_DELAY), DEP_DEALY_dfprev = DEP_DELAY - lag(DEP_DELAY))

CARRIER	ARR_DELAY_dfprev	DEP_DELAY_dfprev
UA	NA	NA
UA	10	-5
UA	10	-20
UA	-50	35
AA	NA	NA
AA	-25	10
AA	-4	10

sum_if

Summary

Returns the sum of all the values that satisfy given conditions.

Syntax

sum_if(<column_num_logic>, <condition>,... , na.rm = <logical>)

Arguments

column - Column to calculate total.
condition - Condition to decide which rows should be kept for aggregate calculation.
... - Additional conditions to decide which rows should be kept for aggregate calculation. When multiple conditions are specified, the rows that satisfy all conditions will be kept. (i.e. each condition is joined with AND logical operator)
na.rm - TRUE or FALSE. The default is TRUE. Decides whether missing values (including NaN) to be removed or not.

Return Value

Numeric

Example

summarize(total = sum_if(revenue, product == 'PC' & country %in% c("Japan", "United States", "United Kingdom"))) Create a new column for calculating the sum of revenue for product 'PC' and country is either "Japan" or "United States" or "United Kingdom" for each group.

summarize(total = sum_if(quantity, priority == 'P1' | country == "United States")) Create a new column for calculating the sum of quantity for piroity 'P1' or country is "United States".

sum_if_ratio

Summary

Returns the ratio of the sum of all the values that satisfy given conditions to the sum of all the values.

Syntax

sum_if_ratio(<column_num_logic>, <condition>,... , na.rm = <logical>)

Arguments

column - Column to calculate total.
condition - Condition to decide which rows should be kept for aggregate calculation.
... - Additional conditions to decide which rows should be kept for aggregate calculation. When multiple conditions are specified, the rows that satisfy all conditions will be kept. (i.e. each condition is joined with AND logical operator)
na.rm - TRUE or FALSE. The default is TRUE. Decides whether missing values (including NaN) to be removed or not.

Return Value

Numeric

Example

summarize(total = sum_if_ratio(revenue, product == 'PC' & country %in% c("Japan", "United States", "United Kingdom"))) Create a new column for calculating the ratio of the sum of revenue for product 'PC' and country is either "Japan" or "United States" or "United Kingdom" to the sum of total revenue for each group.

summarize(total = sum_if_ratio(quantity, priority == 'P1' | country == "United States")) Create a new column for calculating the ratio of sum of quantity for piroity 'P1' or country is "United States" to the sum of total quantity.

count_if

Summary

Summarize the data by counting number of rows that satisfy given conditions.

Syntax

count_if(<column>, <condition>, ... , na.rm = <logical>)

Arguments

column - Column to count.
condition - Condition to decide which rows should be kept for aggregate calculation.
... - Additional conditions to decide which rows should be kept for aggregate calculation. When multiple conditions are specified, the rows that satisfy all conditions will be kept. (i.e. each condition is joined with AND logical operator)
na.rm - TRUE or FALSE. The default is TRUE. Decides whether missing values (including NaN) to be removed or not.

Example

summarize(country_count = count_if(country, product == 'PC' | quantity > 10)) returns the number of rows for product 'PC' or quantity is greater than 10 for each group.

count_if_ratio

Summary

Return the ratio of the sum of the data by counting the number of rows that satisfy given conditions to the sum of the total count.

Syntax

count_if_ratio(<column>, <condition>, ... , na.rm = <logical>)

Arguments

column - Column to count.
condition - Condition to decide which rows should be kept for aggregate calculation.
... - Additional conditions to decide which rows should be kept for aggregate calculation. When multiple conditions are specified, the rows that satisfy all conditions will be kept. (i.e. each condition is joined with AND logical operator)
na.rm - TRUE or FALSE. The default is TRUE. Decides whether missing values (including NaN) to be removed or not.

Example

summarize(country_count_ratio = count_if_ratio(country, product == 'PC' | quantity > 10)) Returns the ratio of the number of rows for product 'PC' or quantity is greater than 10 to the number of rows for each group.

count_unique_if

Summary

Counts the number of the unique values of a column that come from rows that satisfy given conditions.

Syntax

count_unique_if(<column>, <condition>, ... , na.rm = <logical>)

Arguments

column - Column to count the unique values.
condition - Condition to decide which rows should be kept for counting the unique values.
... - Additional conditions to decide which rows should be kept for counting the unique values. When multiple conditions are specified, the rows that satisfy all conditions will be kept. (i.e. each condition is joined with AND logical operator)
na.rm - TRUE or FALSE. The default is TRUE. Decides whether NA is to be counted as a unique value or not.

Example

summarize(country_count = count_unique_if(country, product == 'PC' & quantity > 10)) returns the number of unique countries for rows where product is 'PC' and quantity is greater than 10 for each group.

count_unique_if_ratio

Summary

Return the ratio of the number of the unique values of a column that come from rows that satisfy given conditions to number of the unique values.

Syntax

count_unique_if_ratio(<column>, <condition>, ... , na.rm = <logical>)

Arguments

column - Column to count the unique values.
condition - Condition to decide which rows should be kept for counting the unique values.
... - Additional conditions to decide which rows should be kept for counting the unique values. When multiple conditions are specified, the rows that satisfy all conditions will be kept. (i.e. each condition is joined with AND logical operator)
na.rm - TRUE or FALSE. The default is TRUE. Decides whether NA is to be counted as a unique value or not.

Example

summarize(country_count = count_unique_if_ratio(country, product == 'PC' & quantity > 10)) Returns the ratio of the number of unique countries for rows where a product is 'PC' and quantity is greater than 10 to the number of unique countries for each group.

mean_if

Summary

Returns the numerical average (mean) value that satisfy given conditions.

Syntax

mean_if(<column_num_logic_date>, <condition>, ... , na.rm = <logical>)

Arguments

column - Column to calculate mean.
condition - Condition to decide which rows should be kept for aggregate calculation.
... - Additional conditions to decide which rows should be kept for aggregate calculation. When multiple conditions are specified, the rows that satisfy all conditions will be kept. (i.e. each condition is joined with AND logical operator)
na.rm - TRUE or FALSE. The default is TRUE. Decides whether missing values (including NaN) to be removed or not.

Return Value

Numeric

Example

summarize(revenue_mean = mean_if(revenue), product == 'PC' | quantity > 10))
Create a new column for calculating the average value of revenue for product 'PC' or quantity is greater than 10 for each group.

average_if

Summary

Returns the numerical average (mean) value that satisfy given conditions. This is an alias of mean_if.

Syntax

average_if(<column_num_logic_date>, <condition>, ... , na.rm = <logical>)

Arguments

column - Column to calculate mean.
condition - Condition to decide which rows should be kept for aggregate calculation.
... - Additional conditions to decide which rows should be kept for aggregate calculation. When multiple conditions are specified, the rows that satisfy all conditions will be kept. (i.e. each condition is joined with AND logical operator)
na.rm - TRUE or FALSE. The default is TRUE. Decides whether missing values (including NaN) to be removed or not.

Return Value

Numeric

Example

summarize(revenue_mean = average_if(revenue), product == 'PC' | quantity > 10))
Create a new column for calculating the average value of revenue for product 'PC' or quantity is greater than 10 for each group.

median_if

Summary

Returns the numerical median value that satisfy given conditions.

Syntax

median_if(<column_num_logic_date>, <condition>, ... , na.rm = <logical>)

Arguments

column - Column to calculate median.
condition - Condition to decide which rows should be kept for aggregate calculation.
... - Additional conditions to decide which rows should be kept for aggregate calculation. When multiple conditions are specified, the rows that satisfy all conditions will be kept. (i.e. each condition is joined with AND logical operator)
na.rm - TRUE or FALSE. The default is TRUE. Decides whether missing values (including NaN) to be removed or not.

Return Value

Numeric

Example

summarize(revenue_median = median_if(revenue), product == 'PC' | quantity > 10))
Create a new column for calculating the median value of revenue for product 'PC' or quantity is greater than 10 for each group.

max_if

Summary

Returns the maximum value from a numeric or date column that meets the specified condition.

Syntax

max_if(<column>, <condition>, ... , na.rm = <logical>)

Arguments

column - Column to calculate max.
condition - Condition to decide which rows should be kept for aggregate calculation.
... - Additional conditions to decide which rows should be kept for aggregate calculation. When multiple conditions are specified, the rows that satisfy all conditions will be kept. (i.e. each condition is joined with AND logical operator)
na.rm - TRUE or FALSE. The default is TRUE. Decides whether missing values (including NaN) to be removed or not.

Return Value
Numeric, Date, or DateTime (same type as the input column)

Example

max_if(purchase_amount, customer_category == "premium")

Returns the maximum "purchase_amount" for customers whose "customer_category" is "premium".

order_id	customer_category	purchase_amount	result
O001	regular	15000	58000
O002	premium	58000	58000
O003	premium	42000	58000
O004	regular	25000	58000

max_if(purchase_date, product_name == "Smartphone X")

Returns the latest date (most recent date) when "Smartphone X" was purchased.

order_id	product_name	purchase_date	result
O001	Tablet Y	2023-06-15	2023-05-22
O002	Smartphone X	2023-02-10	2023-05-22
O003	Laptop PC Z	2023-04-05	2023-05-22
O004	Smartphone X	2023-05-22	2023-05-22

min_if

Summary

Returns the minimum value from a numeric or date column that meets the specified condition.

Syntax

min_if(<column>, <condition>, ..., na.rm = <logical>)

Arguments

column - Column to calculate min.
condition - Condition to decide which rows should be kept for aggregate calculation.
... - Additional conditions to decide which rows should be kept for aggregate calculation. When multiple conditions are specified, the rows that satisfy all conditions will be kept. (i.e. each condition is joined with AND logical operator)
na.rm - TRUE or FALSE. The default is TRUE. Decides whether missing values (including NaN) to be removed or not.

Return Value
Numeric, Date, or DateTime (same type as the input column)

Example

min_if(price, product_category == "electronics")

Returns the minimum "price" for products whose "product_category" is "electronics".

product_id	product_category	price	result
E001	electronics	25000	12000
F001	food	5000	12000
E002	electronics	12000	12000
C001	clothing	8000	12000

min_if(purchase_date, product_name == "Laptop A")

Returns the earliest date when "Laptop A" was purchased.

order_id	product_name	purchase_date	result
O001	Tablet B	2023-03-15	2023-01-20
O002	Laptop A	2023-01-20	2023-01-20
O003	Smartphone C	2023-02-10	2023-01-20
O004	Laptop A	2023-04-05	2023-01-20

count_rows

Summary
Returns the count of the rows for each group.

Syntax
count_rows()

Arguments

Return Value
Numeric

Example
summarize(total = count_rows())
Creates a new column for calculating the number of entries for each group.

count_unique

Summary
Returns the count of unique values.

Syntax
count_unique(<column(s)>, na.rm = <logical>)

Arguments

na.rm - TRUE or FALSE. The default is FALSE. Decides whether missing values (including NaN) to be removed or not.

Return Value
Numeric

Example
summarize(total = count_unique(TAIL_NUM))
Create a new column for calculating the count of unique values of TAIL_NUM for each group.

summarize(total = count_unique(STATE, CITY))
Create a new column for calculating the count of unique combinations of STATE and CITY for each group.

weekend

Summary
Returns either Weekday or Weekend based on the provided date column value.

Syntax
weekend(<column_date>)

Arguments * column - Date Column to get Weekday or Weekend

Return Value
Factor (either Weekday or Weekend)

Example

weekend("2015-10-01")
returns Weekday

weekend("2015-10-02")
returns Weekend

is_japanese_holiday

Summary Alias to is_jholiday. Returns TRUE if the provided date is a Japanese Holiday.

Syntax
is_japanese_holiday(<column_date>)

Arguments * column - Date Column that you want to detect if it's a Japanese Holiday

Return Value
Factor (either TRUE or FALSE)

Example

is_japanese_holiday("2020-01-01")
returns TRUE

is_japanese_holiday("2020-01-05")
returns FALSE

get_week_of_month

Summary
Extract Week of Month. For example, if the given date is in the 1st week of the month, it returns 1.

Syntax
get_week_of_month(<column_date>)

Return Value
Numeric

Example

get_week_of_month(as.Date("2019-06-01")) returns 1 get_week_of_month(as.Date("2019-06-25")) returns 5

sample_rows

Summary
Sample n rows from the data frame. This is same as sample_n except for it handles the case where the number of rows in the data is fewer than the specified n without throwing error.

Syntax
sample_rows(<number>, weight=<column_num>, replace=<logical>, seed=<number>)

Arguments

n - number
weight (Optional) - Sampling weight. The ratio of probability the row should be chosen. This should be non-negative numeric vector whose size is the number of rows. It's normalized so that the sum of the probability will be 1.
replace (Optional) - The default is FALSE. If it's TRUE, replacement will be executed when sampling.
seed (Optional) - Random seed. Set a number to always reproduce a same result.

Example

sample_rows(100)
Select 100 randomly selected rows.
sample_rows(100, weight = ARR_DELAY)
Select 100 randomly selected rows from the data with a weight on ARR_DELAY column.

confint_mean

Summary
Returns a confidence interval range (half-width of confidence interval) of mean from given numeric data.

Syntax
confint_mean(<column_num>, level=<numeric>)

Arguments

column_num - Numeric column that you want to calculate a confidence interval range of mean.
level(Optional) - Confidence interval level you want to calculate. Default is 0.95 (95%).

Return Value
Numeric

Example

Original data:

DEPARTMENT	MONTHLY_INCOME
Sales	2200
Sales	3000
Sales	5000
Sales	:
R&D	4000
R&D	3200
R&D	2200
R&D	:

confint = confint_mean(MONTHLY_INCOME)

DEPARTMENT	confint
Sales	25.32468
Sales	17.9226

confint_ratio

Summary
Returns a confidence interval range (half-width of confidence interval) of TRUE ratio from given logical data.

Syntax
confint_ratio(<column_logical>, level=<numeric>)

Arguments

column_logical - Logical column that you want to calculate a confidence interval range of TRUE ratio.
level(Optional) - Confidence interval level you want to calculate. Default is 0.95 (95%).

Return Value
Numeric

Example

Original data:

DEPARTMENT	ATTRITION
Sales	TRUE
Sales	FALSE
Sales	FALSE
Sales	:
R&D	FALSE
R&D	TRUE
R&D	FALSE
R&D	:

confint = confint_ratio(ATTRITION)

DEPARTMENT	confint
Sales	0.02065466
Sales	0.02683791

calc_confint_mean

Summary
Returns a confidence interval range (half-width of confidence interval) of mean from a size and a standard deviation of sample data.

Syntax
calc_confint_mean(<column_num>, <column_num>. level=<numeric>)

Arguments

sd_of_sample - Standard deviation of sample data
size_of_sample - Size of sample data
level(Optional) - Confidence interval level you want to calculate. Default is 0.95 (95%).

Return Value
Numeric

Example

Original data:

DEPARTMENT	SD	NUMBER_OF_PEOPLE
Sales	577.4946	2000
R&D	288.8194	1000

confint = calc_confint_mean(SD, NUMBER_OF_PEOPLE)

DEPARTMENT	SD	NUMBER_OF_PEOPLE	confint
Sales	577.4946	2000	25.32468
R&D	288.8194	1000	17.9226

calc_confint_ratio

Summary
Returns a confidence interval range (half-width of confidence interval) of ratio from a size and a TRUE ratio of sample data.

Syntax
calc_confint_ratio(<column_num>, <column_num>. level=<numeric>)

Arguments

ratio_of_sample - Ratio of sample data
size_of_sample - Size of sample data
level(Optional) - Confidence interval level you want to calculate. Default is 0.95 (95%).

Return Value
Numeric

Example

Original data:

DEPARTMENT	ATTRITION_RATIO	NUMBER_OF_PEOPLE
Sales	0.33	2000
R&D	0.25	1000

confint = calc_confint_ratio(ATTRITION_RATIO, NUMBER_OF_PEOPLE)

DEPARTMENT	ATTRITION_RATIO	NUMBER_OF_PEOPLE	confint
Sales	0.33	2000	0.02065466
R&D	0.25	1000	0.02683791

cumsum_decayed

Summary
Caluculates cumulative sum of decaying effects. It is same as cumsum when r (the second argument) is 1.

Syntax
cumsum_decayed(<column_num>, <numeric>)

Arguments

x - The input numeric column.
r - Ratio of remaining effect from the previous period (i.e. the previous row).

Return Value Numeric

Example

Original data:

Ad_Unit
100
200
100
100

mutate(Effective_Ad_Unit = cumsum_decayed(Add_Unit, 0.1))

Ad_Unit	Effective_Ad_Unit
100	100
200	210
100	121
100	112.1

years_between

Summary
Calculate period between two dates in years.

Syntax
years_between(<column_date>, <column_date>)

Arguments

start_date - period start date.
end_date - period end date.

Return Value
numeric

Example
years_between(ymd("2020-10-01), ymd("2021-10-01")) returns 1

months_between

Summary
Calculate period between two dates in months.

Syntax
months_between(<column_date>, <column_date>)

Arguments

start_date - period start date.
end_date - period end date.

Return Value
numeric

Example
months_between(ymd("2020-10-01), ymd("2021-10-01")) returns 12

weeks_between

Summary
Calculate period between two dates in weeks.

Syntax
weeks_between(<column_date>, <column_date>)

Arguments

start_date - period start date.
end_date - period end date.

Return Value
numeric

Example
weeks_between(ymd("2020-10-01), ymd("2021-10-01")) returns 52.14286

days_between

Summary
Calculate period between two dates in days.

Syntax
days_between(<column_date>, <column_date>)

Arguments

start_date - period start date.
end_date - period end date.

Return Value
numeric

Example
days_between(ymd("2020-10-01), ymd("2021-10-01")) returns 365

hours_between

Summary
Calculate period between two dates in hours.

Syntax
hours_between(<column_date>, <column_date>)

Arguments

start_date - period start date.
end_date - period end date.

Return Value
numeric

Example
hours_between(ymd_hms("2020-10-01 05:00:00"), ymd_hms("2020-10-01 15:00:00")) returns 10

minutes_between

Summary
Calculate period between two dates in minutes.

Syntax
minutes_between(<column_date>, <column_date>)

Arguments

start_date - period start date.
end_date - period end date.

Return Value
numeric

Example
minutes_between(ymd_hms("2020-10-01 05:00:00"), ymd_hms("2020-10-01 15:00:00")) returns 600

seconds_between

Summary
Calculate period between two dates in seconds.

Syntax
seconds_between(<column_date>, <column_date>)

Arguments

start_date - period start date.
end_date - period end date.

Return Value
numeric

Example
seconds_between(ymd_hms("2020-10-01 05:00:00"), ymd_hms("2020-10-01 15:00:00")) returns 36000

last_date

Summary Returns the last date of the specified period (e.g., month) that the original date belongs to.

Syntax last_date(<column_date>, unit = "week"|"month"|"quarter"|"year", previous = <logical>, week_start = <number>)

Arguments

unit - Unit of the time period. "month" by default.
previous - If TRUE, return the last date of the previous period.
week_start (Optional) - 7 (default) means a week starts on Sunday. 1 means a week starts on Monday.

Return Value Date

Example last_date(ymd("2022-10-01"), unit = "month") returns "2022-10-31"

ts_lag

Summary
Returns the value of the previous period.

Arguments

column_date - Column that has time data
column_num - Column that has value data
unit - Unit of time to specify the lag.
n - Length of the lag in the above time unit.
na_fill_type - How to fill a value if the value at the exact lag is missing.
- "previous" - Fill with the last previous non-NA value in the time series.
- "next" - Fill with the next non-NA value in the time series.
- "none" - Do not fill. NA is put in place.

Return Value numeric - The values with the time lag.

Example ts_lag(ymd(c("2020-01-01","2021-01-01","2022-01-01")), c(1,2,3), unit = "year", n = 1) returns c(NA, 1, 2)

ts_diff

Summary Returns differences in values in a time series data since the specified lag time ago.

Arguments

column_num - Column that has value data
column_date - Column that has time data
unit - Unit of time to specify the lag.
n - Length of the lag in the above time unit.
na_fill_type - How to fill a base value if the value at the exact lag is missing.
- "previous" - Fill with the last previous non-NA value in the time series.
- "next" - Fill with the next non-NA value in the time series.
- "none" - Do not fill. NA is put in place.

Return Value numeric

Example ts_diff(ymd(c("2020-01-01","2021-01-01","2022-01-01")), c(1,2,3), unit = "year", n = 1) returns c(NA, 1, 1)

ts_diff_ratio

Summary Returns the differences of the values in ratio compared to the base values from the lag time ago in a time series.

Arguments

column_num - Column that has value data
column_date - Column that has time data
unit - Unit of time to specify the lag.
n - Length of the lag in the above time unit.
na_fill_type - How to fill a base value if the value at the exact lag is missing.
- "previous" - Fill with the last previous non-NA value in the time series.
- "next" - Fill with the next non-NA value in the time series.
- "none" - Do not fill. NA is put in place.

Return Value numeric

Example ts_diff_ratio(ymd(c("2020-01-01","2021-01-01","2022-01-01")), c(1,2,3), unit = "year", n = 1) returns c(NA, 1, 0.5)

likert_sigma

Summary
Returns Likert's sigma values given a raw data column of survey answers with selected levels such as 1="Strongly disagree", 2="Disagree", 3="Neutral", 4="Agree", 5="Strongly agree".

Syntax
likert_sigma(<column_num>)

Arguments

column_num - A column of survey answers with selected levels expressed in numbers such as 1 to 5.

Return Value
Numeric

Example

Original data:

NAME	LIKES_ICE_CREAM
Joe	5
Sally	4
Bob	2
John	5

sigma = likert_sigma(LIKES_ICE_CREAM)

NAME	LIKES_ICE_CREAM	sigma
Joe	5	0.7978846
Sally	4	-0.3246628
Bob	2	-1.2711063
John	5	0.7978846

str_remove_before

Summary
Removes the text before a specified delimiter or regex match.

Syntax
str_remove_before(<text_column>, sep = <delimiter>, include_sep = <TRUE|FALSE>)

Arguments

sep - Specifies the delimiter. The default delimiter is a comma.
include_sep (Optional) - When TRUE, the delimiter itself is also removed. The default value is TRUE.

Return Value
Character

Example

str_remove_before(X1, sep = ",")

Removes text before the comma in column X1, including the comma.

X1	NEW
New York, USA	USA
Tokyo, Japan	Japan
Paris, France	France

str_remove_before(X1, sep = ",", include_sep = FALSE)

Removes text before the comma in column X1, but keeps the comma.

X1	NEW
New York, USA	, USA
Tokyo, Japan	, Japan
Paris, France	, France

str_remove_after

Summary
Removes the text after a specified delimiter or regex match.

Syntax
str_remove_after(<text_column>, sep = <delimiter>, include_sep = <TRUE|FALSE>)

Arguments

sep - Specifies the delimiter. The default delimiter is a comma.
include_sep (Optional) - When TRUE, the delimiter itself is also removed. The default value is TRUE.

Return Value
Character

Example

str_remove_after(X1, sep = ",")

Removes text after the comma in column X1, including the comma.

X1	NEW
New York, USA	New York
Tokyo, Japan	Tokyo
Paris, France	Paris

str_remove_after(X1, sep = ",", include_sep = FALSE)

Removes text after the comma in column X1, but keeps the comma.

X1	NEW
New York, USA	New York,
Tokyo, Japan	Tokyo,
Paris, France	Paris,

str_remove_range

Summary
Removes characters from a specified range in text data.

Syntax
str_remove_range(<text_column>, start = <start_position>, end = <end_position>)

Arguments

start - Specifies the position of the first character to remove.
end - Specifies the position of the last character to remove. Default is -1.

Return Value
Character

Example

str_remove_range(X1, start = 1, end = 3)

Removes characters from the 1st to the 3rd position in column X1.

X1	NEW
Hello	lo
World	ld
Analysis	lysis

str_remove_range(X1, start = -3)

Removes the last 3 characters from column X1.

X1	NEW
Hello	He
World	Wo
Analysis	Analy

str_replace_before

Summary
Replaces the text before a specified delimiter or regex match with a specified string.

Syntax
str_replace_before(<text_column>, sep = <delimiter>, rep = <replacement_string>, include_sep = <TRUE|FALSE>)

Arguments

sep - Specifies the delimiter. The default delimiter is a comma.
rep - Specifies the replacement string.
include_sep (Optional) - When TRUE, the delimiter itself is also replaced. The default value is TRUE.

Return Value
Character

Example

str_replace_before(X1, sep = ",", rep = "City")

Replaces text before the comma in column X1 with "City".

X1	NEW
New York, USA	City USA
Tokyo, Japan	City Japan
Paris, France	City France

str_replace_before(X1, sep = ",", rep = "City", include_sep = FALSE)

Replaces text before the comma in column X1 with "City", but keeps the comma.

X1	NEW
New York, USA	City, USA
Tokyo, Japan	City, Japan
Paris, France	City, France

str_replace_after

Summary
Replaces the text after a specified delimiter or regex match with a specified string.

Syntax
str_replace_after(<text_column>, sep = <delimiter>, rep = <replacement_string>, include_sep = <TRUE|FALSE>)

Arguments

sep - Specifies the delimiter. The default is a comma.
rep - Specifies the replacement string.
include_sep (Optional) - When TRUE, the delimiter itself is also replaced. The default is TRUE.

Return Value
Character

Example

str_replace_after(X1, sep = ",", rep = "Country")

Replaces text after the comma in column X1 with "Country".

X1	NEW
New York, USA	New York Country
Tokyo, Japan	Tokyo Country
Paris, France	Paris Country

str_replace_after(X1, sep = ",", rep = "Country", include_sep = FALSE)

Replaces text after the comma in column X1 with "Country", but keeps the comma.

X1	NEW
New York, USA	New York, Country
Tokyo, Japan	Tokyo, Country
Paris, France	Paris, Country

str_replace_range

Summary
Replaces characters in a specified range of text data with another string.

Syntax
str_replace_range(<text_column>, start = <start_position>, end = <end_position>, replacement = <replacement_string>)

Arguments

start - Specifies the position of the first character to replace.
end - Specifies the position of the last character to replace. Default is -1.
replacement - Specifies the replacement string.

Return Value
Character

Example str_replace_range(X1, start = 1, end = 3, replacement = "NEW")

Replaces characters from the 1st to the 3rd position in column X1 with "NEW".

X1	NEW
Hello	NEWlo
World	NEWld
Analysis	NEWlysis

str_replace_range(X1, start = -3, replacement = "END")

Replaces the last 3 characters in column X1 with "END".

X1	NEW
Hello	HeEND
World	WoEND
Analysis	AnalyEND

str_extract_before

Summary
Extracts the text before a specified delimiter or regex match.

Syntax
str_extract_before(<text_column>, sep = <delimiter>, include_sep = <TRUE|FALSE>)

Arguments

sep - Specifies the delimiter. The default is a comma.
include_sep (Optional) - When TRUE, the delimiter itself is also extracted. The default is FALSE.

Return Value
Character

Example

str_extract_before(X1, sep = ",")

Extracts text before the comma in column X1, excluding the comma.

X1	NEW
New York, USA	New York
Tokyo, Japan	Tokyo
Paris, France	Paris

str_extract_before(X1, sep = ",", include_sep = TRUE)

Extracts text before the comma in column X1, including the comma.

X1	NEW
New York, USA	New York,
Tokyo, Japan	Tokyo,
Paris, France	Paris,

str_extract_after

Summary
Extracts the text after a specified delimiter or regex match.

Syntax
str_extract_after(<text_column>, sep = <delimiter>, include_sep = <TRUE|FALSE>)

Arguments

sep - Specifies the delimiter. The default is a comma.
include_sep (Optional) - When TRUE, the delimiter itself is also extracted. The default is FALSE.

Return Value
Character

Example

str_extract_after(X1, sep = ",")

Extracts text after the comma in column X1, excluding the comma.

X1	NEW
New York, USA	USA
Tokyo, Japan	Japan
Paris, France	France

str_extract_after(X1, sep = ",", include_sep = TRUE)

Extracts text after the comma in column X1, including the comma.

X1	NEW
New York, USA	, USA
Tokyo, Japan	, Japan
Paris, France	, France

exp_balance

Summary
Balances imbalanced data using SMOTE (Synthetic Minority Oversampling Technique). Creates synthetic data for the minority class and appropriately adjusts the majority class data to help build unbiased machine learning models.

The target column must be limited to 0/1 numeric values, binary categorical data, or logical type columns.

Syntax
exp_balance(<imbalanced column>, target_minority_perc = <target percentage of minority data>, target_size = <target number of rows>, max_synth_perc = <maximum percentage increase for minority data>, k = <number of neighbors>, seed = <seed value>)

Arguments

imbalanced column - Column to be balanced. Can be numeric values of 0/1, categorical data with binary values, or logical type columns.
target_minority_perc - Target percentage for minority data. Default is 50%.
target_size - Target number of rows in the balanced dataset. Default is 50,000 rows.
max_synth_perc - Maximum percentage increase for minority data. Default is 200%.
k - Number of neighbors to use in the SMOTE algorithm's k-nearest neighbors. Default is 5.
seed - Random seed value to ensure reproducibility. Default is 0.

Return Value
Data Frame

Example

Original data class distribution (flight delay data):

Flight_ID	Delayed
1001	TRUE
1002	FALSE
1003	FALSE
1004	FALSE
1005	FALSE
1006	FALSE
1007	FALSE
1008	TRUE

Since the proportion of Delayed = TRUE is small, creating a machine learning model with this data might result in predictions biased toward FALSE.

exp_balance(Delayed, target_minority_perc = 50, target_size = 10, max_synth_perc = 200, k = 5, seed = 0)

Adjusts the proportion of minority data (TRUE) to 50% with a target dataset size of 10 rows.

Flight_ID	Delayed	Synthesized
1001	TRUE	FALSE
1002	FALSE	FALSE
1003	FALSE	FALSE
1004	FALSE	FALSE
1005	FALSE	FALSE
1006	FALSE	FALSE
2001	TRUE	TRUE
2002	TRUE	TRUE
2003	TRUE	TRUE
2004	TRUE	TRUE

Synthetic data (Synthesized = TRUE) is added, increasing the amount of Delayed = TRUE data, thus balancing the dataset.

exp_cut_by_step

Summary
Divides numeric data into equal intervals (e.g., in steps of 500) and returns the corresponding group name. Can be combined with format_cut_output function to customize the display format of the intervals.

Syntax
exp_cut_by_step(<numeric_column>, step = <interval_width>, lower.range = <minimum_value>, upper.range = <maximum_value>, right = <TRUE|FALSE>, include.outside.range = <TRUE|FALSE>, dig.lab = <display_digits>, as.factor = <TRUE|FALSE>)

Arguments * numeric_column - Numeric data to be divided into intervals * step - Specifies the width of the interval (e.g., 1000, 5000, etc.) * lower.range (optional) - Specifies the lower limit of the range. If NA, the minimum value of the data is used * upper.range (optional) - Specifies the upper limit of the range. If NA, the maximum value of the data is used * right (optional) - Specifies how to handle the endpoints of the range: * TRUE: Includes the right endpoint of the range * FALSE: Includes the left endpoint of the range. Default is FALSE * include.outside.range (optional) - Specifies how to handle values outside the specified range: * TRUE: Includes values outside the range (creates intervals for values below the lower limit and above the upper limit) * FALSE: Treats values outside the range as NA. Default is TRUE * dig.lab (optional) - Specifies the number of digits to display in the interval labels. Default is 3 * as.factor (optional) - Specifies the data type of the return value: * TRUE: Returns as a factor type (preserves order) * FALSE: Returns as a string. Default is TRUE

Return Value
Factor

Example
format_cut_output(exp_cut_by_step(sales_amount, step = 10000), right = FALSE, decimal.digits = 0) Divides the sales amount into equal intervals in steps of 10000. Classifies by including the lower limit of each group and returns the corresponding group name in integer notation. | sales_amount | interval | |------|--------| | 5000 | < 5,000 | | 15000 | 5,000 - 15,000 | | 25000 | 15,000 - 25,000 |

format_cut_output(exp_cut_by_step(sales_amount, step = 5000, lower.range = 0, upper.range = 20000, right = TRUE), right = TRUE, decimal.digits = 0) Restricts the sales amount to a range from 0 to 20000 and divides it into equal intervals in steps of 5000. Classifies by including the upper limit of each group and returns the corresponding group name in integer notation.

sales_amount	interval
5000	0 - 5,000
15000	10,000 - 15,000
25000	20,000 <

exp_cut

Summary
Divides numeric data into equal width groups and returns the corresponding group name. Can be combined with format_cut_output function to customize the display format of the intervals.

Syntax
exp_cut(<numeric_column>, breaks = <number_of_intervals>, lower.range = <minimum_value>, upper.range = <maximum_value>, right = <TRUE|FALSE>, include.outside.range = <TRUE|FALSE>, dig.lab = <display_digits>, as.factor = <TRUE|FALSE>)

Arguments * numeric_column - Numeric data to be divided into intervals * breaks - Specifies the number of intervals to create * lower.range (optional) - Specifies the lower limit of the range. If NA, the minimum value of the data is used * upper.range (optional) - Specifies the upper limit of the range. If NA, the maximum value of the data is used * right (optional) - Specifies how to handle the endpoints of the range: * TRUE: Includes the right endpoint of the range * FALSE: Includes the left endpoint of the range. Default is FALSE * include.outside.range (optional) - Specifies how to handle values outside the specified range: * TRUE: Includes values outside the range (creates intervals for values below the lower limit and above the upper limit) * FALSE: Treats values outside the range as NA. Default is TRUE * dig.lab (optional) - Specifies the number of digits to display in the interval labels. Default is 3 * as.factor (optional) - Specifies the data type of the return value: * TRUE: Returns as a factor type (preserves order) * FALSE: Returns as a string. Default is TRUE

Return Value
Factor

Example
format_cut_output(exp_cut(sales_amount, breaks = 5), right = TRUE, decimal.digits = 2) Divides the sales amount values into 5 equal-width groups based on the minimum and maximum values, and returns which group each value belongs to. Group names display information up to 2 decimal places.

sales_amount	group
1	1 - 5,000
4800	1 - 5,000
13000	10,000 - 15,000
25000	20,000 - 25,000

format_cut_output(exp_cut(sales_amount, breaks = 4, lower.range = 0, upper.range = 20000, right = TRUE), right = TRUE, decimal.digits = 2) Divides the sales amount into 4 equal-width groups within the range of 0 to 20000, and returns which group each value belongs to. Classified to include the upper limit of each group, with group names displaying information up to 2 decimal places.

sales_amount	group
1	0.00 - 5,000.00
4800	0.00 - 5,000.00
13000	10,000 - 15,000
25000	20,000.00 <

logistic

Summary
Logistic function.

Syntax
logistic(<column_num>)

Return Value
Numeric

Example
mutate(Y = logistic(X)) Create a new column Y with the result of logistic function applied to column X.

is_jholiday

Summary Returns TRUE if the provided date is a Japanese Holiday.

Syntax
is_jholiday(<column_date>)

Arguments

column - Date Column that you want to detect if it's a Japanese Holiday

Return Value
Logical (either TRUE or FALSE)

Example

is_jholiday("2020-01-01")
returns TRUE

is_jholiday("2020-01-05")
returns FALSE

str_conv_zenhan

Summary
Converts between full-width (zenkaku) and half-width (hankaku) characters in a string.

Syntax
str_conv_zenhan(<text_column>, to = "zenkaku"|"hankaku")

Arguments

to - Specifies the conversion format:
- "zenkaku" - Converts half-width characters to full-width characters
- "hankaku" - Converts full-width characters to half-width characters

Return Value
Character

Example

str_conv_zenhan(X1, "zenkaku")

Converts half-width characters to full-width characters.

X1	NEW
ｶﾞｯ	ガッ
ﾊﾛｰ	ハロー
ABC	ＡＢＣ

str_conv_zenhan(X1, to = "hankaku")

Converts full-width characters to half-width characters.

X1	NEW
ガッ	ｶﾞｯ
ハロー	ﾊﾛｰ
ＡＢＣ	ABC

str_conv_hirakana

Summary
Converts between hiragana and katakana characters in a string.

Syntax
str_conv_hirakana(<text_column>, to = "hiragana"|"katakana")

Arguments

to - Specifies the conversion format:
- "hiragana" - Converts katakana to hiragana
- "katakana" - Converts hiragana to katakana

Return Value
Character

Example

str_conv_hirakana(X1, to = "hiragana")

Converts katakana characters to hiragana.

X1	NEW
アイウエオ	あいうえお
カタカナ	かたかな
ヤマダ	やまだ

str_conv_hirakana(X1, to = "katakana")

Converts hiragana characters to katakana.

X1	NEW
あいうえお	アイウエオ
かたかな	カタカナ
やまだ	ヤマダ

separate_japanese_address

Summary Splits a Japanese address string into prefecture, city, and street components.

Syntax separate_japanese_address(<address_column>, prefecture_col = <prefecture_column_name>, city_col = <city_column_name>, street_col = <street_column_name>)

Arguments

address_column - String containing the Japanese address to be split
prefecture_col (Optional) - Column name for the prefecture component
city_col (Optional) - Column name for the city component
street_col (Optional) - Column name for the street component

Return Value Character

Example

separate_japanese_address(address)

Splits the address string into prefecture, city, and street components.

address	prefecture	city	street
東京都千代田区丸の内1-1-1	東京都	千代田区	丸の内1-1-1
大阪府大阪市中央区北浜1-1-1	大阪府	大阪市中央区	北浜1-1-1
北海道札幌市中央区北1条西2丁目	北海道	札幌市中央区	北1条西2丁目

separate_japanese_address(address, prefecture_col = "prefecture", city_col = "city", street_col = "street_address")

Splits the address string with custom column names for prefecture, city, and street components.

address	prefecture	city	street_address
東京都千代田区丸の内1-1-1	東京都	千代田区	丸の内1-1-1
大阪府大阪市中央区北浜1-1-1	大阪府	大阪市中央区	北浜1-1-1
北海道札幌市中央区北1条西2丁目	北海道	札幌市中央区	北1条西2丁目

convert_jyear

Summary
Converts Japanese era years (Reiwa, Heisei, Showa, etc.) to Western calendar years. Supports both kanji notation (令和三年) and alphabetic notation (R3).

Syntax
convert_jyear(<japanese_era_year>, legacy = <TRUE|FALSE>)

Arguments

japanese_era_year - Japanese era year string to convert. Supports formats such as "令和3年", "R3", "平成元年", etc.
legacy - Specifies whether to use the package's own conversion logic instead of the ICU library. Default is FALSE (using ICU library). Specify TRUE when the ICU library is not available in your environment.

Return Value
Numeric

Example

convert_jyear(japanese_era_year)

Converts Japanese era years to Western calendar years.

japanese_era_year	western_year
令和3年	2021
R3	2021
平成元年	1989
昭和64年	1989

convert_prefecture

Summary
Converts prefecture names between kanji format (e.g., 東京都) and romanized format (e.g., Tokyo-to). Useful for normalizing and standardizing address data.

Syntax
convert_prefecture(<prefecture_name_column>, to = <"kanji"|"roman">)

Arguments

prefecture_name_column - Column containing prefecture names to convert. Supports both kanji format (東京都) and romanized format (tokyo-to)
to - Specifies the output format:
- "kanji" - Converts to kanji format (e.g., 東京都, 大阪府)
- "roman" - Converts to romanized format (e.g., tokyo-to, osaka-fu)

Return Value
Character

Example

convert_prefecture(prefecture, to = "kanji")

Converts romanized prefecture names to kanji format.

prefecture	converted_prefecture
tokyo-to	東京都
osaka-fu	大阪府
hokkaido	北海道

convert_prefecture_from_kana

Summary
Converts prefecture names written in hiragana or katakana to kanji notation. Useful for normalizing address data and name matching processes.

Syntax
convert_prefecture_from_kana(<kana_prefecture_name>)

Arguments

kana_prefecture_name - Prefecture name written in hiragana or katakana.

Return Value
Character

Example

convert_prefecture_from_kana(prefecture_hiragana)

Converts prefecture names in hiragana to kanji notation.

prefecture_hiragana	prefecture_kanji
とうきょうと	東京都
おおさかふ	大阪府
ホッカイドウ	北海道

find_date_by_wday

Summary
Retrieves the date for a specific weekday in a given year and month. For example, it can find "the 3rd Tuesday of May 2023." Useful for holiday settings and recurring schedule management.

Syntax
find_date_by_wday(<year>, <month>, <weekday>, <occurrence>)

Arguments

year - Western calendar year (e.g., 2023)
month - Integer from 1 to 12
weekday - Integer from 1 (Monday) to 7 (Sunday)
occurrence - Integer from 1 to 5 (1st week to 5th week)

Return Value
Date

Example

find_date_by_wday(2023, 5, 2, 3)

Retrieves the date of the 3rd Tuesday in May 2023.

Input	Output
(2023, 5, 2, 3)	2023-05-16
(2023, 1, 1, 1)	2023-01-01
(2023, 12, 7, 4)	2023-12-24

harmonize_prefecture_name

Summary
Standardizes prefecture names to a unified format. Absorbs differences such as the presence/absence of "都", "府", "県" suffixes, as well as abbreviated and formal name variations, to provide consistent notation.

Syntax
harmonize_prefecture_name(<prefecture_name>, to = "long"|"short")

Arguments

prefecture_name - Prefecture name string to standardize
to - Specifies the output format:
- "long" - Formal name format (e.g., 東京都, 大阪府, 京都府)
- "short" - Abbreviated format (e.g., 東京, 大阪, 京都)

Return Value
Character

Example

harmonize_prefecture_name(prefecture_abbreviated, to = "long")

Standardizes abbreviated prefecture names (short) to formal name format (long).

prefecture_abbreviated	prefecture_formal
東京	東京都
大阪府	大阪府
京都	京都府

is_prefecture

Summary
Determines whether a string is a valid prefecture name. Returns FALSE if the string contains additional text such as city names or if it's an invalid prefecture name.

Syntax
is_prefecture(<string>)

Arguments

string - String to evaluate. Supports both formal names (e.g., 東京都) and abbreviated names (e.g., 東京)

Return Value
Logical

Example

is_prefecture(prefecture_name)

Evaluates whether the string is a valid prefecture name.

prefecture_name	is_valid_prefecture
東京都	TRUE
大阪	TRUE
東京都新宿区	FALSE
とうきょう	FALSE

is_zipcode

Summary
Determines whether a string is in the correct format for a Japanese postal code. Supports both 7-digit numeric format and hyphenated format.

Syntax
is_zipcode(<string_or_number>)

Arguments

string_or_number - String or number to evaluate. Supports the following formats:
- 7-digit number (e.g., 1000001)
- 7-digit with hyphen (e.g., 100-0001)

Return Value
Logical

Example

is_zipcode(zipcode)

Evaluates whether the string is in a valid postal code format.

zipcode	is_valid_zipcode
100-0001	TRUE
1000001	TRUE
100-001	FALSE
abc-defg	FALSE

str_jconv

Summary
Converts between different Japanese character types (hiragana, katakana, full-width, half-width, etc.). By specifying the conversion function and target format, various character type conversions are supported.

Syntax
str_jconv(<string>, fun = <conversion_function>, to = <target_format>)

Arguments

string - String or string vector to convert
fun - Specify one of the following conversion functions:
- str_conv_hirakana: Hiragana ⇔ Katakana conversion
- str_conv_zenhan: Full-width ⇔ Half-width conversion
- str_conv_normalize: Unicode normalization
to - Specify the target format:
- "hiragana"/"katakana": For hiragana/katakana conversion
- "zenkaku"/"hankaku": For full-width/half-width conversion

Return Value
Character

Example

str_jconv("カタカナの列", str_conv_hirakana, to = "hiragana")

Converts katakana to hiragana.

katakana_column	hiragana_column
カタカナ	かたかな
アイウエオ	あいうえお

str_jnormalize

Summary
Normalizes strings to a standardized format. Converts full-width alphanumeric characters to half-width, unifies space characters, removes duplicate spaces, and more.

Syntax
str_jnormalize(<string>)

Arguments

string - String or string vector to normalize

Return Value
Character

Example

str_jnormalize(string)

Normalizes strings to a standardized format.

string	normalized_string
ＡＢＣ　１２３	ABC 123
デ　ー　タ	データ
Ｒ＆Ｄ部門	R&D部門

zipcode_spacer

Summary
Formats postal code strings. Can convert between numeric-only format and hyphenated format. Useful for data standardization and display format unification.

Syntax
zipcode_spacer(<zipcode>, remove = <TRUE|FALSE>)

Arguments

zipcode - Postal code string or number to format
remove - Specifies whether to remove or add hyphens:
- TRUE: Removes hyphens
- FALSE: Adds hyphens (default)

Return Value
Character

Example

zipcode_spacer(zipcode, remove = FALSE)

Formats postal codes by adding or removing hyphens.

zipcode	formatted_zipcode
1000001	100-0001
100-0001	100-0001

convert_jdate

Summary
Converts Japanese era date strings (e.g., 令和3年1月1日, R3/1/1) to Western calendar dates (e.g., 2021-01-01).For Japanese dates in the Gregorian calendar, such as "yyyy年mm 月" or "yyyy年mm月dd日", use the lubridate functions (ym or ymd) according to the date order (year, month, day) instead of using the convert_jdate function. The lubridate package functions can also properly process Japanese dates such as "year", "month", and "day".

Syntax
convert_jdate(<japanese_date_string>, legacy = <TRUE|FALSE>)

Arguments

japanese_date_string - Japanese era date string to convert. Supports formats such as "令和3年1月1日" or "R3/1/1"
legacy - Specifies whether to use the package's own conversion logic instead of the ICU library. Default is FALSE (using ICU library). Specify TRUE when the ICU library is not available in your environment.

Return Value
Date

Example

Converts Japanese era dates to Western calendar dates.

convert_jdate(japanese_date_string)

japanese_date_string	western_date
令和3年1月1日	2021-01-01
R3/1/1	2021-01-01
平成31年4月30日	2019-04-30

Gregorian Year-Month Pattern (e.g., 2024年1月, 2021年09月) mutate(year_month_column = ym(year_month_column))

Gregorian Year-Month-Day Pattern (e.g., 2024年1月1日, 2021年09月15日) mutate(date_column = ymd(date_column))

!	Reverses the meaning of the following condition. It is a logical negation and has a similar meaning as 'Not'.
all	Return TRUE if the condition is satisfied for all the rows or a group of rows.
any	Return TRUE if the condition is satisfied for any of the rows or a group of rows.
duplicated	Returns a logical value (TRUE or FALSE) indicating if it is a duplicated value or rows.
ifelse	Returns different values based on the conditional result.
is.character	Returns true if a given object is character type.
is.double	Returns true if a given object is double type.
is.factor	Returns true if a given object is factor type.
is.list	Returns true if a given object is list type.
is.logical	Returns true if a given object is logical type.
is.matrix	Returns true if a given object is matrix type.
is.na	Returns TRUE when it is NA.
is.null	Returns TRUE when it is null.
is.numeric	Returns true if a given object is numeric type.
isTRUE	Returns true if and only if a given value is a length-one logical vector whose only element is TRUE.
which	Returns the TRUE indices of a logical object or based on the condition.
xor	Performs an exclusive-or check across two values and return TRUE when only one of them is TRUE.
is.Date	Returns true if a given object is Date type.
is.difftime	Returns true if a given object is difftime type.
is.duration	Returns true if a given object is duration type.
is.POSIXct	Returns true if a given object is POSIXct type.
str_detect	Return TRUE or FALSE based on whether Text data contains a given text or not
if_else	Returns values based on whether a given condition is satisfied or not.
case_when	Returns values based on multiple conditions. This is similar to CASE WHEN conditions in SQL. You can use two sided formula to define the condition and the mapping values, and use TRUE as ELSE condition. See examples below.
coalesce	Replace the missing values with a given value or the values that are at the same position in other columns.
na_if	Replaces the matching values to NA.
near	Compares two numeric vectors (columns). This is a safe way of comparing if two vectors of floating point numbers are (pairwise) equal. This is safer than using ==, because it has a built in tolerance.
recode_factor	Recodes (or replaces) the original values with given values and create a factor column. You can replace numeric values either by the name with backticks or based on their position, and character values by their name.
parse_logical	Convert data to Logical (Boolean) data type.
is.hms	Returns true if a given object is hms type.
str_logical	Converts the data type of a column to logical type, transforming values into TRUE or FALSE.If the "true_value" argument is not specified, it converts "true", "Yes", "yes", "1" to TRUE, and "false", "No", "no", "0" to FALSE.If you want to make a specific value TRUE and all other values FALSE, specify the original value in true_value.Even if the target column for logical conversion contains a mix of uppercase and lowercase letters, the conversion ignores case and focuses only on the string content.
impute_na	Impute NA by mean / median / mode values, manually entered values, or predicted values. It will build a linear regression model for the prediction.
get_mode	Returns the most frequent value (mode) in a column.
summarize_group	Summarize (or aggregate) column values with specified aggregate functions. If grouping columns are provided then summarize the values for each group.
summarize_row	Summarize (or aggregate) column values across columns for each row with the specified aggregate function.
mutate_group	Create new columns or update existing columns with specified functions such as Window Calculations. If grouping columns are provided then it peforms calculation for each group.
confint_ratio	Returns a confidence interval range (half-width of confidence interval) of TRUE ratio from given logical data.
calc_confint_ratio	Returns a confidence interval range (half-width of confidence interval) of ratio from a size and a TRUE ratio of sample data.

abbreviate	Abbreviate strings to at least minlength characters, such that they remain unique (if they were). First all spaces at the beginning of the string are stripped. Then (if necessary) any other spaces are stripped. Next, lower case vowels are removed (starting at the right) followed by lower case consonants. Finally if the abbreviation is still longer than minlength upper case letters are stripped.
c	This is a generic function which combines its arguments.
iconv	Convert a given text from one encoding to another.
nchar	Returns # of letters or size of a given value.
stri_compare	Compares two strings lexicographically and determines their order. The result is returned as a numeric value: -1 (comes before), 0 (equal), or 1 (comes after).
stri_datetime_format	Converts a date-time column to a string based on the specified format.
stri_escape_unicode	Converts non-ASCII characters (such as Japanese multibyte characters or special symbols) in a string to Unicode escape sequences (\uXXXX format). ASCII characters (alphanumeric characters and basic symbols) are preserved as is.
stri_numbytes	Returns the number of bytes required to store text data. In UTF-8 encoding, one character can be represented by 1-4 bytes, so the number of characters and the number of bytes may differ.
stri_rand_strings	Generates random strings of specified length. Various character types can be used for generation, including ASCII characters and Unicode characters.
regex	Modifier functions control the meaning of the pattern argument to stringr functions
str_c	Concatenates multiple text from multiple columns.
str_conv	Convert text to a specified encoding. e.g. UTF-8
str_count	Return number of letters or words in text data. You can also set a pattern to count only the matched pattern or text.
str_detect	Return TRUE or FALSE based on whether Text data contains a given text or not
str_dup	Repeat text values.
str_extract	Extract only letters that match with a given letters or patterns
str_extract_all	Extract all characters that match with a given letters or patterns. Can be combined with list_to_text function to convert extracted strings into a concatenated text.
str_length	Return number of letters in text data.
str_pad	Add white spaces to text values to make the total length to be exact a specified number.
str_replace	Replace letters that matches with a given letters or expressions
str_replace_all	Replace letters that matches with a given letters or expressions
str_replace_na	Replace NA values to a specific Text
str_split	Split a given text into multiple text by a given separator. It will return a list so you want to unnest the newly created column with unnest() function.
str_sub	Extract letters from Text values based on the position
str_to_lower	Convert text to lowercase
str_to_title	Convert text to Title-case
str_to_upper	Convert text to UPPERCASE
str_trim	Trim (or remove) white spaces in Text values
word	Extract First / Last Nth Word in Text data
extract_numeric	Extract only the number from a given text. This is useful for strings that are supposed to be numbers with extra formatting (e.g. $1,200.34, -12%, X1, etc.).
url_parameters	Returns decoded query parameters from url.
coalesce	Replace the missing values with a given value or the values that are at the same position in other columns.
na_if	Replaces the matching values to NA.
recode	Recodes (or replaces) the original values with given values. You can replace numeric values either by the name with backticks or based on their position, and character values by their name.
parse_number	Parse characters, extract numeric values, and convert to number data type including
str_squish	Remove leading and trailing white spaces and convert multiple consecutive white spaces into a single space.
str_trunc	Truncate text to make the maximum length to be the specified number.
str_remove	Remove letters that match with given letters or expressions.
str_remove_all	Remove letters that match with given letters or expressions
str_remove_word	Remove a word from sentence.
str_replace_word	Replace word from a sentence with given text.
str_remove_inside	Replace a text inside of characters.
str_replace_inside	Replace a text inside of characters.
str_remove_url	Remove an URL inside of characters.
str_replace_url	Replace an URL inside of characters with provided text.
str_extract_url	Replace an URL inside of characters with provided text.
str_remove_emoji	Replace an URL inside of characters with provided text.
anonymize	Anonymize values by hashing algorithms.
str_clean	Cleans up text by removing escape characters (e.g. \n, \t), extra white spaces, extra period, and leading and trailing spaces.
str_count_all	Count patterns from texts
str_normalize	Normalize text by replacing full-width alphabets, numbers, special characters with regular alphabets, numbers, special characters, replacing half-width Kana characters with full-width Kana characters, and applying other normalization rules. It follows the rule of Unicode Normalization Forms. This is a wrapper function around stringi::stri_trans_nfkc function.
str_extract_inside	Extract text inside symbols with given begin and end symbol. If multiple strings are extracted, they will be returned as a list type, so please wrap them with the list_to_text function."
str_logical	Converts the data type of a column to logical type, transforming values into TRUE or FALSE.If the "true_value" argument is not specified, it converts "true", "Yes", "yes", "1" to TRUE, and "false", "No", "no", "0" to FALSE.If you want to make a specific value TRUE and all other values FALSE, specify the original value in true_value.Even if the target column for logical conversion contains a mix of uppercase and lowercase letters, the conversion ignores case and focuses only on the string content.
get_stopwords	Returns stop words like "a", "the", "and", etc.
word_to_sentiment	Returns sentiment types of positive or negative based on word(s).
get_sentiment	Returns the sentiment score for a text sentence. The positive score indicates the positive sentence and the negative score indicates the opposite. 0 means 'neutral.'
stem_word	Stem word so that almost the same words which have a little different spells can be recognized as the same words.
is_stopword	Returns TRUE if a word is included in a list of the stop words defined by one of the dictionary.
is_empty	Returns TRUE if a text is empty string or NA.
is_alphabet	Returns TRUE if a text contains only alphabets.
ip_to_country	Returns country names from IP addresses.
url_domain	Returns domain (ex. "exploratory.io") from url.
url_fragment	Returns fragment from url.
url_path	Returns path from url.
url_port	Returns port from url.
url_scheme	Returns scheme (ex. "http", "https") from url.
url_suffix	Returns suffix (ex. "com", "org") from url.
url_subdomain	Returns subdomain (ex. "www", "blog") from url.
url_tld	Returns top-level domain (ex. "com", "co") from url.
url_param	Returns decoded query parameter from url.
countrycode	Converts country names or country codes to other formats of country codes or names (country names, continent names, region names, etc.). This is useful for standardizing country codes from different data sources when conducting international data analysis or creating reports.
statecode	This function takes a column that has US State information and returns either US State names, abbreviations, numeric codes, division, or region, based on 'output_type' argument setting. The US State information can be either US State names, US State abbreviations (FIPS / ANSI codes), or US State numeric code (FIPS) in any combination.
countycode	Generate US county code (FIPS - Federal Information Processing Standard) based on US State and County names.
impute_na	Impute NA by mean / median / mode values, manually entered values, or predicted values. It will build a linear regression model for the prediction.
get_mode	Returns the most frequent value (mode) in a column.
summarize_group	Summarize (or aggregate) column values with specified aggregate functions. If grouping columns are provided then summarize the values for each group.
summarize_row	Summarize (or aggregate) column values across columns for each row with the specified aggregate function.
mutate_group	Create new columns or update existing columns with specified functions such as Window Calculations. If grouping columns are provided then it peforms calculation for each group.
str_remove_before	Removes the text before a specified delimiter or regex match.
str_remove_after	Removes the text after a specified delimiter or regex match.
str_remove_range	Removes characters from a specified range in text data.
str_replace_before	Replaces the text before a specified delimiter or regex match with a specified string.
str_replace_after	Replaces the text after a specified delimiter or regex match with a specified string.
str_replace_range	Replaces characters in a specified range of text data with another string.
str_extract_before	Extracts the text before a specified delimiter or regex match.
str_extract_after	Extracts the text after a specified delimiter or regex match.
str_conv_zenhan	Converts between full-width (zenkaku) and half-width (hankaku) characters in a string.
str_conv_hirakana	Converts between hiragana and katakana characters in a string.
separate_japanese_address	Splits a Japanese address string into prefecture, city, and street components.
convert_jyear	Converts Japanese era years (Reiwa, Heisei, Showa, etc.) to Western calendar years. Supports both kanji notation (令和三年) and alphabetic notation (R3).
convert_prefecture	Converts prefecture names between kanji format (e.g., 東京都) and romanized format (e.g., Tokyo-to). Useful for normalizing and standardizing address data.
convert_prefecture_from_kana	Converts prefecture names written in hiragana or katakana to kanji notation. Useful for normalizing address data and name matching processes.
harmonize_prefecture_name	Standardizes prefecture names to a unified format. Absorbs differences such as the presence/absence of "都", "府", "県" suffixes, as well as abbreviated and formal name variations, to provide consistent notation.
is_zipcode	Determines whether a string is in the correct format for a Japanese postal code. Supports both 7-digit numeric format and hyphenated format.
str_jconv	Converts between different Japanese character types (hiragana, katakana, full-width, half-width, etc.). By specifying the conversion function and target format, various character type conversions are supported.
str_jnormalize	Normalizes strings to a standardized format. Converts full-width alphanumeric characters to half-width, unifies space characters, removes duplicate spaces, and more.
zipcode_spacer	Formats postal code strings. Can convert between numeric-only format and hyphenated format. Useful for data standardization and display format unification.

abs	Returns the absolute value.
acos	Returns the inverse cosine of a value, in radians
acosh	Returns the inverse hyperbolic cosine of a number.
asin	Returns the inverse sine of a value, in radians.
asinh	Returns the inverse hyperbolic sine of a number.
atan	Returns the inverse tangent of a value, in radians.
atan2	Returns the angle between the x-axis and a line segment from the origin (0,0) to specified coordinate pair (`x`,`y`), in radians.
atanh	Returns the inverse hyperbolic tangent of a number.
ceiling	Rounds numeric values up to the specified decimal place.
cos	Returns the cosine of an angle provided in radians.
cosh	Returns the hyperbolic cosine of any real number.
cospi	Returns the cos(pi * ).
cut	Divides the range of values into intervals and return the values in x according to which interval they fall.
exp	Returns Euler's number, e (~2.718) raised to a power.
floor	Rounds a number down to the nearest integer. floor(3.475) is 3.
log	Returns logarithms, by default natural logarithms.
log10	Computes common (i.e., base 10) logarithms
log1p	Computes log(1+x) accurately also for \|x\| << 1.
log2	computes binary (i.e., base 2) logarithms
pmax	Returns the parallel maxima of the input values.
pmin	Returns the parallel minima of the input values.
round	Rounds a number to a certain number of decimal places.
sign	Returns either 1, 0, or -1 depending on whether the input value is positive, zero, or negative, respectively.
signif	Rounds the values in its first argument to the specified number of significant digits.
sin	Returns the sine of an angle provided in radians.
sinh	Returns the hyperbolic sine of any real number.
sinpi	Returns the sin(pi * ).
sqrt	Returns the square root of the values.
tan	Returns the tangent of an angle provided in radians.
tanh	Returns the hyperbolic tangent of any real number.
tanpi	Returns the tan(pi * ).
trunc	Truncates the numbers by omitting digits.
coalesce	Replace the missing values with a given value or the values that are at the same position in other columns.
near	Compares two numeric vectors (columns). This is a safe way of comparing if two vectors of floating point numbers are (pairwise) equal. This is safer than using ==, because it has a built in tolerance.
recode	Recodes (or replaces) the original values with given values. You can replace numeric values either by the name with backticks or based on their position, and character values by their name.
normalize	Centers and/or scales the numeric values of a column.
impute_na	Impute NA by mean / median / mode values, manually entered values, or predicted values. It will build a linear regression model for the prediction.
detect_outlier	Detects outliers in a numeric column and returns labels indicating if values are outliers ('Upper' or 'Lower') or not ('Normal').This function is useful for labeling outliers or for easily removing outliers when used in combination with a filter.
cluster	Build K-means clustering model and label each row with cluster id.
get_mode	Returns the most frequent value (mode) in a column.
summarize_group	Summarize (or aggregate) column values with specified aggregate functions. If grouping columns are provided then summarize the values for each group.
summarize_row	Summarize (or aggregate) column values across columns for each row with the specified aggregate function.
mutate_group	Create new columns or update existing columns with specified functions such as Window Calculations. If grouping columns are provided then it peforms calculation for each group.
confint_mean	Returns a confidence interval range (half-width of confidence interval) of mean from given numeric data.
calc_confint_mean	Returns a confidence interval range (half-width of confidence interval) of mean from a size and a standard deviation of sample data.
cumsum_decayed	Caluculates cumulative sum of decaying effects. It is same as cumsum when r (the second argument) is 1.
ts_lag	Returns the value of the previous period.
ts_diff	Returns differences in values in a time series data since the specified lag time ago.
ts_diff_ratio	Returns the differences of the values in ratio compared to the base values from the lag time ago in a time series.
likert_sigma	Returns Likert's sigma values given a raw data column of survey answers with selected levels such as 1="Strongly disagree", 2="Disagree", 3="Neutral", 4="Agree", 5="Strongly agree".
logistic	Logistic function.

as.character	Convert a given data to character data type.
as.Date	Convert a given data to Date data type. Date data type doesn't include Time. The default formats follow the rules of the ISO 8601 international standard which expresses a day as "2001-02-03". If the given data is number as 'internal system number (# of days since some origin)' then you want to provide which date to be used as 'origin'.
as.double	Convert data to double precision number data type. It is identical to numeric.
as.factor	Encode a column as a factor.
as.integer	Convert data to integer number data type.
as.logical	Converts the target column to logical type (TRUE/FALSE). For numeric values, 0 is converted to FALSE, and all non-zero values are converted to TRUE.For character strings, "TRUE", "True", "true", "T", "t" are converted to TRUE, and "FALSE", "False", "false", "F", "f" are converted to FALSE. Any other strings result in NA.The difference with the str_logical function is that str_logical can also convert Yes/No values to TRUE/FALSE, and allows you to directly specify which values should be converted to TRUE.
as.numeric	Convert data to
as.POSIXct	Convert data to Date/Time data type.
factor	Encode a column as a factor.
parse_date_time	Convert data to Date/Time data type.
stri_datetime_format	Converts a date-time column to a string based on the specified format.
parse_number	Parse characters, extract numeric values, and convert to number data type including
parse_double	Convert data to double precision number data type.
parse_euro_double	Convert data to double precision number data type.
parse_integer	Convert data to integer number data type.
parse_time	Converts character strings containing time values into time objects (hms format). When combined with as.character(), it can effectively transform time data from various formats into standardized hms objects.
parse_character	Convert a given data to character data type.
parse_factor	Parse data and convert to factor data type.
parse_logical	Convert data to Logical (Boolean) data type.
as_date	Convert a given POSIXct to Date.
as_datetime	Convert a given data to date.
as.hms	Converts numeric values representing seconds into hms (hours:minutes:seconds) time format.
excel_numeric_to_date	Converts Excel numeric date values (e.g., 42370) to date format (e.g., 2016-01-01)
excel_numeric_to_datetime	Convert number of date/time from excel to POSIXct.
unixtime_to_datetime	Convert unix time numeric values to POSIXct.

chartr	Translates each character in x that is specified in old to the corresponding character specified in new.
list	Construct a list
rep	Repeats given numbers.
rev	Reverses the entries in a given data.
desc	Change the sorting order to a descending order. e.g. 9 to 1 instead of 1 to 9, z to a instead of a to z.

cummax	Returns the cumulative maxima.
cummin	Returns the cumulative minima.
cumprod	Returns the cumulative products.
cumsum	Returns the cumulative sums.
cumall	Returns TRUE if all the values up to the current position are TRUE.
cumany	Returns TRUE if any of the values up to the current position is TRUE.
cume_dist	Cumulative distribution. Proportion of all values less than or equal to the current rank. Missing values are left as is.
cummean	Returns the cumulative mean (average).
dense_rank	Ranks with no gaps. Missing values are left as is.
first	returns the first value of all the values.
lag	Provides access to a row at a given physical offset prior to that position.
last	returns the last value of all the values.
lead	Provides access to a row at a given physical offset that follows the current row.
min_rank	Ranks. Ties get min rank. Missing values are left as is.
ntile	Divides data into a specified number (n) of "equal frequency" groups (with approximately the same number of rows in each group) and returns an integer value (from 1 to n) indicating which group each row belongs to.Note that this "equal frequency division" distributes rows evenly across groups, which differs from equal width (same value range) division.
percent_rank	Ranks rescaled to [0, 1]. Missing values are left as is.
row_number	Returns row numbers. Equivalent to Rank.
roll_max	Returns the rolling (or moving) max value.
roll_mean	Calculate a moving average for a column. By specifying a window size, it is possible to return the moving average over the specified interval.
roll_median	Returns the rolling (or moving) median value.
roll_min	Returns the rolling (or moving) min value.
roll_prod	Returns the rolling (or moving) prod value.
roll_sd	Returns the rolling (or moving) standard deviation value.
roll_sum	Returns the rolling (or moving) sum value.
roll_var	Returns the rolling (or moving) variance value.
mutate_group	Create new columns or update existing columns with specified functions such as Window Calculations. If grouping columns are provided then it peforms calculation for each group.

difftime	Calculates the difference between two given dates/times.
months	Convert a number to month periods so you can use it for further calculations.
ceiling_date	Rounds date/time up to the nearest integer value of the specified time unit.
date_decimal	Converts numeric data calculated as a fraction of the year to date data type.
day	Extract Day
days	Convert a number to day periods so you can use it for further calculations.
decimal_date	Converts date data to numeric data type by calculating the date to a fraction of the year.
dmy	Convert Character or Number to Date / Period when data contains only Date, but not Time.
dmy_h	Convert Character or Number to Date / Period when data contains Date and Time.
dmy_hm	Convert Character or Number to Date / Period when data contains Date and Time.
dmy_hms	Convert Character or Number to Date / Period when data contains Date and Time.
duration	Create a duration
dym	Convert Character or Number to Date / Period when data contains only Date, but not Time.
floor_date	Rounds date/time down to the nearest integer value of the specified time unit.
force_tz	Re-registers the data with a given timezone.
here	Get current time in your local timezone
hm	Convert text to Period data type if a given column holds values that look like Hours and Minutes
hms	Create hms data. It's difftime dealt as seconds.
hour	Extract hour
hours	Convert a number to hour periods so you can use it for further calculations.
interval	Return intervals between two dates. You can use this result to get the duration by either dividing by a given units such as days, weeks, etc, OR converting it to numeric data type with "as.numeric()".
isoweek	Extract week numbers of the year. Weeks start from Monday.
mday	Extract Day of Month
mdy	Convert Character or Number to Date / Period when data contains only Date, but not Time.
mdy_h	Convert Character or Number to Date / Period when data contains Date and Time.
mdy_hm	Convert Character or Number to Date / Period when data contains Date and Time.
mdy_hms	Convert Character or Number to Date / Period when data contains Date and Time.
milliseconds	Convert a number to milliseconds periods so you can use it for further calculations.
minute	Extract minute
minutes	Convert a number to minute periods so you can use it for further calculations.
month	Extract Month
ms	Convert text to Period data type if a given column holds values that look like Minutes and Seconds
myd	Convert Character or Number to Date / Period when data contains only Date, but not Time.
now	Returns current date and time. A similar function 'today()' returns only date.
parse_date_time	Convert data to Date/Time data type.
qday	Extract Day of Quarter
quarter	Extract Quarter
rollback	Calculate the last day of the previous month or to the first day of the month based on a given date.
round_date	Rounds date/time to the nearest integer value of the specified time unit.
second	Extract second
seconds	Convert a number to second periods so you can use it for further calculations.
time_length	Calculate the exact time length between two dates.
today	Returns current date without time. A similar function 'now()' returns date including time.
wday	Extract day of the week as a decimal number (01-07, Sunday is 1) or as Text (e.g. Sunday). The Text can be a full text or an abbreviated text. Sunday vs. Sun
week	Extract week numbers of the year. The 1st day of the 1st week always starts from January 1st regardless of the day of the week.
weeks	Convert a number to week periods so you can use it for further calculations.
with_tz	Returns Date / Time in a given time zone
yday	Extract Day of Year
ydm	Convert Character or Number to Date / Period when data contains only Date, but not Time.
ydm_h	Convert Character or Number to Date / Period when data contains Date and Time.
ydm_hm	Convert Character or Number to Date / Period when data contains Date and Time.
ydm_hms	Convert Character or Number to Date / Period when data contains Date and Time.
year	Extract Year
years	Convert a number to year periods so you can use it for further calculations.
ymd	Convert Character or Number to Date / Period when data contains only Date, but not Time.
ymd_h	Convert Character or Number to Date / Period when data contains Date and Time.
ymd_hm	Convert Character or Number to Date / Period when data contains Date and Time.
ymd_hms	Convert Character or Number to Date / Period when data contains Date and Time.
coalesce	Replace the missing values with a given value or the values that are at the same position in other columns.
recode	Recodes (or replaces) the original values with given values. You can replace numeric values either by the name with backticks or based on their position, and character values by their name.
parse_time	Converts character strings containing time values into time objects (hms format). When combined with as.character(), it can effectively transform time data from various formats into standardized hms objects.
ym	Convert Character or Number to Date / Period when data contains only Date, but not Time.
my	Convert Character or Number to Date / Period when data contains only Date, but not Time.
yq	Convert text to Date data type if a given column holds values that look like Year and Quoter.
epiweek	Extract week numbers of the year. Weeks start from Sunday.
as_date	Convert a given POSIXct to Date.
as.hms	Converts numeric values representing seconds into hms (hours:minutes:seconds) time format.
impute_na	Impute NA by mean / median / mode values, manually entered values, or predicted values. It will build a linear regression model for the prediction.
get_mode	Returns the most frequent value (mode) in a column.
summarize_group	Summarize (or aggregate) column values with specified aggregate functions. If grouping columns are provided then summarize the values for each group.
summarize_row	Summarize (or aggregate) column values across columns for each row with the specified aggregate function.
mutate_group	Create new columns or update existing columns with specified functions such as Window Calculations. If grouping columns are provided then it peforms calculation for each group.
weekend	Returns either Weekday or Weekend based on the provided date column value.
is_japanese_holiday	Alias to is_jholiday. Returns TRUE if the provided date is a Japanese Holiday.
get_week_of_month	Extract Week of Month. For example, if the given date is in the 1st week of the month, it returns `1`.
years_between	Calculate period between two dates in years.
months_between	Calculate period between two dates in months.
weeks_between	Calculate period between two dates in weeks.
days_between	Calculate period between two dates in days.
hours_between	Calculate period between two dates in hours.
minutes_between	Calculate period between two dates in minutes.
seconds_between	Calculate period between two dates in seconds.
last_date	Returns the last date of the specified period (e.g., month) that the original date belongs to.
ts_lag	Returns the value of the previous period.
ts_diff	Returns differences in values in a time series data since the specified lag time ago.
ts_diff_ratio	Returns the differences of the values in ratio compared to the base values from the lag time ago in a time series.
is_jholiday	Returns TRUE if the provided date is a Japanese Holiday.
find_date_by_wday	Retrieves the date for a specific weekday in a given year and month. For example, it can find "the 3rd Tuesday of May 2023." Useful for holiday settings and recurring schedule management.

levels	Returns the levels of a given factor.
coalesce	Replace the missing values with a given value or the values that are at the same position in other columns.
recode	Recodes (or replaces) the original values with given values. You can replace numeric values either by the name with backticks or based on their position, and character values by their name.
as_factor	Convert a given column to a factor. Compared to base R's as.factor, this function creates levels in the order in which they appear, which will be the same on every platform (base R sorts in the current locale which can vary from place to place).
fct_anon	Anonymize factor levels. Replaces factor levels with arbitrary numeric identifiers. Neither the values nor the order of the levels are preserved.
fct_expand	Add additional levels to a factor
fct_drop	Drop unused levels.
fct_explicit_na	This gives missing value an explicit factor level, ensuring that they appear in the charts.
fct_lump	Moves least/most common factor levels into "Other" category.
fct_other	Creates "Other" category by moving specified values to "Other", or by keeping only specified values from "Other".
fct_inorder	Reorder factors levels by first appearance.
fct_infreq	Reorder factors levels by frequency (the most frequently appears to the least.)
fct_relevel	Sets the level of a given factor column by moving any number of levels to any location.
fct_reorder	Sets the order of values based on sorted values from another column and converts the data type to factor.
fct_rev	Reverse order of factor levels.
impute_na	Impute NA by mean / median / mode values, manually entered values, or predicted values. It will build a linear regression model for the prediction.
cluster	Build K-means clustering model and label each row with cluster id.
get_mode	Returns the most frequent value (mode) in a column.
summarize_group	Summarize (or aggregate) column values with specified aggregate functions. If grouping columns are provided then summarize the values for each group.
summarize_row	Summarize (or aggregate) column values across columns for each row with the specified aggregate function.
mutate_group	Create new columns or update existing columns with specified functions such as Window Calculations. If grouping columns are provided then it peforms calculation for each group.
exp_cut_by_step	Divides numeric data into equal intervals (e.g., in steps of 500) and returns the corresponding group name. Can be combined with format_cut_output function to customize the display format of the intervals.
exp_cut	Divides numeric data into equal width groups and returns the corresponding group name. Can be combined with format_cut_output function to customize the display format of the intervals.

range	Returns a min and max values of a given data.
between	Return TRUE or FALSE based on a given value is within a given range (between).

cor	Computes the correlation of two column values. The covariance divided by the product of the standard deviations of the two column values.
cov	Computes the covariance of two column values.

full_seq	Generate a sequence of numbers based on the values in a given column.
nesting	Fix multiple columns so that only the combination of actually existing values in those columns will be used for complete() or expand() operation.

param_remove	Removes parameters from a given URL.
url_decode	Decodes a given encoded URL. It assume character encoding is UTF-8.
url_encode	Encodes a given URL. It assume character encoding is UTF-8.
url_parameters	Returns decoded query parameters from url.
url_domain	Returns domain (ex. "exploratory.io") from url.
url_fragment	Returns fragment from url.
url_path	Returns path from url.
url_port	Returns port from url.
url_scheme	Returns scheme (ex. "http", "https") from url.
url_suffix	Returns suffix (ex. "com", "org") from url.
url_subdomain	Returns subdomain (ex. "www", "blog") from url.
url_tld	Returns top-level domain (ex. "com", "co") from url.
url_param	Returns decoded query parameter from url.

starts_with	Returns the column names that starts with a given text.
ends_with	Returns the column names that ends with a given text.
contains	Returns the column names that contain a given text.
matches	Returns the column names that matches with a given text.
num_range	Returns the column names that starts with a given text with numbers that are within a given range. It's useful especially when you have column names like X1, X2, X3, X4, etc.
any_of	Returns the column names that are any of the given names.
everything	Returns all the column names. It's useful when you want to have particular column(s) first before everything else.
where	Returns the columns for which the specified column selection function (e.g. is.numeric) returns TRUE.
across	This function is used for selecting columns for other functions such as summarize_row.

xgboost_reg	Create extreme gradient boosting model for regression.
xgboost_binary	Create extreme gradient boosting model for binary classification.
xgboost_multi	Create extreme gradient boosting model for binary classification.
randomForestReg	Create random forest model for regression.
randomForestBinary	Create random forest model for binary classification.
randomForestMulti	Create random forest model for multi class classification.

do_survfit	Calculates Survival Curve from survival time and survival status.
prediction_survfit	Simulates Survival Curve for specified cohort based on a survival model.
build_coxph	Builds Cox Proportional Hazard Model for survival analysis and store it in a data frame.
prediction_coxph	Returns a data frame with predicted values of Cox Proportional Hazard Model.

do_tokenize	Returns one token (e.g. word) per row after tokenizing a text.
pair_count	Count pairs of words (tokens) that cooccur within a group
do_tfidf	Calculates TF-IDF for each term against a group. TF-IDF is a weighting mechanism that calculates the importance of each word to each document by increasing the importance based on the term frequency while decreasing the importance based on the document frequency.
do_ngram	Create columns of n-grams connected in sentences.
do_cosine_sim.kv	Calculates the similarity between each pair of the documents using the cosine similarity algorithm. Cosine similarity measures the cosine of the angle between two vectors in the multi-dimensional space.

Function Reference

Logical

Text

Number

Aggregate

Type Conversion

Others

Window Calculation

Date

Data Frame

Factor

Range

Statistics

List

tidyr sub-function

URL

Column

Column Selection

dplyr-subfunction

POSIXct

Model Function

Survival Analysis

Geo

Geographic

Text Analysis

Summarize

Group By

Sampling

Machine Learning

Numeric

stringr-reference.md

!

abbreviate

abs

acos

acosh

all

any

as.character

as.Date

as.double

as.factor

as.integer

as.logical

as.numeric

as.POSIXct

asin

asinh

atan

atan2

atanh

c

ceiling

chartr

cos

cosh

cospi

cummax

cummin

cumprod

cumsum

cut

difftime

duplicated

exp

factor

floor

iconv

ifelse

intersect

is.character

is.double

is.factor

is.list

is.logical

is.matrix

is.na

is.null

is.numeric

isTRUE