! | Reverses the meaning of the following condition. It is a logical negation and has a similar meaning as 'Not'. |
all | Return TRUE if the condition is satisfied for all the rows or a group of rows. |
any | Return TRUE if the condition is satisfied for any of the rows or a group of rows. |
duplicated | Returns a logical value (TRUE or FALSE) indicating if it is a duplicated value or rows. |
ifelse | Returns different values based on the conditional result. |
is.character | Returns true if a given object is character type. |
is.double | Returns true if a given object is double type. |
is.factor | Returns true if a given object is factor type. |
is.list | Returns true if a given object is list type. |
is.logical | Returns true if a given object is logical type. |
is.matrix | Returns true if a given object is matrix type. |
is.na | Returns TRUE when it is NA. |
is.null | Returns TRUE when it is null. |
is.numeric | Returns true if a given object is numeric type. |
isTRUE | Returns true if and only if a given value is a length-one logical vector whose only element is TRUE. |
which | Returns the TRUE indices of a logical object or based on the condition. |
xor | Performs an exclusive-or check across two values and return TRUE when only one of them is TRUE. |
is.Date | Returns true if a given object is Date type. |
is.difftime | Returns true if a given object is difftime type. |
is.duration | Returns true if a given object is duration type. |
is.POSIXct | Returns true if a given object is POSIXct type. |
str_detect | Return TRUE or FALSE based on whether Text data contains a given text or not |
if_else | Returns values based on whether a given condition is satisfied or not. |
case_when | Returns values based on multiple conditions. This is similar to CASE WHEN conditions in SQL. You can use two sided formula to define the condition and the mapping values, and use TRUE as ELSE condition. See examples below. |
coalesce | Replace the missing values with a given value or the values that are at the same position in other columns. |
na_if | Replaces the matching values to NA. |
near | Compares two numeric vectors (columns). This is a safe way of comparing if two vectors of floating point numbers are (pairwise) equal. This is safer than using ==, because it has a built in tolerance. |
recode_factor | Recodes (or replaces) the original values with given values and create a factor column. You can replace numeric values either by the name with backticks or based on their position, and character values by their name. |
parse_logical | Convert data to Logical (Boolean) data type. |
is.hms | Returns true if a given object is hms type. |
str_logical | Convert a character or numeric data type column to a logical column.If true_value argument is not provided, it treats "yes", "true", "1", and 1 as TRUE and "no", "false", "0", and 0 as FALSE. |
impute_na | Impute NA by average / median values, manually entered values, or predicted values. It will build a linear regression model for the prediction. |
get_mode | Returns the most frequent value (mode) in a column. |
summarize_group | Summarize (or aggregate) column values with specified aggregate functions. If grouping columns are provided then summarize the values for each group. |
summarize_row | Summarize (or aggregate) column values across columns for each row with the specified aggregate function. |
mutate_group | Create new columns or update existing columns with specified functions such as Window Calculations. If grouping columns are provided then it peforms calculation for each group. |
confint_ratio | Returns a confidence interval range (half-width of confidence interval) of TRUE ratio from given logical data. |
calc_confint_ratio | Returns a confidence interval range (half-width of confidence interval) of ratio from a size and a TRUE ratio of sample data. |
abbreviate | Abbreviate strings to at least minlength characters, such that they remain unique (if they were). First all spaces at the beginning of the string are stripped. Then (if necessary) any other spaces are stripped. Next, lower case vowels are removed (starting at the right) followed by lower case consonants. Finally if the abbreviation is still longer than minlength upper case letters are stripped. |
c | This is a generic function which combines its arguments. |
iconv | Convert a given text from one encoding to another. |
nchar | Returns # of letters or size of a given value. |
str_c | Concatenates multiple text from multiple columns. |
str_conv | Convert text to a specified encoding. e.g. UTF-8 |
str_count | Return number of letters or words in text data. You can also set a pattern to count only the matched pattern or text. |
str_detect | Return TRUE or FALSE based on whether Text data contains a given text or not |
str_dup | Repeat text values. |
str_extract | Extract only letters that match with a given letters or patterns |
str_extract_all | Extract all characters that match with a given letters or patterns |
str_length | Return number of letters in text data. |
str_pad | Add white spaces to text values to make the total length to be exact a specified number. |
str_replace | Replace letters that matches with a given letters or expressions |
str_replace_all | Replace letters that matches with a given letters or expressions |
str_replace_na | Replace NA values to a specific Text |
str_split | Split a given text into multiple text by a given separator. It will return a list so you want to unnest the newly created column with unnest() function. |
str_sub | Extract letters from Text values based on the position |
str_to_lower | Convert text to lowercase |
str_to_title | Convert text to Title-case |
str_to_upper | Convert text to UPPERCASE |
str_trim | Trim (or remove) white spaces in Text values |
word | Extract First / Last Nth Word in Text data |
extract_numeric | Extract only the number from a given text. This is useful for strings that are supposed to be numbers with extra formatting (e.g. $1,200.34, -12%, X1, etc.). |
url_parameters | Returns decoded query parameters from url. |
coalesce | Replace the missing values with a given value or the values that are at the same position in other columns. |
na_if | Replaces the matching values to NA. |
recode | Recodes (or replaces) the original values with given values. You can replace numeric values either by the name with backticks or based on their position, and character values by their name. |
parse_number | Parse characters, extract numeric values, and convert to number data type including |
str_trunc | Truncate text to make the maximum length to be the specified number. |
str_remove | Remove letters that match with given letters or expressions. |
str_remove_all | Remove letters that match with given letters or expressions |
str_remove_word | Remove a word from sentence. |
str_replace_word | Replace word from a sentence with given text. |
str_remove_inside | Replace a text inside of characters. |
str_replace_inside | Replace a text inside of characters. |
str_remove_url | Remove an URL inside of characters. |
str_replace_url | Replace an URL inside of characters with provided text. |
str_extract_url | Replace an URL inside of characters with provided text. |
str_remove_emoji | Replace an URL inside of characters with provided text. |
anonymize | Anonymize values by hashing algorithms. |
str_clean | Cleans up text by removing escape characters (e.g. \n, \t), extra white spaces, extra period, and leading and trailing spaces. |
str_count_all | Count patterns from texts |
str_normalize | Normalize text by replacing full-width alphabets, numbers, special characters with regular alphabets, numbers, special characters, replacing half-width Kana characters with full-width Kana characters, and applying other normalization rules. It follows the rule of Unicode Normalization Forms. This is a wrapper function around stringi::stri_trans_nfkc function. |
str_extract_inside | Extract text inside symbols with given begin and end symbol. |
str_logical | Convert a character or numeric data type column to a logical column.If true_value argument is not provided, it treats "yes", "true", "1", and 1 as TRUE and "no", "false", "0", and 0 as FALSE. |
get_stopwords | Returns stop words like "a", "the", "and", etc. |
word_to_sentiment | Returns sentiment types of positive or negative based on word(s). |
get_sentiment | Returns the sentiment score for a text sentence. The positive score indicates the positive sentence and the negative score indicates the opposite. 0 means 'neutral.' |
stem_word | Stem word so that almost the same words which have a little different spells can be recognized as the same words. |
is_stopword | Returns TRUE if a word is included in a list of the stop words defined by one of the dictionary. |
is_empty | Returns TRUE if a text is empty string or NA. |
is_alphabet | Returns TRUE if a text contains only alphabets. |
ip_to_country | Returns country names from IP addresses. |
url_domain | Returns domain (ex. "exploratory.io") from url. |
url_fragment | Returns fragment from url. |
url_path | Returns path from url. |
url_port | Returns port from url. |
url_scheme | Returns scheme (ex. "http", "https") from url. |
url_suffix | Returns suffix (ex. "com", "org") from url. |
url_subdomain | Returns subdomain (ex. "www", "blog") from url. |
url_tld | Returns top-level domain (ex. "com", "co") from url. |
url_param | Returns decoded query parameter from url. |
countrycode | Map country names or codes to other codes or names (country names, continent names, etc.). |
statecode | This function takes a column that has US State information and returns either US State names, abbreviations, numeric codes, division, or region, based on 'output_type' argument setting. The US State information can be either US State names, US State abbreviations (FIPS / ANSI codes), or US State numeric code (FIPS) in any combination. |
countycode | Generate US county code (FIPS - Federal Information Processing Standard) based on US State and County names. |
impute_na | Impute NA by average / median values, manually entered values, or predicted values. It will build a linear regression model for the prediction. |
get_mode | Returns the most frequent value (mode) in a column. |
summarize_group | Summarize (or aggregate) column values with specified aggregate functions. If grouping columns are provided then summarize the values for each group. |
summarize_row | Summarize (or aggregate) column values across columns for each row with the specified aggregate function. |
mutate_group | Create new columns or update existing columns with specified functions such as Window Calculations. If grouping columns are provided then it peforms calculation for each group. |
abs | Returns the absolute value. |
acos | Returns the inverse cosine of a value, in radians |
acosh | Returns the inverse hyperbolic cosine of a number. |
asin | Returns the inverse sine of a value, in radians. |
asinh | Returns the inverse hyperbolic sine of a number. |
atan | Returns the inverse tangent of a value, in radians. |
atan2 | Returns the angle between the x-axis and a line segment from the origin (0,0) to specified coordinate pair (`x`,`y`), in radians. |
atanh | Returns the inverse hyperbolic tangent of a number. |
ceiling | Rounds a number up to the nearest integer. ceiling(3.475) is 4 |
cos | Returns the cosine of an angle provided in radians. |
cosh | Returns the hyperbolic cosine of any real number. |
cospi | Returns the cos(pi * |
cut | Divides the range of values into intervals and return the values in x according to which interval they fall. |
exp | Returns Euler's number, e (~2.718) raised to a power. |
floor | Rounds a number down to the nearest integer. floor(3.475) is 3. |
log | Returns logarithms, by default natural logarithms. |
log10 | Computes common (i.e., base 10) logarithms |
log1p | Computes log(1+x) accurately also for |x| << 1. |
log2 | computes binary (i.e., base 2) logarithms |
pmax | Returns the parallel maxima of the input values. |
pmin | Returns the parallel minima of the input values. |
round | Rounds a number to a certain number of decimal places. |
sign | Returns either 1, 0, or -1 depending on whether the input value is positive, zero, or negative, respectively. |
signif | Rounds the values in its first argument to the specified number of significant digits. |
sin | Returns the sine of an angle provided in radians. |
sinh | Returns the hyperbolic sine of any real number. |
sinpi | Returns the sin(pi * |
sqrt | Returns the square root of the values. |
tan | Returns the tangent of an angle provided in radians. |
tanh | Returns the hyperbolic tangent of any real number. |
tanpi | Returns the tan(pi * |
trunc | Truncates the numbers by omitting digits. |
coalesce | Replace the missing values with a given value or the values that are at the same position in other columns. |
near | Compares two numeric vectors (columns). This is a safe way of comparing if two vectors of floating point numbers are (pairwise) equal. This is safer than using ==, because it has a built in tolerance. |
recode | Recodes (or replaces) the original values with given values. You can replace numeric values either by the name with backticks or based on their position, and character values by their name. |
normalize | Centers and/or scales the numeric values of a column. |
impute_na | Impute NA by average / median values, manually entered values, or predicted values. It will build a linear regression model for the prediction. |
detect_outlier | Detect outlier values and return 'upper' and 'lower' labels. |
cluster | Build K-means clustering model and label each row with cluster id. |
get_mode | Returns the most frequent value (mode) in a column. |
summarize_group | Summarize (or aggregate) column values with specified aggregate functions. If grouping columns are provided then summarize the values for each group. |
summarize_row | Summarize (or aggregate) column values across columns for each row with the specified aggregate function. |
mutate_group | Create new columns or update existing columns with specified functions such as Window Calculations. If grouping columns are provided then it peforms calculation for each group. |
confint_mean | Returns a confidence interval range (half-width of confidence interval) of mean from given numeric data. |
calc_confint_mean | Returns a confidence interval range (half-width of confidence interval) of mean from a size and a standard deviation of sample data. |
cumsum_decayed | Caluculates cumulative sum of decaying effects. It is same as cumsum when r (the second argument) is 1. |
ts_lag | Adds lag time to a time series data. |
ts_diff | Returns differences in values in a time series data since the specified lag time ago. |
ts_diff_ratio | Returns the differences of the values in ratio compared to the base values from the lag time ago in a time series. |
likert_sigma | Returns Likert's sigma values given a raw data column of survey answers with selected levels such as 1="Strongly disagree", 2="Disagree", 3="Neutral", 4="Agree", 5="Strongly agree". |
logistic | Logistic function. |
all | Return TRUE if the condition is satisfied for all the rows or a group of rows. |
any | Return TRUE if the condition is satisfied for any of the rows or a group of rows. |
length | Returns the length of the values. |
max | Returns the maximum value in a numeric column. |
mean | Returns the numerical average (mean) value. |
min | Returns the minimum value in a numeric column. |
prod | Returns the product value by multiplying a series of numbers in a given data together. |
sum | Returns the sum of all the values. |
unique | Returns number of unique values. |
n | Returns the count of the rows for each group. |
n_distinct | Returns the count of unique values. |
nth | returns the nth value of all the values. |
IQR | computes interquartile range of the x values |
mad | returns the median absolute deviation of the values. |
median | Returns the numerical median value. |
quantile | sample quantiles corresponding to the given probabilities. The smallest observation corresponds to a probability of 0 and the largest to a probability of 1. |
sd | returns the standard deviation of the values. |
var | returns the variance of the values. |
get_mode | Returns the most frequent value (mode) in a column. |
summarize_group | Summarize (or aggregate) column values with specified aggregate functions. If grouping columns are provided then summarize the values for each group. |
sum_if | Returns the sum of all the values that satisfy given conditions. |
sum_if_ratio | Returns the ratio of the sum of all the values that satisfy given conditions to the sum of all the values. |
count_if | Summarize the data by counting number of rows that satisfy given conditions. |
count_if_ratio | Return the ratio of the sum of the data by counting the number of rows that satisfy given conditions to the sum of the total count. |
count_unique_if | Counts the number of the unique values of a column that come from rows that satisfy given conditions. |
count_unique_if_ratio | Return the ratio of the number of the unique values of a column that come from rows that satisfy given conditions to number of the unique values. |
mean_if | Returns the numerical average (mean) value that satisfy given conditions. |
average_if | Returns the numerical average (mean) value that satisfy given conditions. This is an alias of mean_if. |
median_if | Returns the numerical median value that satisfy given conditions. |
max_if | Returns the maximum value in a numeric column that satisfies given conditions. |
min_if | Returns the minimum value in a numeric column that satisfies given conditions. |
count_rows | Returns the count of the rows for each group. |
count_unique | Returns the count of unique values. |
as.character | Convert a given data to character data type. |
as.Date | Convert a given data to Date data type. Date data type doesn't include Time. The default formats follow the rules of the ISO 8601 international standard which expresses a day as "2001-02-03". If the given data is number as 'internal system number (# of days since some origin)' then you want to provide which date to be used as 'origin'. |
as.double | Convert data to double precision number data type. It is identical to numeric. |
as.factor | Encode a column as a factor. |
as.integer | Convert data to integer number data type. |
as.logical | Convert data to Logical (Boolean) data type. |
as.numeric | Convert data to |
as.POSIXct | Convert data to Date/Time data type. |
factor | Encode a column as a factor. |
parse_date_time | Convert data to Date/Time data type. |
parse_number | Parse characters, extract numeric values, and convert to number data type including |
parse_double | Convert data to double precision number data type. |
parse_euro_double | Convert data to double precision number data type. |
parse_integer | Convert data to integer number data type. |
parse_time | Convert data to Time data type. |
parse_character | Convert a given data to character data type. |
parse_factor | Parse data and convert to factor data type. |
parse_logical | Convert data to Logical (Boolean) data type. |
as_date | Convert a given POSIXct to Date. |
as_datetime | Convert a given data to date. |
as.hms | Convert a given data to hms data type. It's difftime dealt as seconds. |
excel_numeric_to_date | Convert number of date from excel to Date. |
excel_numeric_to_datetime | Convert number of date/time from excel to POSIXct. |
unixtime_to_datetime | Convert unix time numeric values to POSIXct. |
chartr | Translates each character in x that is specified in old to the corresponding character specified in new. |
list | Construct a list |
rep | Repeats given numbers. |
rev | Reverses the entries in a given data. |
desc | Change the sorting order to a descending order. e.g. 9 to 1 instead of 1 to 9, z to a instead of a to z. |
cummax | Returns the cumulative maxima. |
cummin | Returns the cumulative minima. |
cumprod | Returns the cumulative products. |
cumsum | Returns the cumulative sums. |
cumall | Returns TRUE if all the values up to the current position are TRUE. |
cumany | Returns TRUE if any of the values up to the current position is TRUE. |
cume_dist | Cumulative distribution. Proportion of all values less than or equal to the current rank. Missing values are left as is. |
cummean | Returns the cumulative mean (average). |
dense_rank | Ranks with no gaps. Missing values are left as is. |
first | returns the first value of all the values. |
lag | Provides access to a row at a given physical offset prior to that position. |
last | returns the last value of all the values. |
lead | Provides access to a row at a given physical offset that follows the current row. |
min_rank | Ranks. Ties get min rank. Missing values are left as is. |
ntile | Breaks the column values into n buckets. Missing values are left as is. |
percent_rank | Ranks rescaled to [0, 1]. Missing values are left as is. |
row_number | Returns row numbers. Equivalent to Rank. |
roll_max | Returns the rolling (or moving) max value. |
roll_mean | Returns the rolling (or moving) mean value. |
roll_median | Returns the rolling (or moving) median value. |
roll_min | Returns the rolling (or moving) min value. |
roll_prod | Returns the rolling (or moving) prod value. |
roll_sd | Returns the rolling (or moving) standard deviation value. |
roll_sum | Returns the rolling (or moving) sum value. |
roll_var | Returns the rolling (or moving) variance value. |
mutate_group | Create new columns or update existing columns with specified functions such as Window Calculations. If grouping columns are provided then it peforms calculation for each group. |
difftime | Calculates the difference between two given dates/times. |
months | Convert a number to month periods so you can use it for further calculations. |
ceiling_date | Rounds date/time up to the nearest integer value of the specified time unit. |
date_decimal | Converts numeric data calculated as a fraction of the year to date data type. |
day | Extract Day |
days | Convert a number to day periods so you can use it for further calculations. |
decimal_date | Converts date data to numeric data type by calculating the date to a fraction of the year. |
dmy | Convert Character or Number to Date / Period when data contains only Date, but not Time. |
dmy_h | Convert Character or Number to Date / Period when data contains Date and Time. |
dmy_hm | Convert Character or Number to Date / Period when data contains Date and Time. |
dmy_hms | Convert Character or Number to Date / Period when data contains Date and Time. |
duration | Create a duration |
dym | Convert Character or Number to Date / Period when data contains only Date, but not Time. |
floor_date | Rounds date/time down to the nearest integer value of the specified time unit. |
force_tz | Re-registers the data with a given timezone. |
here | Get current time in your local timezone |
hm | Convert text to Period data type if a given column holds values that look like Hours and Minutes |
hms | Create hms data. It's difftime dealt as seconds. |
hour | Extract hour |
hours | Convert a number to hour periods so you can use it for further calculations. |
interval | Return intervals between two dates. You can use this result to get the duration by either dividing by a given units such as days, weeks, etc, OR converting it to numeric data type with "as.numeric()". |
isoweek | Extract week numbers of the year. Weeks start from Monday. |
mday | Extract Day of Month |
mdy | Convert Character or Number to Date / Period when data contains only Date, but not Time. |
mdy_h | Convert Character or Number to Date / Period when data contains Date and Time. |
mdy_hm | Convert Character or Number to Date / Period when data contains Date and Time. |
mdy_hms | Convert Character or Number to Date / Period when data contains Date and Time. |
milliseconds | Convert a number to milliseconds periods so you can use it for further calculations. |
minute | Extract minute |
minutes | Convert a number to minute periods so you can use it for further calculations. |
month | Extract Month |
ms | Convert text to Period data type if a given column holds values that look like Minutes and Seconds |
myd | Convert Character or Number to Date / Period when data contains only Date, but not Time. |
now | Returns current date and time. A similar function 'today()' returns only date. |
parse_date_time | Convert data to Date/Time data type. |
qday | Extract Day of Quarter |
quarter | Extract Quarter |
rollback | Calculate the last day of the previous month or to the first day of the month based on a given date. |
round_date | Rounds date/time to the nearest integer value of the specified time unit. |
second | Extract second |
seconds | Convert a number to second periods so you can use it for further calculations. |
time_length | Calculate the exact time length between two dates. |
today | Returns current date without time. A similar function 'now()' returns date including time. |
wday | Extract day of the week as a decimal number (01-07, Sunday is 1) or as Text (e.g. Sunday). The Text can be a full text or an abbreviated text. Sunday vs. Sun |
week | Extract week numbers of the year. The 1st day of the 1st week always starts from January 1st regardless of the day of the week. |
weeks | Convert a number to week periods so you can use it for further calculations. |
with_tz | Returns Date / Time in a given time zone |
yday | Extract Day of Year |
ydm | Convert Character or Number to Date / Period when data contains only Date, but not Time. |
ydm_h | Convert Character or Number to Date / Period when data contains Date and Time. |
ydm_hm | Convert Character or Number to Date / Period when data contains Date and Time. |
ydm_hms | Convert Character or Number to Date / Period when data contains Date and Time. |
year | Extract Year |
years | Convert a number to year periods so you can use it for further calculations. |
ymd | Convert Character or Number to Date / Period when data contains only Date, but not Time. |
ymd_h | Convert Character or Number to Date / Period when data contains Date and Time. |
ymd_hm | Convert Character or Number to Date / Period when data contains Date and Time. |
ymd_hms | Convert Character or Number to Date / Period when data contains Date and Time. |
coalesce | Replace the missing values with a given value or the values that are at the same position in other columns. |
recode | Recodes (or replaces) the original values with given values. You can replace numeric values either by the name with backticks or based on their position, and character values by their name. |
parse_time | Convert data to Time data type. |
ym | Convert Character or Number to Date / Period when data contains only Date, but not Time. |
my | Convert Character or Number to Date / Period when data contains only Date, but not Time. |
yq | Convert text to Date data type if a given column holds values that look like Year and Quoter. |
epiweek | Extract week numbers of the year. Weeks start from Sunday. |
as_date | Convert a given POSIXct to Date. |
as.hms | Convert a given data to hms data type. It's difftime dealt as seconds. |
impute_na | Impute NA by average / median values, manually entered values, or predicted values. It will build a linear regression model for the prediction. |
get_mode | Returns the most frequent value (mode) in a column. |
summarize_group | Summarize (or aggregate) column values with specified aggregate functions. If grouping columns are provided then summarize the values for each group. |
summarize_row | Summarize (or aggregate) column values across columns for each row with the specified aggregate function. |
mutate_group | Create new columns or update existing columns with specified functions such as Window Calculations. If grouping columns are provided then it peforms calculation for each group. |
weekend | Returns either Weekday or Weekend based on the provided date column value. |
is_japanese_holiday | Alias to is_jholiday. Returns TRUE if the provided date is a Japanese Holiday. |
get_week_of_month | Extract Week of Month. For example, if the given date is in the 1st week of the month, it returns `1`. |
years_between | Calculate period between two dates in years. |
months_between | Calculate period between two dates in months. |
weeks_between | Calculate period between two dates in weeks. |
days_between | Calculate period between two dates in days. |
hours_between | Calculate period between two dates in hours. |
minutes_between | Calculate period between two dates in minutes. |
seconds_between | Calculate period between two dates in seconds. |
last_date | Returns the last date of the specified period (e.g., month) that the original date belongs to. |
ts_lag | Adds lag time to a time series data. |
ts_diff | Returns differences in values in a time series data since the specified lag time ago. |
ts_diff_ratio | Returns the differences of the values in ratio compared to the base values from the lag time ago in a time series. |
is_jholiday | Returns TRUE if the provided date is a Japanese Holiday. |
intersect | Keep the rows that appear in all of the given data frames. |
setdiff | Keep the rows that appear in the first data frame but not the other data frames. |
union | Keep the rows that appear in one of the given data frames. |
anti_join | Return all rows from the current data frame where there are not matching values in the target, keeping just columns from the current. |
arrange | Sort rows by given column(s). |
bind_cols | Bind multiple data frames by column. |
bind_rows | Bind multiple data frames by row. |
distinct | Select distinct/unique rows. Only the first row will be preserved. Set .keep_all argument to TRUE to keep all the original columns. |
filter | Select rows with conditions. |
full_join | Return all rows and all columns from both the current data frame and the target data frame. |
group_by | Converts the data frame into a grouped data frame where the following operations will be performed based on the groups. Grouping should be done before you want to aggregate values. |
inner_join | Return all rows from the current data frame where there are matching values in the current, and all columns from the current and the target. |
left_join | Return all rows from the current data frame, and all columns from the current and the target. Rows in the current with no match in the target will have NA values in the new columns. If there are multiple matches between the current and the target, all combinations of the matches are returned. |
mutate | Mutate creates new column(s) with given expressions and preserves existing |
rename | Rename existing column names. |
right_join | Return all rows from the target data frame, and all columns from the current and the target |
sample_frac | Sample n fraction of rows from the data frame. |
sample_n | Sample n rows from the data frame. |
select | Select column(s) by column names. |
semi_join | Return all rows from the current data frame where there are matching values in the target data frame, keeping just columns from the current. |
slice | Select rows by positions. This is not supported for relational databases, in which case you should use filter() with row_number() function. |
summarize | Summarize a column values to a single value by using aggregate functions. Make sure you have grouped the data frame already using group_by() already. |
top_n | Select the top n entries based on a given measure in each group. |
transmute | Transmute adds new columns and drops existing columns. |
ungroup | Ungroup existing grouping |
predict | Returns data augmented with fitted values and residuals. |
complete | Complete a data frame by adding missing combinations of data. |
expand | Expand a data frame by adding all the combination of given columns. |
fill | Fills missing values in using the previous entry. This is convenient especially when values are not presented in some cells to avoid duplicated entries. |
gather | Takes multiple columns and collapses into key-value pairs, duplicating all other columns as needed. |
nest | Nest a set of columns together as a list column. |
separate | Separates a column with delimited values into multiple columns. |
spread | Spread a key-value pair across multiple columns. Missing values will be replaced with NA. |
unite | Unite multiple columns together into one column with given uniting characters. It will concatenate values by “\_" and remove the original column by default. Multiple columns specified in the list can be combined together at once. |
unnest | Unnest a list column or a list of data frames by making each element of the list to be presented in its own row. |
select_if | Select column(s) with a predicate (conditional) function. |
mutate_all | Apply functions to all the columns. |
mutate_at | Apply functions to specified columns. |
mutate_if | Apply functions to only the columns that match with a given condition. |
summarize_all | Apply functions to all the columns. |
summarize_at | Apply functions to specified columns. |
summarize_if | Apply functions to only the columns that match with a given condition. |
separate_rows | Separates a column with delimited values into multiple rows. |
drop_na | Drop rows that have NA value. |
type_convert | Heuristically guess the data type for each column by reading the first 1000 rows, parse the data, and set appropriate data types for all the columns of the data frame. |
model_info | Returns a summary information of a given model in a tidy (normalized data frame) format. |
remove_empty_rows | Remove rows whose column values are all NAs. |
remove_empty_cols | Remove columns whose values are all NAs. |
clean_names | Make column names clean by using only _ character, lowercase letters, and numbers. |
get_dupes | Get rows that are duplicated. |
convert_to_NA | Convert specific values to NA. |
tabyl | Create frequency table. |
do_svd.kv | Calculates coordinations by reducing dimensionality using SVD (Singular Value Decomposition). |
do_svd | Calculates coordinations by reducing dimensionality using SVD (Singular Value Decomposition). |
build_lm | Builds a linear regression model (lm) and store it in a data frame. |
build_lr | Builds logistic regression model and store it in a data frame. |
build_glm | Builds generalized linear models (glm) and store it in a data frame. |
build_multinom | Builds multinomial logistic regression model and store it in a data frame. |
prediction | Returns a data frame with regression information about a model. |
prediction_binary | Returns a data frame with binary classification information about a model. |
prediction_coxph | Returns a data frame with predicted values of Cox Proportional Hazard Model. |
model_coef | Returns a data frame with 'Parameter Estimates (Coefficients)' information about a model including the below.* term - Term in lm that is estimated.* estimate - Estimated coefficients. The larger the number, the more positive effect the variable has.* std_error - Standard Error. How much prediction error there is in the variable.* t_ratio - estimate / std_error. This value in t distribution is p_value.* p_value - The probability that the variable is **not** effective.* conf_low - The lower bound of confidence interval.* conf_high - The upper bound of confidence interval. |
model_stats | Returns a data frame with 'Summary of Fit' information about a model including the below.* r_square - R Square.* r_square_adj - Adjusted R Square.* root_mean_square_error - Root mean square error.* f_ratio - F ratio.* p_value - P value.* df - Degree of freedom.* null_deviance - Null deviance.* df_for_null_model - Degree of freedom for null model.* log_likelihood - Log likelifood.* deviance - Deviance.* AIC - Akaike's information criterion.* BIC - Bayesian information criterion.* deviance - Deviance.* residual_df - Residual degree of freedom. |
model_anova | Returns a data frame with anova test information about a model including the below.* df - Degree of freedom.* sum_of_squares* mean_square* f_ratio* p_value* deviance* residual_df* residual_deviance |
evaluate_regression | Returns a data frame with evaluation score of regression including the below.* r_squared* explained_variance* mean_square_error* misclassification_error* root_mean_square_error* mean_absolute_error* mean_absolute_percentage_error |
evaluate_binary | Returns a data frame with evaluation score of binary classification including the below.* AUC* f_score* accuracy* misclassification_rate* precision* recall* specificity* true_positive - Number of positive predictions that actually are positive.* false_positive - Number of positive predictions that actually are negative.* true_negative - Number of negative predictions that actually are negative.* false_negative - Number of negative predictions that actually are positive.* test_size - The number of tested data.* threshold - threshold value for prediction. |
evaluate_multi | Returns a data frame with evaluation score of multi classification including the below.* micro_f_score* macro_f_score* accuracy* misclassification_rate |
do_roc | Returns coordinates of roc curve. |
build_kmeans.cols | Builds a clustering model (k-means) from variable columns and returns the summary of the model or the augmented data depending on the parameter value. |
build_kmeans.kv | Builds a clustering model (k-means) from key-value columns and store the model into a generated data frame or augment the original data with the clustered ID. |
build_kmeans | Builds a clustering model (k-means). Stores the model into a generated data frame or the augmented data depending on the parameter value. |
do_t.test | Execute t-test, which checks differences of means of variables. |
do_var.test | Execute F-test, which checks the differences of variances between groups. |
do_chisq.test | Execute chi-squared contingency table tests and goodness-of-fit tests. |
do_apriori | Find rules of what tend to occur at the same time from transaction data. |
do_anomaly_detection | Detect anomaly in time series data frame. |
do_prophet | Add forecast data to time series data frame. |
do_market_impact | Estimate impact of an event (advertisement, etc.) on a market, by using other markets to form a synthetic control. |
row_as_header | Use a row as column names. |
pivot | Pivot columns into rows and columns. Values are count of pairs of rows and columns or aggregation of another column. |
do_cor.cols | Calculates correlations for all the pairs of the variables (columns). |
do_cor.kv | Calculates correlations for all the pairs of subject columns. |
do_cor | Calculates correlations for all the pairs of the variables or subjects. |
do_dist.kv | Calculate the distances between each of the pairs. |
do_dist.cols | Calculate distances of each of the pairs. |
do_dist | Calculate distances of each of the pairs of the variables or subjects. |
do_cmdscale | Execute multidimensional scaling (MDS). Calculate approximated coordinations from distances of entity pairs. |
build_model | Create data frame with models from input data frame by model function and arguments. |
one_hot | One-hot encodes a categorical column, producing separate columns for each categorical values, each of which has values of 1 or 0 that tells whether a row has the value the column represents. |
sample_rows | Sample n rows from the data frame. This is same as sample_n except for it handles the case where the number of rows in the data is fewer than the specified n without throwing error. |
levels | Returns the levels of a given factor. |
coalesce | Replace the missing values with a given value or the values that are at the same position in other columns. |
recode | Recodes (or replaces) the original values with given values. You can replace numeric values either by the name with backticks or based on their position, and character values by their name. |
as_factor | Convert a given column to a factor. Compared to base R's as.factor, this function creates levels in the order in which they appear, which will be the same on every platform (base R sorts in the current locale which can vary from place to place). |
fct_anon | Anonymize factor levels. Replaces factor levels with arbitrary numeric identifiers. Neither the values nor the order of the levels are preserved. |
fct_expand | Add additional levels to a factor |
fct_drop | Drop unused levels. |
fct_explicit_na | This gives missing value an explicit factor level, ensuring that they appear in the charts. |
fct_lump | Moves least/most common factor levels into "Other" category. |
fct_other | Creates "Other" category by moving specified values to "Other", or by keeping only specified values from "Other". |
fct_inorder | Reorder factors levels by first appearance. |
fct_infreq | Reorder factors levels by frequency (the most frequently appears to the least.) |
fct_relevel | Sets the level of a given factor column by moving any number of levels to any location. |
fct_reorder | Reorder factor levels by sorting based on another variable |
fct_rev | Reverse order of factor levels. |
impute_na | Impute NA by average / median values, manually entered values, or predicted values. It will build a linear regression model for the prediction. |
cluster | Build K-means clustering model and label each row with cluster id. |
get_mode | Returns the most frequent value (mode) in a column. |
summarize_group | Summarize (or aggregate) column values with specified aggregate functions. If grouping columns are provided then summarize the values for each group. |
summarize_row | Summarize (or aggregate) column values across columns for each row with the specified aggregate function. |
mutate_group | Create new columns or update existing columns with specified functions such as Window Calculations. If grouping columns are provided then it peforms calculation for each group. |
range | Returns a min and max values of a given data. |
between | Return TRUE or FALSE based on a given value is within a given range (between). |
cor | Computes the correlation of two column values. The covariance divided by the product of the standard deviations of the two column values. |
cov | Computes the covariance of two column values. |
str_detect | Return TRUE or FALSE based on whether Text data contains a given text or not |
list_extract | Extract a value of a list data type column based on a position or name when the data inside the list is data frame. |
list_to_text | Concatenates texts from all the elements of a list data type column. |
list_concat | Concatenates values from multiple columns into a list. |
list_n | Returns number of elements inside a list data type column for each row. |
full_seq | Generate a sequence of numbers based on the values in a given column. |
nesting | Fix multiple columns so that only the combination of actually existing values in those columns will be used for complete() or expand() operation. |
param_remove | Removes parameters from a given URL. |
url_decode | Decodes a given encoded URL. It assume character encoding is UTF-8. |
url_encode | Encodes a given URL. It assume character encoding is UTF-8. |
url_parameters | Returns decoded query parameters from url. |
url_domain | Returns domain (ex. "exploratory.io") from url. |
url_fragment | Returns fragment from url. |
url_path | Returns path from url. |
url_port | Returns port from url. |
url_scheme | Returns scheme (ex. "http", "https") from url. |
url_suffix | Returns suffix (ex. "com", "org") from url. |
url_subdomain | Returns subdomain (ex. "www", "blog") from url. |
url_tld | Returns top-level domain (ex. "com", "co") from url. |
url_param | Returns decoded query parameter from url. |
- | Removes column(s). It can be used along with any column(s) selection operation like select(), gather(), etc. |
starts_with | Returns the column names that starts with a given text. |
ends_with | Returns the column names that ends with a given text. |
contains | Returns the column names that contain a given text. |
matches | Returns the column names that matches with a given text. |
num_range | Returns the column names that starts with a given text with numbers that are within a given range. It's useful especially when you have column names like X1, X2, X3, X4, etc. |
any_of | Returns the column names that are any of the given names. |
everything | Returns all the column names. It's useful when you want to have particular column(s) first before everything else. |
where | Returns the columns for which the specified column selection function (e.g. is.numeric) returns TRUE. |
across | This function is used for selecting columns for other functions such as summarize_row. |
vars | Returns a list of user selected columns. |
as_date | Convert a given POSIXct to Date. |
weekend | Returns either Weekday or Weekend based on the provided date column value. |
is_japanese_holiday | Alias to is_jholiday. Returns TRUE if the provided date is a Japanese Holiday. |
is_jholiday | Returns TRUE if the provided date is a Japanese Holiday. |
xgboost_reg | Create extreme gradient boosting model for regression. |
xgboost_binary | Create extreme gradient boosting model for binary classification. |
xgboost_multi | Create extreme gradient boosting model for binary classification. |
randomForestReg | Create random forest model for regression. |
randomForestBinary | Create random forest model for binary classification. |
randomForestMulti | Create random forest model for multi class classification. |
do_survfit | Calculates Survival Curve from survival time and survival status. |
prediction_survfit | Simulates Survival Curve for specified cohort based on a survival model. |
build_coxph | Builds Cox Proportional Hazard Model for survival analysis and store it in a data frame. |
prediction_coxph | Returns a data frame with predicted values of Cox Proportional Hazard Model. |
ip_to_country | Returns country names from IP addresses. |
do_tokenize | Returns one token (e.g. word) per row after tokenizing a text. |
pair_count | Count pairs of words (tokens) that cooccur within a group |
do_tfidf | Calculates TF-IDF for each term against a group. TF-IDF is a weighting mechanism that calculates the importance of each word to each document by increasing the importance based on the term frequency while decreasing the importance based on the document frequency. |
do_ngram | Create columns of n-grams connected in sentences. |
do_cosine_sim.kv | Calculates the similarity between each pair of the documents using the cosine similarity algorithm. Cosine similarity measures the cosine of the angle between two vectors in the multi-dimensional space. |
summarize_group | Summarize (or aggregate) column values with specified aggregate functions. If grouping columns are provided then summarize the values for each group. |
summarize_group | Summarize (or aggregate) column values with specified aggregate functions. If grouping columns are provided then summarize the values for each group. |
mutate_group | Create new columns or update existing columns with specified functions such as Window Calculations. If grouping columns are provided then it peforms calculation for each group. |
Summary
Reverses the meaning of the following condition. It is a logical negation and has a similar meaning as 'Not'.
Syntax
!
Return Value
Example
filter(!is.na(Population))
Keep the row if Population value is NOT NA
filter(!str_detect(City, "Francisco"))
Keep the row if City value does NOT include text 'Francisco'.
filter(!Sales > 500)
Keep the row if Sales value is NOT greater than 500.
Summary
Abbreviate strings to at least minlength characters, such that they remain unique (if they were). First all spaces at the beginning of the string are stripped. Then (if necessary) any other spaces are stripped. Next, lower case vowels are removed (starting at the right) followed by lower case consonants. Finally if the abbreviation is still longer than minlength upper case letters are stripped.
Syntax
abbreviate(<column>
, minlength = <number>
, dot = <logical>
)
Arguments
Return Value
Character
Example
abbreviate("Exploratory")
Returns "Expl"
abbreviate("Exploratory", 3)
Returns "Exp"
Summary
Returns the absolute value.
Syntax
abs(<column_num>
)
Return Value
Numeric
Example
mutate(abs = abs(ARR_DELAY))
Summary
Returns the inverse cosine of a value, in radians
Syntax
acos(<column_num>
)
Return Value
Numeric
Example
mutate(a = acos(X1))
Create a new column for calculating the acos.
Summary
Returns the inverse hyperbolic cosine of a number.
Syntax
acosh(<column_num>
)
Return Value
Numeric
Example
mutate(a = acosh(X1))
Create a new column for calculating the acosh.
Summary
Return TRUE if the condition is satisfied for all the rows or a group of rows.
Syntax
all(<condition>
, na.rm = <logical>
)
Arguments
Return Value
Logical
Example
summarize(is_matched_all = all(userid == username))
Return TRUE if all the userid values and username values are same, otherwise FALSE.
summarize(is_matched_all = all(revenue > 500))
Return TRUE if all the revenue values are greater than 500, otherwise FALSE.
Summary
Return TRUE if the condition is satisfied for any of the rows or a group of rows.
Syntax
any(<condition>
, na.rm = <logical>
)
Arguments
Return Value
Logical
Example
summarize(is_matched_all = any(userid == username))
Return TRUE if any of the userid values and username values are same, otherwise FALSE.
summarize(is_matched_all = all(revenue > 500))
Return TRUE if any of the revenue values are greater than 500, otherwise FALSE.
Summary
Convert a given data to character data type.
Syntax
as.character(<column>
)
Return Value
Character
Example
as.character(123.11)
Returns "123.11" as character.
Summary
Convert a given data to Date data type. Date data type doesn't include Time. The default formats follow the rules of the ISO 8601 international standard which expresses a day as "2001-02-03". If the given data is number as 'internal system number (# of days since some origin)' then you want to provide which date to be used as 'origin'.
Syntax
as.Date(<column>
, format = <date_format>
, tz = <timezone>
, origin = <text>
)
Arguments
Return Value
Date
Example
as.Date("2015-10-10")
Returns "2015-10-10"
as.Date("2015-10-10", tz = "America/Los_Angeles")
Returns "2015-10-10"
as.Date("10-10-10", format = "%d-%m-%y")
Returns "2010-10-10"
as.Date(35981, origin = "1899-12-30") # This is recommended for Windows Excel data.
Returns "1998-07-05"
as.Date(34519, origin = "1904-01-01") # This is recommended for Mac Excel data.
Returns "1998-07-05"
Summary
Convert data to double precision number data type. It is identical to numeric.
Syntax
as.double(<column>
)
Return Value
Double
Example
as.double("12345")
Returns 12345
as.double("12345.10")
Returns 12345.1
Summary
Encode a column as a factor.
Syntax
factor(<column>
, levels = <text>
, labels = <text>
, exclude = <text>
, ordered = <logical>
, nmax = <number>
)
Arguments
Return Value
Factor
Example
columnA - "iMac", "iPod", "iPhone", "iPod", "iPhone"
as.factor(columnA)
Returns iMac, iPod, iPhone, iPod, iPhone (It has Levels information of : iMac iPhone iPod)
Summary
Convert data to integer number data type.
Syntax
as.integer(<column>
)
Return Value
Integer
Example
as.integer("12345")
Returns 12345
as.integer("12345.10")
Returns 12345
Summary
Convert data to Logical (Boolean) data type.
Syntax
as.logical(<column>
)
Return Value
Logical
Example
as.logical("TRUE")
Returns TRUE
as.logical("true")
Returns TRUE
as.logical("T")
Returns TRUE
as.logical("True")
Returns TRUE
as.logical(1)
Returns TRUE
Summary
Convert data to
Syntax
as.numeric(<column>
, units = <difftime_units>
)
Arguments
Return Value
Numeric
Example
as.numeric("12345")
Returns 12345
as.numeric("12345.10")
Returns 12345.1
as.numeric(as.Date("2015-01-30") - as.Date("2015-01-15"), units = "days")
Returns 15
as.numeric(as.Date("2015-01-30") - as.Date("2015-01-15"), units = "weeks")
Returns 2.142857
Summary
Convert data to Date/Time data type.
Syntax
as.POSIXct(<column>
, format = <date_time_format>
, tz = <timezone>
, origin = <text>
)
Arguments
Return Value
POSIXct
Example
as.POSIXct("2015-10-10 13:10:05")
Returns "2015-10-10 13:10:05 PDT"
as.POSIXct("2015-10-10 13:10:05", tz = "America/Los_Angeles")
Returns "2015-10-10 13:10:05 PDT"
as.POSIXct("2015-10-10 13:10:05", tz = "Asia/Tokyo")
Returns "2015-10-10 13:10:05 JST"
as.POSIXct("05-10-15T13:10:05", format = "%d-%m-%yT%H:%M")
Returns "2015-10-05 13:10:00 PDT"
Summary
Returns the inverse sine of a value, in radians.
Syntax
asin(<column_num>
)
Return Value
Numeric
Example
mutate(a = asin(X1))
Create a new column for calculating the asin.
Summary
Returns the inverse hyperbolic sine of a number.
Syntax
asinh(<column_num>
)
Return Value
Numeric
Example
mutate(a = asinh(X1))
Create a new column for calculating the asinh.
Summary
Returns the inverse tangent of a value, in radians.
Syntax
atan(<column_num>
)
Return Value
Numeric
Example
mutate(a = atan(X1))
Create a new column for calculating the atan.
Summary
Returns the angle between the x-axis and a line segment from the origin (0,0) to specified coordinate pair (x
,y
), in radians.
Syntax
atan2(<column_num>
)
Return Value
Numeric
Example
mutate(a = atan2(X1, Y1))
Create a new column for calculating the atan2.
Summary
Returns the inverse hyperbolic tangent of a number.
Syntax
atanh(<column_num>
)
Return Value
Numeric
Example
mutate(a = atanh(X1))
Create a new column for calculating the atanh.
Summary
This is a generic function which combines its arguments.
Syntax
c(<value1>
, <value2>
, ... , recursive = <logical>
)
Arguments
Example
c(1:10)
Returns 1 2 3 4 5 6 7 8 9 10
c(1:10, 1:3)
Returns 1 2 3 4 5 6 7 8 9 10 1 2 3
c("a", "b", "c")
Returns "a" "b" "c"
Summary
Rounds a number up to the nearest integer. ceiling(3.475) is 4
Syntax
ceiling(<column_num>
, digits = <number>
)
Arguments
Return Value
Numeric
Example
ceiling(3.475, digits=2)
Returns 3.48
Summary
Translates each character in x that is specified in old to the corresponding character specified in new.
Syntax
chartr(<old_text>
, <new_text>
, <column_text>
)
Arguments
Return Value
Character
Example
x = "abcdef 123456"
chartr("abc", "xyz", x)
Returns "xyzdef 123456"
chartr("a-f", "u-z", x)
Returns "uvwxyz 123456"
chartr("a-f", "xxxxxx", x)
Returns "xxxxxx 123456"
chartr("123", "000", x)
Returns "abcdef 000456"
Summary
Returns the cosine of an angle provided in radians.
Syntax
cos(<column_num>
)
Return Value
Numeric
Example
mutate(a = cos(X1))
Create a new column for calculating the cos.
Summary
Returns the hyperbolic cosine of any real number.
Syntax
cosh(<column_num>
)
Return Value
Numeric
Example
mutate(a = tanh(X1))
Create a new column for calculating the cosh.
Summary
Returns the cos(pi *
Syntax
cospi(<column_num>
)
Return Value
Numeric
Example
mutate(a = cospi(X1))
Create a new column for calculating the cospi.
Summary
Returns the cumulative maxima.
Syntax
cummax(<column_num>
)
Return Value
Numeric
Example
// X = c(1, 2, 3, 2, 1)
cummax(X)
Returns 1, 2, 3, 3, 3
Summary
Returns the cumulative minima.
Syntax
cummin(<column_num>
)
Return Value
Numeric
Example
// X = c(1, 2, 3, 2, 1)
cummin(X)
Returns 1, 1, 1, 1, 1
Summary
Returns the cumulative products.
Syntax
cumprod(<column_num>
)
Return Value
Numeric
Example
// X = c(1, 2, 3, 2, 1)
cumprod(X)
Returns 1, 2, 6, 12, 12
Summary
Returns the cumulative sums.
Syntax
cumsum(<column_num>
)
Return Value
Numeric
Example
// X = c(1, 2, 3, 2, 1)
cumsum(X)
Returns 1, 3, 6, 8, 9
Summary
Divides the range of values into intervals and return the values in x according to which interval they fall.
Syntax
cut(<column_num>
, breaks = <number>
, labels = <text>
,
include.lowest = <logical>
, right = <logical>
, dig.lab = <number>
),
ordered_result = <logical>
)
Arguments
Example
//x = c(1,2,3,4,5,6,7,8,9,10)
cut(x, breaks = 5)
Returns (0.991,2.8] (0.991,2.8] (2.8,4.6] (2.8,4.6] (4.6,6.4] (4.6,6.4] (6.4,8.2] (6.4,8.2] (8.2,10] (8.2,10]
cut(x, breaks = 2)
Returns (0.991,5.5] (0.991,5.5] (0.991,5.5] (0.991,5.5] (0.991,5.5] (5.5,10] (5.5,10] (5.5,10] (5.5,10] (5.5,10]
cut(x, breaks = 2, labels = c(1,2))
Returns 1 1 1 1 1 2 2 2 2 2
cut(x, breaks = 2, labels = c("A", "B"))
Returns A A A A A B B B B B
mutate(a = cut(FL_DATE, breaks = 5, start.on.monday = FALSE))
Returns one of the five buckets based on the date value
Summary
Calculates the difference between two given dates/times.
Syntax
difftime(<column_date>
, <column_date>
, tz = <timezone>
, units = <difftime_units>
)
Arguments
Return Value
Difftime
Example
difftime("2015-12-10", "2015-12-30"))
Returns Time difference of -20 days.
difftime("2015-12-10", "2015-12-30", unit = "hours")
Returns Time difference of -480 hours.
as.numeric(difftime("2015-12-10", "2015-12-30"))
Returns -20.
as.numeric(difftime("2015-12-10", "2015-12-30"), units = "hours")
Returns -480.
Summary
Returns a logical value (TRUE or FALSE) indicating if it is a duplicated value or rows.
Syntax
duplicated(<column(s)>
, incomparables = <logical>
)
Arguments
Return Value
Logical
Example
// x <- c("a", "b", "c", "a", "b")
duplicated(x)
Returns FALSE FALSE FALSE TRUE TRUE
duplicated(x, incomparables = "a")
Returns FALSE FALSE FALSE FALSE TRUE
mutate(a = duplicated(ARR_TIME))
Returns TRUE or FALSE for each row
filter(duplicated(ARR_TIME, ARR_DELAY))
Keep only the duplicated rows
summarise(a = sum(!duplicated(ARR_TIME)))
Returns # of NOT duplicated values for each group. This is same as the below.
summarise(a = sum(length(unique(ARR_TIME))))
Summary
Returns Euler's number, e (~2.718) raised to a power.
Syntax
exp(<column_num>
)
Return Value
Numeric
Example
mutate(revenue_exp = exp(revenue))
Returns the exponential value of revenue
Summary
Encode a column as a factor.
Syntax
factor(<column>
, levels = <text>
, labels = <text>
, exclude = <text>
, ordered = <logical>
, nmax = <number>
)
Arguments
Return Value
Factor
Example
mutate(color = factor(color, levels = c("Gold","Silver","Bronze"), ordered=TRUE))
Before:
color | counts |
---|---|
Bronze | 8 |
Silver | 12 |
Gold | 15 |
After:
color | counts |
---|---|
Gold | 15 |
Silver | 12 |
Bronze | 8 |
Summary
Rounds a number down to the nearest integer. floor(3.475) is 3.
Syntax
floor(<column_num>
, digits = <number>
)
Arguments
Return Value
Numeric
Example
floor(3.475, digits=2)
Returns 3.47
Summary
Convert a given text from one encoding to another.
Syntax
iconv(<column>
, from = <encoding>
, to = <encoding>
, sub = <text>
, mark = <logical>
, toRaw = <logical>
)
Arguments
Return Value
Character
Example
iconv(x, "ISO_8859-1", "UTF-8")
Converts a given text from "ISO_8859-1" to "UTF-8".
Summary
Returns different values based on the conditional result.
Syntax
ifelse(<condition>
, <return_value_when_TRUE>
, <return_value_when_FALSE>
)
Arguments
Return Value
Example
//x <- 1:5
ifelse(x > 3, "Bigger", "Smaller")
Returns "Smaller" "Smaller" "Smaller" "Bigger" "Bigger"
Summary
Keep the rows that appear in all of the given data frames.
Syntax
Intersect(<data_set(s)>
, ...)
Example
Intersect(DATA_2015)
Keep the rows that appear in the original data and DATA_2015
.
Summary
Returns true if a given object is character type.
Syntax
is.character
Return Value
Logical
Example
mutate_if(is.character, str_to_lower)
Convert all the letters to lowercase for all the character type columns.
Summary
Returns true if a given object is double type.
Syntax
is.double
Return Value
Logical
Example
mutate_if(is.double, round)
Round all the double type columns.
mutate_if(is.double, funs(. * 0.2))
Multiply all the double columns by 0.2 overriding existing columns.
mutate_if(is.double, funs(calc = . * 0.2))
Multiply all the double columns by 0.2 as new columns.
summarize_if(is.double, mean)
Calculate the mean (average) for all the double columns.
Summary
Returns true if a given object is factor type.
Syntax
is.factor
Return Value
Logical
Example
mutate_if(is.factor, as.character)
Convert all the factor type columns to character type.
Summary
Returns true if a given object is list type.
Syntax
is.list
Return Value
Logical
Example
select_if(is.list)
Select only list type columns.
Summary
Returns true if a given object is logical type.
Syntax
is.logical
Return Value
Logical
Example
mutate_if(is.logical, as.numeric)
Convert the logical values (TRUE, FALSE) to numeric values (1,0).
summarize_if(is.logical, sum)
Count the number of TRUE values for all the logical type columns.
Summary
Returns true if a given object is matrix type.
Syntax
is.matrix
Return Value
Logical
Example
mutate_if(is.matrix, scale)
Scale and center all the columns of matrices in all the matrix type columns.
Summary
Returns TRUE when it is NA.
Syntax
is.na(<column>
)
Return Value
Logical
Example
filter(is.na(ARR_DELAY))
Keep only rows where ARR_DELAY is NA
filter(!is.na(ARR_DELAY))
Keep only rows where ARR_DELAY is NOT NA
Summary
Returns TRUE when it is null.
Syntax
is.null(<column>
)
Return Value
Logical
Example
filter(is.null(ARR_DELAY))
Keep only rows where ARR_DELAY is null
filter(!is.null(ARR_DELAY))
Keep only rows where ARR_DELAY is NOT null
Summary
Returns true if a given object is numeric type.
Syntax
is.numeric
Return Value
Logical
Example
mutate_if(is.numeric, funs(. * 0.2))
Multiply all the numeric columns by 0.2 overriding existing columns.
mutate_if(is.numeric, funs(calc = . * 0.2))
Multiply all the numeric columns by 0.2 as new columns.
summarize_if(is.numeric, mean)
Calculate the mean (average) for all the numeric columns.
Summary
Returns true if and only if a given value is a length-one logical vector whose only element is TRUE.
Syntax
isTRUE(
Return Value
Logical
Example
isTRUE(TRUE)
Returns TRUE
isTRUE(FALSE)
Returns FALSE
summarise(a = isTRUE(mean(ARR_DELAY) > 120))
Returns TRUE or FALSE based on the average of ARR_DELAY for each group.
Summary
Returns the length of the values.
Syntax
length(<column>
)
Return Value
Numeric
Example
summarize(text_length = length(TAIL_NUM))
Returns # of values for each group
summarize(text_length = length(unique(TAIL_NUM)))
Returns # of unique values of 'TAIL_NUM' for each group
Summary
Returns the levels of a given factor.
Syntax
levels(<column>
)
Example
// x <- as.factor(c("a", "b", "c", "d", "e", "a", "b", "a"))
levels(x)
Returns "a" "b" "c" "d" "e"
summarise(n = length(levels(as.factor(flight$CARRIER))))
Returns 16
Summary
Construct a list
Syntax
list(<column>
)
Return Value
List
Example
list(sales = 0, profit = 10)
Create a named list of [sales = 0, profit = 10]
complete(year, product, fill = list(sales = 0))
Set 0 for 'sales' column for missing values.
Summary
Returns logarithms, by default natural logarithms.
Syntax
log(<column_num>
, base = <number>
)
Arguments
Return Value
Numeric
Example
mutate(Sales_log = log(Sales))
Summary
Computes common (i.e., base 10) logarithms
Syntax
log10(<column_num>
, base = <number>
)
Arguments
Return Value
Numeric
Example
mutate(Sales_log = log10(Sales))
Summary
Computes log(1+x) accurately also for |x| << 1.
Syntax
log1p(<column_num>
, base = <number>
)
Arguments
Return Value
Numeric
Example
mutate(Sales_log = log1p(Sales))
Summary
computes binary (i.e., base 2) logarithms
Syntax
log2(<column_num>
, base = <number>
)
Arguments
Return Value
Numeric
Example
mutate(Sales_log = log2(Sales))
Summary
Returns the maximum value in a numeric column.
Syntax
max(<column>
, na.rm = <logical>
)
Arguments
Example
summarize(revenue_max = max(revenue))
Create a new column for calculating the max value of of revenue for each group.
Summary
Returns the numerical average (mean) value.
Syntax
mean(<column_num_logic_date>
, na.rm = <logical>
, trim =
Arguments
Return Value
Numeric
Example
summarize(revenue_mean = mean(revenue))
Create a new column for calculating the average value of revenue for each group.
Summary
Returns the minimum value in a numeric column.
Syntax
min(<column>
, na.rm = <logical>
)
Arguments
Example
summarize(revenue_min = min(revenue))
Create a new column for calculating the minimum of revenue for each group.
Summary
Convert a number to month periods so you can use it for further calculations.
Syntax
months(<column_num>
)
Return Value
Period
Example
months(1)
returns "1m 0d 0H 0M 0S"
ymd_hms("2015-10-01 06:15:30") + months(1)
returns "2015-11-01 06:15:30 UTC"
Summary
Returns # of letters or size of a given value.
Syntax
nchar(<column>
, type = "chars"|"bytes"|"width"
, allowNA = <logical>
)
Arguments
Return Value
Numeric
Example
nchar("Exploratory")
Returns 11
nchar("西田")
Returns 2
nchar("西田", type = "bytes")
Returns 6
Summary
Returns the parallel maxima of the input values.
Syntax
pmax(<column_num_logic_date>
, na.rm = <logical>
)
Arguments
Return Value
Numeric
Example
x <- c(1, 2, 3, 4, 5)
y <- c(3, 3, 3, 3, 3)
pmax(x, y)
Returns 3 3 3 4 5
mutate(a = pmax(ARR_DELAY, DEP_DELAY, na.rm = TRUE))
Returns a bigger value between ARR_DELAY and DEP_DELAY values.
Summary
Returns the parallel minima of the input values.
Syntax
pmin(<column_num_logic_date>
, na.rm = <logical>
)
Arguments
Return Value
Numeric
Example
x <- c(1, 2, 3, 4, 5)
y <- c(3, 3, 3, 3, 3)
pmin(x, y)
Returns 1 2 3 3 3
mutate(a = pmin(ARR_DELAY, DEP_DELAY, na.rm = TRUE))
Returns a smaller value between ARR_DELAY and DEP_DELAY values.
Summary
Returns the product value by multiplying a series of numbers in a given data together.
Syntax
prod(<<column_num_logic>
, na.rm = <logical>
)
Arguments
Return Value
Numeric
Example
// x <- 1:5
prod(x)
Returns 120
Summary
Returns a min and max values of a given data.
Syntax
range(<column>
, na.rm = <logical>
)
Arguments
Return Value
Example
// x <- c(1, 2, 3, 4, 5)
Returns 1 5
filter(ARR_DELAY %in% range(DEP_DELAY, na.rm = TRUE))
Keep only rows with ARR_DELAY is in a range of DEP_DELAY. This is same as the below.
filter(between(ARR_DELAY, range(DEP_DELAY, na.rm = TRUE)[1], range(DEP_DELAY, na.rm = TRUE)[2]))
Summary
Repeats given numbers.
Syntax
rep(<column>
, times = <number>
, length_out = <number>
, each = <number>
)
Arguments
Return Value
Numeric
Example
// x <- c(1, 2, 3, 4, 5)
rep(x, 2)
Returns 1 2 3 4 5 1 2 3 4 5
rep(x, times = 2)
Returns 1 2 3 4 5 1 2 3 4 5
rep(x, length.out = 8)
Returns 1 2 3 4 5 1 2 3
rep(x, times = 2, each = 2)
Returns 1 1 2 2 3 3 4 4 5 5 1 1 2 2 3 3 4 4 5 5
rep(x, length.out = 8, each = 2)
Returns 1 1 2 2 3 3 4 4
Summary
Reverses the entries in a given data.
Syntax
rev(<column>
)
Return Value
Example
// x <- c(1, 2, 3, 4, 5)
rev(x)
Returns 5 4 3 2 1
Summary
Rounds a number to a certain number of decimal places.
Syntax
round(<column_num>
, digits = <number>
)
Arguments
Return Value
Numeric
Example
round(3.475, digits=2)
Returns 3.48
Summary
Returns a sequence of numbers.
Syntax
seq(from = <number>
, to = <number>
, by = <number>
, length.out = <number>
, along.with = <column>
)
Arguments
Return Value
Numeric
Example
seq(1, 5)
Returns 1 2 3 4 5
seq(1, 5, by = 2)
Returns 1 3 5
seq(1:5)
Returns 1 2 3 4 5
seq(0, 1, length.out = 11)
Returns 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
seq(ymd("2017-04-01"), by = "month", length.out = 6)
Returns "2017-04-01" "2017-05-01" "2017-06-01" "2017-07-01" "2017-08-01" "2017-09-01" - It will generate date values for the next 6 months.
Summary
Keep the rows that appear in the first data frame but not the other data frames.
Syntax
setdiff(<data_set(s)>
, ...)
Example
setdiff(DATA_2015)
Keep the rows that appear in the original data frame but not in DATA_2015
Summary
Returns either 1, 0, or -1 depending on whether the input value is positive, zero, or negative, respectively.
Syntax
sign(<column_num>
)
Return Value
Numeric
Example
// x <- c(1, 2, 10, 0, -1, -2, -10)
sign(x)
Returns 1 1 1 0 -1 -1 -1
Summary
Rounds the values in its first argument to the specified number of significant digits.
Syntax
signif(<column_num>
, digits = <number>
)
Arguments
Return Value
Numeric
Example
signif(3.475, digits=2)
Returns 3.5
Summary
Returns the sine of an angle provided in radians.
Syntax
sin(<column_num>
)
Return Value
Numeric
Example
mutate(a = sinh(X1))
Create a new column for calculating the sinh.
Summary
Returns the hyperbolic sine of any real number.
Syntax
sinh(<column_num>
)
Return Value
Numeric
Example
mutate(a = sinh(X1))
Create a new column for calculating the sinh.
Summary
Returns the sin(pi *
Syntax
sinpi(<column_num>
)
Return Value
Numeric
Example
mutate(a = sinpi(X1))
Create a new column for calculating the sinpi.
Summary
Returns the square root of the values.
Syntax
sqrt(<column_num>
)
Return Value
Numeric
Example
mutate(revenue_sq = sqrt(revenue))
Returns the square root of revenue values
Summary
Returns the sum of all the values.
Syntax
sum(<column_num_logic>
, na.rm = <logical>
)
Arguments
Return Value
Numeric
Example
summarize(total = sum(revenue))
Create a new column for calculating the sum of revenue for each group.
Summary
Returns the tangent of an angle provided in radians.
Syntax
tan(<column_num>
)
Return Value
Numeric
Example
mutate(a = sinh(X1))
Create a new column for calculating the sinh.
Summary
Returns the hyperbolic tangent of any real number.
Syntax
tanh(<column_num>
)
Return Value
Numeric
Example
mutate(a = tanh(X1))
Create a new column for calculating the tanh.
Summary
Returns the tan(pi *
Syntax
tanpi(<column_num>
)
Return Value
Numeric
Example
mutate(a = tanpi(X1))
Create a new column for calculating the tanpi.
Summary
Truncates the numbers by omitting digits.
Syntax
trunc(<column_num>
)
Return Value
Numeric
Example
trunc(3.475)
Returns 3
Summary
Keep the rows that appear in one of the given data frames.
Syntax
union(<data_set(s)>
, ...)
Example
union(DATA_2015)
Keep the rows that appear in either or both the original data frame and DATA_2015
.
Summary
Returns number of unique values.
Syntax
unique(<column>
, incomparables = <logical>
, fromLast = <logical>
, nmax = <number>
)
Arguments
Return Value
Numeric
Example
summarize(text_length = length(unique(TAIL_NUM)))
Returns # of unique values of 'TAIL_NUM' for each group
Summary
Returns the TRUE indices of a logical object or based on the condition.
Syntax
which(<condition>
, arr.ind = <logical>
, useNames = <logical>
)
Arguments
Return Value
Numeric
Example
x <- c("a", "b", "c", "d", "e")
which(x == "c")
Returns 3
select(which(sapply(., is.numeric)))
Returns only numeric columns.
Summary
Performs an exclusive-or check across two values and return TRUE when only one of them is TRUE.
Syntax
xor(<value1>
, <value2>
)
Return Value
Logical
Example
xor(TRUE, TRUE)
Returns FALSE
xor(FALSE, TRUE)
Returns TRUE
xor(FALSE, FALSE)
Returns FALSE
Summary
Return all rows from the current data frame where there are not matching values in the target, keeping just columns from the current.
Syntax
anti_join(<data_set>
, by = "<column>
")
anti_join(<data_set>
, by = c("<source_column>
" = "<target_column>
"))
Arguments
Example
anti_join(AIRPORT)
Join with AIRPORT data frame using all variables with common names.
anti_join(AIRPORT, by = "CODE")
Join with AIRPORT data frame using CODE columns.
anti_join(AIRPORT, by = c("DEST" = "CODE")
Join with AIRPORT data frame using DEST column from the original data frame and CODE from the target.
Summary
Sort rows by given column(s).
Syntax
arrange(<column(s)>
)
Used Together
desc
Example
arrange(POPULATION)
Sort by POPULATION values in a ascending order.
arrange(desc(POPULATION))
Sort by POPULATION values in a descending order.
arrange(POPULATION, PRODUCTION)
Sort by POPULATION, PRODUCTION columns in a ascending order.
arrange(-POPULATION, -PRODUCTION)
Sort by POPULATION, PRODUCTION columns in both descending order. This is an equivalent to the following
arrange(desc(POPULATION), desc(PRODUCTION))
Summary
Return TRUE or FALSE based on a given value is within a given range (between).
Syntax
between(<column_num_date>
, <left>
, <right>
)
Arguments
Return Value
Logical
Example
filter(between(revenue, 1000, 6000))
Keep data whose revenue is between 1000 and 6000.
Summary
Bind multiple data frames by column.
Syntax
bind_cols(<data_set(s)>
)
Example
bind_cols(ECONOMY)
Add columns from a data frame 'ECONOMY' to the existing data frame.
Summary
Bind multiple data frames by row.
Syntax
bind_rows(<data_set(s)>
, id_column_name = <text>
, current_df_name=<text>
, force_data_type=<logical>
, encoding=<encoding>
)
Arguments
Example
Existing data:
DATA2013
date | sales | product_id |
---|---|---|
11/05/2013 | 5000 | 1 |
12/20/2013 | 2000 | 2 |
Additional data to bind:
DATA2014
date | sales | product_id |
---|---|---|
01/05/2014 | 1000 | 1 |
02/20/2014 | 2000 | 2 |
DATA2015
date | sales | product_id |
---|---|---|
01/05/2015 | 3000 | 1 |
02/20/2015 | 3000 | 2 |
bind_rows(DATA2014)
Add a data frame 'DATA2014' to the existing data.
date | sales | product_id |
---|---|---|
11/05/2013 | 5000 | 1 |
12/20/2013 | 2000 | 2 |
01/05/2014 | 1000 | 1 |
02/20/2014 | 2000 | 2 |
bind_rows(DATA2014, DATA2015)
Add rows from a data frame 'DATA2014' and 'DATA2015' to the existing data.
date | sales | product_id |
---|---|---|
11/05/2013 | 5000 | 1 |
12/20/2013 | 2000 | 2 |
01/05/2014 | 1000 | 1 |
02/20/2014 | 2000 | 2 |
01/05/2015 | 3000 | 1 |
02/20/2015 | 3000 | 2 |
bind_rows(DATA2014, DATA2015, force_data_type = TRUE)
Add rows from a data frame 'DATA2014' and 'DATA2015' to the existing data and force all columns data types as character, then reevaluate columns types after the merge.
For example, if product_id column only has character values for DATA2015, the product_id column data type is Character. And if you try to bind this to other data frames whose the product_id column data type is Integer, it gives an error becuase of this data type mismatch.
date | sales | product_id |
---|---|---|
01/05/2015 | 3000 | a |
02/20/2015 | 3000 | b |
In this case you can set force_data_type
argument as TRUE. This will create a data frame like below:
date | sales | product_id |
---|---|---|
11/05/2013 | 5000 | 1 |
12/20/2013 | 2000 | 2 |
01/05/2014 | 1000 | 1 |
02/20/2014 | 2000 | 2 |
01/05/2015 | 3000 | a |
02/20/2015 | 3000 | b |
bind_rows(DATA2014, DATA2015, current_df_name = "DATA2013", id_column_name = "df_name")
Add rows from a data frame 'DATA2014' and 'DATA2015' to the DATA2013 and it will create a column called 'df_name' and use "DATA2013", "DATA2014", and "DATA2015" to mark each row from two data frames
df_name | date | sales | product_id |
---|---|---|---|
DATA2013 | 11/05/2013 | 5000 | 1 |
DATA2013 | 12/20/2013 | 2000 | 2 |
DATA2014 | 01/05/2014 | 1000 | 1 |
DATA2014 | 02/20/2014 | 2000 | 2 |
DATA2015 | 01/05/2015 | 3000 | 1 |
DATA2015 | 02/20/2015 | 3000 | 2 |
Summary
Summarize the data by either calling n() or sum() for each group.
Syntax
count(<column(s)>
, wt = <column_num_or_base_aggregate_expression>
, sort = <logical>
)
Arguments
Example
count(CARRIER)
Groups by 'CARRIER' and returns the number of rows for each 'CARRIER'.
count(CARRIER, sort = TRUE)
Groups by 'CARRIER' and returns the number of rows for each 'CARRIER' from the highest number to the lowest.
count(CARRIER, wt = ARR_DELAY)
Groups by 'CARRIER' and returns the sum total of 'ARR_DELAY'.
count(CARRIER, TAIL_NUM, wt = ARR_DELAY)
Groups by 'CARRIER' and 'TAIL_NUM', and returns the sum total of 'ARR_DELAY'.
count(CARRIER, wt = ARR_DELAY, sort = TRUE)
Groups by 'CARRIER' and returns the sum total of 'ARR_DELAY' and sorted by the total number.
count(CARRIER, wt = mean(ARR_DELAY, na.rm = TRUE))
Groups by 'CARRIER' and returns the average of 'ARR_DELAY' after removing NA values.
Summary
Returns TRUE if all the values up to the current position are TRUE.
Syntax
cumall(<condition>
)
Arguments
Return Value
Logical
Example
// X = TRUE, TRUE, FALSE, TRUE, FALSE
cumall(X)
Returns TRUE, TRUE, FALSE, FALSE, FALSE
Summary
Returns TRUE if any of the values up to the current position is TRUE.
Syntax
cumany(<condition>
)
Arguments
Return Value
Logical
Example
// X = TRUE, TRUE, FALSE, TRUE, FALSE
cumany(X)
Returns TRUE, TRUE, TRUE, TRUE, TRUE
Summary
Cumulative distribution. Proportion of all values less than or equal to the current rank. Missing values are left as is.
Syntax
cume_dist(<column_num>
)
Return Value
Numeric
Example
mutate(
Create a new column to have the cumulative distribution number.
filter(cume_dist(
Select rows whose cumulative distribution number are less .25.
Summary
Returns the cumulative mean (average).
Syntax
cummean(<column_num>
)
Arguments
Return Value
Numeric
Example
// X = 1, 2, 3, 2, 1
cummean(X)
Returns 1, 3, 6, 8, 9
Summary
Ranks with no gaps. Missing values are left as is.
Syntax
dense_rank(<column_num>
)
Return Value
Numeric
Example
mutate(population_rank = dense_rank(population))
Create a new column to have the rank number based on 'population' values.
filter(dense_rank(population) < 10)
Select rows whose rank numbers are less 10.
Summary
Change the sorting order to a descending order. e.g. 9 to 1 instead of 1 to 9, z to a instead of a to z.
Syntax
desc(<column>
)
Example
arrange(desc(POPULATION))
Sort by POPULATION values in a descending order.
arrange(desc(POPULATION), desc(PRODUCTION))
Sort by POPULATION, PRODUCTION columns in both descending order.
Summary
Select distinct/unique rows. Only the first row will be preserved. Set .keep_all argument to TRUE to keep all the original columns.
Syntax
distinct(<column(s)>
, .keep_all = <logical>
)
Arguments
Example
distinct()
Keep only unique/distinct rows based on all the columns.
distinct(CARRIER, TAIL_NUM, FL_NUM)
Keep only unique/distinct rows based on CARRIER, TAIL_NUM, FL_NUM columns.
distinct(CARRIER, .keep_all = TRUE)
Keep only unique/distinct rows based on CARRIER and return all the columns.
Summary
Select rows with conditions.
Syntax
filter(<condition>
)
Used Together
is.na, between,
Example
filter(revenue < 6000)
Keep data whose revenue is less than 6000.
filter(revenue < 6000 & region == “West")
Keep data whose revenue is less than 6000 AND region is "West".
filter(revenue < 6000 | region == “West")
Keep data whose revenue is less than 6000 OR region is "West".
filter(between(revenue, 1000, 6000))
Keep data whose revenue is between 1000 and 6000.
filter(is.na(revenue))
Keep data whose revenue is NA.
filter(!is.na(revenue))
Keep data whose revenue is NOT NA.
Summary
returns the first value of all the values.
Syntax
first(<column>
, order_by = <column(s)>
, default = <text>
)
Arguments
Example
summarize(total = first(name))
Create a new column for returning the first value for each group.
Summary
Return all rows and all columns from both the current data frame and the target data frame.
Syntax
full_join(<data_set>
, by = "<column>
", suffix = c("<source_suffix>
", "<target_suffix>
"))
full_join(<data_set>
, by = c("<source_column>
" = "<target_column>
"), suffix = c("<source_suffix>
", "<target_suffix>
"))
Arguments
Example
full_join(AIRPORT)
Join with AIRPORT data frame using all variables with common names.
full_join(AIRPORT, by = "CODE")
Join with AIRPORT data frame using CODE columns.
full_join(AIRPORT, by = c("DEST" = "CODE")
Join with AIRPORT data frame using DEST column from the original data frame and CODE from the target.
Summary
funs provides a flexible way to generate a named list of functions for input to other functions like summarise_each.
Syntax
funs(<function_name(s)>
)
Arguments
A list of functions specified by: * Their name, "mean" * The function itself, mean * A call to the function with . as a dummy parameter, mean(., na.rm = TRUE)
Example
funs(mean, "mean", mean(., na.rm = TRUE))
# Override default names
funs(m1 = mean, m2 = "mean", m3 = mean(., na.rm = TRUE))
Summary
Converts the data frame into a grouped data frame where the following operations will be performed based on the groups. Grouping should be done before you want to aggregate values.
Syntax
group_by(<column(s)>
)
Example
group_by(COUNTRY)
Group the data by COUNTRY.
group_by(REGION, COUNTRY)
Group the data by REGION and COUNTRY.
group_by(FIRSTNAME + LASTNAME)
Group by a dynamically generated value of 'FIRSTNAME + LASTNAME'
Summary
Return all rows from the current data frame where there are matching values in the current, and all columns from the current and the target.
Syntax
inner_join(<data_set>
, by = "<column>
", suffix = c("<source_suffix>
", "<target_suffix>
"))
inner_join(<data_set>
, by = c("<source_column>
" = "<target_column>
"), suffix = c("<source_suffix>
", "<target_suffix>
"))
Arguments
Example
inner_join(AIRPORT)
Join with AIRPORT data frame using all variables with common names.
inner_join(AIRPORT, by = "CODE")
Join with AIRPORT data frame using CODE columns.
inner_join(AIRPORT, by = c("DEST" = "CODE")
Join with AIRPORT data frame using DEST column from the original data frame and CODE from the target.
Summary
Provides access to a row at a given physical offset prior to that position.
Syntax
lag(<column>
, n = <number>
, default = <text>
, order_by = <column(s)>
)
Arguments
Example
// x <- c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10)
lag(x)
Returns NA 1 2 3 4 5 6 7 8 9
lag(x, 2)
Returns NA NA 1 2 3 4 5 6 7 8
Summary
returns the last value of all the values.
Syntax
last(<column>
, order_by = <column(s)>
, default = <text>
)
Arguments
Example
summarize(total = last(name))
Create a new column for returning the last value for each group.
Summary
Provides access to a row at a given physical offset that follows the current row.
Syntax
lead(<column>
, n = <number>
, default = <text>
, order_by = <column(s)>
)
Arguments
Example
// x <- c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10)
lead(x)
Returns 2 3 4 5 6 7 8 9 10 NA
lead(x, 2)
Returns 3 4 5 6 7 8 9 10 NA NA
Summary
Return all rows from the current data frame, and all columns from the current and the target. Rows in the current with no match in the target will have NA values in the new columns. If there are multiple matches between the current and the target, all combinations of the matches are returned.
Syntax
left_join(<data_set>
, by = "<column>
", suffix = c("<source_suffix>
", "<target_suffix>
"))
left_join(<data_set>
, by = c("<source_column>
" = "<target_column>
"), suffix = c("<source_suffix>
", "<target_suffix>
"))
Arguments
Example
left_join(AIRPORT)
Join with AIRPORT data frame using all variables with common names.
left_join(AIRPORT, by = "CODE")
Join with AIRPORT data frame using CODE columns.
left_join(AIRPORT, by = c("DEST" = "CODE")
Join with AIRPORT data frame using DEST column from the original data frame and CODE from the target.
left_join(AIRPORT, by = c("DEST" = "CODE", "MONTH" = "MON")
Join with AIRPORT data frame using DEST and MONTH columns from the original data frame and CODE and MON from the target respectively.
Summary
Ranks. Ties get min rank. Missing values are left as is.
Syntax
min_rank(<column_num>
)
Return Value
Numeric
Example
mutate(population_rank = min_rank(population))
Create a new column to have the rank number based on 'population' values. The smaller the numbers are the higher the ranks are.
mutate(population_rank = min_rank(desc(population)))
Create a new column to have the rank number based on 'population' values. The bigger the numbers are the higher the ranks are.
filter(min_rank(population) < 10)
Select rows whose rank numbers are less 10.
Summary
Mutate creates new column(s) with given expressions and preserves existing
Syntax
mutate(<new_column>
= <expression>
, ...)
Example
mutate(profit = revenue - expense)
Create 'profit' column based on the calculation result of 'revenue - expense'.
Summary
Returns the count of the rows for each group.
Syntax
n()
Arguments
Return Value
Numeric
Example
summarize(total = n())
Create a new column for calculating the number of entries for each group.
Summary
Returns the count of unique values.
Syntax
n_distinct(<column(s)>
, na.rm = <logical>
)
Arguments
Return Value
Numeric
Example
summarize(total = n_distinct(TAIL_NUM))
Create a new column for calculating the count of unique values of TAIL_NUM for each group.
summarize(total = n_distinct(STATE, CITY))
Create a new column for calculating the count of unique combinations of STATE and CITY for each group.
Summary
returns the nth value of all the values.
Syntax
nth(<column>
, n = <number>
, order_by = <column(s)>
, default = <text>
)
Arguments
Example
summarize(total = nth(name, 5))
Returns the fifth value for each group.
summarize(total = nth(name, -5))
Returns the fifth value from the end for each group.
Summary
Breaks the column values into n buckets. Missing values are left as is.
Syntax
ntile(<column>
, <number>
)
Arguments
Return Value
Numeric
Example
mutate(quartile = ntile(revenue, 4))
Breaks 'revenue' column values into 4 buckets to generate quartile.
Summary
Ranks rescaled to [0, 1]. Missing values are left as is.
Syntax
percent_rank(<column_num>
)
Return Value
Numeric
Example
mutate(population_rank = percent_rank(population))
Create a new column to have the percent rank number based on 'population' values.
filter(percent_rank(population) < .25)
Select rows whose percent rank numbers are less .25.
Summary
Rename existing column names.
Syntax
rename(<new_column_name>
= <column>
, ...)
Example
rename(state = X1)
Rename X1 with 'state'
rename(state = X1, country = X2)
Rename X1 with 'state' and X2 with 'country'
Summary
Return all rows from the target data frame, and all columns from the current and the target
Syntax
right_join(<data_set>
, by = "<column>
", suffix = c("<source_suffix>
", "<target_suffix>
"))
right_join(<data_set>
, by = c("<source_column>
" = "<target_column>
"), suffix = c("<source_suffix>
", "<target_suffix>
"))
Arguments
Example
right_join(AIRPORT)
Join with AIRPORT data frame using all variables with common names.
right_join(AIRPORT, by = "CODE")
Join with AIRPORT data frame using CODE columns.
right_join(AIRPORT, by = c("DEST" = "CODE")
Join with AIRPORT data frame using DEST column from the original data frame and CODE from the target.
Summary
Returns row numbers. Equivalent to Rank.
Syntax
row_number()
Arguments
Return Value
Numeric
Example
mutate(rowid = row_number())
Returns an unique row number for each row.
Summary
Sample n fraction of rows from the data frame.
Syntax
sample_frac(<number>
, weight=<column_num>
, replace=<logical>
)
Arguments
Example
sample_frac(.2)
Select randomly selected 20% of of data.
sample_frac(.2, weight = ARR_DELAY)
Select randomly selected 20% of of data with a weight on ARR_DELAY column.
Summary
Sample n rows from the data frame.
Syntax
sample_n(<number>
, weight=<column_num>
, replace=<logical>
)
Arguments
Example
sample_n(100)
Select 100 randomly selected rows.
sample_n(100, weight = ARR_DELAY)
Select 100 randomly selected rows from the data with a weight on ARR_DELAY column.
Summary
Select column(s) by column names.
Syntax
select(<column(s)>
,...)
select(-<column(s)>
,...)
select(starts_with(<text>
, ignore.case = <logical>
))
select(ends_with(<text>
, ignore.case = <logical>
))
select(contains(<text>
, ignore.case = <logical>
))
select(matches(<text>
, ignore.case = <logical>
))
select(num_range(<text>
, <start_num>:<end_num>
))
select(any_of(<text1>
, <text2>
, ...))
select(<column>
, everything())
Arguments
Used Together
starts_with, ends_with, contains, matches, num_range, any_of, everything
Example
select(X, Y)
Select column X and Y.
select(-X,- Y)
Remove column X and Y.
select(starts_with("X", ignore.case = TRUE))
Select only columns whose name starts with "X"
select(ends_with("ABR", ignore.case = TRUE))
Select only columns whose name ends with "ABR"
select(contains("ABR", ignore.case = TRUE))
Select only columns whose name contains with "ABR"
select(matches("[:digit:]+", ignore.case = TRUE))
Select only columns whose names contains digits (regular expression)
select(num_range("X", 1:6))
Selects all columns (numerically) from X1 to X5. 
select(any_of("x", "y", "z”))
Selects columns provided inside the brackets.
select(X, everything())
Reorder columns: keep the column “X" in the front
select_if(is.numeric)
Select only numeric columns.
Summary
Return all rows from the current data frame where there are matching values in the target data frame, keeping just columns from the current.
Syntax
semi_join(<data_set>
, by = "<column>
")
semi_join(<data_set>
, by = c("<source_column>
" = "<target_column>
"))
Arguments
Example
semi_join(AIRPORT)
Join with AIRPORT data frame using all variables with common names.
semi_join(AIRPORT, by = "CODE")
Join with AIRPORT data frame using CODE columns.
semi_join(AIRPORT, by = c("DEST" = "CODE")
Join with AIRPORT data frame using DEST column from the original data frame and CODE from the target.
Summary
Select rows by positions. This is not supported for relational databases, in which case you should use filter() with row_number() function.
Syntax
slice(<numeric>
)
slice(<numeric>:<numeric>
)
Arguments
Example
slice(1) : equivalent to -> filter(row_number() == 1)
Select only the first row.
slice(5:n()) : equivalent to -> filter(between(row_number(), 5, n()))
Select only rows starting from the 5th to the last.
slice(n()) : equivalent to -> filter(row_number() == n())
Select only the last row
slice(n()-10:n())
Select only the last 10 rows
Summary
Summarize a column values to a single value by using aggregate functions. Make sure you have grouped the data frame already using group_by() already.
Syntax
summarize(<new_column>
= <aggregate_expression>
, ...)
Example
summarise(avg = mean(revenue))
Create a new column 'avg' to store average values of 'revenue' for each group.
summarize(avg = mean(revenue), total = sum(revenue))
Create new columns 'avg' for 'average of revenue' and 'total' for 'sum of revenue' for each group.
Summary
Summarize the data by either calling n() or sum() on top of the grouped data. group_by() operation should have been done before.
Syntax
tally(wt = <column>
, sort = <logical>
)
Arguments
Example
tally()
Returns the number of rows for each group.
tally(wt = ARR_DELAY)
Returns the sum total of 'ARR_DELAY' for each group.
tally(wt = ARR_DELAY, sort = TRUE)
Returns the sum total of 'ARR_DELAY' and sorted by the total number.
Summary
Select the top n entries based on a given measure in each group.
Syntax
top_n(<number>
, <column_num>
)
top_n(<number>
, desc(<column_num>
))
Arguments
Used Together
desc
Example
top_n(10, POPULATION)
Select the top 10 countries based on POPULATION.
top_n(-10, POPULATION)
Select the bottom 10 countries based on POPULATION. This is equivalent to the next command.
top_n(10, desc(POPULATION))
Summary
Transmute adds new columns and drops existing columns.
Syntax
transmute(<new_column>
= <expression>
, ...)
Example
transmute(profit = revenue - expense)
Create 'profit' column based on the calculation result of 'revenue - expense', and drop 'revenue' and 'expense' columns.
Summary
Ungroup existing grouping
Syntax
ungroup()
Example
ungroup()
Ungroup the existing groups.
Summary
Rounds date/time up to the nearest integer value of the specified time unit.
Syntax
ceiling_date(<column_date>
, unit = <unit>
)
Arguments
Return Value
POSIXct
Example
ceiling_date(A, "second")
returns "2015-10-01 06:10:15 UTC"
ceiling_date(A, "minute")
returns "2015-10-01 06:11:00 UTC"
ceiling_date(A, "hour")
returns "2015-10-01 07:00:00 UTC"
ceiling_date(A, "day")
returns "2015-10-02 UTC"
ceiling_date(A, "week")
returns "2015-10-04 UTC"
ceiling_date(A, "month")
returns "2015-11-01 UTC"
ceiling_date(A, "quarter")
returns "2016-01-01 UTC"
ceiling_date(A, "year")
returns "2016-01-01 UTC"
Summary
Converts numeric data calculated as a fraction of the year to date data type.
Syntax
date_decimal(<column_num>
, tz = <timezone>
)
Arguments
Return Value
POSIXct
Example
date_decimal(2015.953)
Returns "2015-12-14 20:16:47 UTC"
Summary
Extract Day
Syntax
day(<column_date>
)
Return Value
Numeric
Example
day("2015-10-01 06:15:30")
returns 1
Summary
Convert a number to day periods so you can use it for further calculations.
Syntax
days(<column_num>
)
Return Value
Period
Example
days(1)
returns "1d 0H 0M 0S"
ymd_hms("2015-10-01 06:15:30") + days(1)
returns "2015-10-02 06:15:30 UTC"
Summary
Converts date data to numeric data type by calculating the date to a fraction of the year.
Syntax
decimal_date(<column_date>
)
Arguments
Return Value
Numeric
Example
decimal_date(ymd("2015-12-15"))
Returns 2015.953
Summary
Convert Character or Number to Date / Period when data contains only Date, but not Time.
Syntax
dmy(<column>
, tz = <timezone>
, locale = <locale>
, truncated = <number>
)
Arguments
Return Value
Date
Example
dmy("10-01-2015")
returns "2015-10-01 UTC"
dmy("10/01/2015")
returns "2015-10-01 UTC"
dmy("Created on 10 1 2015")
returns "2015-10-01 UTC"
Summary
Convert Character or Number to Date / Period when data contains Date and Time.
Syntax
dmy_h(<column>
, tz = <timezone>
, locale = <locale>
, truncated = <number>
)
Arguments
Return Value
POSIXct
Example
dmy_h("10-01-2015 06:10")
returns "2015-10-01 06:10:00 UTC"
dmy_h("Created on 10 1 2015, at 6:10AM")
returns "2015-10-01 06:10:00 UTC"
dmy_h("10-01-2015 06", truncated = 1)
returns "2015-10-01 06:00:00 UTC"
Summary
Convert Character or Number to Date / Period when data contains Date and Time.
Syntax
dmy_hm(<column>
, tz = <timezone>
, locale = <locale>
, truncated = <number>
)
Arguments
Return Value
POSIXct
Example
dmy_hm("01-10-2015 06:10")
returns "2015-10-01 06:10:00 UTC"
dmy_hm("Created on 1 10 2015, at 6:10AM")
returns "2015-10-01 06:10:00 UTC"
dmy_hm("01-10-2015 06", truncated = 1)
returns "2015-10-01 06:00:00 UTC"
Summary
Convert Character or Number to Date / Period when data contains Date and Time.
Syntax
dmy_hms(<column>
, tz = <timezone>
, locale = <locale>
, truncated = <number>
)
Arguments
Return Value
POSIXct
Example
dmy_hms("01-10-2015 06:10:15")
returns "2015-10-01 06:10:15 UTC"
dmy_hms("01/10/2015 06:10:15")
returns "2015-10-01 06:10:15 UTC"
dmy_hms("Created on 1 10 2015, at 06:10:15 AM")
returns "2015-10-01 06:10:15 UTC"
dmy_hms("01-10-2015 06")
returns NA, because there is not enough data for Date and Time parsing.
dmy_hms("01-10-2015 06", truncated = 2)
returns "2015-10-01 06:00:00 UTC" as expected
Summary
Create a duration
Syntax
duration(<column_num>
, units = <units>
)
Arguments
Return Value
Duration
Example
duration(1.5, "minutes")
returns 90s
duration(mins = 1.5)
returns 90s
Summary
Convert Character or Number to Date / Period when data contains only Date, but not Time.
Syntax
dym(<column>
, tz = <timezone>
, locale = <locale>
, truncated = <number>
)
Arguments
Return Value
Date
Example
dym("01-2015-10")
returns "2015-10-01 UTC"
dym("01/2015/10")
returns "2015-10-01 UTC"
dym("Created on 1 2015 10")
returns "2015-10-01 UTC"
Summary
Rounds date/time down to the nearest integer value of the specified time unit.
Syntax
floor_date(<column_date>
, unit = <unit>
)
Arguments
Return Value
POSIXct
Example
floor_date(A, "second")
returns "2015-10-01 06:10:15 UTC"
floor_date(A, "minute")
returns "2015-10-01 06:10:00 UTC"
floor_date(A, "hour")
returns "2015-10-01 06:00:00 UTC"
floor_date(A, "day")
returns "2015-10-01 UTC"
floor_date(A, "week")
returns "2015-09-27 UTC"
floor_date(A, "month")
returns "2015-10-01 UTC"
floor_date(A, "quarter")
returns "2015-10-01 UTC"
floor_date(A, "year")
returns "2015-01-01 UTC"
Summary
Re-registers the data with a given timezone.
Syntax
force_tz(<column_date>
, tzone = <timezone>
)
Arguments
Return Value
POSIXct
Example
as.POSIXct("2014-01-01 00:15:00", tz = "UTC")
Returns "2014-01-01 00:15:00 UTC".
force_tz(as.POSIXct("2014-01-01 00:15:00", tz = "America/Los_Angeles"))
Returns "2014-01-01 00:15:00 PST". Notice that the timezone is now "PST" instead of the original "UTC"
Summary
Get current time in your local timezone
Syntax
here()
Return Value
POSIXct
Example
here()
returns "2015-12-31 00:04:59 PST"
Summary
Convert text to Period data type if a given column holds values that look like Hours and Minutes
Syntax
hm(<column>
, tz = <timezone>
)
Arguments
Return Value
Period
Example
hm("06:10")
returns "6H 10M 0S"
hm("06:10", "UTC")
returns "6H 10M 0S"
Summary
Create hms data. It's difftime dealt as seconds.
Syntax
hms(seconds = <column_num>
, minutes = <column_num>
, hours = <column_num>
, days = <column_num>
)
Arguments
Return Value
hms
Example
hms(seconds = 20, minutes = 1, hours = 5)
returns 20 + 1 * 60 + 5 * 60 * 60 = 18080 seconds difftime
Summary
Extract hour
Syntax
hour(<column_date>
)
Return Value
Integer
Example
hour("2015-10-01 06:15:30")
returns 6
Summary
Convert a number to hour periods so you can use it for further calculations.
Syntax
hours(<column_num>
)
Return Value
Period
Example
hours(1)
returns "1H 0M 0S"
ymd_hms("2015-10-01 06:15:30") + hours(1)
returns "2015-10-01 07:15:30 UTC"
Summary
Return intervals between two dates. You can use this result to get the duration by either dividing by a given units such as days, weeks, etc, OR converting it to numeric data type with "as.numeric()".
Syntax
interval(<column_date>
, <column_date>
)
Return Value
Interval
Example
interval("2015-10-01", "2015-10-18")
returns 2015-10-01 UTC--2015-10-18 UTC
interval("2015-10-01", "2015-10-18") %/% days(1)
returns 17
as.numeric(interval("2015-10-01", "2015-10-18"), units = "days")
returns 17
interval("2015-10-01", "2015-10-18") %/% weeks(1)
returns 2
Summary
Returns true if a given object is Date type.
Syntax
is.Date
Arguments
Return Value
Logical
Example
mutate_if(is.Date, wday)
Extract day of the week information from Date columns.
mutate_if(is.Date, funs(week_day = wday))
Extract day of the week information from Date columns and creating new columns.
mutate_if(is.Date, wday, label = TRUE)
Extract day of the week information using wday function with 'label = TRUE' argument from Date columns.
Summary
Returns true if a given object is difftime type.
Syntax
is.difftime
Arguments
Return Value
Logical
Example
mutate_if(is.difftime, as.numeric)
Convert difftime type columns to numeric type.
Summary
Returns true if a given object is duration type.
Syntax
is.duration
Arguments
Return Value
Logical
Example
mutate_if(is.duration, as.numeric)
Convert duration type columns to numeric type.
Summary
Returns true if a given object is POSIXct type.
Syntax
is.POSIXct
Arguments
Return Value
Logical
Example
mutate_if(is.POSIXct, wday)
Extract day of the week information from Date columns.
mutate_if(is.POSIXct, funs(week_day = wday))
Extract day of the week information from Date columns and creating new columns.
mutate_if(is.POSIXct, wday, label = TRUE)
Extract day of the week information using wday function with 'label = TRUE' argument from Date columns.
Summary
Extract week numbers of the year. Weeks start from Monday.
Syntax
isoweek(<column_date>
)
Return Value
Numeric
Example
isoweek(ymd(c( \
"2020-01-01",\
"2020-01-02",\
"2020-01-03",\
"2020-01-04",\
"2020-01-05",\
"2020-01-06",\
"2020-01-07",\
"2020-01-08")))\
returns 1,1,1,1,1,2,2,2 (2020-01-06 is Monday)
Summary
Extract Day of Month
Syntax
mday(<column_date>
)
Return Value
Numeric
Example
mday("2015-10-01 06:15:30")
returns 1
Summary
Convert Character or Number to Date / Period when data contains only Date, but not Time.
Syntax
mdy(<column>
, tz = <timezone>
, locale = <locale>
, truncated = <number>
)
Arguments
Return Value
Date
Example
mdy("01-10-2015")
returns "2015-10-01 UTC"
mdy("01/10/2015")
returns "2015-10-01 UTC"
mdy("Created on 1 10 2015")
returns "2015-10-01 UTC"
Summary
Convert Character or Number to Date / Period when data contains Date and Time.
Syntax
mdy_h(<column>
, tz = <timezone>
, locale = <locale>
, truncated = <number>
)
Arguments
Return Value
POSIXct
Example
mdy_h("10-01-2015 06")
returns "2015-10-01 06:00:00 UTC"
mdy_h("Created on 10 1 2015 6 AM")
returns "2015-10-01 06:00:00 UTC"
Summary
Convert Character or Number to Date / Period when data contains Date and Time.
Syntax
mdy_hm(<column>
, tz = <timezone>
, locale = <locale>
, truncated = <number>
)
Arguments
Return Value
POSIXct
Example
mdy_hm("10-01-2015T06:10")
returns "2015-10-01 06:10:00 UTC"
mdy_hm("Created on 10 1 2015, at 6:10 AM")
returns "2015-10-01 06:10:00 UTC"
Summary
Convert Character or Number to Date / Period when data contains Date and Time.
Syntax
mdy_hms(<column>
, tz = <timezone>
, locale = <locale>
, truncated = <number>
)
Arguments
Return Value
POSIXct
Example
mdy_hms("10-01-2015T06:10:15")
returns "2015-10-01 06:10:15 UTC"
mdy_hms("10/01/2015T06:10:15")
returns "2015-10-01 06:10:15 UTC"
mdy_hms("Created on 10 1 2015")
returns "2015-10-01 06:10:15 UTC"
mdy_hms("10-01-2015 06")
returns NA, because there is not enough data for Date and Time parsing.
mdy_hms("10-01-2015 06", truncated = 2)
returns "2015-10-01 06:00:00 UTC" as expected
Summary
Convert a number to milliseconds periods so you can use it for further calculations.
Syntax
milliseconds(<column_num>
)
Return Value
Period
Example
milliseconds(1)
returns "1S"
ymd_hms("2015-10-01 06:15:30") + milliseconds(1)
returns "2015-10-01 06:15:31 UTC"
Summary
Extract minute
Syntax
minute(<column_date>
)
Return Value
Integer
Example
minute("2015-10-01 06:15:30")
returns 15
Summary
Convert a number to minute periods so you can use it for further calculations.
Syntax
minutes(<column_num>
)
Return Value
Period
Example
minutes(1)_
returns "1M 0S"
ymd_hms("2015-10-01 06:15:30") + minutes(1)
returns "2015-10-01 06:16:30 UTC"
Summary
Extract Month
Syntax
month(<column_date>
, label = <logical>
, abbr = <logical>
, locale = <locale>
)
Arguments
Return Value
Numeric or Text
Example
month("2015-10-01 06:15:30")
returns 10
month("2015-10-01 06:15:30", label = TRUE)
returns October
month("2015-10-01 06:15:30", label = TRUE, abbr = TRUE)
returns Oct
Summary
Convert text to Period data type if a given column holds values that look like Minutes and Seconds
Syntax
ms(<column>
, tz = <timezone>
)
Arguments
Return Value
Period
Example
ms("10:15")
returns "10M 15S"
ms("10:15", "UTC")
returns "10M 15S"
Summary
Convert Character or Number to Date / Period when data contains only Date, but not Time.
Syntax
myd(<column>
, tz = <timezone>
, locale = <locale>
, truncated = <number>
)
Arguments
Return Value
Date
Example
myd("01-2015-10")
returns "2015-01-10 UTC"
myd("01/2015/10")
returns "2015-01-10 UTC"
myd("Created on 1 2015 10")
returns "2015-01-10 UTC"
Summary
Returns current date and time. A similar function 'today()' returns only date.
Syntax
now(tzone = <timezone>
)
Arguments
Return Value
POSIXct
Example
now()
returns "2015-12-31 08:04:59 PST"
now("GMT")
returns "2015-12-31 08:04:59 PST"
now("America/Los_Angeles")
returns "2015-12-31 00:04:59 PST"
Summary
Convert data to Date/Time data type.
Syntax
parse_date_time(<column>
, orders = <date_time_order>
, locale = <locale>
, tz = <timezone>
, truncated = <integer>
, quiet = <logical>
, exact = <logical>
, lt = <logical>
)
Arguments
orders - Specify the order of parameters in the original data. The supported parameters are
locale - The default is the session default locale. Set the date/time related configuration like time zone, encoding, decimal mark, and day/month names.
tz (Optional) - Specify which time zone to parse the date with. You can find a list of time zones.
trundated (Optional) - The default is 0. Number of parameters that can be missed.
quiet (Opional) - The default is FALSE. If TRUE, it suppresses the error when the format is wrong.
exact (Optional) - The default is FALSE. If TRUE, the orders parameters must explicit format (ex. "%m/%d/%y %I:%M").
lt (Optional) - The default is FALSE. If FALSE, it returns POSIXct. If TRUE, it return POSIXlt.
Return Value
POSIXct
Example
parse_date_time("03/15/05 1:30 PM", "mydHMp")
Returns "2015-03-05 13:30:00 PDT"
lubridate::parse_date_time("1:30 PM", "HMp")
Returns "0000-01-01 13:30:00 PDT. The default to each parameters are 0 and it always returns POSIXct"
Summary
Extract Day of Quarter
Syntax
qday(<column_date>
)
Return Value
Numeric
Example
qday("2015-10-01 06:15:30")
returns 7
Summary
Extract Quarter
Syntax
quarter(<column_date>
)
Return Value
Numeric
Example
quarter("2015-08-01 06:15:30")
returns 3
quarter("2015-06-16 07:42:29")
returns 2
Summary
Calculate the last day of the previous month or to the first day of the month based on a given date.
Syntax
rollback(<column_date>
, roll_to_first = <logical>
, preserve_hms = <logical>
)
Arguments
Return Value
POSIXct
Example
rollback("2015-03-03 06:15:30")
returns "2010-02-28 06:15:30 UTC"
rollback("2015-03-03 06:15:30", roll_to_first = TRUE)
returns "2010-03-01 06:15:30 UTC"
rollback("2015-03-03 06:15:30", preserve_hms = FALSE)
returns "2010-02-28 UTC"
Summary
Rounds date/time to the nearest integer value of the specified time unit.
Syntax
round_date(<column_date>
, unit = <unit>
)
Arguments
Return Value
POSIXct
Example
round_date("2015-10-01 06:10:15 UTC", "second")
returns "2015-10-01 06:10:15 UTC"
round_date("2015-10-01 06:10:15 UTC", "minute")
returns "2015-10-01 06:10:00 UTC"
round_date("2015-10-01 06:10:15 UTC", "hour")
returns "2015-10-01 06:00:00 UTC"
round_date("2015-10-01 06:10:15 UTC", "day")
returns "2015-10-01 UTC"
round_date("2015-10-01 06:10:15 UTC", "week")
returns "2015-10-04 UTC"
round_date("2015-10-01 06:10:15 UTC", "month")
returns "2015-10-01 UTC"
round_date("2015-10-01 06:10:15 UTC", "quarter")
returns "2015-10-01 UTC"
round_date("2015-10-01 06:10:15 UTC", "year")
returns "2015-01-01 UTC"
Summary
Extract second
Syntax
second(<column_date>
)
Return Value
Integer
Example
second("2015-10-01 06:15:30")
returns 30
Summary
Convert a number to second periods so you can use it for further calculations.
Syntax
seconds(<column_num>
)
Return Value
Period
Example
seconds(1)
returns "1S"
ymd_hms("2015-10-01 06:15:30") + seconds(1)
returns "2015-10-01 06:15:31 UTC"
Summary
Calculate the exact time length between two dates.
Syntax
time_length(<interval_or_duration>
, unit = <unit>
)
Arguments
Return Value
Numeric
Example
time_length(interval("2015-10-01", "2015-10-18"), "day")
returns 17
time_length(interval("2015-10-01", "2015-10-18"), "month")
returns 0.5483871
trunc(time_length(interval("2015-10-01", "2015-10-18"), "month"))
returns 0
round(time_length(interval("2015-10-01", "2015-10-18"), "month"))
returns 1
Summary
Returns current date without time. A similar function 'now()' returns date including time.
Syntax
today(tzone = <timezone>
)
Arguments
Return Value
Date
Example
today("GMT")
returns "2015-12-31"
today("America/Los_Angeles")
returns "2015-12-31"
Summary
Extract day of the week as a decimal number (01-07, Sunday is 1) or as Text (e.g. Sunday). The Text can be a full text or an abbreviated text. Sunday vs. Sun
Syntax
wday(<column_date>
, label = <logical>
, abbr = <logical>
, week_start = <number>
, locale = <locale>
)
Arguments
Return Value
Numeric or Text
Example
wday("2015-10-01 06:15:30")
returns 5
wday(A, label = TRUE)
returns Thurs
wday(A, label = TRUE, abbr=FALSE)
returns Thursday
Summary
Extract week numbers of the year. The 1st day of the 1st week always starts from January 1st regardless of the day of the week.
Syntax
week(<column_date>
)
Return Value
Numeric
Example
week(ymd(c( \
"2020-01-01",\
"2020-01-02",\
"2020-01-03",\
"2020-01-04",\
"2020-01-05",\
"2020-01-06",\
"2020-01-07",\
"2020-01-08")))\
returns 1,1,1,1,1,1,1,2
Summary
Convert a number to week periods so you can use it for further calculations.
Syntax
weeks(<column_num>
)
Return Value
Period
Example
weeks(1)
returns "7d 0H 0M 0S"
ymd_hms("2015-10-01 06:15:30") + weeks(1)
returns "2015-10-08 06:15:30 UTC"
Summary
Returns Date / Time in a given time zone
Syntax
with_tz(<column_date>
, tzone = <timezone>
)
Arguments
Return Value
POSIXct
Example
with_tz(as.POSIXct("2014-01-01 00:15:00", tz = "UTC"), tzone = "America/Los_Angeles")
Returns "2013-12-31 16:15:00 PST". The original data is registered in UTC, and with_tz() function is returning it in Pacific Timezone.
with_tz(as.POSIXct("2014-01-01 00:15:00", tz = "UTC"))
Returns "2013-12-31 16:15:00 PST". with_tz() function returns the value in the default timezone (e.g. "America/Los_Angeles").
Summary
Extract Day of Year
Syntax
yday(<column_date>
)
Return Value
Numeric
Example
yday("2015-10-01 06:15:30")
returns 274
Summary
Convert Character or Number to Date / Period when data contains only Date, but not Time.
Syntax
ydm(<column>
, tz = <timezone>
, locale = <locale>
, truncated = <number>
)
Arguments
Return Value
Date
Example
ydm("2015-10-01")
returns ""2015-01-10 UTC""
ydm("2015/10/01")
returns ""2015-01-10 UTC""
ydm("Created on 2015 10 1")
returns ""2015-01-10 UTC""
Summary
Convert Character or Number to Date / Period when data contains Date and Time.
Syntax
ydm_h(<column>
, tz = <timezone>
, locale = <locale>
, truncated = <number>
)
Arguments
Return Value
POSIXct
Example
ydm_h("2015-01-10T06")
returns "2015-10-01 06:00:00 UTC"
ydm_h("2015/01/10 06")
returns "2015-10-01 06:00:00 UTC"
ydm_h("Created on 15-01-10 at 6 AM")
returns "2015-10-01 06:00:00 UTC"
Summary
Convert Character or Number to Date / Period when data contains Date and Time.
Syntax
ydm_hm(<column>
, tz = <timezone>
, locale = <locale>
, truncated = <number>
)
Arguments
Return Value
POSIXct
Example
ydm_hm("2015-01-10 06:10")
returns "2015-10-01 06:10:00 UTC"
ydm_hm("2015/01/10 06:00 UTC")
returns "2015-10-01 06:10:15 UTC"
ydm_hm("Created on 15-01-10 at 06:10 AM")
returns "2015-10-01 06:10:00 UTC"
Summary
Convert Character or Number to Date / Period when data contains Date and Time.
Syntax
ydm_hms(<column>
, tz = <timezone>
, locale = <locale>
, truncated = <number>
)
Arguments
Return Value
POSIXct
Example
ydm_hms("2015-01-10T06:10:15")
returns "2015-10-01 06:10:15 UTC"
ydm_hms("2015/01/10 06:10:15 UTC")
returns "2015-10-01 06:10:15 UTC"
ydm_hms("Created on 15-01-10 at 06:10:15 AM")
returns "2015-10-01 06:10:15 UTC"
ydm_hms("2015-01-10 06")
returns NA, because there is not enough data for Date and Time parsing.
ydm_hms("2015-01-10 06", truncated = 2)
returns "2015-10-01 06:00:00 UTC" as expected
Summary
Extract Year
Syntax
year(<column_date>
)
Return Value
Numeric
Example
year("2015-10-01 06:15:30")
returns 2015
Summary
Convert a number to year periods so you can use it for further calculations.
Syntax
years(<column_num>
)
Return Value
Period
Example
ymd_hms("2015-10-01 06:15:30") + years(1)
returns "2016-10-01 06:15:30 UTC"
Summary
Convert Character or Number to Date / Period when data contains only Date, but not Time.
Syntax
ymd(<column>
, tz = <timezone>
, locale = <locale>
, truncated = <number>
)
Arguments
Return Value
Date
Example
ymd("2015-10-01")
returns "2015-10-01 UTC"
ymd("2015/10/01")
returns "2015-10-01 UTC"
ymd("Created on 2015 10 1")
returns "2015-10-01 UTC"
Summary
Convert Character or Number to Date / Period when data contains Date and Time.
Syntax
ymd_h(<column>
, tz = <timezone>
, locale = <locale>
, truncated = <number>
)
Arguments
Return Value
POSIXct
Example
ymd_h("2015-10-01T06")
returns "2015-10-01 06:00:00 UTC"
ymd_h("2015/10/01 06")
returns "2015-10-01 06:00:00 UTC"
ymd_h("Created on 15-10-01 at 6 AM")
returns "2015-10-01 06:00:00 UTC"
Summary
Convert Character or Number to Date / Period when data contains Date and Time.
Syntax
ymd_hm(<column>
, tz = <timezone>
, locale = <locale>
, truncated = <number>
)
Arguments
Return Value
POSIXct
Example
ymd_hm("2015-10-01 06:10")
returns "2015-10-01 06:10:00 UTC"
ymd_hm("2015/10/01 06:00 UTC")
returns "2015-10-01 06:10:15 UTC"
ymd_hm("Created on 15-10-01 at 06:10 AM")
returns "2015-10-01 06:10:00 UTC"
Summary
Convert Character or Number to Date / Period when data contains Date and Time.
Syntax
ymd_hms(<column>
, tz = <timezone>
, locale = <locale>
, truncated = <number>
)
Arguments
Return Value
POSIXct
Example
ymd_hms("2015-10-01T06:10:15")
returns "2015-10-01 06:10:15 UTC"
ymd_hms("2015/10/01 06:10:15 UTC")
returns "2015-10-01 06:10:15 UTC"
ymd_hms("Created on 15-10-01 at 06:10:15 AM")
returns "2015-10-01 06:10:15 UTC"
ymd_hms("2015-10-01 06")
returns NA, because there is not enough data for Date and Time parsing.
ymd_hms("2015-10-01 06", truncated = 2)
returns "2015-10-01 06:00:00 UTC" as expected
Summary
Computes the correlation of two column values. The covariance divided by the product of the standard deviations of the two column values.
Syntax
cor(<column_num>
, <column_num>
, use = <cor_na_operation>
, method = <cor_method>
)
Arguments
Example
summarize(correlation = cor(ARR_DELAY, DEP_DELAY))
Return the correlation value for ARR_DELAY and DEP_DELAY columns.
Summary
Computes the covariance of two column values.
Syntax
cov(<column_num>
, <column_num>
, use = <cor_na_operation>
, method = <cor_method>
)
Arguments
Example
summarize(covariance = cov(ARR_DELAY, DEP_DELAY))
Return the covariance value for ARR_DELAY and DEP_DELAY columns.
Summary
computes interquartile range of the x values
Syntax
IQR(<column_num_logic_date>
, na.rm = <logical>
, type = 1|2|3|4|5|6|7|8|9
)
Arguments
Return Value
Numeric
Example
// x <- c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10)
IQR(x)
Returns 4.5
Summary
returns the median absolute deviation of the values.
Syntax
mad(<column_num_logic_date>
, na.rm = <logical>
)
Arguments
Return Value
Numeric
Example
summarize(mad = mad(revenue))
Create a new column for calculating the median absolute deviation of revenue for each group.
Summary
Returns the numerical median value.
Syntax
median(<column_num_logic_date>
, na.rm = <logical>
)
Arguments
Return Value
Numeric
Example
summarize(revenue_median = median(revenue))
Create a new column for calculating the median value of revenue for each group.
Summary
Returns data augmented with fitted values and residuals.
Syntax
predict(<column_model>
, data = <column_data_set>
, newdata = <data_set>
, type.predict = "link"|"response"
)
Arguments
Example
predict(model) - assume that 'model' column contains a fitted model of 'lm', 'glm' or 'kmeans'. If it's lm, the output is like below.
.rownames | ARR_DELAY | DEP_DELAY | .fitted | .se.fit | .resid | .hat | .sigma | .cooksd | .std.resid |
---|---|---|---|---|---|---|---|---|---|
1 | -15 | -9 | -13.934 | 0.0247 | -1.066 | 0.0000035503 | 13.0886 | 7.8503e-9 | -0.0814 |
2 | -28 | -19 | -22.4116 | 0.035 | -5.5884 | 0.0000071333 | 13.0886 | 4.3348e-7 | -0.427 |
Summary
sample quantiles corresponding to the given probabilities. The smallest observation corresponds to a probability of 0 and the largest to a probability of 1.
Syntax
quantile(<column_num_logic_date>
, probs = <percent_number>
, na.rm = <logical>
, [type = 1|2|3|4|5|6|7|8|9
])
Arguments
Return Value
Numeric
Example
//x <- c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10)
quantile(x, 0.25)
Returns 3.25
quantile(x, .5)
Returns 5.5
Summary
returns the standard deviation of the values.
Syntax
sd(<column_num_logic_date>
, na.rm = <logical>)
)
Arguments
Return Value
Numeric
Example
summarize(sd = sd(revenue))
Create a new column for calculating the standard deviation of revenue for each group.
Summary
returns the variance of the values.
Syntax
var(<column_num_logic_date>
, na.rm = <logical>
)
Arguments
Return Value
Numeric
Example
summarize(total = sum(revenue))
Create a new column for calculating the variance of revenue for each group.
Summary
Concatenates multiple text from multiple columns.
Syntax
str_c(<column(s)>
, sep = <text>
, collapse = <text>
)
Arguments
Return Value
Character
Example
str_c(CARRIER, TAIL_NUM)
Concatenate CARRIER and TAIL_NUM values.
str_c(CARRIER, TAIL_NUM, sep = "-")
Use dash as a separator to concatenate CARRIER and TAIL_NUM values.
summarize(combined_text = str_c(CARRIER, collapse = ",")
Use comma ',' as a separator to combine all the CARRIER names into one text for each group.
Summary
Convert text to a specified encoding. e.g. UTF-8
Syntax
str_conv(<column_text>
, <encoding>
)
Arguments
Return Value
Character
Example
str_conv("Exploratory", "UTF-8")
Summary
Return number of letters or words in text data. You can also set a pattern to count only the matched pattern or text.
Syntax
str_count(<column_text>
, pattern = [<regular_expression>]
)
str_count(<column_text>
, pattern = [boundary("word")]
)
Arguments
Return Value
Numeric
Example
str_count("I'm traveling to New York on 1/1.")
returns 33 including punctuations and space
str_count("I'm traveling to New York on 1/1.", "New")
returns 1 by matching text 'New'
str_count("I'm traveling to New York on 1/1.", "[:alpha:]")
returns 22 counting only alphabet letters
str_count("I'm traveling to New York on 1/1.", "[:digit:]")
returns 2 counting only digit letters
str_count("I'm traveling to New York.", boundary("word"))
returns 5 counting as words
str_count("I'm traveling to New York on 1/1.", ".")
returns 33 because dot matches anything
str_count("I'm traveling to New York on 1/1.", fixed("."))
returns 1 because dot is searching dot as letter with 'fixed()'
Summary
Return TRUE or FALSE based on whether Text data contains a given text or not
Syntax
str_detect(<column_text>
, <pattern>
)
Arguments
Return Value
Logical
Example
str_detect("New York", "New")
returns TRUE because there is 'New' in the data.
str_detect("New York on 1/1.", "[:alpha:]")
returns TRUE because there are alphabet letters in the data.
str_detect("New York on 1/1.", "[:digit:]")
returns TRUE because there are numeric (digit) letters in the data.
str_detect("New York on 1/1.", "^New")
returns TRUE because there is "New" text at the beginning of the data.
str_detect("New York on 1/1.", "New$")
returns TRUE because there is no "New" text at the end of the data.
str_detect("New York on 1/1.", "N.w")
returns TRUE because 'New' starts with 'N' and ends with 'w'
str_detect("New York on 1/1.", fixed("N.w"))
returns FALSE because 'N.w' exact matching letters don't exist
str_detect("New York on 1/1.", c("New", "Old"))
returns TRUE because a given text matches with one of matching texts
Summary
Repeat text values.
Syntax
str_dup(<column_text>
, times = <numeber>
)
Arguments
Return Value
Character
Example
str_dup("wow", 2)
returns "wowwow".
Summary
Extract only letters that match with a given letters or patterns
Syntax
str_extract(<column_text>
, <pattern>
)
Arguments
Return Value
Character
Example
str_extract("ABCabc123", "[A-Za-z]+")
returns "ABCabc" extracting only alphabet letters.
str_extract("ABCabc123", "[:alpha:]+")
returns "ABCabc" extracting only alphabet letters.
str_extract("ABCabc123", "[0-9]+")
returns "123" extracting only numbers
str_extract("ABCabc123", "[:digit:]+")
returns "123" extracting only numbers
Summary
Extract all characters that match with a given letters or patterns
Syntax
str_extract_all(<column_text>
, <pattern>
)
Arguments
Return Value
List
Example
str_extract_all("Hello world", "[:alpha:]+")
Extracts all matching texts and returns c("Hello", "world").
str_extract("Hello world", "[:alpha:]+")
Extracts the first matching text and returns "Hello".
Summary
Return number of letters in text data.
Syntax
str_length(<column_text>
)
Return Value
Numeric
Example
str_length("I'm traveling to New York.")
Returns 26 including punctuations and space
Summary
Add white spaces to text values to make the total length to be exact a specified number.
Syntax
str_pad(<column_text>
, width = <number>
, side = "left"|"right"|"both"
, pad = <text>
)
Arguments
Return Value
Character
Example
str_pad("Exploratory", 3)
returns "Exploratory" because 3 is less than number of the original given text.
str_pad("Exploratory", 30, "left")
returns " Exploratory" to make the length of the text to be 30 by adding white spaces.
Summary
Replace letters that matches with a given letters or expressions
Syntax
str_replace(<column_text>
, <pattern>
, <replacement>
)
Arguments
Return Value
Character
Example
str_replace("I am traveling to New York!!", "York", "Jersey")
returns "I am traveling to New Jersey"
str_replace("I am traveling to New York!!", "[:punct:]", "")
returns "I am traveling to New York!" replacing only the first matching punctuation. You want to use str_replace_all() to replace them all.
str_replace("New York New", "^New", "Old")
returns "Old York New" by finding "New" at the beginning and replace it with "Old"
str_replace("New York New.", "New$", "Old")
returns "New York Old" by finding "New" at the end and replace it with "Old"
str_replace("New York", str_sub("New York", 4, 8), "")
returns "New".
Summary
Replace letters that matches with a given letters or expressions
Syntax
str_replace_all(<column_text>
, <pattern>
, <replacement>
)
Arguments
Return Value
Character
Example
str_replace_all("I am traveling to New York!!", "[:punct:]", "")
returns "I am traveling to New York"
Summary
Replace NA values to a specific Text
Syntax
str_replace_na(<column_text>
, <replacement>
)
Arguments
Return Value
Character
Example
str_replace_na(ARR_DELAY, "Not Available")
Returns "Not Available" for NA in ARR_DELAY column.
Summary
Split a given text into multiple text by a given separator. It will return a list so you want to unnest the newly created column with unnest() function.
Syntax
str_split(<column_text>
, pattern = <pattern>
, n = <numeric>
)
Arguments
Return Value
List
Example
str_split("a, b, c", ",")
returns "a" " b" " c" in a list format.
str_split("a, b, c", ",", 2)
returns "a" and " b, c" in a list format.
mutate(y = str_split(text_a, ",")) %>% unnest()
First split a given text column, this will create a list column. Then, unnest() function will break the list by given each element of the list its own row.
unnest(y = str_split(text_a, ","))
This is equivalent of the above command.
Summary
Extract letters from Text values based on the position
Syntax
str_sub(<column_text>
, <start>
, <end>
)
Arguments
Return Value
Character
Example
str_sub("Kevin Spacey", start = 7, end = 12)
returns "Spacey". Starts from 7 to 12.
str_sub("Kevin Spacey", start = 7)
returns "Spacey". Starts from 7 to the end.
str_sub(Text, start = -6)
returns "Spacey". Last 6 letters.
Summary
Convert text to lowercase
Syntax
str_to_lower(<column_text>
, locale = <locale>
)
Arguments
Return Value
Character
Example
str_to_lower("CALIFORNIA")
Returns "california"
Summary
Convert text to Title-case
Syntax
str_to_title(<column_text>
, locale = <locale>
)
Arguments
Return Value
Character
Example
str_to_title("CALIFORNIA")
Returns "California"
Summary
Convert text to UPPERCASE
Syntax
str_to_upper(<column_text>
, locale = <locale>
)
Arguments
Return Value
Character
Example
str_to_upper("california")
Returns "CALIFORNIA"
str_to_upper("i", "en")
Setting English locale would return 'i'
str_to_upper("i", "tr")
Setting Turkish locale would return dotted 'I' in capital
Summary
Trim (or remove) white spaces in Text values
Syntax
str_trim(<column_text>
, side = "both"|"left"|"right"
)
Arguments
Return Value
Character
Example
str_trim(" String with trailing and leading white space\t")
returns "String with trailing and leading white space".
str_trim("\n\nString with trailing and leading white space\n\n")
returns "String with trailing and leading white space".
Summary
Extract First / Last Nth Word in Text data
Syntax
word(<column_text>
, start = <number>
, end = <number>
, sep = <text>
)
Arguments
Return Value
Character
Example
word("I am traveling to New York", 1)
returns "I".
word("I am traveling to New York", 1, 3)
returns "I am traveling"
word("I am traveling to New York", 2)
returns "am".
word("I am traveling to New York", -1)
returns "York"
word("I am traveling to New York", 2, -1)
returns "am traveling to New York"
word("123-456-789", 1, sep = "-")
returns "123".
word("123.456.789", 1, sep = ".")
returns null. You want to use fixed() to make the dot literally mean dot.
word("123.4568.789", 1, sep = fixed("."))
returns "123".
Summary
Complete a data frame by adding missing combinations of data.
Syntax
complete(<column(s)>
, fill = <list>
)
Arguments
Used Together
full_seq, nesting
Example
Sample data:
year | product | type | sales |
---|---|---|---|
2013 | iPhone | 5 | 100 |
2013 | iPad | mini | 50 |
2015 | iPhone | 6 | 200 |
complete(year, product)
Returns all the combinations of year and product values.
year | product | type | sales |
---|---|---|---|
2013 | iPad | mini | 50 |
2013 | iPhone | 5 | 100 |
2015 | iPad | NA | NA |
2015 | iPhone | 6 | 200 |
complete(year, product, fill = list(sales = 0))
Returns all the combinations of year and product values and fill with 0 for NA for sales column.
year | product | type | sales |
---|---|---|---|
2013 | iPad | mini | 50 |
2013 | iPhone | 5 | 100 |
2015 | iPad | NA | 0 |
2015 | iPhone | 6 | 200 |
complete(year = full_seq(year, period = 1), product)
Returns all the combinations of year and product values plus fill missing years (2014).
year | product | type | sales |
---|---|---|---|
2013 | iPad | mini | 50 |
2013 | iPhone | 5 | 100 |
2014 | iPad | NA | NA |
2014 | iPhone | NA | NA |
2015 | iPad | NA | NA |
2015 | iPhone | 6 | 200 |
complete(year, product, type)
Returns all the combinations of year, product, and type values.
year | product | type | sales |
---|---|---|---|
2013 | iPad | 5 | NA |
2013 | iPad | 6 | NA |
2013 | iPad | mini | 50 |
2013 | iPhone | 5 | 100 |
2013 | iPhone | 6 | NA |
2013 | iPhone | mini | NA |
2015 | iPad | 5 | NA |
2015 | iPad | 6 | NA |
2015 | iPad | mini | NA |
2015 | iPhone | 5 | NA |
2015 | iPhone | 6 | 200 |
2015 | iPhone | mini | NA |
complete(year, nesting(product, type))
By using nesting() function you can set the combination of values to be the ones that really exist in data for some columns.
year | product | type | sales |
---|---|---|---|
2013 | iPad | mini | 50 |
2013 | iPhone | 5 | 100 |
2013 | iPhone | 6 | NA |
2015 | iPad | mini | NA |
2015 | iPhone | 5 | NA |
2015 | iPhone | 6 | 200 |
Summary
Expand a data frame by adding all the combination of given columns.
Syntax
expand(<column(s)>
)
Arguments
Used Together
full_seq, nesting
Example
Sample data:
year | product | type | sales |
---|---|---|---|
2013 | iPhone | 5 | 100 |
2013 | iPad | mini | 50 |
2015 | iPhone | 6 | 200 |
expand(year, product)
Returns all the combinations of year and product values.
year | product |
---|---|
2013 | iPad |
2013 | iPhone |
2015 | iPad |
2015 | iPhone |
expand(year = full_seq(year, period = 1), product)
Returns all the combinations of year and product values plus fill missing years (2014).
year | product |
---|---|
2013 | iPad |
2013 | iPhone |
2014 | iPad |
2014 | iPhone |
2015 | iPad |
2015 | iPhone |
expand(year, product, type)
Returns all the combinations of year, product, and type values.
year | product | type |
---|---|---|
2013 | iPad | 5 |
2013 | iPad | 6 |
2013 | iPad | mini |
2013 | iPhone | 5 |
2013 | iPhone | 6 |
2013 | iPhone | mini |
2015 | iPad | 5 |
2015 | iPad | 6 |
2015 | iPad | mini |
2015 | iPhone | 5 |
2015 | iPhone | 6 |
2015 | iPhone | mini |
expand(year, nesting(product, type))
By using nesting() function you can set a certain combination of column values to be fixed and have them nested inside other columns.
year | product | type |
---|---|---|
2013 | iPad | mini |
2013 | iPhone | 5 |
2013 | iPhone | 6 |
2015 | iPad | mini |
2015 | iPhone | 5 |
2015 | iPhone | 6 |
Summary
Extract only the number from a given text. This is useful for strings that are supposed to be numbers with extra formatting (e.g. $1,200.34, -12%, X1, etc.).
Syntax
extract_numeric(<column>
)
Return Value
Numeric
Example
extract_numeric("$56,500.00")
Returns 56500
extract_numeric("-45%")
Returns -45
mutate(
Crate a new column to have only the extracted numbers from the original column.
Summary
Fills missing values in using the previous entry. This is convenient especially when values are not presented in some cells to avoid duplicated entries.
Syntax
fill(<column(s)>
, .direction = <fill_direction>
)
Arguments
Example
fill(POPULATION)
Fills using the previous appearing value for POPULATION column.
fill(POPULATION, .direction = "up")
Fills using the next appearing value for POPULATION column.
fill(POPULATION, PRODUCTION)
Fills using the previous appearing value for POPULATION and PRODUCTION columns.
fill(-UNEMPLOYMENT)
Fills using the previous appearing value for all the columns other than UNEMPLOYMENT column.
Summary
Generate a sequence of numbers based on the values in a given column.
Syntax
full_seq(<column_num>
, period = <number>
)
Arguments
Used Together
complete, expand,
Example
Sample data:
year | product | type | sales |
---|---|---|---|
2013 | iPhone | 5 | 100 |
2013 | iPad | mini | 50 |
2015 | iPhone | 6 | 200 |
expand(year = full_seq(year, period = 1), product)
Returns all the combinations of year and product values plus fill missing years (2014).
year | product |
---|---|
2013 | iPad |
2013 | iPhone |
2014 | iPad |
2014 | iPhone |
2015 | iPad |
2015 | iPhone |
Without full_seq() function.
expand(year, product)
Returns all the combinations of year and product values.
year | product |
---|---|
2013 | iPad |
2013 | iPhone |
2015 | iPad |
2015 | iPhone |
Summary
Takes multiple columns and collapses into key-value pairs, duplicating all other columns as needed.
Syntax
gather(<new_key_column>
, <new_value_column>
, <column(s)>
, na.rm = <logical>
, convert = <logical>
)
Arguments
Example
gather(COUNTRY, POPULATION, 2:50, na.rm = FALSE)
Takes the 2nd column to the 50th column into two new columns, COUNTRY and POPULATION.
gather(COUNTRY, POPULATION, -YEAR)
Takes all the columns other than YEAR column into two new columns, COUNTRY and POPULATION.
gather(COUNTRY, POPULATION, starts_with("country"))
Takes the columns whose names start with "country" into two new columns, COUNTRY and POPULATION.
Summary
Nest a set of columns together as a list column.
Syntax
nest(<column(s)>
, .key = <text>
)
Arguments
Example
Sample Data
year | product | type | sales |
---|---|---|---|
2013 | iPhone | 5 | 100 |
2013 | iPad | mini | 50 |
2015 | iPhone | 6 | 200 |
nest(product, type, sales, .key = a)
This will create a new column that is a list of data frames and each row of the data frame holds 'product', 'type', 'sales' values.
year | a |
---|---|
2013 | list() |
2015 | list() |
Summary
Fix multiple columns so that only the combination of actually existing values in those columns will be used for complete() or expand() operation.
Syntax
nesting(<column(s)>
)
Arguments
Used Together
complete, expand,
Example
Sample data:
year | product | type | sales |
---|---|---|---|
2013 | iPhone | 5 | 100 |
2013 | iPad | mini | 50 |
2015 | iPhone | 6 | 200 |
expand(year, nesting(product, type))
By using nesting() function you can set a certain combination of column values to be fixed and have them nested inside other columns.
year | product | type |
---|---|---|
2013 | iPad | mini |
2013 | iPhone | 5 |
2013 | iPhone | 6 |
2015 | iPad | mini |
2015 | iPhone | 5 |
2015 | iPhone | 6 |
Without nesting():
expand(year, product, type)
Returns all the combinations of year, product, and type values.
year | product | type |
---|---|---|
2013 | iPad | 5 |
2013 | iPad | 6 |
2013 | iPad | mini |
2013 | iPhone | 5 |
2013 | iPhone | 6 |
2013 | iPhone | mini |
2015 | iPad | 5 |
2015 | iPad | 6 |
2015 | iPad | mini |
2015 | iPhone | 5 |
2015 | iPhone | 6 |
2015 | iPhone | mini |
Summary
Separates a column with delimited values into multiple columns.
Syntax
separate(<column_text_date>
, into = c("<new_column_name(s)>
"), sep = <text>
, remove=<logical>
, convert=<logical>
, extra = "warn"|"drop"|"merge"
, fill = "warn"|"right"|"left"
)
Arguments
Example
separate(NAME, into = c("FIRSTNAME", "LASTNAME"), extra="drop")
Separates NAME column into FIRSTNAME and LASTNAME columns by non-alphanumeric values. It will drop the rest if any.
separate(NAME, into = c("FIRSTNAME", "LASTNAME"), remove = FALSE)
Separates NAME column into FIRSTNAME and LASTNAME columns by non-alphanumeric values. It will keep NAME column.
separate(NAME, into = c("FIRSTNAME", "LASTNAME"), extra="merge")
Separates NAME column into FIRSTNAME and LASTNAME columns by non-alphanumeric values. It will merge the rest into the last column.
separate(NAME, into = c("FIRSTNAME", "LASTNAME"), sep = "-")
Separates NAME column into FIRSTNAME and LASTNAME columns by dash "-".
separate(NAME, into = c("FIRSTNAME", "LASTNAME"), sep = 2)
Separates NAME column into FIRSTNAME and LASTNAME by a position of 2.
separate(DATE, into = c("Y", "M", "D"), convert = TRUE)
Separate DATE into Year, Month, and Day columns and set the data type for each column based on the heuristics.
Summary
Spread a key-value pair across multiple columns. Missing values will be replaced with NA.
Syntax
spread(<key_column>
, <value_column>
, fill = <value>
, convert = <logical>
, drop = <logical>
, sep=<text>
)
Arguments
Example
spread(COUNTRY, POPULATION)
Spread a COUNTRY-POPULATION pair across multiple columns of COUNTRY.
spread(COUNTRY, POPULATION, fill = “NA")
Spread a COUNTRY-POPULATION pair across multiple columns of COUNTRY, and fill with "NA" text for missing values.
Summary
Unite multiple columns together into one column with given uniting characters. It will concatenate values by “_" and remove the original column by default. Multiple columns specified in the list can be combined together at once.
Syntax
unite(<new_column>
, <column(s)>
, sep = <text>
, remove = <logical>
)
Arguments
Example
unite(NAME, FIRSTNAME, LASTNAME)
Unite FIRSTNAME and LASTNAME into a new column called NAME using _ as a uniting character.
unite(NAME, FIRSTNAME, LASTNAME, sep = "-")
Unite FIRSTNAME and LASTNAME into a new column called NAME using _ as a uniting character.
Summary
Unnest a list column or a list of data frames by making each element of the list to be presented in its own row.
Syntax
unnest(<column_list>
, .drop = <logical>
, .id = <text>
, .sep = <text>
)
Arguments
Example
Sample Data with a list column
issue_id | assignee |
---|---|
720 | list(Kan, Kei) |
721 | list(Kan, Marie, Simon) |
unnest() or unnest(assignee)
Unnest 'assignee' column which is a list. This will give each value of the list its own row.
issue_id | assignee |
---|---|
720 | Kan |
720 | Kei |
721 | Kan |
721 | Marie |
721 | Simon |
Sample Data with a list of Data Frames:
issue_id | assignee |
---|---|
720 | list(name = Kan, id = 105, country = US ) |
721 | list(name = Kei, id = 106, country = Japan ) |
unnest()
Unnest 'assignee' column which is a list of data frames. This will create columns for each column of the nested data frame.
issue_id | name | id | country |
---|---|---|---|
720 | Kan | 105 | US |
721 | Kei | 106 | Japan |
unnest(.sep="-")
Unnest With .sep argument, the result columns are with original list column name connected by .sep argument.
issue_id | assignee-name | assignee-id | assignee-country |
---|---|---|---|
720 | Kan | 105 | US |
721 | Kei | 106 | Japan |
Sample Data with two list columns
issue_id | assignee | category |
---|---|---|
720 | list(Kan, Kei) | list(Critical, Security) |
721 | list(Kan, Marie, Simon) | list(Critical, Minor Issue) |
unnest(assignee, .drop = FALSE)
Unnest 'assignee' column which is a list. This will give each value of the list its own row.
issue_id | assignee | category |
---|---|---|
720 | Kan | list(Critical, Security) |
720 | Kei | list(Critical, Security) |
721 | Kan | list(Critical, Security) |
721 | Marie | list(Critical, Minor Issue) |
721 | Simon | list(Critical, Minor Issue) |
Summary
Removes parameters from a given URL.
Syntax
param_remove(<column_text>
, keys = <text>
)
Arguments
Return Value
Character
Example
param_remove(urls = "https://github.com/hideaki/tam/issues?q=is%3Aopen&ticket=open", keys = c("q","ticket"))
Returns "https://github.com/hideaki/tam/issues"
Summary
Decodes a given encoded URL. It assume character encoding is UTF-8.
Syntax
url_decode(<column_text>
)
Arguments * column - column that you want to decode.
Return Value
Character
Example
url_decode("http://exploratory.io%3fname%3dflight%20data%20analysis%20%28V1%29")
Returns "http://exploratory.io?name=flight data analysis (V1)"
Summary
Encodes a given URL. It assume character encoding is UTF-8.
Syntax
url_encode(<column_text>
)
Arguments * column - column that you want to encode.
Return Value
Character
Example
url_encode("http://exploratory.io?name=flight data analysis (V1)")
Returns "http://exploratory.io%3fname%3dflight%20data%20analysis%20%28V1%29"
Summary
Returns decoded query parameters from url.
Syntax
url_parameters(<column_text>
)
Arguments
Return Value
Character
Example
url_parameters("https://exploratory.io/?debug=%3Ctrue%3E#test")
returns "debug=<true>".
Summary
Returns the rolling (or moving) max value.
Syntax
roll_max(<column_num>
, n = <numeric>
, weights = NULL, by = <numeric>
, fill = <numeric>
, align = <alignment>
, normalize = <logical>
, na.rm = <logical>
)
Arguments
Return Value
Numeric
Example
rolling_max = roll_max(revenue, n = 3, align="center", fill = 0)
Returns a max value among three values starting from one previous and one next value, setting NA to 0.
year | revenue | rolling_max |
---|---|---|
2012 | 100 | 0 |
2013 | 200 | 300 |
2014 | 300 | 400 |
2015 | 400 | 500 |
2016 | 500 | 0 |
rolling_max = roll_max(revenue, n = 3, align="left", fill = 0)
Returns a max value among three values starting from the current value and the next two values, setting NA to 0.
year | revenue | rolling_max |
---|---|---|
2012 | 100 | 300 |
2013 | 200 | 400 |
2014 | 300 | 500 |
2015 | 400 | 0 |
2016 | 500 | 0 |
Summary
Returns the rolling (or moving) mean value.
Syntax
roll_mean(<column_num>
, n = <numeric>
, weights = NULL, by = <numeric>
, fill = <numeric>
, align = <alignment>
, normalize = <logical>
, na.rm = <logical>
)
Arguments
Return Value
Numeric
Example
rolling_mean = roll_mean(revenue, n = 3, align="center", fill = 0)
Returns a mean value among three values starting from one previous and one next value, setting NA to 0.
year | revenue | rolling_mean |
---|---|---|
2012 | 100 | 0 |
2013 | 200 | 200 |
2014 | 300 | 300 |
2015 | 400 | 400 |
2016 | 500 | 0 |
rolling_mean = roll_mean(revenue, n = 3, align="left", fill = 0)
Returns a mean value among three values starting from the current value and the next two values, setting NA to 0.
year | revenue | rolling_mean |
---|---|---|
2012 | 100 | 200 |
2013 | 200 | 300 |
2014 | 300 | 400 |
2015 | 400 | 0 |
2016 | 500 | 0 |
Summary
Returns the rolling (or moving) median value.
Syntax
roll_median(<column_num>
, n = <numeric>
, weights = NULL, by = <numeric>
, fill = <numeric>
, align = <alignment>
, normalize = <logical>
, na.rm = <logical>
)
Arguments
Return Value
Numeric
Example
rolling_median = roll_median(revenue, n = 3, align="center", fill = 0)
Returns a median value among three values starting from one previous and one next value, setting NA to 0.
year | revenue | rolling_median |
---|---|---|
2012 | 100 | 0 |
2013 | 200 | 200 |
2014 | 300 | 300 |
2015 | 400 | 400 |
2016 | 500 | 0 |
rolling_median = roll_median(revenue, n = 3, align="left", fill = 0)
Returns a median value among three values starting from the current value and the next two values, setting NA to 0.
year | revenue | rolling_median |
---|---|---|
2012 | 100 | 200 |
2013 | 200 | 300 |
2014 | 300 | 400 |
2015 | 400 | 0 |
2016 | 500 | 0 |
Summary
Returns the rolling (or moving) min value.
Syntax
roll_min(<column_num>
, n = <numeric>
, weights = NULL, by = <numeric>
, fill = <numeric>
, align = <alignment>
, normalize = <logical>
, na.rm = <logical>
)
Arguments
Return Value
Numeric
Example
rolling_max = roll_min(revenue, n = 3, align="center", fill = 0)
Returns a min value among three values starting from one previous and one next value, setting NA to 0.
year | revenue | rolling_min |
---|---|---|
2012 | 100 | 0 |
2013 | 200 | 100 |
2014 | 300 | 200 |
2015 | 400 | 300 |
2016 | 500 | 0 |
rolling_max = roll_min(revenue, n = 3, align="left", fill = 0)
Returns a min value among three values starting from the current value and the next two values, setting NA to 0.
year | revenue | rolling_min |
---|---|---|
2012 | 100 | 100 |
2013 | 200 | 200 |
2014 | 300 | 300 |
2015 | 400 | 0 |
2016 | 500 | 0 |
Summary
Returns the rolling (or moving) prod value.
Syntax
roll_prod(<column_num>
, n = <numeric>
, weights = NULL, by = <numeric>
, fill = <numeric>
, align = <alignment>
, normalize = <logical>
, na.rm = <logical>
)
Arguments
Return Value
Numeric
Example
rolling_prod = roll_prod(revenue, n = 3, align="center", fill = 0)
Returns a prod value among three values starting from one previous and one next value, setting NA to 0.
year | revenue | rolling_prod |
---|---|---|
2012 | 1 | 0 |
2013 | 2 | 6 |
2014 | 3 | 24 |
2015 | 4 | 60 |
2016 | 5 | 0 |
rolling_prod = roll_prod(revenue, n = 3, align="left", fill = 0)
Returns a prod value among three values starting from the current value and the next two values, setting NA to 0.
year | revenue | rolling_prod |
---|---|---|
2012 | 1 | 6 |
2013 | 2 | 24 |
2014 | 3 | 60 |
2015 | 4 | 0 |
2016 | 5 | 0 |
Summary
Returns the rolling (or moving) standard deviation value.
Syntax
roll_sd(<column_num>
, n = <numeric>
, weights = NULL, by = <numeric>
, fill = <numeric>
, align = <alignment>
, normalize = <logical>
, na.rm = <logical>
)
Arguments
Return Value
Numeric
Example
rolling_sd = roll_sd(revenue, n = 3, align="center", fill = 0)
Returns a standard deviation value among three values starting from one previous and one next value, setting NA to 0.
year | revenue | rolling_sd |
---|---|---|
2012 | 100 | 0 |
2013 | 200 | 100 |
2014 | 300 | 100 |
2015 | 400 | 100 |
2016 | 500 | 0 |
rolling_sd = roll_sd(revenue, n = 3, align="left", fill = 0)
Returns a min value among three values starting from the current value and the next two values, setting NA to 0.
year | revenue | rolling_sd |
---|---|---|
2012 | 100 | 100 |
2013 | 200 | 100 |
2014 | 300 | 100 |
2015 | 400 | 0 |
2016 | 500 | 0 |
Summary
Returns the rolling (or moving) sum value.
Syntax
roll_sum(<column_num>
, n = <numeric>
, weights = NULL, by = <numeric>
, fill = <numeric>
, align = <alignment>
, normalize = <logical>
, na.rm = <logical>
)
Arguments
Return Value
Numeric
Example
rolling_sum = roll_sum(revenue, n = 3, align="center", fill = 0)
Returns a sum value among three values starting from one previous and one next value, setting NA to 0.
year | revenue | rolling_sum |
---|---|---|
2012 | 100 | 0 |
2013 | 200 | 600 |
2014 | 300 | 900 |
2015 | 400 | 1200 |
2016 | 500 | 0 |
rolling_sum = roll_sum(revenue, n = 3, align="left", fill = 0)
Returns a sum value among three values starting from the current value and the next two values, setting NA to 0.
year | revenue | rolling_sum |
---|---|---|
2012 | 100 | 600 |
2013 | 200 | 900 |
2014 | 300 | 1200 |
2015 | 400 | 0 |
2016 | 500 | 0 |
Summary
Returns the rolling (or moving) variance value.
Syntax
roll_var(<column_num>
, n = <numeric>
, weights = NULL, by = <numeric>
, fill = <numeric>
, align = <alignment>
, normalize = <logical>
, na.rm = <logical>
)
Arguments
Return Value
Numeric
Example
rolling_var = roll_var(revenue, n = 3, align="center", fill = 0)
Returns a variance value among three values starting from one previous and one next value, setting NA to 0.
year | revenue | rolling_var |
---|---|---|
2012 | 100 | 0 |
2013 | 200 | 10000 |
2014 | 300 | 10000 |
2015 | 400 | 10000 |
2016 | 500 | 0 |
rolling_var = roll_var(revenue, n = 3, align="left", fill = 0)
Returns a variance value among three values starting from the current value and the next two values, setting NA to 0.
year | revenue | rolling_var |
---|---|---|
2012 | 100 | 10000 |
2013 | 200 | 10000 |
2014 | 300 | 10000 |
2015 | 400 | 0 |
2016 | 500 | 0 |
Summary
Removes column(s). It can be used along with any column(s) selection operation like select(), gather(), etc.
Syntax
-
Return Value
Example
select(-Country)
Remove Country column.
select(-starts_with("user"))
Remove all the columns whose names start with "user".
gather(Year, Population, -Country)
Gather all the columns except Country column into Year and Population columns.
Summary
Select column(s) with a predicate (conditional) function.
Syntax
select_if(<predicate_condition>
,...)
Arguments
Example
select_if(is.numeric)
Select only numeric data type columns.
select_if(is.character)
Select only character data type columns.
select_if(function(x) is.numeric(x) && mean(x) > 5)
Select only the columns that are numeric data type and their mean values are greater than 5.
Summary
Returns the column names that starts with a given text.
Syntax
starts_with(<text>
, ignore.case = <logical>
)
Arguments
Example
select(starts_with("X", ignore.case = TRUE))
Select only columns whose name starts with "X"
Summary
Returns the column names that ends with a given text.
Syntax
ends_with(<text>
, ignore.case = <logical>
)
Arguments
Example
select(ends_with("ABR", ignore.case = TRUE))
Select only columns whose name ends with "ABR"
Summary
Returns the column names that contain a given text.
Syntax
contains(<text>
, ignore.case = <logical>
)
Arguments
Example
select(contains("ABR", ignore.case = TRUE))
Select only columns whose name contains with "ABR"
Summary
Returns the column names that matches with a given text.
Syntax
matches(<text>
, ignore.case = <logical>
)
Arguments
Example
select(matches("[:digit:]+", ignore.case = TRUE))
Select only columns whose names contains digits (regular expression)
Summary
Returns the column names that starts with a given text with numbers that are within a given range. It's useful especially when you have column names like X1, X2, X3, X4, etc.
Syntax
num_range(<text>
, <start_num>
:<end_num>
)
Arguments
Example
select(num_range("X", 1:5))
Selects all columns (numerically) from X1 to X5.
Summary
Returns the column names that are any of the given names.
Syntax
any_of(<text1>
, <text2>
, ...)
Arguments
Example
select(any_of("x", "y", "z”))
Selects columns whose names are "x", "y", and "z".
Summary
Returns all the column names. It's useful when you want to have particular column(s) first before everything else.
Syntax
everything()
Arguments
Example
select(X, everything())
Reorder columns: keep the column “X" before all the other columns.
Summary
Returns the columns for which the specified column selection function (e.g. is.numeric) returns TRUE.
Syntax
where(<columns_select_function>
)
Arguments
Example
select(where(is.numeric))
Select all numeric columns.
Summary
This function is used for selecting columns for other functions such as summarize_row.
Syntax
<column(s)>
))<column(s)>
))<columns_select_function>
))<text>
, ignore.case = <logical>
))<text>
, ignore.case = <logical>
))<text>
, ignore.case = <logical>
))<text>
, ignore.case = <logical>
))<text>
, <start_num>:<end_num>
))Arguments
Example
mutate(Total = summarize_row(across(c(Sales_Consumer, Sales_Business, Sales_Government)), sum, na.rm = TRUE)
Sums up values from the specified columns for each row, and stores the sums in the Total column.
mutate(Total = summarize_row(across(where(is.numeric)), sum, na.rm = TRUE)
Sums up values from all numeric columns for each row, and stores the sums in the Total column.
Summary
Apply functions to all the columns.
Syntax
mutate_all(funs(<function(s)>
), ...)
Arguments
Example
mutate_all(funs(. * 0.5))
Multiply 0.5 to all the column values and override the existing columns.
mutate_all(funs(half = . * 0.5))
Multiply 0.5 to all the column values and create new columns whose names will end with "_half".
mutate_all(funs(min_rank, dense_rank))
Apply min_rank and dense_rank functions to all the column values and create new columns whose names will end with the function name (min_rank, dense_rank).
Summary
Apply functions to specified columns.
Syntax
mutate_at(vars(<column(s)>
), funs(<function(s)>
), ...)
Arguments
Example
mutate_at(vars(ends_with("DELAY")), min_rank)
Apply 'min_rank' function to the columns whose name ending with "DELAY".
mutate_at(vars(ends_with("DELAY")), funs(. * .5))
Multiply 0.5 to the columns whose name ending with "DELAY".
mutate_at(c("ARR_DELAY", "DEP_DELAY"), min_rank)
Apply 'min_rank' function to "ARR_DELAY" and "DEP_DELAY" columns.
mutate_at(c(1,3), min_rank)
Apply 'min_rank' function to 1st and 3rd columns.
Summary
Apply functions to only the columns that match with a given condition.
Syntax
mutate_if(<predicate_condition>
, funs(<function(s)>
), ...)
Arguments
Example
mutate_if(is_numeric, mean)
Apply 'mean' function to all the columns that are numeric data type.
mutate_if(is.numeric, funs(. * 0.2))
Multiply all the numeric columns by 0.2 overriding existing columns.
mutate_if(is.numeric, funs(calc = . * 0.2))
Multiply all the numeric columns by 0.2 as new columns.
Summary
Apply functions to all the columns.
Syntax
summarize_all(funs(<function(s)>
), ...)
Arguments
Example
summarize_all(mean)
Apply mean function to all the columns values per group and override the existing columns.
summarize_all(mean, na.rm = TRUE)
Apply mean function with argument "na.rm = TRUE" to all the columns values per group.
summarize_all(funs(min, max))
Apply min and max functions to all the columns values per group and create new columns.
summarize_all(funs(average = mean), na.rm = TRUE)
Apply mean function to all the columns values per group and create new columns whose names ending with 'average'.
Summary
Apply functions to specified columns.
Syntax
summarize_at(vars(<column(s)>
), funs(<function(s)>
), ...)
Arguments
Example
summarize_at(vars(ends_with("DELAY")), mean)
Apply 'mean' function to the columns whose name ending with "DELAY".
summarize_at(vars(ends_with("DELAY")), mean, na.rm = TRUE)
Apply 'mean' function with "na.rm = TRUE" argument to the columns whose name ending with "DELAY".
summarize_at(c("ARR_DELAY", "DEP_DELAY"), mean)
Apply 'mean' function to "ARR_DELAY" and "DEP_DELAY" columns.
summarize_at(c(1,3), mean)
Apply 'mean' function to 1st and 3rd columns.
Summary
Apply functions to only the columns that match with a given condition.
Syntax
summarize_if(<predicate_condition>
, )
Arguments
Example
summarize_if(is_numeric, mean)
Apply 'mean' function to all the columns that are numeric data type for each group.
Summary
Returns a list of user selected columns.
Syntax
vars(<column(s)>
)
Arguments
Example
vars(starts_with("ARR"))
Select columns whose name starting with "ARR".
vars(ARR_DELAY, ARR_TIME)
Select "ARR_DELAY" and "ARR_TIME" columns.
Summary Returns values based on whether a given condition is satisfied or not.
Syntax
if_else(<condition>
, <return_value_when_TRUE>
, <return_value_when_FALSE>
, missing = NULL)
Arguments
Return Value
Vector
Example
mutate(category = ifelse(price >= 500, "Expensive", "Cheap"))
price | category |
---|---|
1000 | Expensive |
500 | Expensive |
100 | Cheap |
50 | Cheap |
Summary Returns values based on multiple conditions. This is similar to CASE WHEN conditions in SQL. You can use two sided formula to define the condition and the mapping values, and use TRUE as ELSE condition. See examples below.
Syntax
case_when(...)
Arguments
Return Value
Vector
Example
mutate(category = case_when(price > 900 ~ "Super Expensive", price >= 500 ~ "Expensive", price >= 100 ~ "Mild")))
price | category |
---|---|
1000 | Super Expensive |
500 | Expensive |
100 | Mild |
50 | NA |
mutate(category = case_when(price > 900 ~ "Super Expensive", price >= 500 ~ "Expensive", price >= 100 ~ "Mild", TRUE ~ "Cheap")))
price | category |
---|---|
1000 | Super Expensive |
500 | Expensive |
100 | Mild |
50 | Cheap |
Summary
Replace the missing values with a given value or the values that are at the same position in other columns.
Syntax
coalesce(<column>
, <column(s)>
)
Arguments
Return Value
Vector
Example
Original data:
price | price_old |
---|---|
1500 | 1000 |
NA | 500 |
NA | 100 |
50 | 10 |
mutate(price = coalesce(price, 0))
price | price_old |
---|---|
1500 | 1000 |
0 | 500 |
0 | 100 |
50 | 10 |
mutate(price = coalesce(price, price_old))
price | price_old |
---|---|
1500 | 1000 |
500 | 500 |
100 | 100 |
50 | 10 |
Summary
Replaces the matching values to NA.
Syntax
na_if(<column>
, <column_value>
)
Arguments
Return Value
Vector
Example
mutate(after = na_if(before, 999))
before | after |
---|---|
A | A |
G | G |
999 | NA |
Summary
Compares two numeric vectors (columns). This is a safe way of comparing if two vectors of floating point numbers are (pairwise) equal. This is safer than using ==, because it has a built in tolerance.
Syntax
near(<column_num>
, <column_num>
, tol = <numeric>
)
Arguments
Return Value
Logical
Example
sqrt(2) ^ 2 == 2 Returns FALSE.
near(sqrt(2) ^ 2, 2) Returns TRUE.
Summary
Recodes (or replaces) the original values with given values. You can replace numeric values either by the name with backticks or based on their position, and character values by their name.
Syntax
recode(<column>
, ..., .default = <text>
, .missing = <text>
)
Arguments
Return Value
Vector
Example
Original data is numeric:
ID |
---|
4716 |
4755 |
2432 |
mutate(name = recode(ID, `4716` = "Oracle", `4755` = "Rakuten"))
ID | name |
---|---|
4716 | Oracle |
4755 | Rakuten |
2432 | NA |
Since the original value 2432 doesn't have a replacement value yet its data type (numeric) is different from the other replacement values (character) its replacement value becomes NA. If the data type is same then it will use the original value.
mutate(name = recode(ID, `4716` = "Oracle", `4755` = "Rakuten", .default = "Not Known"))
You can use '.default' argument to set the default value.
ID | name |
---|---|
4716 | Oracle |
4755 | Rakuten |
2432 | Not known |
Original data is character:
ID |
---|
A |
G |
M |
mutate(name = recode(ID, "A" = "Apple", "G" = "Google"))
ID | name |
---|---|
A | Apple |
G | |
M | M |
Since the original value 'M' doesn't have a replacement value yet its data type (character) is same as the other replacement values (character) it will use the original value as 'M'.
mutate(name = recode(ID, A = "Apple", G = "Google", .default = "Not known"))
You can use '.default' argument to set the default value.
ID | name |
---|---|
A | Apple |
G | |
M | Not Known |
Summary
Recodes (or replaces) the original values with given values and create a factor column. You can replace numeric values either by the name with backticks or based on their position, and character values by their name.
Syntax
recode_factor(<column>
, ..., .default = <text>
, .missing = <text>
, .ordered = <logical>
)
Arguments
Return Value
Factor
Example
Original Data:
ID |
---|
M |
T |
S |
mutate(name = recode_factor(ID, S = "Sunday", M = "Mondy", T = "Tuesday", .ordered = TRUE))
ID | name |
---|---|
M | Monday |
T | Tuesday |
S | Sunday |
Summary
Separates a column with delimited values into multiple rows.
Syntax
separate_rows(<column_text_date>
, sep = <text>
, convert=<logical>
)
Arguments
Example
Original data:
company | investors |
---|---|
Uber | Lowercase Capital, Benchmark Capital, Google Ventures |
Xiaomi | Digital Sky Technologies, QiMing Venture Partners, Qualcomm Ventures |
seperate_rows(investors, sep=", ")
company | investors |
---|---|
Uber | Lowercase Capital |
Uber | Benchmark Capital |
Uber | Google Ventures |
Xiaomi | Digital Sky Technologies |
Xiaomi | QiMing Venture Partners |
Xiaomi | Qualcomm Ventures |
Summary
Unnest a list column by turning each element of a list-column into a column.
Syntax
unnest_wider(col = <column_list>
, names_sep = <text>
, names_repair = <text>
, simplify = <logical>
, ptype = <list>
, transform = <list>
)
Arguments
names_repair (Optional) - Used to check that output data frame has valid names. Must be one of the following options
simplify (Optional) - If TRUE, will attempt to simplify lists of length-1 vectors to an atomic vector.
ptype (Optional) - A named list of prototypes declaring the desired output type of each component. Use this argument if you want to check each element has the types you expect when simplifying.
transform (Optional) - A named list of transformation functions applied to each component. Use this function if you want transform or parse individual elements as they are hoisted.
Summary
Unnest a list column by turning each element of a list-column into a row.
Syntax
unnest_longer(col = <column_list>
, values_to = <text>
, indices_to = <text>
, indices_include = <logical>
, names_repair = <text>
, simplify = <logical>
, ptype = <list>
, transform = <list>
)
Arguments
names_repair (Optional) - Used to check that output data frame has valid names. Must be one of the following options
simplify (Optional) - If TRUE, will attempt to simplify lists of length-1 vectors to an atomic vector.
ptype (Optional) - A named list of prototypes declaring the desired output type of each component. Use this argument if you want to check each element has the types you expect when simplifying.
transform (Optional) - A named list of transformation functions applied to each component. Use this function if you want transform or parse individual elements.
Hoist allows you to selectively pull components of a list-column out in to their own top-level columns.
Syntax
hoist(.col = <column_list>
, .remove = <logical>
, .simplify = <logical>
, .ptype = <list>
, .transform=<list>
)
Arguments
Summary
Drop rows that have NA value.
Syntax
drop_na(<column(s)>
,...)
drop_na(-<column(s)>
,...)
drop_na(starts_with(<text>
, ignore.case = <logical>
))
drop_na(ends_with(<text>
, ignore.case = <logical>
))
drop_na(contains(<text>
, ignore.case = <logical>
))
drop_na(matches(<text>
, ignore.case = <logical>
))
drop_na(num_range(<text>
, <start_num>:<end_num>
))
drop_na(one_of(<text1>
, <text2>
, ...))
drop_na(<column>
, everything())
Arguments
Example
Sample data:
Original data:
location | year | plant_date | harvest_date |
---|---|---|---|
Als | 1900 | NA | 35.02899 |
Als | 1901 | 10.00000 | 24.02899 |
Als | 1902 | 12.58863 | 45.02899 |
Bea | 1900 | 5.00000 | 18.00000 |
Bea | 1901 | 3.31119 | 10.00000 |
Bea | 1902 | 29.12891 | NA |
Bor | 1900 | 12.52136 | 18.14896 |
Bor | 1901 | NA | 10.64896 |
Bor | 1902 | 10.23056 | 20.64896 |
drop_na(ends_with("date"))
Returns a data frame without NA values in plant_date and harvest_date columns.
location | year | plant_date | harvest_date |
---|---|---|---|
Als | 1901 | 10.00000 | 24.02899 |
Als | 1902 | 12.58863 | 45.02899 |
Bea | 1900 | 5.00000 | 18.00000 |
Bea | 1901 | 3.31119 | 10.00000 |
Bor | 1900 | 12.52136 | 18.14896 |
Bor | 1902 | 10.23056 | 20.64896 |
Summary
Add rows with given column values.
Syntax
add_row()
Arguments
Return Value
Data Frame
Example
Original Data:
ID | name |
---|---|
4716 | Oracle |
4755 | Rakuten |
2432 | DeNA |
add_row(ID = 4689)
ID | name |
---|---|
4716 | Oracle |
4755 | Rakuten |
2432 | DeNA |
4689 | NA |
add_row(ID = 4689, name = "Yahoo")
ID | name |
---|---|
4716 | Oracle |
4755 | Rakuten |
2432 | DeNA |
4689 | Yahoo |
Summary
Parse characters, extract numeric values, and convert to number data type including
Syntax
parse_number(<column>
, na = c("", "NA"), locale = <locale>
)
Arguments
Return Value
Numeric
Example
parse_number("12345")
Returns 12345
parse_number("12345.10")
Returns 12345.1
parse_number("$12,345.10")
Returns 12345.1
parse_number("$12.20M")
Returns 12.2
parse_number(as.Date("2015-01-30") - as.Date("2015-01-15"))
Returns 15
Summary
Convert data to double precision number data type.
Syntax
parse_double(<column>
, na = c("", "NA"), locale = <locale>
)
Arguments
Return Value
Double
Example
parse_double("12345")
Returns 12345
parse_double("12345.10")
Returns 12345.1
Summary
Convert data to double precision number data type.
Syntax
parse_euro_double(<column>
, na = c("", "NA"))
Arguments
Return Value
Double
Example
parse_euro_double("12345")
Returns 12345
parse_euro_double("12345.10")
Returns 12345.1
Summary
Convert data to integer number data type.
Syntax
parse_integer(<column>
, na = c("", "NA"), locale = <locale>
)
Arguments
Return Value
Integer
Example
parse_integer("12345")
Returns 12345
parse_integer("12345.10")
Returns 12345
Summary
Convert data to Time data type.
Syntax
parse_time(<column>
, format = <time_format>
, locale = <locale>
)
Arguments
Return Value
time
Example
parse_time("13:10:05")
Returns "13:10:05"
parse_time("13:10:05", locale = locale(tz = "America/Los_Angeles"))
Returns "13:10:05"
parse_time("10:20:15 PM")
Returns "22:20:15"
parse_time("10:20:15 午後", local = locale("ja"))
Returns "22:20:15"
Summary
Convert a given data to character data type.
Syntax
parse_character(<column>
, na = c("", "NA"), locale = <locale>
)
Arguments
Return Value
Character
Example
parse_character(123.11)
Returns "123.11" as character.
Summary
Parse data and convert to factor data type.
Syntax
parse_factor(x, levels, ordered = <logical>
, na = c("", "NA"), locale = <locale>
)
Arguments
Return Value
Factor
Example
columnA - "iMac", "iPod", "iPhone", "iPod", "iPhone"
parse_factor(columnA)
Returns iMac, iPod, iPhone, iPod, iPhone (It has Levels information of : iMac iPhone iPod)
Summary
Convert data to Logical (Boolean) data type.
Syntax
parse_logical(<column>
, na = c("", "NA"), locale = <locale>
)
Arguments
Return Value
Logical
Example
parse_logical("TRUE")
Returns TRUE
parse_logical("T")
Returns TRUE
Summary
Heuristically guess the data type for each column by reading the first 1000 rows, parse the data, and set appropriate data types for all the columns of the data frame.
Syntax
type_convert()
Arguments
Return Value
Data Frame
Example
type_convert()
Summary
Convert Character or Number to Date / Period when data contains only Date, but not Time.
Syntax
ym(<column>
, tz = <timezone>
, locale = <locale>
)
Arguments
Return Value
Date
Example
ym("2015-10")
returns "2015-10-01 UTC"
ym("2015/10")
returns "2015-10-01 UTC"
ym("Created on 2015 10")
returns "2015-10-01 UTC"
Summary
Convert Character or Number to Date / Period when data contains only Date, but not Time.
Syntax
my(<column>
, tz = <timezone>
, locale = <locale>
)
Arguments
Return Value
Date
Example
my("01-2015")
returns "2015-01-01 UTC"
my("01/2015")
returns "2015-01-01 UTC"
my("Created on 1 2015")
returns "2015-01-01 UTC"
Summary
Convert text to Date data type if a given column holds values that look like Year and Quoter.
Syntax
yq(<column>
, tz = <timezone>
, locale = <locale>
)
Arguments
Return Value
Date
Example
yq("2016.2")
returns "2016-04-01"
yq("2016-03")
returns "2016-07-01"
Summary
Extract week numbers of the year. Weeks start from Sunday.
Syntax
epiweek(<column_date>
)
Return Value
Numeric
Example
epiweek(ymd(c( \
"2020-01-01",\
"2020-01-02",\
"2020-01-03",\
"2020-01-04",\
"2020-01-05",\
"2020-01-06",\
"2020-01-07",\
"2020-01-08")))\
returns 1,1,1,1,2,2,2,2 (2020-01-05 is Sunday)
Summary
Convert a given POSIXct to Date.
Syntax
as_date(<column_date>
)
Arguments
Return Value
Date
Example
as_date(as_datetime("2015-10-10 13:10:05"))
Returns "2015-10-10"
Summary
Convert a given data to date.
Syntax
as_datetime(<column_date>
, format = <date_time_format>
, tz = <timezone>
, origin = <text>
)
Arguments
Return Value
POSIXct
Example
as_datetime("2015-10-10 13:10:05")
Returns "2015-10-10 13:10:05 PDT"
as_datetime("2015-10-10 13:10:05", tz = "America/Los_Angeles")
Returns "2015-10-10 13:10:05 PDT"
as.as_datetime("2015-10-10 13:10:05", tz = "Asia/Tokyo")
Returns "2015-10-10 13:10:05 JST"
as_datetime("05-10-15T13:10:05", format = "%d-%m-%yT%H:%M")
Returns "2015-10-05 13:10:00 PDT"
Summary
Truncate text to make the maximum length to be the specified number.
Syntax
str_trunc(<column_text>
, width = <number>
, side = "right"|"left"|"center"
, ellipsis = <text>
)
Arguments
Return Value
Character
Example
str_trunc("Exploratory", 6)
returns "Exp..."
str_trunc("Exploratory", 6, side = "left")
returns "...ory"
Summary
Remove letters that match with given letters or expressions.
Syntax
str_remove(<column_text>
, <pattern>
)
Arguments
Return Value
Character
Example
str_remove("I am traveling to New York!!", "New")
returns "I am traveling to York!!"
Summary
Remove letters that match with given letters or expressions
Syntax
str_remove_all(<column_text>
, <pattern>
)
Arguments
Return Value
Character
Example
str_remove_all("I am traveling to New York and New Jersey!!", "new")
returns "I am traveling to York and Jersey!!"
Summary
Remove a word from sentence.
Syntax
str_remove_word(<column_text>
, start = <number>
, end = <number>
, sep = <text>
)
Arguments
Return Value
Character
Example
str_remove_word("I am traveling to New York!!", start = 1)
returns "am traveling to York!!"
Summary
Replace word from a sentence with given text.
Syntax
str_replace_word(<column_text>
, start = <number>
, end = <number>
, sep = <text>
, rep = <text>
)
Arguments
NOTE: For now only the first and the last work.
Return Value
Character
Example
exploratory::str_replace_word("I am traveling to New York", start = -1, sep = " ", rep = "England")
returns "I am traveling to New England"
Summary
Replace a text inside of characters.
Syntax
str_remove_inside(<column_text>
, begin = <text>
, end = <text>
, all = <logical>
, include_special_chars = <logical>
)
Arguments
Return Value
Character
Example
exploratory::str_remove_inside("I am traveling to New York (2020)", begin = "(", end = ")")
returns "I am traveling to New York "
Summary
Replace a text inside of characters.
Syntax
str_replace_inside(<column_text>
, begin = <text>
, end = <text>
, rep = <text>
, all = <logical>
, include_special_chars = <logical>
)
Arguments
Return Value
Character
Example
exploratory::str_replace_inside("I am traveling to New (York)", begin = "(", end = ")", rep = "England")
returns "I am traveling to New England"
Summary
Remove an URL inside of characters.
Syntax
str_remove_url(<column_text>
, position = <text>
)
Arguments
Return Value
Character
Example
exploratory::str_remove_url("I am traveling to New York http://example.com", position = "any")
returns "I am traveling to New York "
Summary
Replace an URL inside of characters with provided text.
Syntax
str_replace_url(<column_text>
, rep = <text>
)
Arguments
Return Value
Character
Example
exploratory::str_replace_url("I am traveling to New http://example.com/ny", rep = "York")
returns "I am traveling to New York"
Summary
Replace an URL inside of characters with provided text.
Syntax
str_extract_url(<column_text>
)
Arguments
Return Value
Character
Example
exploratory::str_extract_url("I am traveling to New http://example.com/ny")
returns "http://example.com/ny"
Summary
Replace an URL inside of characters with provided text.
Syntax
str_remove_emoji(<column_text>
)
Arguments
Return Value
Character
Example
exploratory::str_remove_emoji("I am traveling to New York 🤩")
returns "I am traveling to New York "
Summary
Returns a summary information of a given model in a tidy (normalized data frame) format.
Syntax
model_info(<column_list>
, output = <model_info_output_type>
, ...)
Arguments
* output (Optional) - The default is "summary". This changes the type of output. Can be one of the following.
Returned Values
Example
Following examples assumes that 'model' column contains a fitted model of 'lm' by do() command.
model_info(model, output = "summary")
r.squared | adj.r.squared | sigma | statistic | p.value | df | logLik | AIC | BIC | deviance | df.residual |
---|---|---|---|---|---|---|---|---|---|---|
0.8879 | 0.8879 | 13.0886 | 1716441.58 | 0 | 3 | -1729152.62 | 3458313.24 | 3458357.15 | 74228528.48 | 433295 |
model_info(model, output = "variables")
term | estimate | std.error | statistic | p.value |
---|---|---|---|---|
(Intercept) | -3.0966 | 0.0341 | -90.749 | 0 |
2 DEP_DELAY | 1.003 | 0.0005 | 1851.9463 | 0 |
3 DISTANCE | -0.0037 | 0 | -114.3364 | 0 |
model_info(model, output = "data")
.rownames | ARR_DELAY | DEP_DELAY | .fitted | .se.fit | .resid | .hat | .sigma | .cooksd | .std.resid |
---|---|---|---|---|---|---|---|---|---|
1 | -15 | -9 | -13.934 | 0.0247 | -1.066 | 0.0000035503 | 13.0886 | 7.8503e-9 | -0.0814 |
2 | -28 | -19 | -22.4116 | 0.035 | -5.5884 | 0.0000071333 | 13.0886 | 4.3348e-7 | -0.427 |
Summary
Convert a given data to hms data type. It's difftime dealt as seconds.
Syntax
as.hms(<column>
)
Arguments
Return Value
hms
Example
as.hms(20)
returns 20 seconds difftime
Summary
Returns true if a given object is hms type.
Syntax
is.hms(<column>
)
Arguments
Return Value
Logical
Example
mutate_if(is.hms, as.character)
Convert hms columns into character columns. An example of the value is "05:01:20".
Summary
Remove rows whose column values are all NAs.
Syntax
remove_empty_rows()
Example
Original data:
item_name | transaction_id |
---|---|
banana | 1 |
apple | 1 |
NA | NA |
carrot | 2 |
apple | 2 |
NA | NA |
orange | 3 |
NA | 3 |
remove_empty_rows()
Returns the data without all NA rows.
item_name | transaction_id |
---|---|
banana | 1 |
apple | 1 |
carrot | 2 |
apple | 2 |
orange | 3 |
NA | 3 |
Summary
Remove columns whose values are all NAs.
Syntax
remove_empty_cols()
Example
Original data:
item_name | transaction_id | tag |
---|---|---|
banana | 1 | NA |
apple | 1 | NA |
carrot | 2 | NA |
apple | 2 | NA |
orange | 3 | NA |
NA | 3 | NA |
remove_empty_cols()
Returns the data without all NA columns.
item_name | transaction_id |
---|---|
banana | 1 |
apple | 1 |
carrot | 2 |
apple | 2 |
orange | 3 |
NA | 3 |
Summary
Make column names clean by using only _ character, lowercase letters, and numbers.
Syntax
clean_names()
Example
Original data:
Item Name | .transaction |
---|---|
banana | 1 |
apple | 1 |
carrot | 2 |
apple | 2 |
orange | 3 |
NA | 3 |
clean_names()
Returns data with clean column names.
item_name | _transaction |
---|---|
banana | 1 |
apple | 1 |
carrot | 2 |
apple | 2 |
orange | 3 |
NA | 3 |
Summary
Get rows that are duplicated.
Syntax
get_dupes(<column(s)>
)
Example
Original data:
name | n |
---|---|
banana | 1 |
apple | 1 |
apple | 1 |
apple | 2 |
get_dupes(name)
Returns rows that are duplicated in name column.
name | n |
---|---|
apple | 1 |
apple | 1 |
apple | 2 |
get_dupes()
Returns rows that are duplicated in all columns.
name | n |
---|---|
apple | 1 |
apple | 1 |
Summary
Convert number of date from excel to Date.
Syntax
excel_numeric_to_date(<column_num>
, date_system = "modern"|"mac pre-2011"
)
Arguments
Example
Original data:
user | last_login |
---|---|
Lisa | 42370 |
Emily | 42318 |
John | 42489 |
mutate(last_login = excel_numeric_to_date(last_login)) Returns data frame with last_login column as Date.
user | last_login |
---|---|
Lisa | 2016-01-01 |
Emily | 2015-11-10 |
John | 2016-04-29 |
Return Value
Date
Summary
Convert specific values to NA.
Syntax
convert_to_NA(strings = <character_vector>
)
Arguments
Example
Original data:
item_name | transaction_id |
---|---|
banana | 1 |
apple | 1 |
-99 | -99 |
carrot | 2 |
apple | 2 |
-99 | -99 |
orange | 3 |
-99 | -99 |
convert_to_NA("-99")
Convert -99 to NA even if it's numeric.
item_name | transaction_id |
---|---|
banana | 1 |
apple | 1 |
NA | NA |
carrot | 2 |
apple | 2 |
NA | NA |
orange | 3 |
NA | NA |
Summary
Create frequency table.
Syntax
tabyl(
Arguments
Example
Original data:
item_name | transaction_id |
---|---|
banana | 1 |
apple | 1 |
NA | NA |
carrot | 2 |
apple | 2 |
NA | NA |
orange | 3 |
NA | 3 |
tabyl(item_name)
Show ratio of item_name with NA and without it.
item_name | n | percent | valid_percent |
---|---|---|---|
apple | 1 | 0.125 | 0.2 |
apple | 1 | 0.125 | 0.2 |
banana | 1 | 0.125 | 0.2 |
carrot | 1 | 0.125 | 0.2 |
orange | 1 | 0.125 | 0.2 |
NA | 3 | 0.375 | NA |
Summary
Convert a given column to a factor. Compared to base R's as.factor, this function creates levels in the order in which they appear, which will be the same on every platform (base R sorts in the current locale which can vary from place to place).
Syntax
as_factor(<column_char_num_date>
)
Arguments
Example
x <- c("FR", "US", "JP")
as_factor(x)
Returns a factor with FR, US, JP with levels of FR, US, JP.
Summary
Anonymize factor levels. Replaces factor levels with arbitrary numeric identifiers. Neither the values nor the order of the levels are preserved.
Syntax
fct_anon(<column_char_num_factor>
, prefix = <text>
)
Arguments
Example
x <- factor(c("FR", "US", "JP"))
fct_anon(x)
Returns a factor with 1, 2, 3.
Summary
Add additional levels to a factor
Syntax
fct_expand(<column_char_num_factor>
, <text>
)
Arguments
Example
x <- factor(c("FR", "US"))
fct_expand(x, "JP", "CA")
Returns a factor with levels, FR US JP CA.
Summary
Drop unused levels.
Syntax
fct_drop(<column_char_num_factor>
, only = <text>
)
Arguments
Example
x <- factor(c("FR", "US"), levels = c("FR", "US", "JP", "CA"))
fct_drop(x)
Returns a factor with levels of FR, US.
fct_drop(x, only = "CA")
Returns a factor with levels of FR, US, JP.
Summary
This gives missing value an explicit factor level, ensuring that they appear in the charts.
Syntax
fct_explicit_na(<column_char_num_factor>
, na_level = <text>
)
Arguments
Example
x <- factor(c("FR", NA, "US", NA, "JP"))
fct_explicit_na(x, "No value")
Returns a factor with FR, No value, US, No value, JP and with levels of FR JP US No value
Summary
Moves least/most common factor levels into "Other" category.
Syntax
fct_lump(<column_char_num_factor>
, n = <number>
, prop = <number>
, other_level = <text>
, ties.method = <ties_method_type>
, w = <column_num>
)
Arguments
If both n and prop are missing, fct_lump lumps together the least frequent levels into "other", while ensuring that "other" is still the smallest level.
Example
mutate(STATE = fct_lump(STATE))
This will make STATE column to keep most of the values in order to make the 'Other' level will have the smallest counts compared to all the other levels.
mutate(ORIGIN_STATE_ABR = fct_lump(ORIGIN_STATE_ABR, n = 10))
This will make STATE column to keep only 10 States from the most frequent ones and add 'Other' level for putting all the others together.
mutate(ORIGIN_STATE_ABR = fct_lump(ORIGIN_STATE_ABR, prop = .1))
This will make ORIGIN_STATE_ABR column to keep only States that have more than 10% proportion of all the entries and add 'Other' level for putting all the others together.
mutate(ORIGIN_STATE_ABR = fct_lump(ORIGIN_STATE_ABR, n = 5, w = DISTANCE))
This will make ORIGIN_STATE_ABR column to keep only 5 States that have the most weight value, which is calculated as sum of DISTANCE for each State, and add 'Other' level for putting all the others together.
Summary
Creates "Other" category by moving specified values to "Other", or by keeping only specified values from "Other".
Syntax
fct_other(<column_char_num_factor>
, keep = c("<column_value(s)>
"), drop = c("<column_value(s)>
"), other_level = <text>
)
Arguments
Example
mutate(STATE = fct_other(STATE, keep = c("CA", "AK", "AZ")))
This will make STATE column to keep only the levels for specified values and the 'Other' level for all other values.
mutate(STATE = fct_other(STATE, drop = c("CA", "AK", "AZ")))
This will make STATE column to keep the levels that are not specified here, and the 'Other' level for the specified values.
Summary
Reorder factors levels by first appearance.
Syntax
fct_inorder(<column_char_num_factor>
, ordered = <logical>
)
Arguments
Example
x <- factor(c("FR", "US", "US", "JP", "US", "JP"))
fct_inorder(x)
Returns a factor with levels in the following order, FR US JP.
x <- c("FR", "US", "US", "JP", "US", "JP") -- it can be used for Character data type column.
fct_inorder(x)
Returns a factor with levels in the following order, FR US JP.
Summary
Reorder factors levels by frequency (the most frequently appears to the least.)
Syntax
fct_infreq(<column_char_num_factor>
, ordered = <logical>
)
Arguments
Example
x <- factor(c("FR", "US", "US", "JP", "US", "JP"))
fct_infreq(x)
Returns a factor with levels in the following order, US JP FR.
x <- c("FR", "US", "US", "JP", "US", "JP") -- it can be used for Character data type column.
fct_infreq(x)
Returns a factor with levels in the following order, US JP FR.
Summary
Sets the level of a given factor column by moving any number of levels to any location.
Syntax
fct_relevel(<column_char_num_factor>
, ...)
Arguments
Example
x <- c("France", "Japan", "Australia", "United States") -- creating a character column.
fct_relevel(x, "Japan")
Return as Japan, Australia, France, United States. The character column is converted to factor with the default level setting, which is an alphabetical order.
x <- factor(level = c("France", "Japan", "Australia", "United States")) -- creating a factor column with levels.
fct_relevel(x, "Japan")
Return as Japan, France, Australia, United States
fct_relevel(x, "Japan", "United States")
Return as Japan, United States, France, Australia
Summary
Reorder factor levels by sorting based on another variable
Syntax
fct_reorder(<column_char_num_factor>
, <column_num>
, .fun = <aggregate_function>
, ..., .desc = <logical>
)
Arguments
Example
mutate(CARRIER = fct_reorder(CARRIER, ARR_DELAY))
This will make CARRIER column to be factor with the following level order. Levels: VX 9E DL FL UA AS AA B6 MQ US WN YV EV OO HA F9 (VX has the smallest median ARR_DELAY value.
mutate(CARRIER = fct_reorder(CARRIER, ARR_DELAY, .desc = TRUE))
This will make CARRIER column to be factor with the following level order. Levels: F9 HA EV OO B6 MQ US WN YV AA AS UA 9E DL FL VX (F9 has the largest median ARR_DELAY value.
mutate(CARRIER = fct_reorder(CARRIER, DISTANCE))
This will make CARRIER column to be factor with the following level order. Levels: HA YV OO 9E EV MQ WN US FL DL F9 AS AA B6 UA VX (HA has the smallest median distance.
mutate(CARRIER = fct_reorder(CARRIER, ARR_DELAY, .fun = sum, na.rm= TRUE))
This will make CARRIER column to be factor with the following level order. Levels: DL FL VX AS UA HA US WN YV F9 9E B6 AA MQ OO EV (DL has the smallest total ARR_DELAY value.
Summary
Reverse order of factor levels.
Syntax
fct_rev(<column_factor>
)
Arguments
Example
x <- factor(c("FR", "US", "JP"))
This creates a factor column with levels in an order of, FR, JP, US. (alphabetical order as default.)
fct_rev(x)
Returns a factor with levels in an order of US, JP, FR.
Summary
Create extreme gradient boosting model for regression.
Syntax
xgboost_reg(watchlist_rate = <numeric>
, output_type = <xgb_reg_output_type>
, nrounds = <integer>
, booster = <xgboost_booster_type>
, eval_metric = <xgb_reg_evaluation_type>
, weight = <column_num>
, early_stopping_rounds = <integer>
)
Arguments
Return Value
Model
Example
Original data:
CARRIER | ARR_DELAY | DEP_DELAY | DISTANCE |
---|---|---|---|
UA | 10 | 7 | 300 |
UA | 20 | 30 | 1000 |
UA | -5 | 0 | 200 |
AA | 20 | 20 | 500 |
AA | -5 | 3 | 700 |
AA | 10 | 8 | 1000 |
%>% group_by(CARRIER)
%>% build_model(model_func = xgboost_reg, formula = DEP_DELAY ~ ARR_DELAY + DISTANCE, test_rate = 0.1)
Returns a data frame that stores a xgb.Booster model based on a formula defined as 'Predict DEP_DELAY based on ARR_DELAY and DISTANCE.' It also returns a column of original data.
CARRIER | source_data | model | .test_index |
---|---|---|---|
UA | source dataframe | xgb.Booster model | c(1) |
AA | source dataframe | xgb.Booster model | c(2) |
You can use model_coef or model_stats function to get the summary information about the models. Also, you can use prediction function to predict with the data that is stored in the same data frame.
Summary
Create extreme gradient boosting model for binary classification.
Syntax
xgboost_binary(watchlist_rate = <numeric>
, output_type = <xgb_binary_output_type>
, nrounds = <integer>
, booster = <xgboost_booster_type>
, eval_metric = <xgb_binary_evaluation_type>
, weight = <column_num>
, early_stopping_rounds = <integer>
)
Arguments
Return Value
Model
Summary
Create extreme gradient boosting model for binary classification.
Syntax
xgboost_multi(watchlist_rate = <numeric>
, output_type = <xgb_multi_output_type>
, nrounds = <integer>
, booster = <xgboost_booster_type>
, eval_metric = <xgb_multi_evaluation_type>
, weight = <column_num>
, early_stopping_rounds = <integer>
)
Arguments
Return Value
Model
Example
Summary
Create random forest model for regression.
Syntax
randomForestReg(subset = <column_logical>
, na.action = <'na_action_type'>
, ntree = <integer>
, mtry = <numeric>
, replace = <logical>
, strata = <column>
, sampsize = <integer>
, nodesize = <integer>
, maxnodes = <integer>
, importance = <logical>
, localImp = <logical>
, nPerm = <integer>
, proximity = <logical>
, oob.prox = <logical>
, keep.forest = <logical>
, corr.bias = <logical>
, keep.inbag = <logical>
)
Arguments
Return Value
Model
Example
Summary
Create random forest model for binary classification.
Syntax
randomForestReg(subset = <column_logical>
, na.action = <'na_action_type'>
, ntree = <integer>
, mtry = <numeric>
, replace = <logical>
, classwt = <numeric>
, cutoff = <numeric>
, strata = <column>
, sampsize = <integer>
, nodesize = <integer>
, maxnodes = <integer>
, importance = <logical>
, localImp = <logical>
, nPerm = <integer>
, proximity = <logical>
, oob.prox = <logical>
, norm.votes = <logical>
, keep.forest = <logical>
, corr.bias = <logical>
, keep.inbag = <logical>
)
Arguments
Return Value
Model
Example
Summary
Create random forest model for multi class classification.
Syntax
randomForestReg(subset = <column_logical>
, na.action = <'na_action_type'>
, ntree = <integer>
, mtry = <numeric>
, replace = <logical>
, classwt = <numeric>
, cutoff = <numeric>
, strata = <column>
, sampsize = <integer>
, nodesize = <integer>
, maxnodes = <integer>
, importance = <logical>
, localImp = <logical>
, nPerm = <integer>
, proximity = <logical>
, oob.prox = <logical>
, norm.votes = <logical>
, keep.forest = <logical>
, corr.bias = <logical>
, keep.inbag = <logical>
)
Arguments
Return Value
Model
Example
Summary
Extract a value of a list data type column based on a position or name when the data inside the list is data frame.
Syntax
list_extract(<column_list>
, position = <numeric_text>
, rownum = <numeric>
)
Arguments
Return Value
Character
Example
1) Column is List of Character
mutate(text = list_extract(categories, 1))
Returns the following.
category | text |
---|---|
c("Doctors", "Health & Medical") | Doctors |
c("Bars", "American (New)", "Nightlife") | Bars |
mutate(text = list_extract(categories, -1))
Returns the value of the 1st element counted from the last.
category | text |
---|---|
c("Doctors", "Health & Medical") | Health & Medical |
c("Bars", "American (New)", "Nightlife") | Nightlife |
2) Column is List of Data Frame
mutate(text = list_extract(labels, 1))
Returns the value of the 1st column and 1st row. The default is 1st row when not specified.
labels | text |
---|---|
list(name = c("Regression", "UI"), color = c("5319e7", "207de5")) | Regression |
mutate(text = list_extract(labels, -1))
Returns the value of the last column and 1st row..
labels | text |
---|---|
list(name = c("Regression", "UI"), color = c("5319e7", "207de5")) | 5319e7 |
mutate(text = list_extract(labels, 1, 2))
Returns the value of the 1st column and 2nd row.
labels | text |
---|---|
list(name = c("Regression", "UI"), color = c("5319e7", "207de5")) | UI |
mutate(text = list_extract(labels, "name"))
Returns the value of a column, 'name', and the 1st row. The default is 1st row when not specified.
labels | text |
---|---|
list(name = c("Regression", "UI"), color = c("5319e7", "207de5")) | Regression |
Summary
Concatenates texts from all the elements of a list data type column.
Syntax
list_to_text(<column_list>
, sep = <text>
)
Arguments
Return Value
character
Example
mutate(category_text = list_to_text(category))
Returns the following.
category | category_text |
---|---|
c("Doctors", "Health & Medical") | Doctors, Health & Medical |
Nightlife | Nightlife |
c("Active Life", "Mini Golf", "Golf") | Active Life, Mini Golf, Golf |
c("Bars", "American (New)", "Nightlife") | Bars, American (New), Nightlife |
c("Bars", "American (Traditional)", "Nightlife") | Bars, American (Traditional), Nightlife |
Summary
Concatenates values from multiple columns into a list.
Syntax
list_concat(<column(s)>
, collapse = <logical>
)
Arguments
Return Value
list
Example
mutate(items = list_concat(item1, item2))
Returns the following.
Before:
name | item1 | item2 |
---|---|---|
Cathy | c("Banana", "Apple") | "Apple" |
Mary | "Lemon" | c("Lemon", "Banana") |
Lisa | c("Lemon", "Banana") | c("Banana", "Apple") |
After:
name | item1 | item2 | items |
---|---|---|---|
Cathy | c("Banana", "Apple") | "Apple" | c("Banana", "Apple", "Apple") |
Mary | "Lemon" | c("Lemon", "Banana") | c("Lemon", "Lemon", "Banana") |
Lisa | c("Lemon", "Banana") | c("Banana", "Apple") | c("Lemon", "Banana", "Banana", "Apple") |
summarize(categories = list_concat(category, collapse = TRUE))
Returns the following.
Before:
grouped by "region"
region | category |
---|---|
East | c("Doctors", "Health & Medical") |
East | Nightlife |
East | c("Active Life", "Mini Golf", "Golf") |
West | c("Bars", "American (New)", "Nightlife") |
West | c("Bars", "American (Traditional)", "Nightlife") |
After:
region | categories |
---|---|
East | c("Doctors", "Health & Medical", "Nightlife", "Active Life", "Mini Golf", "Golf") |
West | c("Bars", "American (New)", "Nightlife", "Bars", "American (Traditional)", "Nightlife") |
Summary
Returns number of elements inside a list data type column for each row.
Syntax
list_n(<column_list>
)
Arguments
Return Value
Numeric
Example
mutate(count = list_n(categories)) Returns the following.
category | count |
---|---|
c("Doctors", "Health & Medical") | 2 |
Nightlife | 1 |
c("Active Life", "Mini Golf", "Golf") | 3 |
c("Bars", "American (New)", "Nightlife", "Lounges", "Restaurants") | 5 |
c("Bars", "American (Traditional)", "Nightlife", "Restaurants") | 4 |
Summary
Centers and/or scales the numeric values of a column.
Syntax
normalize(<column_num>
, center = <logical>
, scale = <logical>
)
Arguments
(x - mean(x)) / sd(x)
. When it is TRUE and center is FALSE, scaling is done by dividing by the root-mean-square, which is same as calculating x / sqrt(sum(x^2)/(length(x)-1))
. When it is a numeric value, input values are divided by this numeric value. When it is FALSE, no scaling is done.Return Value
Numeric
Example
// x <- c(1, 2, 3, 4, 5, 100)
normalize(x, center = 0)
Returns -0.4584610 -0.4332246 -0.4079882 -0.3827518 -0.3575154 2.0399410. These are same as the below.
(x - mean(x)) / sd(x)
normalize(x, center = FALSE)
Returns 0.02229944 0.04459888 0.06689832 0.08919776 0.11149720 2.22994404. This is same as the below.
x / sqrt(sum(x^2)/(length(x)-1))
normalize(x, center = 0, scale = max(x)))
Returns 0.01 0.02 0.03 0.04 0.05 1.00
Summary
Anonymize values by hashing algorithms.
Syntax
anonymize(<column>
, algo = <hash_algorithms>
, seed = <integer>
, chars = <character>
, n_chars = <integer>
)
Arguments
Example
Original data:
user | login |
---|---|
Lisa | 1482148710 |
Lisa | 1484148710 |
Emily | 1479913692 |
John | 1481128318 |
mutate(user = anonymize(user, algo = "murmur32") Returns anonymized values .
user | login |
---|---|
6fe21ea2 | 1482148710 |
6fe21ea2 | 1484148710 |
2f968d4a | 1479913692 |
9b5f08f7 | 1481128318 |
Return Value
Character
Summary
Convert number of date/time from excel to POSIXct.
Syntax
excel_numeric_to_datetime(<column_num>
, tz = <timezone>
)
Arguments
Example
Original data:
user | last_login |
---|---|
Lisa | 42370.5 |
Emily | 42318.25 |
John | 42489.75 |
mutate(last_login = excel_numeric_to_datetime(last_login)) Returns data frame with last_login column as POSIXct.
user | last_login |
---|---|
Lisa | 2016-01-01 12:00:00 |
Emily | 2015-11-10 06:00:00 |
John | 2016-04-29 18:00:00 |
Return Value
POSIXct
Summary
Convert unix time numeric values to POSIXct.
Syntax
unixtime_to_datetime(<column_num>
)
Example
Original data:
user | last_login |
---|---|
Lisa | 1482148710 |
Emily | 1479913692 |
John | 1481128318 |
mutate(last_login = unixtime_to_datetime(last_login)) Returns data frame with last_login column as Date.
user | last_login |
---|---|
Lisa | 2016-12-19 11:58:30 GMT |
Emily | 2016-11-23 15:08:12 GMT |
John | 2016-12-07 16:31:58 GMT |
Return Value
Date, POSIXct
Summary
Calculates coordinations by reducing dimensionality using SVD (Singular Value Decomposition).
Syntax
do_svd.kv(<subject_column>
, <key_column>
, <value_column>
, type = <svd_type>
, fill = <numeric>
, fun.aggregate = <aggregate_function>
, n_component = <integer>
, centering = <logical>
, output = <svd_output>
)
Arguments
Return Value
Data frame
Example
Original data:
author_name | keywords | tfidf_value |
---|---|---|
Cathy | party | 3.2 |
Cathy | exciting | 1.3 |
Cathy | friends | 0.8 |
Mary | study | 2.4 |
Mary | exam | 2.1 |
Mary | hard | 1.5 |
Lisa | sports | 1.9 |
Lisa | exciting | 1.8 |
Lisa | hard | 1.6 |
do_svd.kv(document_name, keywords, tfidf_value, n_component=2)
Reducing the number of the dimensions (keywords) to 2 and returning the result in a 'long' data form.
author_name | new.dimension | value |
---|---|---|
Cathy | 1 | -0.7131255 |
Cathy | 2 | 0.3976414 |
Lisa | 1 | 0.0121952 |
Lisa | 2 | -0.8164055 |
Mary | 1 | 0.7009303 |
Mary | 2 | 0.4187641 |
do_svd.kv(name, key, value, n_component=2, output="wide")
Reducing the number of the dimensions (keywords) to 2 and returning the result in a 'wide' data form.
author_name | axis1 | axis2 |
---|---|---|
Cathy | -0.7131255 | 0.3976414 |
Lisa | 0.0121952 | -0.8164055 |
Mary | 0.7009303 | 0.4187641 |
do_svd.kv(name, key, value, n_component=2, type="dimension")
Reducing the number of the dimensions (keywords) to 2 and returning the result by 'dimensions' instead of by 'group'.
top_key_words | new.dimension | value |
---|---|---|
exam | 1 | 0.4163953 |
exam | 2 | 0.3259924 |
exciting | 1 | -0.2560436 |
exciting | 2 | -0.3531243 |
friends | 1 | -0.1613866 |
friends | 2 | 0.1179235 |
hard | 1 | 0.3029449 |
hard | 2 | -0.2513704 |
party | 1 | -0.6455466 |
party | 2 | 0.4716940 |
sports | 1 | 0.0065547 |
sports | 2 | -0.5750138 |
study | 1 | 0.4758803 |
study | 2 | 0.3725628 |
do_svd.kv(name, key, value, n_component=2, type="variance")
Reducing the number of the dimensions (keywords) to 2 and returning the 'variance' values for each dimension
new.dimension | value |
---|---|
1 | 3.534991 |
2 | 2.697623 |
Summary
Calculates coordinations by reducing dimensionality using SVD (Singular Value Decomposition).
Syntax
do_svd(<column(s)>
, skv = <character_vector>
, type = <svd_type>
, fill = <numeric>
, fun.aggregate = <aggregate_function>
, n_component = <integer>
, centering = <logical>
, output = <svd_output>
Arguments
Return Value
Data frame
Summary
Builds a linear regression model (lm) and store it in a data frame.
Syntax
build_lm(<formula>
, test_rate = <numeric>
, subset = <numeric_vector>
, weights = <numeric_vector>
, na.action = <na_action_type>
, method = <lm_method_type>
, model = <logical>
, x = <logical>
, y = <logical>
, qr = <logical>
, singular.ok = <logical>
, contrasts = <list>
, offset = <numeric_vector>
)
Arguments
Return Value
Data frame
Example
Original data:
CARRIER | ARR_DELAY | DEP_DELAY | DISTANCE |
---|---|---|---|
UA | 10 | 7 | 300 |
UA | 20 | 30 | 1000 |
UA | -5 | 0 | 200 |
AA | 20 | 20 | 500 |
AA | -5 | 3 | 700 |
AA | 10 | 8 | 1000 |
%>% group_by(CARRIER)
%>% build_lm(ARR_DELAY ~ DEP_DELAY + DISTANCE, test_rate = 0.1)
Returns a data frame that stores a lm (Linear Regression) model based on a formula defined as 'Predict ARR_DELAY based on DEP_DELAY and DISTANCE.' It also returns a column of original data.
CARRIER | source_data | model | .test_index |
---|---|---|---|
UA | source dataframe | lm model | c(1) |
AA | source dataframe | lm model | c(2) |
You can use model_coef or model_stats function to get the summary information about the models. Also, you can use prediction function to predict with the data that is stored in the same data frame.
Summary
Calculates Survival Curve from survival time and survival status.
Syntax
do_survfit(<column_num>
, <column_num_logic>
, weights = <numeric_vector>
, subset = <numeric_vector>
, na.action = <na_action_type>
, id = <id_column>
, istate = <column>
, type = <survival_curve_type>
, error = <survfit_error_type>
, conf.type = <survfit_conf_int_type>
, conf.lower = <survfit_lower_conf_int_type>
, conf.int = <numeric>
, se.fit = <logical>
, influence = <logical>
)
Arguments
Return Value
Data frame
Example
Original data:_
status | time |
---|---|
1 | 40 |
1 | 30 |
0 | 35 |
0 | 8 |
do_survfit(time, status) Returns a data frame as follows.
time | n_risk | n_event | n_censor | estimate | std_error | conf_high | conf_low |
---|---|---|---|---|---|---|---|
8 | 4 | 0 | 1 | 1.0000000 | 0.0000000 | 1 | 1.0000000 |
30 | 3 | 1 | 0 | 0.6666667 | 0.4082483 | 1 | 0.2995071 |
35 | 2 | 0 | 1 | 0.6666667 | 0.4082483 | 1 | 0.2995071 |
40 | 1 | 1 | 0 | 0.0000000 | Inf | NA | NA |
Summary
Simulates Survival Curve for specified cohort based on a survival model.
Syntax
prediction_survfit(newdata = <cohort_data_set>
, na.action = <na_action_type>
, type = <coxph_survival_curve_type>
, conf.type = <survfit_conf_int_type>
, conf.int = <numeric>
, se.fit = <logical>
, censor = <logical>
, start.time = <numeric>
)
Arguments
Return Value
Data frame
Example
Original data: model data frame.
prediction_survfit(newdata = data.frame(age = c(50, 60), sex = (0, 1))) Returns a data frame as follows.
time | n_risk | n_event | n_censor | age_sex | estimate | std_error | conf_high | conf_low |
---|---|---|---|---|---|---|---|---|
8 | 4 | 0 | 1 | 50_0: | 1.0000000 | 0.0000000 | 1 | 1.0000000 |
30 | 3 | 1 | 0 | 50_0: | 0.6666667 | 0.4082483 | 1 | 0.2995071 |
35 | 2 | 0 | 1 | 50_0: | 0.6666667 | 0.4082483 | 1 | 0.2995071 |
40 | 1 | 1 | 0 | 50_0: | 0.0000000 | Inf | NA | NA |
Summary
Builds logistic regression model and store it in a data frame.
Syntax
build_lr(<formula>
, test_rate = <numeric>
, weights = <numeric_vector>
, subset = <numeric_vector>
, na.action = <na_action_type>
, start = <numeric_vector>
, etastart = <numeric_vector>
, mustart = <numeric_vector>
, offset = <numeric_vector>
, epsilon = <numeric>
, maxit = <integer>
, trace = <logical>
, model = <logical>
, method = <glm_method_type>
, contrasts = <list>
, x = <logical>
, y = <logical>
)
Arguments
Example
Original data:
CARRIER | IS_DELAYED | DEP_DELAY | DISTANCE |
---|---|---|---|
UA | TRUE | 7 | 300 |
UA | TRUE | 30 | 1000 |
UA | FALSE | 0 | 200 |
AA | TRUE | 20 | 500 |
AA | FALSE | 3 | 700 |
AA | TRUE | 8 | 1000 |
%>% group_by(CARRIER)
%>% build_lr(IS_DELAYED ~ DEP_DELAY + DISTANCE, test_rate = 0.1)
Returns a data frame that stores a GLM model based on a formula defined as 'Predict ARR_DELAY based on DEP_DELAY and DISTANCE.'
CARRIER | source_data | model | .test_index |
---|---|---|---|
UA | source dataframe | glm model | c(1) |
AA | source dataframe | glm model | c(2) |
You can use model_coef or model_stats function to get the summary information about the models. Also, you can use prediction function to predict with the data that is stored in the same data frame.
Summary
Builds generalized linear models (glm) and store it in a data frame.
Syntax
build_glm(<formula>
, test_rate = <numeric>
, family = <glm_family_type>
, weights = <numeric_vector>
, subset = <numeric_vector>
, na.action = <na_action_type>
, start = <numeric_vector>
, etastart = <numeric_vector>
, mustart = <numeric_vector>
, offset = <numeric_vector>
, epsilon = <numeric>
, maxit = <integer>
, trace = <logical>
, model = <logical>
, method = <glm_method_type>
, contrasts = <list>
, x = <logical>
, y = <logical>
)
Arguments
Example
Original data:
CARRIER | ARR_DELAY | DEP_DELAY | DISTANCE |
---|---|---|---|
UA | 10 | 7 | 300 |
UA | 20 | 30 | 1000 |
UA | -5 | 0 | 200 |
AA | 20 | 20 | 500 |
AA | -5 | 3 | 700 |
AA | 10 | 8 | 1000 |
%>% group_by(CARRIER)
%>% build_glm(ARR_DELAY ~ DEP_DELAY + DISTANCE, test_rate = 0.1)
Returns a data frame that stores a GLM model based on a formula defined as 'Predict ARR_DELAY based on DEP_DELAY and DISTANCE.'
CARRIER | source_data | model | .test_index |
---|---|---|---|
UA | source dataframe | lm model | c(1) |
AA | source dataframe | lm model | c(2) |
You can use model_coef or model_stats function to get the summary information about the models. Also, you can use prediction function to predict with the data that is stored in the same data frame.
Summary
Builds multinomial logistic regression model and store it in a data frame.
Syntax
build_multinom(<formula>
, test_rate = <numeric>
, subset = <numeric_vector>
, weights = <numeric_vector>
, na.action = <na_action_type>
, Hess = <logical>
, summ = <multinom_summ_type>
, censored = <logical>
, model = <logical>
, contrasts = <list>
)
Arguments
Example
Original data:
CARRIER | ARR_DELAY | DEP_DELAY | DISTANCE |
---|---|---|---|
UA | 10 | 7 | 300 |
UA | 20 | 30 | 1000 |
DL | -5 | 0 | 200 |
AA | 20 | 20 | 500 |
AA | -5 | 3 | 700 |
AA | 10 | 8 | 1000 |
%>% build_multinom(CARRIER ~ ARR_DELAY + DEP_DELAY, test_rate = 0.1)
Returns a data frame that stores a GLM model based on a formula defined as 'Predict ARR_DELAY based on DEP_DELAY and DISTANCE.'
source_data | model | .test_index |
---|---|---|
source dataframe | multinom model | c(1) |
You can use model_coef or model_stats function to get the summary information about the models. Also, you can use prediction function to predict with the data that is stored in the same data frame.
Summary
Builds Cox Proportional Hazard Model for survival analysis and store it in a data frame.
Syntax
build_coxph(<formula>
, test_rate = <numeric>
, subset = <numeric_vector>
, weights = <numeric_vector>
, na.action = <na_action_type>
, init = <numeric>
, ties = <coxph_tie_type>
, singular.ok = <logical>
, model = <logical>
, x = <logical>
, y = <logical>
, tt = <function>
)
Arguments
Example
Original data:
time | status | age | sex |
---|---|---|---|
40 | 2 | 55 | female |
30 | 2 | 68 | male |
35 | 1 | 59 | male |
8 | 1 | 80 | female |
build_coxph(survival::Surv(time, status) ~ age + sex) Builds a Cox Proportional Hazard Model that estimates survival of the subjects based on age and sex. The resulting model is stored in returning data frame. You can use model_coef or model_stats function to get the summary information about the models. Also, you can use prediction_coxph function to predict with the data that is stored in the same data frame.
Summary
Returns a data frame with regression information about a model.
Syntax
prediction(data = "training"|"test"|"newdata"
, data_frame = <data_set>
)
Arguments * data (Optional) - The default is "training". Type of data to use in the prediction. * data_frame (Optional) - This works only when data = "newdata". A name of another data frame to apply the model.
Return Value
Data Frame
Example
Original data: model data frame.
prediction(data = "test")
Returns a data frame with predicted values.
Summary
Returns a data frame with binary classification information about a model.
Syntax
prediction_binary(<predicted_probability_column>
, <actual_value_column>
, data = "training"|"test"|"newdata"
, threshold = <binary_metrics>
, data_frame = <data_set>
)
Arguments
Return Value
Data Frame
Example
Original data: binary classification model data frame.
prediction_binary(data = "test", threshold = 0.2)
Returns a data frame with predicted values of binary classification.
Summary
Returns a data frame with predicted values of Cox Proportional Hazard Model.
Syntax
prediction_coxph(data = "training"|"test"
, type.predict = <coxph_predict_type>
, type.residuals = <coxph_residuals_type>
)
Arguments * data (Optional) - The default is "training". Type of data to use in the prediction. * type.predict (Optional) - The default is "lp". The type of prediction. This can be * "lp" * "risk" * "expected" * type.residuals (Optional) - The default is "martingale". The type of residuals. This can be * "martingale" * "deviance" * "score" * "schoenfeld" * "scaledsch" * "dfbeta" * "dfbetas"
Return Value
Data Frame
Example
Original data: coxph model data frame.
prediction_coxph(data = "test")
Returns a data frame with predicted values.
Summary
Returns a data frame with 'Parameter Estimates (Coefficients)' information about a model including the below.
Syntax
model_coef(conf_int = <conf_int_type>
, conf.level = <numeric>
)
Arguments
Return Value
Data Frame
Example
Original data: lm model or glm model
model_coef(conf_int = "default", conf.level = 0.90)
Returns a data frame with 'Parameter Estimates' information.
Summary
Returns a data frame with 'Summary of Fit' information about a model including the below.
Syntax
model_stats()
Example
Original data: lm model or glm model
model_stats()
Returns a data frame with 'Summary of Fit' information.
Return Value
Data Frame
Summary
Returns a data frame with anova test information about a model including the below.
Syntax
model_anova()
Return Value
Data Frame
Example
Original data: lm model or glm model
model_anova()
Returns a data frame with anova test information.
Summary
Returns a data frame with evaluation score of regression including the below.
Syntax
evaluate_regression(<predicted_value_column>
, <actual_value_column>
)
Arguments
Return Value
Data Frame
Example
Original data:
ARR_DELAY | predicted_value |
---|---|
10 | 7 |
20 | 30 |
-5 | 0 |
20 | 20 |
-5 | 3 |
10 | 8 |
evaluate_regression(predicted_value, ARR_DELAY)
Returns a data frame with evaluation scores of regression.
Summary
Returns a data frame with evaluation score of binary classification including the below.
Syntax
evaluate_binary(<predicted_probability_column>
, <actual_value_column>
, threshold = <binary_metrics>
)
Arguments
Return Value
Data Frame
Example
Original data:
IS_DELAYED | predicted_probability |
---|---|
TRUE | 0.82 |
TRUE | 0.67 |
FALSE | 0.34 |
TRUE | 0.98 |
FALSE | 0.19 |
TRUE | 0.26 |
evaluate_binary(predicted_value, ARR_DELAY, threshold = "accuracy")
Returns a data frame with evaluation scores of binary classification.
Summary
Returns a data frame with evaluation score of multi classification including the below.
Syntax
evaluate_multi(<predicted_label_column>
, <actual_value_column>
)
Arguments
Return Value
Data Frame
Example
Original data:
CARRIER | predicted_label |
---|---|
AA | AA |
9E | 9E |
AS | AA |
9E | 9E |
AA | AA |
AS | FL |
evaluate_binary(predicted_value, ARR_DELAY, threshold = "accuracy")
Returns a data frame with evaluation scores of multi classification.
micro_f_score | macro_f_score | accuracy | misclassification_rate |
---|---|---|---|
0.6666667 | 0.45 | 0.6666667 | 0.3333333 |
Summary
Returns coordinates of roc curve.
Syntax
do_roc(<predicted_probability_column>
, <actual_value_column>
)
Arguments
Return Value
Data Frame
Example
Original data:
IS_DELAYED | predicted_probability |
---|---|
TRUE | 0.82 |
TRUE | 0.67 |
FALSE | 0.34 |
TRUE | 0.98 |
FALSE | 0.19 |
TRUE | 0.26 |
do_roc(predicted_value, ARR_DELAY)
Returns a data frame with true_positive_rate and false_positive_rate.
true_positive_rate | false_positive_rate |
---|---|
0.00 | 0.0 |
0.25 | 0.0 |
0.50 | 0.0 |
0.75 | 0.0 |
0.75 | 0.5 |
1.00 | 0.5 |
1.00 | 1.0 |
With this data, you can assign "false_positive_rate" to X axis and "true_positive_rate" to Y axis to draw ROC curve.
Summary
Builds a clustering model (k-means) from variable columns and returns the summary of the model or the augmented data depending on the parameter value.
Syntax
build_kmeans.cols(<column(s)>
, centers = <integer>
, fill = <numeric>
, fun.aggregate = <aggregate_function>
, seed = <integer>
, iter.max = <integer>
, nstart = <integer>
, algorithm = <kmeans_algorithms>
, trace = <logical>
, normalize_data = <logical>
, augment = <logical>
, keep.source = <logical>
)
Arguments
augment_kmeans(model, source.data)
Return Value
Data frame
Example
Original data:
location | 1900 | 1901 | 1902 |
---|---|---|---|
Als | 35.02899 | 24.02899 | 45.02899 |
Bea | 18.00000 | 10.00000 | 27.00000 |
Bor | 18.14896 | 10.64896 | 20.64896 |
build_kmeans.cols(-location, centers=2, augment=TRUE)
Builds a K-means clustering model selecting all the columns other than 'location' and setting 2 as the number of the clusters, and scoring the original data right away. If you want to evaluate the model set 'augment' to FALSE.
location | X1900 | X1901 | X1902 | .cluster |
---|---|---|---|---|
Als | 35.02899 | 24.02899 | 45.02899 | 2 |
Bea | 18.00000 | 10.00000 | 27.00000 | 1 |
Bor | 18.14896 | 10.64896 | 20.64896 | 1 |
Summary
Builds a clustering model (k-means) from key-value columns and store the model into a generated data frame or augment the original data with the clustered ID.
Syntax
build_kmeans.kv(<subject_column>
, <key_column>
, <value_column>
, centers = <integer>
, seed = <integer>
, iter.max = <integer>
, nstart = <integer>
, algorithm = <kmeans_algorithms>
, trace = <logical>
, normalize_data = <logical>
, augment = <logical>
, keep.source = <logical>
)
Arguments
augment_kmeans(model, source.data)
Return Value
Data frame
Example
Original data:
location | year | harvest_date |
---|---|---|
Als | 1900 | 35.02899 |
Als | 1901 | 24.02899 |
Als | 1902 | 45.02899 |
Bea | 1900 | 18.00000 |
Bea | 1901 | 10.00000 |
Bea | 1902 | 27.00000 |
Bor | 1900 | 18.14896 |
Bor | 1901 | 10.64896 |
Bor | 1902 | 20.64896 |
build_kmeans.kv(location, year, harvest_date, centers=2, augment=TRUE)
Builds a K-means clustering model selecting 'location' as the subject to cluster, 'year' as the dimension, and 'harvest_date' as the value, and setting 2 as the number of the clusters, and scoring the original data right away. If you want to evaluate the model set 'augment' to FALSE.
location | year | harvest_date | .cluster |
---|---|---|---|
Als | 1900 | 35.02899 | 2 |
Als | 1901 | 24.02899 | 2 |
Als | 1902 | 45.02899 | 2 |
Bea | 1900 | 18.00000 | 1 |
Bea | 1901 | 10.00000 | 1 |
Bea | 1902 | 27.00000 | 1 |
Bor | 1900 | 18.14896 | 1 |
Bor | 1901 | 10.64896 | 1 |
Bor | 1902 | 20.64896 | 1 |
Summary
Builds a clustering model (k-means). Stores the model into a generated data frame or the augmented data depending on the parameter value.
Syntax
build_kmeans(<column(s)>
, skv = <character_vector>
, fill = <numeric>
, fun.aggregate = <aggregate_function>
, centers = <integer>
, iter.max = <integer>
, nstart = <integer>
, algorithm = <kmeans_algorithms>
, trace = <logical>
, normalize_data = <logical>
, seed = <integer>
, augment = <logical>
, keep.source = <logical>
)
Arguments
augment_kmeans(model, source.data)
Return Value
Data frame
Example
Original data:
location | 1900 | 1901 | 1902 |
---|---|---|---|
Als | 35.02899 | 24.02899 | 45.02899 |
Bea | 18.00000 | 10.00000 | 27.00000 |
Bor | 18.14896 | 10.64896 | 20.64896 |
build_kmeans(-location, centers=2, augment=TRUE)
Builds a K-means clustering model selecting all the columns other than 'location' and setting 2 as the number of the clusters, and scoring the original data right away. If you want to evaluate the model set 'augment' to FALSE.
location | X1900 | X1901 | X1902 | .cluster |
---|---|---|---|---|
Als | 35.02899 | 24.02899 | 45.02899 | 2 |
Bea | 18.00000 | 10.00000 | 27.00000 | 1 |
Bor | 18.14896 | 10.64896 | 20.64896 | 1 |
Original data:
location | year | harvest_date |
---|---|---|
Als | 1900 | 35.02899 |
Als | 1901 | 24.02899 |
Als | 1902 | 45.02899 |
Bea | 1900 | 18.00000 |
Bea | 1901 | 10.00000 |
Bea | 1902 | 27.00000 |
Bor | 1900 | 18.14896 |
Bor | 1901 | 10.64896 |
Bor | 1902 | 20.64896 |
build_kmeans(skv = c("location", "year", "harvest_date"), centers = 2, augment = TRUE)
Builds a K-means clustering model selecting 'location' as the subject to cluster, 'year' as the dimension, and 'harvest_date' as the value, and setting 2 as the number of the clusters, and scoring the original data right away. If you want to evaluate the model set 'augment' to FALSE.
location | year | harvest_date | .cluster |
---|---|---|---|
Als | 1900 | 35.02899 | 2 |
Als | 1901 | 24.02899 | 2 |
Als | 1902 | 45.02899 | 2 |
Bea | 1900 | 18.00000 | 1 |
Bea | 1901 | 10.00000 | 1 |
Bea | 1902 | 27.00000 | 1 |
Bor | 1900 | 18.14896 | 1 |
Bor | 1901 | 10.64896 | 1 |
Bor | 1902 | 20.64896 | 1 |
Summary
Execute t-test, which checks differences of means of variables.
Syntax
do_t.test(<value_column>
, <group_column>
, alternative = <alternative_type>
, mu = <integer>
, paired = <logical>
, var.equal = <logical>
, conf.level = <numeric>
, subset = <column_logical>
, na.action = <na_action_type>
)
Arguments
Return Value
Data frame
Example
Original data:
location | year | harvest_date |
---|---|---|
Als | 1900 | 35.02899 |
Als | 1901 | 24.02899 |
Als | 1902 | 45.02899 |
Bor | 1900 | 18.14896 |
Bor | 1901 | 10.64896 |
Bor | 1902 | 20.64896 |
do_t.test(harvest_date, location)
Returns a data frame with one row that has t-test result.
Summary
Execute F-test, which checks the differences of variances between groups.
Syntax
do_var.test(<value_column>
, <group_column>
, ratio = <numeric>
, alternative = <alternative_type>
, conf.level = <numeric>
, na.action = <na_action_type>
)
Arguments
Return Value
Data frame
Example
Original data:
location | year | harvest_date |
---|---|---|
Als | 1900 | 35.02899 |
Als | 1901 | 24.02899 |
Als | 1902 | 45.02899 |
Bor | 1900 | 18.14896 |
Bor | 1901 | 10.64896 |
Bor | 1902 | 20.64896 |
do_var.test(harvest_date, location)
Returns a data frame with one row that has t-test result.
Summary
Execute chi-squared contingency table tests and goodness-of-fit tests.
Syntax
do_chisq.test(<column(s)>
, correct = <logical>
, p = <column_num>
, rescale.p = <logical>
, simulate.p.value = <logical>
, B = <numeric>
)
Arguments
Return Value
Data frame
Example
Original data:
clarity | GIA | HRD | IGI |
---|---|---|---|
IF | 6 | 4 | 34 |
VS1 | 61 | 13 | 7 |
VS2 | 36 | 15 | 2 |
VVS1 | 15 | 23 | 14 |
VVS2 | 33 | 24 | 21 |
do_chisq.test(-clarity)
Returns a data frame with one row that has chi-squared test result.
statistic | p.value | parameter | method |
---|---|---|---|
112.7472 | 0 | 8 | Pearson's Chi-squared test |
Summary
Find rules of what tend to occur at the same time from transaction data.
Syntax
do_apriori(<subject_column>
, <key_column>
, minlen = <integer>
, maxlen = <integer>
, min_support = <numeric>
, max_support = <numeric>
, min_confidence = <numeric>
, lhs = <character_vector>
, rhs = <character_vector>
, max_basket_items = <numeric>
)
Arguments
Return Value
Data frame
Example
Original data:
item_name | transaction_id |
---|---|
banana | 1 |
apple | 1 |
carrot | 2 |
apple | 2 |
orange | 3 |
carrot | 3 |
apple | 4 |
carrot | 4 |
do_apriori(item_name, transaction_id)
Returns association rules about item names.
lhs | rhs | support | confidence | lift |
---|---|---|---|---|
apple | 0.75 | 0.7500000 | 1.0000000 | |
carrot | 0.75 | 0.7500000 | 1.0000000 | |
banana | apple | 0.25 | 1.0000000 | 1.3333333 |
orange | carrot | 0.25 | 1.0000000 | 1.3333333 |
apple | carrot | 0.50 | 0.6666667 | 0.8888889 |
carrot | apple | 0.50 | 0.6666667 | 0.8888889 |
do_apriori(item_name, transaction_id, minlen=2)
Returns association rules where the total number of item names in lhs and rhs are more than 2.
lhs | rhs | support | confidence | lift |
---|---|---|---|---|
banana | apple | 0.25 | 1.0000000 | 1.3333333 |
orange | carrot | 0.25 | 1.0000000 | 1.3333333 |
apple | carrot | 0.50 | 0.6666667 | 0.8888889 |
carrot | apple | 0.50 | 0.6666667 | 0.8888889 |
do_apriori(item_name, transaction_id, lhs=c( "apple", "orange"))
Returns association rules where apple or orange come to lhs.
lhs | rhs | support | confidence | lift |
---|---|---|---|---|
orange | carrot | 0.25 | 1.0000000 | 1.3333333 |
apple | carrot | 0.50 | 0.6666667 | 0.8888889 |
Summary
Detect anomaly in time series data frame.
Syntax
do_anomaly_detection(<column_date>
, <column_num>
, direction = <anomaly_direction_type>
, e_value = <logical>
, max_anoms = <numeric>
, alpha = <numeric>
, only_last = <anomaly_last_type>
, threshold = <anomaly_threshold_type>
, longterm = <logical>
, piecewise_median_period_weeks = <integer>
)
Arguments
Return Value
Data frame
Example
Original data:
dateHour | newUsers |
---|---|
... | ... |
2017-01-20 09:00:00 | 4 |
2017-01-20 10:00:00 | 2 |
2017-01-20 11:00:00 | 5 |
2017-01-20 12:00:00 | 31 |
2017-01-20 13:00:00 | 3 |
2017-01-20 14:00:00 | 3 |
... | ... |
do_anomaly_detection(dateHour, newUsers, direction = "both", e_value = TRUE)
Returns data frame with columns about anomaly data.
dateHour | newUsers | pos_anomaly | pos_value | neg_anomaly | neg_value | expected_value |
---|---|---|---|---|---|---|
... | ... | ... | ... | ... | ... | ... |
2017-01-20 09:00:00 | 4 | FALSE | NA | FALSE | NA | 4 |
2017-01-20 10:00:00 | 2 | FALSE | NA | FALSE | NA | 2 |
2017-01-20 11:00:00 | 5 | FALSE | NA | FALSE | NA | 5 |
2017-01-20 12:00:00 | 31 | TRUE | 31 | FALSE | NA | 5 |
2017-01-20 13:00:00 | 3 | FALSE | NA | FALSE | NA | 3 |
2017-01-20 14:00:00 | 3 | FALSE | NA | FALSE | NA | 3 |
... | ... | ... | ... | ... | ... | ... |
Summary
Add forecast data to time series data frame.
Syntax
do_prophet(<column_date>
, <column_num>
, <numeric>
, time_unit = <prophet_time_unit>
, include_history = <logical>
, fun.aggregate = <aggregate_function>
, growth = "linear"|"logistics"
, seasonality.prior.scale = <numeric>
, yearly.seasonality = <logical>
, weekly.seasonality = <logical>
, n.changepoints = <numeric>
, changepoint.prior.scale = <numeric>
, changepoints = <list>
, holidays.prior.scale = <numeric>
, holidays = <data_set>
, mcmc.samples =
Arguments
Return Value
Data frame
Example
Original data:
date | pageviews |
---|---|
... | ... |
2017-03-19 | 2383 |
2017-03-20 | 4441 |
2017-03-21 | 3147 |
2017-03-22 | 456 |
... | ... |
do_prophet(date, pageviews, 2, time_unit = "day")
Returns data frame with forecasted rows and columns about trend and seasonality.
date | pageviews | forecasted_value | forcasted_value_high | forcasted_value_low | trend | trend_high | trend_low | seasonal | seasonal_low | seasonal_high | yearly | yearly_low | yearly_high | weekly | weekly_low | weekly_high |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
2017-03-19 | 2383 | 3200.6539 | 4150.7993 | 2199.57 | 1131.882 | 1131.882 | 1131.882 | 2068.7719 | 2068.7719 | 2068.7719 | 1014.3165 | 1014.3165 | 1014.3165 | -177.1076 | -177.1076 | -177.1076 |
2017-03-20 | 4441 | 3401.953 | 4330.5388 | 2401.9095 | 1089.3266 | 1089.3266 | 1089.3266 | 2312.6264 | 2312.6264 | 2312.6264 | 1258.171 | 1258.171 | 1258.171 | -174.3324 | -174.3324 | -174.3324 |
2017-03-21 | 3147 | 2723.3344 | 3655.037 | 1841.2349 | 1046.7712 | 1046.7712 | 1046.7712 | 1676.5631 | 1676.5631 | 1676.5631 | 622.1077 | 622.1077 | 622.1077 | 15.2078 | 15.2078 | 15.2078 |
2017-03-22 | 456 | 1209.1918 | 2152.4326 | 298.3122 | 1004.2159 | 1004.2159 | 1004.2159 | 204.976 | 204.976 | 204.976 | -849.4794 | -849.4794 | -849.4794 | 143.4983 | 143.4983 | 143.4983 |
2017-03-23 | NA | 654.9152 | 1581.3289 | -276.3799 | 876.5498 | 876.5498 | 876.5497 | -221.6346 | -221.6346 | -221.6346 | -1276.09 | -1276.09 | -1276.09 | 107.7315 | 107.7315 | 107.7315 |
2017-03-24 | NA | 2090.3595 | 3049.9388 | 1133.799 | 833.9944 | 833.9944 | 833.9944 | 1256.3651 | 1256.3651 | 1256.3651 | 201.9097 | 201.9097 | 201.9097 | 77.2003 | 77.2003 | 77.2003 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
Summary
Estimate impact of an event (advertisement, etc.) on a market, by using other markets to form a synthetic control.
Syntax
do_market_impact(<column_date>
, <column_num>
, <column_text>
, target_market = <text>
, time_unit = <market_impact_time_unit>
, fun.aggregate = <aggregate_function>
, event_time = <text>
, output_type = "series"|"model_stats"|"model_coef"|"predictor_market_candidates"
, na_fill_type = "spline"|"interpolate"|"previous"|"value"
, na_fill_value = <numeric>
, distance_weight = <numeric>
, alpha = <numeric>
, niter = <numeric>
, standardize.data = <logical>
, prior.level.sd = <numeric>
, nseasons = <numeric>
, season.duration = <numeric>
, dynamic.regression = <logical>
)
Arguments
Return Value
Data frame
Example
Original data:
date | pageviews | country |
---|---|---|
... | ... | ... |
2017-03-19 | 2383 | Japan |
2017-03-19 | 3625 | UK |
2017-03-19 | 3825 | US |
2017-03-20 | 4441 | Japan |
2017-03-20 | 4721 | UK |
2017-03-20 | 8367 | US |
2017-03-21 | 383 | Japan |
2017-03-21 | 2693 | UK |
2017-03-21 | 3147 | US |
... | ... | ... |
do_market_impact(date, pageviews, country, target_market = "Japan", event_time = "2017-03-20")
Returns data frame with pageviews of target market, synthetic control calculated from pageviews of other markets, and estimated impact of the event which happened only to the target market.
Output data frame columns:
Summary
Use a row as column names.
Syntax
row_as_header(row_index = <integer>
, prefix = <character>
, clean_names = <logical>
)
Arguments
Return Value
Data frame
Example
Original data:
V1 | V2 |
---|---|
NA | NA |
item.name | transaction.id |
banana | 1 |
apple | 1 |
carrot | 2 |
apple | 2 |
orange | 3 |
carrot | 3 |
apple | 4 |
carrot | 4 |
row_as_header(row_index = 2, prefix = "c_", clean_names = TRUE)
Returns data frame using the 2nd row as column names with prefix and dot is replaced by underscore.
c_item_name | c_transaction_id |
---|---|
NA | NA |
banana | 1 |
apple | 1 |
carrot | 2 |
apple | 2 |
orange | 3 |
carrot | 3 |
apple | 4 |
carrot | 4 |
Summary
Pivot columns into rows and columns. Values are count of pairs of rows and columns or aggregation of another column.
Syntax
pivot(<formula>
, value = <column>
, fill = <numeric>
, fun.aggregate = <character>
, na.rm = <logical>
)
Arguments
Return Value
Data frame
Example
Original data:
clarity | certification | colour | price |
---|---|---|---|
VVS2 | HRD | F | 10796 |
VS1 | GIA | F | 1551 |
VS1 | GIA | F | 4291 |
VVS2 | GIA | F | 5881 |
VS2 | GIA | H | 4585 |
VVS2 | GIA | H | 5193 |
pivot(clarity+colour~certification, value = price, fill=0, fun.aggregate = sum)
Returns data frame with rows of combinations of clarity and colour and with columns of certification. Values are sum of price and missing values are filled by 0.
clarity_colour | GIA | HRD |
---|---|---|
VS1_F | 5842 | 0 |
VS2_H | 4585 | 0 |
VVS2_F | 5881 | 10796 |
VVS2_H | 5193 | 0 |
Original data:
FL_DATE | ARR_DELAY | CARRIER |
---|---|---|
2018-10-01 | 10 | AA |
2018-10-02 | 20 | UA |
2018-10-03 | 30 | AA |
2018-10-04 | 40 | UA |
2018-11-01 | 20 | UA |
2018-11-02 | 10 | AA |
2018-11-03 | 20 | UA |
2018-11-04 | 30 | AA |
pivot(floor_date(FL_DATE, unit="month") ~ CARRIER, value = ARR_DELAY, fun.aggregate = mean)
Returns data frame with rows of FL_DATE (floored to month) and ARR_DELAY with columns of each CARRIER. Values are mean of ARR_DELAY.
FL_DATE | AA | UA |
---|---|---|
2018-10-01 | 20 | 30 |
2018-11-01 | 20 | 20 |
Summary
Calculates correlations for all the pairs of the variables (columns).
Syntax
do_cor.cols(<column(s)>
, use = <cor_na_operation>
, method = <cor_method>
, distinct = <logical>
, diag = <logical>
)
Arguments
Return Value
Data frame
Example
Original data:
year | Als | Bea | Bor |
---|---|---|---|
1900 | 35.02899 | 18 | 18.14896 |
1901 | 24.02899 | 10 | 10.64896 |
1902 | 45.02899 | 27 | 20.64896 |
do_cor.cols(-year)
pair.name.1 | pair.name.2 | value |
---|---|---|
Als | Bea | 0.9981135 |
Als | Bor | 0.9680283 |
Bea | Als | 0.9981135 |
Bea | Bor | 0.9508014 |
Bor | Als | 0.9680283 |
Bor | Bea | 0.9508014 |
Summary
Calculates correlations for all the pairs of subject columns.
Syntax
do_cor.kv(<subject_column>
, <key_column>
, <value_column>
, use = <cor_na_operation>
, method = <cor_method>
, distinct = <logical>
, diag = <logical>
, fill = <numeric>
, fun.aggregate = <aggregate_function>
)
Arguments
Return Value
Data frame
Example
Original data:
location | year | harvest_date |
---|---|---|
Als | 1900 | 35.02899 |
Als | 1901 | 24.02899 |
Als | 1902 | 45.02899 |
Bea | 1900 | 18.00000 |
Bea | 1901 | 10.00000 |
Bea | 1902 | 27.00000 |
Bor | 1900 | 18.14896 |
Bor | 1901 | 10.64896 |
Bor | 1902 | 20.64896 |
do_cor.kv(location, year, harvest_date)
location.x | location.y | value |
---|---|---|
Als | Bea | 0.9981135 |
Als | Bor | 0.9680283 |
Bea | Als | 0.9981135 |
Bea | Bor | 0.9508014 |
Bor | Als | 0.9680283 |
Bor | Bea | 0.9508014 |
Summary
Calculates correlations for all the pairs of the variables or subjects.
Syntax
do_cor(<column(s)>
, skv = <character_vector>
, use = <cor_na_operation>
, method = <cor_method>
, distinct = <logical>
, diag = <logical>
, fill = <numeric>
, fun.aggregate = <aggregate_function>
)
Arguments
Return Value
Data frame
Example
Original data:
year | Als | Bea | Bor |
---|---|---|---|
1900 | 35.02899 | 18 | 18.14896 |
1901 | 24.02899 | 10 | 10.64896 |
1902 | 45.02899 | 27 | 20.64896 |
do_cor(-year)
pair.name.1 | pair.name.2 | value |
---|---|---|
Als | Bea | 0.9981135 |
Als | Bor | 0.9680283 |
Bea | Als | 0.9981135 |
Bea | Bor | 0.9508014 |
Bor | Als | 0.9680283 |
Bor | Bea | 0.9508014 |
Original data:
location | year | harvest_date |
---|---|---|
Als | 1900 | 35.02899 |
Als | 1901 | 24.02899 |
Als | 1902 | 45.02899 |
Bea | 1900 | 18.00000 |
Bea | 1901 | 10.00000 |
Bea | 1902 | 27.00000 |
Bor | 1900 | 18.14896 |
Bor | 1901 | 10.64896 |
Bor | 1902 | 20.64896 |
do_cor( skv = c("location", "year", "harvest_date"))
location.x | location.y | value |
---|---|---|
Als | Bea | 0.9981135 |
Als | Bor | 0.9680283 |
Bea | Als | 0.9981135 |
Bea | Bor | 0.9508014 |
Bor | Als | 0.9680283 |
Bor | Bea | 0.9508014 |
Summary
Calculate the distances between each of the pairs.
Syntax
do_dist.kv(<subject_column>
, <key_column>
, <value_column>
, distinct = <logical>
, diag = <logical>
, method = <dist_method>
, fill = <numeric>
, fun.aggregate = <aggregate_function>
, p = <numeric>
)
Arguments
Return Value
Data frame
Example
Original data:
date | name | product | num |
---|---|---|---|
2015-06-12 | Lisa | orange | 5 |
2015-06-12 | Emily | apple | 3 |
2015-06-12 | John | carrot | 3 |
2015-06-13 | Emily | apple | 4 |
2015-06-13 | John | apple | 6 |
2015-06-16 | Lisa | orange | 2 |
2015-06-16 | John | carrot | 4 |
2015-06-16 | Emily | orange | 7 |
do_dist.kv(name, product, num)
This considers matrix like this. Duplicated entries is aggregated to mean as default and you can change it by fun.aggregate argument.
Emily | John | Lisa | |
---|---|---|---|
apple | 3.5 | 6.0 | 0.0 |
carrot | 0.0 | 3.5 | 0.0 |
orange | 7.0 | 0.0 | 3.5 |
Then calculate euclidean distances between pairs of each column.
name.x | name.y | value |
---|---|---|
Emily | John | 8.215838 |
Emily | Lisa | 4.949747 |
John | Emily | 8.215838 |
John | Lisa | 7.778175 |
Lisa | Emily | 4.949747 |
Lisa | John | 7.778175 |
do_dist.kv(name, product, num, fun.aggregate=sum, method="manhattan")
This considers matrix like this. Duplicated entries is aggregated to sum.
Emily | John | Lisa | |
---|---|---|---|
apple | 7 | 6 | 0 |
carrot | 0 | 7 | 0 |
orange | 7 | 0 | 7 |
Then calculate manhattan distances between pairs of each column.
name.x | name.y | value |
---|---|---|
Emily | John | 15 |
Emily | Lisa | 7 |
John | Emily | 15 |
John | Lisa | 20 |
Lisa | Emily | 7 |
Lisa | John | 20 |
Summary
Calculate distances of each of the pairs.
Syntax
do_dist.cols(<column(s)>
, distinct = <logical>
, diag = <logical>
, method = <dist_method>
, p = <numeric>
)
Arguments
Return Value
Data frame
Example
Original data:
year | Als | Bea | Bor |
---|---|---|---|
1900 | 35.02899 | 18 | 18.14896 |
1901 | 24.02899 | 10 | 10.64896 |
1902 | 45.02899 | 27 | 20.64896 |
do_dist.cols(-year)
pair.name.1 | pair.name.2 | value |
---|---|---|
Als | Bea | 28.492868 |
Als | Bor | 32.532238 |
Bea | Als | 28.492868 |
Bea | Bor | 6.385847 |
Bor | Als | 32.532238 |
Bor | Bea | 6.385847 |
Summary
Calculate distances of each of the pairs of the variables or subjects.
Syntax
do_dist(<column(s)>
, skv = <character_vector>
, distinct = <logical>
, diag = <logical>
, method = <dist_method>
, p = <numeric>
, fill = <numeric>
, fun.aggregate = <aggregate_function>
)
Arguments
Return Value
Data frame
Example
Original data:
year | Als | Bea | Bor |
---|---|---|---|
1900 | 35.02899 | 18 | 18.14896 |
1901 | 24.02899 | 10 | 10.64896 |
1902 | 45.02899 | 27 | 20.64896 |
do_dist(-year)
pair.name.1 | pair.name.2 | value |
---|---|---|
Als | Bea | 28.492868 |
Als | Bor | 32.532238 |
Bea | Als | 28.492868 |
Bea | Bor | 6.385847 |
Bor | Als | 32.532238 |
Bor | Bea | 6.385847 |
Original data:
date | name | product | num |
---|---|---|---|
2015-06-12 | Lisa | orange | 5 |
2015-06-12 | Emily | apple | 3 |
2015-06-12 | John | carrot | 3 |
2015-06-13 | Emily | apple | 4 |
2015-06-13 | John | apple | 6 |
2015-06-16 | Lisa | orange | 2 |
2015-06-16 | John | carrot | 4 |
2015-06-16 | Emily | orange | 7 |
do_dist(skv = c("name", "product", "num"))
This considers matrix like this. Duplicated entries is aggregated to mean as default and you can change it by fun.aggregate argument.
Emily | John | Lisa | |
---|---|---|---|
apple | 3.5 | 6.0 | 0.0 |
carrot | 0.0 | 3.5 | 0.0 |
orange | 7.0 | 0.0 | 3.5 |
Then calculate euclidean distances between pairs of each column.
name.x | name.y | value |
---|---|---|
Emily | John | 8.215838 |
Emily | Lisa | 4.949747 |
John | Emily | 8.215838 |
John | Lisa | 7.778175 |
Lisa | Emily | 4.949747 |
Lisa | John | 7.778175 |
do_dist(skv = c("location", "year", "harvest_date"), fun.aggregate=sum, method="manhattan")
This considers matrix like this. Duplicated entries is aggregated to sum.
Emily | John | Lisa | |
---|---|---|---|
apple | 7 | 6 | 0 |
carrot | 0 | 7 | 0 |
orange | 7 | 0 | 7 |
Then calculate manhattan distances between pairs of each column.
name.x | name.y | value |
---|---|---|
Emily | John | 15 |
Emily | Lisa | 7 |
John | Emily | 15 |
John | Lisa | 20 |
Lisa | Emily | 7 |
Lisa | John | 20 |
Summary
Cleans up text by removing escape characters (e.g. \n, \t), extra white spaces, extra period, and leading and trailing spaces.
Syntax
str_clean(<column_text>
)
Arguments
Return Value
Character
Example
str_clean(" Exploratory ..io ")
returns "Exploratory.io".
Summary
Count patterns from texts
Syntax
str_count_all(<column_text>
, patterns = <character_vector>
, remove.zero = <logical>
)
Arguments
Return Value
List of data frame
Example
str_count_all("I ate banana, apple, and peach yesterday, and banana, peach today.", patterns=c("apple", "banana"), remove.zero=TRUE)
Returns a list column of data frames with 'apple' and 'banana' columns.
Summary
Normalize text by replacing full-width alphabets, numbers, special characters with regular alphabets, numbers, special characters, replacing half-width Kana characters with full-width Kana characters, and applying other normalization rules. It follows the rule of Unicode Normalization Forms. This is a wrapper function around stringi::stri_trans_nfkc function.
Syntax
str_normalize(<column_text>
)
Arguments
Return Value
Character
Example
str_normalize("ABC123+ー=") returns "ABC123+-=".
Summary
Extract text inside symbols with given begin and end symbol.
Syntax
str_extract_inside(<column_text>
, begin = <text>
, end = <text>
)
Arguments
Return Value
Character
Example
str_extract_inside(company, begin = "(", end = ")")
str_extract_inside(company, begin = "{", end = "}")
str_extract_inside(company, begin = "[", end = "]")
str_extract_inside(company, begin = "'", end = "'")
str_extract_inside(company, begin = '"', end = '"')
Summary
Convert a character or numeric data type column to a logical column.
If true_value argument is not provided, it treats "yes", "true", "1", and 1 as TRUE and "no", "false", "0", and 0 as FALSE.
Syntax
str_logical(<column_text>
, true_value = <text>
)
Arguments
Return Value
Logical
Example
Original data:
'Status' column is Character data type.
Status |
---|
yes |
no |
No |
NO |
YES |
Yes |
mutate(New_Status = str_logical(Status))
Result:
'New_status' column is Logical data type.
Status | New_Status |
---|---|
yes | TRUE |
no | FALSE |
No | FALSE |
NO | FALSE |
YES | TRUE |
Yes | TRUE |
Original data:
'Status' column is Character data type.
Status |
---|
TRUE |
FalSE |
True |
False |
NA |
tRUE |
FALSE |
mutate(New_Status = str_logical(Status))
Result:
'New_status' column is Logical data type.
Status | new_status |
---|---|
TRUE | TRUE |
FalSE | FALSE |
True | TRUE |
False | FALSE |
NA | NA |
tRUE | TRUE |
FALSE | FALSE |
Original data:
'Status' column is Character data type.
Status |
---|
1 |
0 |
1 |
0 |
NA |
1 |
0 |
mutate(New_Status = str_logical(Status))
Result:
'New_status' column is Logical data type.
Status | new_status |
---|---|
1 | TRUE |
0 | FALSE |
1 | TRUE |
0 | FALSE |
NA | NA |
1 | TRUE |
0 | FALSE |
Original data:
'Status' column is Numeric data type.
Status |
---|
Sign Up |
Not Yet |
Sign Up |
Not Yet |
NA |
sign Up |
Not yet |
mutate(New_Status = str_logical(Status, "Sign Up")
Result:
'New_status' column is Logical data type.
Status | new_status |
---|---|
Sign Up | TRUE |
Not Yet | FALSE |
Sign Up | TRUE |
Not Yet | FALSE |
NA | NA |
sign Up | TRUE |
Not yet | FALSE |
Summary
Returns stop words like "a", "the", "and", etc.
Syntax
get_stopwords(lang = <language>
, include = <character_vector>
, exclude = <character_vector>
)
Arguments
Return Value
Character
Example
get_stopwords()
Return a list of the stop words like "a", "the", "and".
Summary
Returns sentiment types of positive or negative based on word(s).
Syntax
word_to_sentiment(<column_text>
, lexicon = <sentiment_lexicon>
)
Arguments
Return Value
If lexicon argument is "bing" (default), Character.
If lexicon argument is "AFINN", Numeric.
If lexicon argument is "nrc", List.
Example
word_to_sentiment("good")
Return "positive".
word_to_sentiment("bad", lexicon="AFINN")
Return -3.
word_to_sentiment("cry", lexicon="nrc")
Return c("negative" "sadness").
Summary
Returns the sentiment score for a text sentence. The positive score indicates the positive sentence and the negative score indicates the opposite. 0 means 'neutral.'
Syntax
get_sentiment(<column_text>
)
Arguments
Return Value
Numeric
Example
get_sentiment("I'm happy")
Return 0.7071068.
get_sentiment("I'm sick.")
Return -0.7071068.
get_sentiment("I'm not so good.")
Return -0.5000000.
Summary
Stem word so that almost the same words which have a little different spells can be recognized as the same words.
Syntax
stem_word(<column_text>
, language = <stem_language>
)
Arguments
Return Value
Character
Example
stem_word(c("stand","stands", "cheerful", "cheering"))
Return c("stand","stand","cheer","cheer").
Summary
Returns TRUE if a word is included in a list of the stop words defined by one of the dictionary.
Syntax
is_stopword(<column_text>
, lang = <language>
, include = <character_vector>
, exclude = <character_vector>
)
Arguments
Return Value
Logical
Example
is_stopword(c("a", "and", "stopword", "the"))
Return c(TRUE, TRUE, FALSE, TRUE).
Summary
Returns TRUE if a text is empty string or NA.
Syntax
is_empty(<column_text>
)
Arguments
Return Value
Logical
Example
is_empty(c("", " ", NA, "The", "\n"))
Return c(TRUE, TRUE, TRUE, FALSE, TRUE).
Summary
Returns TRUE if a text contains only alphabets.
Syntax
is_alphabet(<column_text>
)
Arguments
Return Value
Data frame
Example
is_alphabet(c("1", "132", "32MB", "hello", "Hello World"))
Return c(FALSE, FALSE, FALSE, TRUE, FALSE).
This is expected to be used to tokens and if this is used to sentences, returns FALSE.
Summary
Returns country names from IP addresses.
Syntax
ip_to_country(<column_text>
)
Arguments
Return Value
Character
Example
ip_to_country("133.43.96.45")
returns "Japan".
Summary
Returns domain (ex. "exploratory.io") from url.
Syntax
url_domain(<column_text>
)
Arguments
Return Value
Character
Example
url_domain("https://exploratory.io/login/")
returns "exploratory.io".
Summary
Returns fragment from url.
Syntax
url_fragment(<column_text>
)
Arguments
Return Value
Character
Example
url_fragment("https://exploratory.io/?debug=true#test")
returns "test".
Summary
Returns path from url.
Syntax
url_path(<column_text>
)
Arguments
Return Value
Character
Example
url_path("https://exploratory.io/reference/")
returns "reference/".
Summary
Returns port from url.
Syntax
url_port(<column_text>
)
Arguments
Return Value
Character
Example
url_port("https://exploratory.io:443")
returns "443".
Summary
Returns scheme (ex. "http", "https") from url.
Syntax
url_scheme(<column_text>
)
Arguments
Return Value
Character
Example
url_scheme("https://exploratory.io")
returns "https".
Summary
Returns suffix (ex. "com", "org") from url.
Syntax
url_suffix(<column_text>
)
Arguments
Return Value
Character
Example
url_suffix("https://exploratory.io")
returns "io".
url_suffix("http://sample.edu.co")
returns "edu.co".
Summary
Returns subdomain (ex. "www", "blog") from url.
Syntax
url_subdomain(<column_text>
)
Arguments
Return Value
Character
Example
url_subdomain("https://blog.exploratory.io")
returns "blog".
Summary
Returns top-level domain (ex. "com", "co") from url.
Syntax
url_tld(<column_text>
)
Arguments
Return Value
Character
Example
url_tld("http://sample.edu.co")
returns "co".
Summary
Returns decoded query parameter from url.
Syntax
url_param(<column_text>
, parameter_name = <character>
)
Arguments
* parameter_name - Name of query. If there's "?user=ferret" in the end of the url, the name is "user".
Return Value
Character
Example
url_param("http://example.com/?user=ferret", "user")
returns "ferret".
Summary
Map country names or codes to other codes or names (country names, continent names, etc.).
Syntax
countrycode(<column>
, origin = <origin_countrycode_type>
, destination = <destination_countrycode_type>
)
Arguments
Values for origin and destination arguments:
Return Value
Country name, id, etc.
Example
countrycode("U.S.", "country.name", "iso2c")
Return US.
countrycode("United States", "country.name", "iso2c")
Return US.
countrycode("US", "iso2c", "country.name")
Return USA.
countrycode("USA", "cowc", "country.name")
Return United States.
countrycode(2, "cown", "country.name")
Return United States.
Summary
This function takes a column that has US State information and returns either US State names, abbreviations, numeric codes, division, or region, based on 'output_type' argument setting. The US State information can be either US State names, US State abbreviations (FIPS / ANSI codes), or US State numeric code (FIPS) in any combination.
Syntax
statecode(<state_column>
, output_type = <statecode_type>
)
Arguments
Output Types:
Return Value
State name, State code, etc.
Example
statecode("CA", "name")
Return California.
statecode("CA", "num_code")
Return 06.
statecode("California", "num_code")
Return 06.
statecode("CA", "region")
Return West.
Summary
Generate US county code (FIPS - Federal Information Processing Standard) based on US State and County names.
Syntax
countycode(state = <state_column>
, county = <county_column>
)
Arguments
Return Value
County code
Example
countycode("California", "San Mateo")
Return 06081.
countycode("CA", "San Mateo")
Return 06081.
countycode("CA", "San Mateo County")
Return 06081.
Summary
Returns one token (e.g. word) per row after tokenizing a text.
Syntax
do_tokenize(<column_text>
, token = <token_type>
, keep_cols = <logical>
, to_lower = <logical>
, drop = <logical>
, output = <new_column_name>
, with_id = <logical>
, pattern = <pattern>
)
Arguments
Return Value
Data frame
Example
Original data:
index | text |
---|---|
First | It was a great party. |
Second | She has just left. She will be off tomorrow. |
do_tokenize(text)
index | document_id | sentence_id | token |
---|---|---|---|
First | 1 | 1 | it |
First | 1 | 1 | was |
First | 1 | 1 | a |
First | 1 | 1 | great |
First | 1 | 1 | party |
Second | 2 | 1 | she |
Second | 2 | 1 | has |
Second | 2 | 1 | just |
Second | 2 | 1 | left |
Second | 2 | 2 | she |
Second | 2 | 2 | will |
Second | 2 | 2 | be |
Second | 2 | 2 | off |
Second | 2 | 2 | tomorrow |
do_tokenize(text, token="sentences")
index | token |
---|---|
First | it was a great party. |
Second | she has just left. |
Second | she will be off tomorrow. |
Summary
Count pairs of words (tokens) that cooccur within a group
Syntax
pair_count(group = <column>
, value = <column>
, distinct = <logical>
, diag = <logical>
, sort = <logical>
)
Arguments
Return Value
Data frame
Example
pair_count(group=title, value=word, distinct=FALSE, diag = FALSE, sort=TRUE)
Return a data frame with word.x and word.y which is from "word" column and value that has count.
Summary
Calculates TF-IDF for each term against a group. TF-IDF is a weighting mechanism that calculates the importance of each word to each document by increasing the importance based on the term frequency while decreasing the importance based on the document frequency.
Syntax
do_tfidf(<document_column>
, <token_column>
, tf_weight = <tf_weight_type>
, idf_log_scale = <function>
, norm = "l2"|"l1"|FALSE
)
Arguments
1+log(count of a term in a document)
.log_scale_function((the total number of documents)/(the number of documents which have the token))
. It's how rare the token is in the set of documents. It might be worth trying log2 or log10. log2 increases the value more easily and log10 increases it more slowly.Return Value
Data frame
Example
Original data:
document_id | token |
---|---|
1 | this |
1 | is |
1 | what |
1 | it |
1 | is |
2 | which |
2 | is |
2 | better |
do_tfidf(document_id, token)
document_id | token | count_per_doc | count_of_docs | tfidf |
---|---|---|---|---|
1 | is | 2 | 2 | 0.0000000 |
1 | it | 1 | 1 | 0.5773503 |
1 | this | 1 | 1 | 0.5773503 |
1 | what | 1 | 1 | 0.5773503 |
2 | better | 1 | 1 | 0.7071068 |
2 | is | 1 | 2 | 0.0000000 |
2 | which | 1 | 1 | 0.7071068 |
Summary
Create columns of n-grams connected in sentences.
Syntax
do_ngram(<token_column>
, <document_column>
, <sentence_column>
, maxn = <numeric>
, sep = <text>
)
Arguments
Return Value
Data frame
Example
Original data:
document_id | sentence_id | token |
---|---|---|
1 | 1 | it |
1 | 1 | is |
1 | 1 | good |
2 | 1 | she |
2 | 1 | left |
2 | 2 | she |
2 | 2 | will |
2 | 2 | come |
2 | 2 | tomorrow |
do_ngram(token, document_id, sentence_id, maxn=3)
document_id | sentence_id | gram | token |
---|---|---|---|
1 | 1 | 1 | it |
1 | 1 | 1 | is |
1 | 1 | 1 | good |
1 | 1 | 2 | it_is |
1 | 1 | 2 | is_good |
1 | 1 | 3 | it_is_good |
2 | 1 | 1 | she |
2 | 1 | 1 | left |
2 | 1 | 2 | she_left |
2 | 2 | 1 | she |
2 | 2 | 1 | will |
2 | 2 | 1 | come |
2 | 2 | 1 | tomorrow |
2 | 2 | 2 | she_will |
2 | 2 | 2 | will_come |
2 | 2 | 2 | come_tomorrow |
2 | 2 | 3 | she_will_come |
2 | 2 | 3 | will_come_tomorrow |
Summary
Calculates the similarity between each pair of the documents using the cosine similarity algorithm. Cosine similarity measures the cosine of the angle between two vectors in the multi-dimensional space.
Syntax
do_cosine_sim.kv(<subject_column>
, <key_column>
, <value_column>
, distinct = <logical>
, diag = <logical>
, fun.aggregate = <aggregate_function>
)
Arguments
Return Value
Data frame
Example
Original data:
document_id | token | value |
---|---|---|
Lisa | it | 0.8966972 |
Lisa | was | 0.2655087 |
Lisa | good | 0.3721239 |
Emily | she | 0.5728534 |
Emily | is | 0.9082078 |
Emily | nice | 0.2016819 |
John | she | 0.8983897 |
John | is | 0.9446753 |
John | good | 0.6607978 |
do_cosine_sim.kv(document_id, token, value)
document_id.x | document_id.y | value |
---|---|---|
Lisa | Emily | 0.0000000 |
Lisa | John | 0.1671574 |
Emily | Lisa | 0.0000000 |
Emily | John | 0.8595770 |
John | Lisa | 0.1671574 |
John | Emily | 0.8595770 |
Summary
Execute multidimensional scaling (MDS). Calculate approximated coordinations from distances of entity pairs.
Syntax
do_cmdscale(<name1_column>
, <name2_column>
, <value_column>
, k = <integer>
)
Arguments
Return Value
Data frame
Example
Original data:
name.x | name.y | value |
---|---|---|
cow | horse | 36 |
cow | turtle | 82 |
horse | turtle | 48 |
do_cmdscale(pair.name.1, pair.name.2, value)
name | V1 | V2 |
---|---|---|
cow | -38.915551 | -3.435417 |
horse | -4.163866 | 5.962475 |
turtle | 43.079417 | -2.527058 |
Summary
Impute NA by average / median values, manually entered values, or predicted values. It will build a linear regression model for the prediction.
Syntax
impute_na(<target_column>
, type = <impute_na_type>
, val = <column>
, <column(s)>
)
Arguments
<column(s)>
.Example
Original data:
CARRIER | ARR_DELAY | DEP_DELAY | DISTANCE |
---|---|---|---|
UA | 10 | 7 | 300 |
UA | 20 | NA | 1000 |
UA | -5 | 0 | 200 |
AA | 20 | 20 | 500 |
AA | -5 | NA | 700 |
AA | 10 | 8 | 1000 |
mutate(DEP_DELAY = impute_na(DEP_DELAY, type = "predict", CARRIER, ARR_DELAY, DISTANCE)) NA in DEP_DELAY is filled by predicted values from CARRIER, ARR_DELAY and DISTANCE columns using linear regression
CARRIER | ARR_DELAY | DEP_DELAY | DISTANCE |
---|---|---|---|
UA | 10 | 7 | 300 |
UA | 20 | 3.47 | 1000 |
UA | -5 | 0 | 200 |
AA | 20 | 20 | 500 |
AA | -5 | 3.59 | 700 |
AA | 10 | 8 | 1000 |
mutate(DEP_DELAY = impute_na(DEP_DELAY, type = "value", val = 10))
NA is replaced by 10
CARRIER | ARR_DELAY | DEP_DELAY | DISTANCE |
---|---|---|---|
UA | 10 | 7 | 300 |
UA | 20 | 10 | 1000 |
UA | -5 | 0 | 200 |
AA | 20 | 20 | 500 |
AA | -5 | 10 | 700 |
AA | 10 | 8 | 1000 |
Summary
Detect outlier values and return 'upper' and 'lower' labels.
Syntax
detect_outlier(<column_num>
, type = <detect_outlier_type>
, threshold = <numeric>
)
Arguments
Example
Original data:
CARRIER | ARR_DELAY | DEP_DELAY | DISTANCE |
---|---|---|---|
UA | 10 | 7 | 800 |
UA | 20 | NA | 5000 |
UA | -5 | 0 | 600 |
AA | 20 | 20 | 500 |
AA | -5 | NA | 700 |
AA | 10 | 8 | 10 |
mutate(outlier = detect_outlier(DISTANCE, type = "iqr"))
Detect outliers by labelling the data as "upper", "normal" or "lower".
CARRIER | ARR_DELAY | DEP_DELAY | DISTANCE | outlier |
---|---|---|---|---|
UA | 10 | 7 | 800 | normal |
UA | 20 | NA | 5000 | upper |
UA | -5 | 0 | 600 | normal |
AA | 20 | 20 | 500 | normal |
AA | -5 | NA | 700 | normal |
AA | 10 | 8 | 10 | lower |
Summary
Build K-means clustering model and label each row with cluster id.
Syntax
cluster(<column(s)>
, n_cluster = <integer>
)
Arguments
Example
Original data:
CARRIER | ARR_DELAY | DEP_DELAY | DISTANCE |
---|---|---|---|
UA | 10 | 7 | 800 |
UA | 20 | NA | 5000 |
UA | -5 | 0 | 600 |
AA | 20 | 20 | 500 |
AA | -5 | NA | 700 |
AA | 10 | 8 | 10 |
mutate(cluster = cluster(DEP_DELAY, DISTANCE))
Label each row with the cluster id (factor) based on DEP_DELAY and DISTANCE columns values.
CARRIER | ARR_DELAY | DEP_DELAY | DISTANCE | cluster |
---|---|---|---|---|
UA | 10 | 7 | 800 | 2 |
UA | 20 | NA | 5000 | NA |
UA | -5 | 0 | 600 | 1 |
AA | 20 | 20 | 500 | 1 |
AA | -5 | NA | 700 | NA |
AA | 10 | 8 | 10 | 3 |
Summary
Create data frame with models from input data frame by model function and arguments.
Syntax
build_model(model_func = <model_function>
, seed = <integer>
, test_rate = <numeric>
, <model_parameters>
)
Arguments
Example
Original data:
CARRIER | ARR_DELAY | DEP_DELAY | DISTANCE | WEIGHTS |
---|---|---|---|---|
UA | 10 | 7 | 300 | 0.8 |
UA | 20 | 30 | 1000 | 0.5 |
UA | -5 | 0 | 200 | 1.2 |
AA | 20 | 20 | 500 | 0.8 |
AA | -5 | 3 | 700 | 0.9 |
AA | 10 | 8 | 1000 | 0.5 |
%>% build_model(model_func = lme4::lmer, formula = ARR_DELAY ~ DEP_DELAY + (DISTANCE|CARRIER), test_rate = 0.1, weights = WEIGHTS)
Returns a data frame that stores a linear mixed-effects model. It also returns a column of original data.
source_data | model | .test_index |
---|---|---|
source dataframe | lme4 model | c(1) |
You can use model_coef or model_stats function to get the summary information about the models. Also, you can use prediction function to predict with the data that is stored in the same data frame.
Summary
One-hot encodes a categorical column, producing separate columns for each categorical values, each of which has values of 1 or 0 that tells whether a row has the value the column represents.
Syntax
one_hot(<column>
)
Example
Original data:
ARR_DELAY | CARRIER |
---|---|
10 | UA |
20 | UA |
-5 | UA |
20 | AA |
-5 | AA |
10 | AA |
one_hot(CARRIER)
ARR_DELAY | CARRIER_UA | CARRIER_AA |
---|---|---|
10 | 1 | 0 |
20 | 1 | 0 |
-5 | 1 | 0 |
20 | 0 | 1 |
-5 | 0 | 1 |
10 | 0 | 1 |
Summary
Returns the most frequent value (mode) in a column.
Syntax
get_mode(<column>
, na.rm = <logical>
)
Arguments
* column - Column to get mode of
* na.rm (Optional) - The default is FALSE. If NA should be ignored even if it is the most frequent value.
Example
Original data:
ARR_DELAY | CARRIER |
---|---|
10 | UA |
20 | UA |
-5 | UA |
20 | AA |
-5 | AA |
summarize(ARR_DELAY_mode = get_mode(ARR_DELAY), CARRIER_mode = get_mode(CARRIER))
ARR_DELAY_mode | CARRIER_mode |
---|---|
20 | UA |
Summary
Summarize (or aggregate) column values with specified aggregate functions. If grouping columns are provided then summarize the values for each group.
Syntax
summarize_group(group_cols=<column_list>
, group_funs = <function(s)>
, ...)
Arguments
* group_cols - Columns to group by
* grp_aggregatons - Aggregate Functions to apply to group by Columns.
* ... - A pair of a newly created column name and an aggregation function with a column name. You can give multiple pairs to aggregate on multiple columns.
Example
Original data:
ARR_DELAY | CARRIER |
---|---|
10 | UA |
20 | UA |
30 | UA |
-20 | UA |
20 | AA |
-5 | AA |
-9 | AA |
summarize_group(group_cols=c("CARRIER"), group_funs=("none"), ARR_DELAY_mean = mean(ARR_DELAY))
CARRIER | ARR_DELAY_mean |
---|---|
UA | 10 |
AA | 2 |
Original data:
ARR_DELAY | DEP_DELAY | CARRIER |
---|---|---|
10 | 20 | UA |
20 | 15 | UA |
30 | -5 | UA |
-20 | 30 | UA |
20 | 10 | AA |
-5 | 20 | AA |
-9 | 30 | AA |
summarize_group(group_cols=c("CARRIER"), group_funs=("none"), ARR_DELAY_mean = mean(ARR_DELAY), DEP_DEALY_sum = sum(DEP_DELAY))
CARRIER | ARR_DELAY_mean | DEP_DELAY_sum |
---|---|---|
UA | 10 | 60 |
AA | 2 | 60 |
Original data:
FL_DATE | ARR_DELAY | DEP_DELAY |
---|---|---|
06/01/2018 | 10 | 20 |
06/12/2018 | 20 | 15 |
06/18/2018 | 30 | -5 |
06/21/2018 | -20 | 30 |
07/14/2018 | 20 | 10 |
07/16/2018 | -5 | 20 |
07/25/2018 | -9 | 30 |
summarize_group(group_cols=c("FL_DATE"), group_funs=("month"), ARR_DELAY_mean = mean(ARR_DELAY), DEP_DEALY_sum = sum(DEP_DELAY))
FL_DATE_month | ARR_DELAY_mean | DEP_DELAY_sum |
---|---|---|
6 | 10 | 60 |
7 | 2 | 60 |
Summary
Summarize (or aggregate) column values across columns for each row with the specified aggregate function.
Syntax
<column(s)>
)), <aggregate_function>
, ...)<column(s)>
)), <aggregate_function>
, ...)<columns_select_function>
)), <aggregate_function>
, ...)<text>
, ignore.case = <logical>
)), <aggregate_function>
, ...)<text>
, ignore.case = <logical>
)), <aggregate_function>
, ...)<text>
, ignore.case = <logical>
)), <aggregate_function>
, ...)<text>
, ignore.case = <logical>
)), <aggregate_function>
, ...)<text>
, <start_num>:<end_num>
)), <aggregate_function>
, ...)Arguments
Example
mutate(Total = summarize_row(across(c(Sales_Consumer, Sales_Business, Sales_Government)), sum)
Sums up values from the specified columns for each row, and stores the sums in the Total column. If any of the column values are NA, the result will be NA too.
mutate(Total = summarize_row(across(c(Sales_Consumer, Sales_Business, Sales_Government)), sum, na.rm = TRUE)
Same as the above example, except that even if some of the column values are NA, the result will be calculated from the rest of the values.
mutate(Total = summarize_row(across(where(is.numeric)), sum, na.rm = TRUE)
Sums up values from all numeric columns for each row, and stores the sums in the Total column.
Summary
Create new columns or update existing columns with specified functions such as Window Calculations. If grouping columns are provided then it peforms calculation for each group.
Syntax
mutate_group(group_cols=<column_list>
, group_funs = <function(s)>
, sort_cols=<column_list>
, sort_funs = <function(s)>
, ...)
Arguments
Example
Original data:
ARR_DELAY | CARRIER |
---|---|
10 | UA |
20 | UA |
30 | UA |
20 | UA |
20 | AA |
5 | AA |
9 | AA |
mutate_group(group_cols=c("CARRIER"), group_funs=("none"), ARR_DELAY_cumsum = cumsum(ARR_DELAY))
CARRIER | ARR_DELAY_cumsum |
---|---|
UA | 10 |
UA | 30 |
UA | 60 |
UA | 80 |
AA | 20 |
AA | 25 |
AA | 34 |
Original data:
ARR_DELAY | DEP_DELAY | CARRIER |
---|---|---|
10 | 20 | UA |
20 | 15 | UA |
30 | -5 | UA |
-20 | 30 | UA |
20 | 10 | AA |
-5 | 20 | AA |
-9 | 30 | AA |
mutate_group(group_cols=c("CARRIER"), group_funs=("none"), ARR_DELAY_dfprev = ARR_DELAY - lag(ARR_DELAY), DEP_DEALY_dfprev = DEP_DELAY - lag(DEP_DELAY))
CARRIER | ARR_DELAY_dfprev | DEP_DELAY_dfprev |
---|---|---|
UA | NA | NA |
UA | 10 | -5 |
UA | 10 | -20 |
UA | -50 | 35 |
AA | NA | NA |
AA | -25 | 10 |
AA | -4 | 10 |
Summary
Returns the sum of all the values that satisfy given conditions.
Syntax
sum_if(<column_num_logic>
, <condition>
,... , na.rm = <logical>
)
Arguments
Return Value
Numeric
Example
summarize(total = sum_if(revenue, product == 'PC' & country %in% c("Japan", "United States", "United Kingdom"))) Create a new column for calculating the sum of revenue for product 'PC' and country is either "Japan" or "United States" or "United Kingdom" for each group.
summarize(total = sum_if(quantity, priority == 'P1' | country == "United States")) Create a new column for calculating the sum of quantity for piroity 'P1' or country is "United States".
Summary
Returns the ratio of the sum of all the values that satisfy given conditions to the sum of all the values.
Syntax
sum_if_ratio(<column_num_logic>
, <condition>
,... , na.rm = <logical>
)
Arguments
Return Value
Numeric
Example
summarize(total = sum_if_ratio(revenue, product == 'PC' & country %in% c("Japan", "United States", "United Kingdom"))) Create a new column for calculating the ratio of the sum of revenue for product 'PC' and country is either "Japan" or "United States" or "United Kingdom" to the sum of total revenue for each group.
summarize(total = sum_if_ratio(quantity, priority == 'P1' | country == "United States")) Create a new column for calculating the ratio of sum of quantity for piroity 'P1' or country is "United States" to the sum of total quantity.
Summary
Summarize the data by counting number of rows that satisfy given conditions.
Syntax
count_if(<column>
, <condition>
, ... , na.rm = <logical>
)
Arguments
Example
summarize(country_count = count_if(country, product == 'PC' | quantity > 10)) returns the number of rows for product 'PC' or quantity is greater than 10 for each group.
Summary
Return the ratio of the sum of the data by counting the number of rows that satisfy given conditions to the sum of the total count.
Syntax
count_if_ratio(<column>
, <condition>
, ... , na.rm = <logical>
)
Arguments
Example
summarize(country_count_ratio = count_if_ratio(country, product == 'PC' | quantity > 10)) Returns the ratio of the number of rows for product 'PC' or quantity is greater than 10 to the number of rows for each group.
Summary
Counts the number of the unique values of a column that come from rows that satisfy given conditions.
Syntax
count_unique_if(<column>
, <condition>
, ... , na.rm = <logical>
)
Arguments
Example
summarize(country_count = count_unique_if(country, product == 'PC' & quantity > 10)) returns the number of unique countries for rows where product is 'PC' and quantity is greater than 10 for each group.
Summary
Return the ratio of the number of the unique values of a column that come from rows that satisfy given conditions to number of the unique values.
Syntax
count_unique_if_ratio(<column>
, <condition>
, ... , na.rm = <logical>
)
Arguments
Example
summarize(country_count = count_unique_if_ratio(country, product == 'PC' & quantity > 10)) Returns the ratio of the number of unique countries for rows where a product is 'PC' and quantity is greater than 10 to the number of unique countries for each group.
Summary
Returns the numerical average (mean) value that satisfy given conditions.
Syntax
mean_if(<column_num_logic_date>
, <condition>
, ... , na.rm = <logical>
)
Arguments
Return Value
Numeric
Example
summarize(revenue_mean = mean_if(revenue), product == 'PC' | quantity > 10))
Create a new column for calculating the average value of revenue for product 'PC' or quantity is greater than 10 for each group.
Summary
Returns the numerical average (mean) value that satisfy given conditions. This is an alias of mean_if.
Syntax
average_if(<column_num_logic_date>
, <condition>
, ... , na.rm = <logical>
)
Arguments
Return Value
Numeric
Example
summarize(revenue_mean = average_if(revenue), product == 'PC' | quantity > 10))
Create a new column for calculating the average value of revenue for product 'PC' or quantity is greater than 10 for each group.
Summary
Returns the numerical median value that satisfy given conditions.
Syntax
median_if(<column_num_logic_date>
, <condition>
, ... , na.rm = <logical>
)
Arguments
Return Value
Numeric
Example
summarize(revenue_median = median_if(revenue), product == 'PC' | quantity > 10))
Create a new column for calculating the median value of revenue for product 'PC' or quantity is greater than 10 for each group.
Summary
Returns the maximum value in a numeric column that satisfies given conditions.
Syntax
max_if(<column>
, <condition>
, ... , na.rm = <logical>
)
Arguments
Example
summarize(revenue_max = max_if(revenue, product == 'PC' | quantity > 10)
Create a new column for calculating the maximum value of of revenue product 'PC' or quantity is greater than 10 for each group.
Summary
Returns the minimum value in a numeric column that satisfies given conditions.
Syntax
min_if(<column>
, <condition>
, ..., na.rm = <logical>
)
Arguments
Example
summarize(revenue_min = min_if(revenue, product == 'PC' | quantity > 10))
Create a new column for calculating the minimum of revenue for product 'PC' or quantity is greater than 10 for each group.
Summary
Returns the count of the rows for each group.
Syntax
count_rows()
Arguments
Return Value
Numeric
Example
summarize(total = count_rows())
Creates a new column for calculating the number of entries for each group.
Summary
Returns the count of unique values.
Syntax
count_unique(<column(s)>
, na.rm = <logical>
)
Arguments
Return Value
Numeric
Example
summarize(total = count_unique(TAIL_NUM))
Create a new column for calculating the count of unique values of TAIL_NUM for each group.
summarize(total = count_unique(STATE, CITY))
Create a new column for calculating the count of unique combinations of STATE and CITY for each group.
Summary
Returns either Weekday or Weekend based on the provided date column value.
Syntax
weekend(<column_date>
)
Arguments * column - Date Column to get Weekday or Weekend
Return Value
Factor (either Weekday or Weekend)
Example
weekend("2015-10-01")
returns Weekday
weekend("2015-10-02")
returns Weekend
Summary Alias to is_jholiday. Returns TRUE if the provided date is a Japanese Holiday.
Syntax
is_japanese_holiday(<column_date>
)
Arguments * column - Date Column that you want to detect if it's a Japanese Holiday
Return Value
Factor (either TRUE or FALSE)
Example
is_japanese_holiday("2020-01-01")
returns TRUE
is_japanese_holiday("2020-01-05")
returns FALSE
Summary
Extract Week of Month. For example, if the given date is in the 1st week of the month, it returns 1
.
Syntax
get_week_of_month(<column_date>
)
Return Value
Numeric
Example
get_week_of_month(as.Date("2019-06-01")) returns 1 get_week_of_month(as.Date("2019-06-25")) returns 5
Summary
Sample n rows from the data frame. This is same as sample_n except for it handles the case where the number of rows in the data is fewer than the specified n without throwing error.
Syntax
sample_rows(<number>
, weight=<column_num>
, replace=<logical>
, seed=<number>
)
Arguments
Example
sample_rows(100)
Select 100 randomly selected rows.
sample_rows(100, weight = ARR_DELAY)
Select 100 randomly selected rows from the data with a weight on ARR_DELAY column.
Summary
Returns a confidence interval range (half-width of confidence interval) of mean from given numeric data.
Syntax
confint_mean(<column_num>
, level=<numeric>
)
Arguments
Return Value
Numeric
Example
Original data:
DEPARTMENT | MONTHLY_INCOME |
---|---|
Sales | 2200 |
Sales | 3000 |
Sales | 5000 |
Sales | : |
R&D | 4000 |
R&D | 3200 |
R&D | 2200 |
R&D | : |
confint = confint_mean(MONTHLY_INCOME)
DEPARTMENT | confint |
---|---|
Sales | 25.32468 |
Sales | 17.9226 |
Summary
Returns a confidence interval range (half-width of confidence interval) of TRUE ratio from given logical data.
Syntax
confint_ratio(<column_logical>
, level=<numeric>
)
Arguments
Return Value
Numeric
Example
Original data:
DEPARTMENT | ATTRITION |
---|---|
Sales | TRUE |
Sales | FALSE |
Sales | FALSE |
Sales | : |
R&D | FALSE |
R&D | TRUE |
R&D | FALSE |
R&D | : |
confint = confint_ratio(ATTRITION)
DEPARTMENT | confint |
---|---|
Sales | 0.02065466 |
Sales | 0.02683791 |
Summary
Returns a confidence interval range (half-width of confidence interval) of mean from a size and a standard deviation of sample data.
Syntax
calc_confint_mean(<column_num>
, <column_num>
. level=<numeric>
)
Arguments
Return Value
Numeric
Example
Original data:
DEPARTMENT | SD | NUMBER_OF_PEOPLE |
---|---|---|
Sales | 577.4946 | 2000 |
R&D | 288.8194 | 1000 |
confint = calc_confint_mean(SD, NUMBER_OF_PEOPLE)
DEPARTMENT | SD | NUMBER_OF_PEOPLE | confint |
---|---|---|---|
Sales | 577.4946 | 2000 | 25.32468 |
R&D | 288.8194 | 1000 | 17.9226 |
Summary
Returns a confidence interval range (half-width of confidence interval) of ratio from a size and a TRUE ratio of sample data.
Syntax
calc_confint_ratio(<column_num>
, <column_num>
. level=<numeric>
)
Arguments
Return Value
Numeric
Example
Original data:
DEPARTMENT | ATTRITION_RATIO | NUMBER_OF_PEOPLE |
---|---|---|
Sales | 0.33 | 2000 |
R&D | 0.25 | 1000 |
confint = calc_confint_ratio(ATTRITION_RATIO, NUMBER_OF_PEOPLE)
DEPARTMENT | ATTRITION_RATIO | NUMBER_OF_PEOPLE | confint |
---|---|---|---|
Sales | 0.33 | 2000 | 0.02065466 |
R&D | 0.25 | 1000 | 0.02683791 |
Summary
Caluculates cumulative sum of decaying effects. It is same as cumsum when r (the second argument) is 1.
Syntax
cumsum_decayed(<column_num>
, <numeric>
)
Arguments
Return Value Numeric
Example
Original data:
Ad_Unit |
---|
100 |
200 |
100 |
100 |
mutate(Effective_Ad_Unit = cumsum_decayed(Add_Unit, 0.1))
Ad_Unit | Effective_Ad_Unit |
---|---|
100 | 100 |
200 | 210 |
100 | 121 |
100 | 112.1 |
Summary
Calculate period between two dates in years.
Syntax
years_between(<column_date>
, <column_date>
)
Arguments
Return Value
numeric
Example
years_between(ymd("2020-10-01), ymd("2021-10-01"))
returns 1
Summary
Calculate period between two dates in months.
Syntax
months_between(<column_date>
, <column_date>
)
Arguments
Return Value
numeric
Example
months_between(ymd("2020-10-01), ymd("2021-10-01"))
returns 12
Summary
Calculate period between two dates in weeks.
Syntax
weeks_between(<column_date>
, <column_date>
)
Arguments
Return Value
numeric
Example
weeks_between(ymd("2020-10-01), ymd("2021-10-01"))
returns 52.14286
Summary
Calculate period between two dates in days.
Syntax
days_between(<column_date>
, <column_date>
)
Arguments
Return Value
numeric
Example
days_between(ymd("2020-10-01), ymd("2021-10-01"))
returns 365
Summary
Calculate period between two dates in hours.
Syntax
hours_between(<column_date>
, <column_date>
)
Arguments
Return Value
numeric
Example
hours_between(ymd_hms("2020-10-01 05:00:00"), ymd_hms("2020-10-01 15:00:00"))
returns 10
Summary
Calculate period between two dates in minutes.
Syntax
minutes_between(<column_date>
, <column_date>
)
Arguments
Return Value
numeric
Example
minutes_between(ymd_hms("2020-10-01 05:00:00"), ymd_hms("2020-10-01 15:00:00"))
returns 600
Summary
Calculate period between two dates in seconds.
Syntax
seconds_between(<column_date>
, <column_date>
)
Arguments
Return Value
numeric
Example
seconds_between(ymd_hms("2020-10-01 05:00:00"), ymd_hms("2020-10-01 15:00:00"))
returns 36000
Summary Returns the last date of the specified period (e.g., month) that the original date belongs to.
Syntax
last_date(<column_date>
, unit = "week"|"month"|"quarter"|"year"
, previous = <logical>
, week_start = <number>
)
Arguments
Return Value Date
Example
last_date(ymd("2022-10-01"), unit = "month")
returns "2022-10-31"
Summary
Adds lag time to a time series data.
Syntax
ts_lag(<column_date>
, <column_num>
, unit = "day"|"week"|"month"|"quarter"|"year"
, n = <integer>
, na_fill_type = "previous"|"next"|"none"
)
Arguments
Return Value numeric - The values with the time lag.
Example ts_lag(c(1,2,3), ymd(c("2020-01-01","2021-01-01","2022-01-01")), unit = "year", n = 1)
returns c(NA, 1, 2)
Summary Returns differences in values in a time series data since the specified lag time ago.
Syntax
ts_diff(<column_date>
, <column_num>
, unit = "day"|"week"|"month"|"quarter"|"year"
, n = <integer>
, na_fill_type = "previous"|"next"|"none"
)
Arguments
Return Value numeric
Example ts_diff(c(1,2,3), ymd(c("2020-01-01","2021-01-01","2022-01-01")), unit = "year", n = 1)
returns c(NA, 1, 1)
Summary Returns the differences of the values in ratio compared to the base values from the lag time ago in a time series.
Syntax
ts_diff_ratio(<column_date>
, <column_num>
, unit = "day"|"week"|"month"|"quarter"|"year"
, n = <integer>
, na_fill_type = "previous"|"next"|"none"
)
Arguments
Return Value numeric
Example ts_diff_ratio(c(1,2,3), ymd(c("2020-01-01","2021-01-01","2022-01-01")), unit = "year", n = 1)
returns c(NA, 1, 0.5)
Summary
Returns Likert's sigma values given a raw data column of survey answers with selected levels such as 1="Strongly disagree", 2="Disagree", 3="Neutral", 4="Agree", 5="Strongly agree".
Syntax
likert_sigma(<column_num>
)
Arguments
Return Value
Numeric
Example
Original data:
NAME | LIKES_ICE_CREAM |
---|---|
Joe | 5 |
Sally | 4 |
Bob | 2 |
John | 5 |
sigma = likert_sigma(LIKES_ICE_CREAM)
NAME | LIKES_ICE_CREAM | sigma |
---|---|---|
Joe | 5 | 0.7978846 |
Sally | 4 | -0.3246628 |
Bob | 2 | -1.2711063 |
John | 5 | 0.7978846 |
Summary
Logistic function.
Syntax
logistic(<column_num>
)
Return Value
Numeric
Example
mutate(Y = logistic(X))
Create a new column Y with the result of logistic function applied to column X.
Summary Returns TRUE if the provided date is a Japanese Holiday.
Syntax
is_jholiday(<column_date>
)
Arguments
Return Value
Logical (either TRUE or FALSE)
Example
is_jholiday("2020-01-01")
returns TRUE
is_jholiday("2020-01-05")
returns FALSE