Utilities

Range Joins

range_join performs inequality-based joins between DataFrames, useful when matching on date ranges or other interval conditions. Conditions are specified with the Conditions struct.

WRDSMerger.ConditionsType
Conditions(fun::Function, l::Union{Symbol,String}, r::Union{Symbol,String})
Conditions(l, fun, r)

A condition for use with range_join. Specifies a comparison function fun (e.g., <=, >=) applied between column l from the left DataFrame and column r from the right DataFrame.

source
WRDSMerger.range_joinFunction
function range_join(
    df1::DataFrame,
    df2::DataFrame,
    on,
    conditions::Array{Conditions};
    minimize=nothing,
    join_conditions::Union{Array{Symbol}, Symbol}=:and,
    validate::Tuple{Bool, Bool}=(false, false),
    jointype::Symbol=:inner
)

Joins the dataframes based on a series of conditions, designed to work with ranges

Arguments

  • df1::DataFrame: left DataFrame
  • df2::DataFrame: right DataFrame
  • on: either array of column names or matched pairs
  • conditions::Array{Conditions}: array of Conditions, which specifies the function (<=, >, etc.), left and right columns
  • jointype::Symbol=:inner: type of join, options are :inner, :outer, :left, and :right
  • minimize: either nothing or an array of column names or matched pairs, minimization will take place in order
  • join_conditions::Union{Array{Symbol}, Symbol}: defaults to :and, otherwise an array of symbols that is 1 less than the length of conditions that the joins will happen in (:or or :and)
  • validate::Tuple{Bool, Bool}: Whether to validate, this works differently than the equivalent in DataFrames joins, here, validate insures that a single row from the dataframe is not duplicated. So validate=(true, false) means that there are no duplicated rows from the left dataframe in the result.

Example

df1 = DataFrame(
    firm=1:10,
    date=Date.(2013, 1:10, 1:10)
)

df2 = df1[:, :]
df2[!, :date_low] = df2.date .- Day(2)
df2[!, :date_high] = df2.date .+ Day(2)
select!(df2, Not(:date))

range_join(
    df1,
    df2,
    [:firm],
    [
        Conditions(<, :date, :date_high),
        Conditions(>, :date, :date_low)
    ],
    join_conditions=[:and]
)
source

Internal Helpers

WRDSMerger.modify_col!Function

modify_col! tries to identify the real type of a column, especially for strings that are actually dates or floats that are actually integers. Almost all data downloaded from WRDS correctly specifies dates, but all numbers are stored as float8, even items like year which should be an integer. This uses multiple dispatch to check if all elements in a given column are compatible with changing type and then changes the type of the column. Note that for strings this function does not by default try to convert integer like strings to integer (such as GVKey), it only converts strings that look like a date, datetime, or time.

source