Linking Internals

Underlying Methodology of `convert_identifier`

WRDSMerger.convert_identifier — Function

convert_identifier(::Type{ID}, x::T1, dt::Date; vargs...) where {ID<:AbstractIdentifier, T1<:AbstractIdentifier}

convert_identifier(
    ::Type{ID},
    x::T1,
    dt::Date,
    data::Dict{T1, Vector{<:AbstractLinkPair{T1, ID}}}=data;
    vargs...
) where {ID<:AbstractIdentifier, T1<:AbstractIdentifier}

Converts an identifier (T1) to a different identifier (ID). In its most generic form, this throws a MethodError implying there is not a function that exists to directly link T1 -> ID.

Calling new_link_method calls a macro to create different versions of convert_identifier to provide links between identifiers. If these are direct links (which means there is a AbstractLinkPair that links the two identifiers, such as Permno -> Permco or Permno -> NCusip), then this provides a one step method to link these two identifiers.

For other identifiers, there is not a link table that provides a direct link (such as SecID -> GVKey). In those cases, new_link_method will find a path between the two (in the case of SecID -> GVKey, SecID -> NCusip -> Permno -> GVKey). Each case of convert_identifier only does 1 step in the process, so convert_identifier(GVKey, SecID(1), today()) would call convert_identifier(Permno, SecID(1), today()).

Note

There are two slightly different behaviors for the direct links of convert_identifier. When linking a SecurityIdentifier -> FirmIdentifier, the function might retry if a link is not found with the parent identifier of the security. For example, when trying to link NCusip -> GVKey, the default behavior is to try NCusip -> Permno -> GVKey. However, suppose there is not a matching NCusip -> Permno, the function will try again with NCusip6 -> Permno. The logic is that it should not matter if the Permno does not perfectly match the NCusip if the end goal is to find a relevant GVKey. This behavior can be disabled by using allow_parent_firm=false.

LinkPairs

This packages primary storage method for links is an AbstractLinkPair, which is typically a LinkPair:

WRDSMerger.LinkPair — Type

function LinkPair(
    parent::T1,
    child::T2,
    dt1::Union{Missing, Date, String}=Date(0, 1, 1),
    dt2::Union{Missing, Date, String}=Date(9999, 12, 31),
    priority::Real=0.0
) where {T1<:AbstractIdentifier, T2<:AbstractIdentifier}

function LinkPair(
    parent::T1,
    child::T2,
    dt1::Union{Missing, Date, String},
    dt2::Union{Missing, Date, String},
    linkprim::String,
    linktype::String,
) where {T1<:Union{GVKey, Permno, Permco}, T2<:Union{GVKey, Permno, Permco}}

LinkPair is the basic structure that provides a link between two identifiers. These are defined as a single direction link (T1 -> T2) that is valid between a specific date range (inclusive) and has a given priority (higher is better). Priority is useful if there are overlapping T1 -> T2 items. For example, a FirmIdentifier likely has multiple SecurityIdentifiers that relate to it. One common way to pick between different SecurityIdentifiers is to pick the one with the large market cap as the primary.

If defining a new identifier that has other methods of choosing priorities (such as a String indicating priority), it can help to define a function that converts these strings into a number. An example of this exists for linking GVKey -> Permno or Permco (and the reverse), which take in linkprim and linktype and convert those to the appropriate priority.

Conceptually, a LinkPair provides a one direction link between T1 -> T2. These are typically stored in a dictionary for quick lookup:

Dict{T1, Vector{LinkPair{T1, T2}}}()

and this package adds a function to Dict to convert an abstract vector of LinkPairs to such a dictionary:

Base.Dict — Type

Dict(data::AbstractVector{L}) where {T1, T2, L<:AbstractLinkPair{T1, T2}}

Converts a vector of AbstractLinkPairs to a dictionary where each T1 is a key in the dictionary and the values are vectors of L. It also checks whether those vectors ever have overlapping inconsistent priorities.

Defining New `AbstractLinkPair`s

While this package currently makes use of LinkPair, it might be easier for other identifiers to define a more complex AbstractLinkPair. A key component of an AbstractLinkPair is being able to compare when one link should be used as opposed to another, which this package refers to as priority. In some cases, there might be multiple values that determine a priority. For example, the link between GVKey <-> Permno has two columns, depending on the direction (e.g., going from Permno -> GVKey, "LC" > "LU" > "LS"...). This package converts these into a single number with decimals ("LC" = 8, "LU" = 7... and the other column, "P" = 0.3, "C" = 0.2 ..., added together). This is switched when defining the link between GVKey -> Permno ("LC" = 0.8, "P" = 3). An alternative way to define this would be to create a separate AbstractLinkPair type that would avoid adding and, perhaps, be clearer on methodology. For example, something like:

struct CrspCompLinkPair{T1<:AbstractIdentifier, T2<:AbstractIdentifier} <: AbstractLinkPair{T1, T2}
    parent::T1
    child::T2
    dt1::Date# first date valid
    dt2::Date# last date valid
    comp_crsp_priority::Int
    crsp_comp_priority::Int
    function CrspCompLinkPair(
        t1::T1,
        t2::T2,
        dt1::Date,
        dt2::Date,
        linktype::AbstractString,
        linkprim::AbstractString
    ) where {T1, T2}
        comp_crsp_priority = if linkprim == "P"
            3
        elseif linkprim == "C"
            2
        elseif linkprim == "J"
            1
        end
        crsp_comp_priority = if linktype == "LC"
            8
        elseif linktype == "LU"
            7
        elseif linktype == "LS"
            6
        elseif linktype == "LX"
            5
        elseif linktype == "LD"
            4
        elseif linktype == "LN"
            3
        elseif linktype == "NR"
            2
        elseif linktype == "NU"
            1
        end
        new{T1, T2}(t1, t2, dt1, dt2, comp_crsp_priority, crsp_comp_priority)
    end
end

While most of the default functions for AbstractLinkPair would work with this new type (parentID, childID, min_date, max_date, Base.in), the one that does not is priority, which determines which AbstractLinkPair is preferable. Since the direction of the link matters, two new priority functions are required:

function WRDSMerger.priority(data::CrspCompLinkPair{GVKey, T2}) where {T2<:AbstractIdentifier}
    data.comp_crsp_priority + data.crsp_comp_priority / 10
end

function WRDSMerger.priority(data::CrspCompLinkPair{T1, GVKey}) where {T1<:AbstractIdentifier}
    data.crsp_comp_priority + data.comp_crsp_priority / 10
end

While this case is not used by default in this package, following similar methodology could allow for more complex priority structures.

Linking Download Functions

WRDSMerger.download_crsp_links — Function

download_crsp_links(db; main_table="crsp.stocknames", stockfile="crsp.dsf")

Runs the following SQL code (tables are changeable by setting the main_table and stockfile keyword arguments):

select a.*, b.mkt_cap from crsp.stocknames a
        left join (
            select s.permno, s.namedt, s.nameenddt, avg(d.mkt_cap) as mkt_cap from crsp.stocknames s
                inner join (select permno, date, abs(prc) * shrout as mkt_cap from crsp.dsf) as d
                on s.permno = d.permno and s.namedt <= d.date and s.nameenddt >= d.date
            group by s.permno, s.namedt, s.nameenddt
            ) b
            on a.permno = b.permno and a.namedt = b.namedt and a.nameenddt = b.nameenddt

and returns a DataFrame.

WRDSMerger.download_comp_crsp_links — Function

download_comp_crsp_links(db; main_table="crsp_a_ccm.ccmxpf_linkhist")

Runs the following SQL code (table is changeable by setting the main_table keyword argument):

SELECT * FROM crsp_a_ccm.ccmxpf_linkhist

and returns the resulting DataFrame

WRDSMerger.download_comp_cik_links — Function

download_comp_cik_links(db; main_table="comp.company")

Runs the following SQL code (table is changeable by setting the main_table keyword argument):

SELECT * FROM comp.company

and returns the resulting DataFrame

WRDSMerger.download_ibes_links — Function

download_ibes_links(db; main_table="wrdsapps.ibcrsphist")

Runs the following SQL code (table is changeable by setting the main_table keyword argument):

SELECT * FROM wrdsapps.ibcrsphist

and returns the resulting DataFrame

WRDSMerger.download_option_crsp_links — Function

download_option_crsp_links(db; main_table="optionm_all.secnmd")

Runs the following SQL code (table is changeable by setting the main_table keyword argument):

SELECT * FROM optionm_all.secnmd

and returns the resulting DataFrame

WRDSMerger.download_ravenpack_links — Function

download_ravenpack_links(db; main_table="ravenpack.rp_entity_mapping", cusip_list="crsp.stocknames")

Runs the following SQL code (tables are changeable by setting the main_table and cusip_list keyword arguments):

SELECT rp_entity_id, data_value as ncusip, range_start, range_end FROM ravenpack.rp_entity_mapping as a
            inner join (select distinct ncusip from crsp.stocknames) as b
            on left(a.data_value, 8) = b.ncusip

and returns a DataFrame.

Changing The Priority for Permno

A single company can have many securities, therefore, there might be multiple options when linking these items. For example, a single GVKey or Permco will match to multiple Permnos. In some tables in WRDS (such as in the case of GVKey <-> Permno), there are explicit primary identifier markers provided, improving the match. In others, there are not (as in Permco <-> Permno). This is a particular problem for Permno since this package prioritizes matches through Permno (as discussed in Supremacy of Permno).

The most common method to resolve these matches is to find the Permno that has the largest market capitalization on the day of the match since that should be the primary identifier. This is difficult to do in a package like this where the values are, ideally, predetermined. Therefore, the default behavior is to average the market capitalization over the period of the link and choose the higher average market capitalization. This behavior is convenient (requiring only a single SQL download), but potentially inconsistent with the end goal. Specifically, if one link has a lower average market capitalization (perhaps due to a long time window where the value was lower) than another link, this package might pick the Permno with a smaller market capitalization on the day of the match.

This is a proposed alternative that makes use of the AbnormalReturns.jl package to provide a quick lookup of the market capitalization just before the link:

First, stock price data is required:

using WRDSMerger, DataFramesMeta, AbnormalReturns
df = raw_sql(wrds_conn, "SELECT permno, date, abs(prc) * shrout AS mkt_cap FROM crsp.dsf")
@rtransform!(df, :mkt_cap = coalesce(:mkt_cap, 0.0))

Note

It is recommended to provide some filter on the WRDS download as the crsp.dsf file has over 100 million rows, downloading this data takes a lot of ram, peaking at ~20 GB. Most obviously, selecting dates beyond a certain point helps a lot.

AbnormalReturns needs a market calendar, instead of downloading something, just reuse the dates from df and load that into a MarketData object:

mkt_data = MarketData(
    @by(df, :date, :full_mkt = mean(:mkt_cap)),
    df
)

Then we need to redefine how WRDSMerger goes about choosing between two links when the outcome is a Permno. It is also important to do some error checking since AbnormalReturns does not accept cases when the date is out of the range available or the Permno is not in the dataset. WRDSMerger determines priority uses the is_higher_priority function, which checks the priority of two AbstractLinkPairs and compares them. Therefore, changing the priority function slightly when the outcome is a Permno will create the necessary changes:

function WRDSMerger.priority(
    data::AbstractLinkPair{T1, Permno},
    dt::Date;
    mkt_data=mkt_data # need the market data defined above
) where {T1}
    if dt < AbnormalReturns.cal_dt_min(mkt_data.calendar) || dt > AbnormalReturns.cal_dt_min(mkt_data.calendar)
        return 0.0
    end
    if dt > AbnormalReturns.cal_dt_min(mkt_data.calendar)
        # typically, the market cap on the day before is checked
        # but it is also important to avoid going outside the calendar
        # range
        dt = BusinessDays.advancebdays(mkt_data.calendar, dt, -1)
    end
    permno_val = WRDSMerger.childID(data)
    if haskey(mkt_data.firmdata, permno_val)
        coalesce(
            mkt_data[permno_val, dt, :mkt_cap], # returns value or missing
            0.0
        )
    else
        0.0
    end
end

This method is obviously slower than the default setup, but would provide the market capitalization on the day before the match.

This is not the default in this package since many of these operations are costly, particularly downloading the data.

Adding New Identifiers

There are likely other identifiers in WRDS that are not included by default in this package, making it necessary to define a new identifier. This is quite easy. First, define a new type:

struct IdentiferName <: FirmIdentifier
    val::String
    IdentifierName(x::AbstractString) = new(x)
end

WRDSMerger.value(x::IdentifierName) = x.val

Replacing FirmIdentifier with SecurityIdentifier if necessary and choosing between String or Int or some other type.

Next, provide the information that links this new identifier to some other identifier in the package. This is done by calling new_link_method:

WRDSMerger.new_link_method — Function

new_link_method(data::Vector{L}) where {L<:AbstractLinkPair}

new_link_method(data::Dict{T1, Vector{L}}) where {T1, ID, L<:AbstractLinkPair{T1, ID}}

function new_link_method(
    ::Type{T1},
    ::Type{ID};
    current_links = all_pairs(AbstractIdentifier, AbstractIdentifier)
) where {ID<:AbstractIdentifier, T1<:AbstractIdentifier}

Creates a new convert_identifier method to link T1 -> ID. See detailed notes under convert_identifier. If a vector of AbstractLinkPair or a dictionary is passed, this creates a direct link method, while passing two types will attempt to find a path between the two identifiers and define the appropriate function.

Note

all_pairs is a relatively slow function, needing to repeatedly check what methods are available. Therefore, if needing to create many new methods, it is best to run all_pairs once and pass that for each new T1 -> ID that needs to be created.

Specifically the method with a vector of AbstractLinkPairs or the dictionary version. Therefore, you need to create a vector of these links, I will assume use of the LinkPair type, but this can be adjusted as discussed in Defining New AbstractLinkPairs. A LinkPair requires 5 elements: the ID it is coming from (parent ID), the ID it is going to (child ID), a start and end date, and a priority (though the start and end date and priority have defaults). Therefore, it is easiest if you create a DataFrame that has similar data (i.e., a column of parent ID, child ID, start date, end date, priority). This package then has a function that allows you to create the bi-directional links required, create_link_pair:

WRDSMerger.create_link_pair — Function

function create_link_pair(
    ::Type{LP},
    ::Type{T1},
    ::Type{T2},
    df::DataFrame,
    cols...
) where {T1<:AbstractIdentifier, T2<:AbstractIdentifier, LP<:AbstractLinkPair}

Generic function that creates an AbstractLinkPair based on the types and a DataFrame. cols... should be a list of column names in the DataFrame, the first being ready to convert to type T1 and the second ready to convert to type T2. This function returns a tuple of two dictionaries: (Dict{T1, LP{T1, T2}},Dict{T2, LP{T2, T1}}) which is easily passed to new_link_method.

Example

create_link_pair(
    LinkPair,
    Permno,
    NCusip,
    df,
    :permno,
    :ncusip,
    :namedt,
    :nameenddt,
    :priority
)

Since this returns a tuple of dictionaries, each needs to be passed to new_link_method to create the bidirectional links. Then, to create links beyond just T1 <-> T2, call create_all_links().

Other Functions

WRDSMerger.choose_best_match — Function

function choose_best_match(
    data::AbstractVector{L},
    dt::Date;
    allow_inexact_date=true,
    args...
) where {L<:AbstractLinkPair}

Picks the best identifier based on the vector of links provided.

Args

allow_inexact_date=true: If true, and the length of the supplied vector is 1, then is will return that value even if the supplied date does not fit within the link.

WRDSMerger.check_priority_errors — Function

This function tests whether there are any dates that are in multiple AbstractLinkPairs and those links have equivalent priority. If this function returns true, then there is at least a date where there is no distinction between two links. The way choose_best_match works, the first in the vector will be chosen.

WRDSMerger.is_higher_priority — Function

is_higher_priority(
    data1::AbstractLinkPair{T1, T2},
    data2::AbstractLinkPair{T1, T2},
    args...
) where {T1, T2}

Determines whether data1 has higher priority than data2. args... are automatically passed to the priorityfunction, which can then deal with special circumstances (currently passed as the date of the match). However, none of the default settings use this.

WRDSMerger.identify_overlaps — Function

This function looks for overlapping periods. It takes a list of all dates and checks if individual sub periods are a subset of multiple periods.

WRDSMerger.value — Function

value(x::AbstractIdentifier)

Converts an identifier into a common Julia type (typically Int or String).

WRDSMerger.all_pairs — Function

function all_pairs(
    a::Type{<:AbstractIdentifier},
    b::Type{<:AbstractIdentifier};
    out = Vector{Tuple{DataType, DataType}}(),
    test_fun=base_method_exists
)

Generates a vector of tuples for which a method exists. It specifically looks for base types (not abstract types).

test_fun has two values defined:

base_method_exists looks for methods that have the two types provided and that link is a direct link (e.g., Permno <-> NCusip), as opposed to a method that takes more than a single step
method_is_missing looks for methods that do not exist, this is designed to look for cases where a new method is needed, typically taking more than a single step to complete