Linking Internals
Underlying Methodology of convert_identifier
WRDSMerger.convert_identifier — Functionconvert_identifier(::Type{ID}, x::T1, dt::Date; vargs...) where {ID<:AbstractIdentifier, T1<:AbstractIdentifier}
convert_identifier(
::Type{ID},
x::T1,
dt::Date,
data::Dict{T1, Vector{<:AbstractLinkPair{T1, ID}}}=data;
vargs...
) where {ID<:AbstractIdentifier, T1<:AbstractIdentifier}Converts an identifier (T1) to a different identifier (ID). In its most generic form, this throws a MethodError implying there is not a function that exists to directly link T1 -> ID.
Calling new_link_method calls a macro to create different versions of convert_identifier to provide links between identifiers. If these are direct links (which means there is a AbstractLinkPair that links the two identifiers, such as Permno -> Permco or Permno -> NCusip), then this provides a one step method to link these two identifiers.
For other identifiers, there is not a link table that provides a direct link (such as SecID -> GVKey). In those cases, new_link_method will find a path between the two (in the case of SecID -> GVKey, SecID -> NCusip -> Permno -> GVKey). Each case of convert_identifier only does 1 step in the process, so convert_identifier(GVKey, SecID(1), today()) would call convert_identifier(Permno, SecID(1), today()).
There are two slightly different behaviors for the direct links of convert_identifier. When linking a SecurityIdentifier -> FirmIdentifier, the function might retry if a link is not found with the parent identifier of the security. For example, when trying to link NCusip -> GVKey, the default behavior is to try NCusip -> Permno -> GVKey. However, suppose there is not a matching NCusip -> Permno, the function will try again with NCusip6 -> Permno. The logic is that it should not matter if the Permno does not perfectly match the NCusip if the end goal is to find a relevant GVKey. This behavior can be disabled by using allow_parent_firm=false.
LinkPairs
This packages primary storage method for links is an AbstractLinkPair, which is typically a LinkPair:
WRDSMerger.LinkPair — Typefunction LinkPair(
parent::T1,
child::T2,
dt1::Union{Missing, Date, String}=Date(0, 1, 1),
dt2::Union{Missing, Date, String}=Date(9999, 12, 31),
priority::Real=0.0
) where {T1<:AbstractIdentifier, T2<:AbstractIdentifier}
function LinkPair(
parent::T1,
child::T2,
dt1::Union{Missing, Date, String},
dt2::Union{Missing, Date, String},
linkprim::String,
linktype::String,
) where {T1<:Union{GVKey, Permno, Permco}, T2<:Union{GVKey, Permno, Permco}}LinkPair is the basic structure that provides a link between two identifiers. These are defined as a single direction link (T1 -> T2) that is valid between a specific date range (inclusive) and has a given priority (higher is better). Priority is useful if there are overlapping T1 -> T2 items. For example, a FirmIdentifier likely has multiple SecurityIdentifiers that relate to it. One common way to pick between different SecurityIdentifiers is to pick the one with the large market cap as the primary.
If defining a new identifier that has other methods of choosing priorities (such as a String indicating priority), it can help to define a function that converts these strings into a number. An example of this exists for linking GVKey -> Permno or Permco (and the reverse), which take in linkprim and linktype and convert those to the appropriate priority.
Conceptually, a LinkPair provides a one direction link between T1 -> T2. These are typically stored in a dictionary for quick lookup:
Dict{T1, Vector{LinkPair{T1, T2}}}()and this package adds a function to Dict to convert an abstract vector of LinkPairs to such a dictionary:
Base.Dict — TypeDict(data::AbstractVector{L}) where {T1, T2, L<:AbstractLinkPair{T1, T2}}Converts a vector of AbstractLinkPairs to a dictionary where each T1 is a key in the dictionary and the values are vectors of L. It also checks whether those vectors ever have overlapping inconsistent priorities.
Defining New AbstractLinkPairs
While this package currently makes use of LinkPair, it might be easier for other identifiers to define a more complex AbstractLinkPair. A key component of an AbstractLinkPair is being able to compare when one link should be used as opposed to another, which this package refers to as priority. In some cases, there might be multiple values that determine a priority. For example, the link between GVKey <-> Permno has two columns, depending on the direction (e.g., going from Permno -> GVKey, "LC" > "LU" > "LS"...). This package converts these into a single number with decimals ("LC" = 8, "LU" = 7... and the other column, "P" = 0.3, "C" = 0.2 ..., added together). This is switched when defining the link between GVKey -> Permno ("LC" = 0.8, "P" = 3). An alternative way to define this would be to create a separate AbstractLinkPair type that would avoid adding and, perhaps, be clearer on methodology. For example, something like:
struct CrspCompLinkPair{T1<:AbstractIdentifier, T2<:AbstractIdentifier} <: AbstractLinkPair{T1, T2}
parent::T1
child::T2
dt1::Date# first date valid
dt2::Date# last date valid
comp_crsp_priority::Int
crsp_comp_priority::Int
function CrspCompLinkPair(
t1::T1,
t2::T2,
dt1::Date,
dt2::Date,
linktype::AbstractString,
linkprim::AbstractString
) where {T1, T2}
comp_crsp_priority = if linkprim == "P"
3
elseif linkprim == "C"
2
elseif linkprim == "J"
1
end
crsp_comp_priority = if linktype == "LC"
8
elseif linktype == "LU"
7
elseif linktype == "LS"
6
elseif linktype == "LX"
5
elseif linktype == "LD"
4
elseif linktype == "LN"
3
elseif linktype == "NR"
2
elseif linktype == "NU"
1
end
new{T1, T2}(t1, t2, dt1, dt2, comp_crsp_priority, crsp_comp_priority)
end
endWhile most of the default functions for AbstractLinkPair would work with this new type (parentID, childID, min_date, max_date, Base.in), the one that does not is priority, which determines which AbstractLinkPair is preferable. Since the direction of the link matters, two new priority functions are required:
function WRDSMerger.priority(data::CrspCompLinkPair{GVKey, T2}) where {T2<:AbstractIdentifier}
data.comp_crsp_priority + data.crsp_comp_priority / 10
end
function WRDSMerger.priority(data::CrspCompLinkPair{T1, GVKey}) where {T1<:AbstractIdentifier}
data.crsp_comp_priority + data.comp_crsp_priority / 10
endWhile this case is not used by default in this package, following similar methodology could allow for more complex priority structures.
Linking Download Functions
WRDSMerger.download_crsp_links — Functiondownload_crsp_links(db; main_table="crsp.stocknames", stockfile="crsp.dsf")Runs the following SQL code (tables are changeable by setting the main_table and stockfile keyword arguments):
select a.*, b.mkt_cap from crsp.stocknames a
left join (
select s.permno, s.namedt, s.nameenddt, avg(d.mkt_cap) as mkt_cap from crsp.stocknames s
inner join (select permno, date, abs(prc) * shrout as mkt_cap from crsp.dsf) as d
on s.permno = d.permno and s.namedt <= d.date and s.nameenddt >= d.date
group by s.permno, s.namedt, s.nameenddt
) b
on a.permno = b.permno and a.namedt = b.namedt and a.nameenddt = b.nameenddtand returns a DataFrame.
WRDSMerger.download_comp_crsp_links — Functiondownload_comp_crsp_links(db; main_table="crsp_a_ccm.ccmxpf_linkhist")Runs the following SQL code (table is changeable by setting the main_table keyword argument):
SELECT * FROM crsp_a_ccm.ccmxpf_linkhistand returns the resulting DataFrame
WRDSMerger.download_comp_cik_links — Functiondownload_comp_cik_links(db; main_table="comp.company")Runs the following SQL code (table is changeable by setting the main_table keyword argument):
SELECT * FROM comp.companyand returns the resulting DataFrame
WRDSMerger.download_ibes_links — Functiondownload_ibes_links(db; main_table="wrdsapps.ibcrsphist")Runs the following SQL code (table is changeable by setting the main_table keyword argument):
SELECT * FROM wrdsapps.ibcrsphistand returns the resulting DataFrame
WRDSMerger.download_option_crsp_links — Functiondownload_option_crsp_links(db; main_table="optionm_all.secnmd")Runs the following SQL code (table is changeable by setting the main_table keyword argument):
SELECT * FROM optionm_all.secnmdand returns the resulting DataFrame
WRDSMerger.download_ravenpack_links — Functiondownload_ravenpack_links(db; main_table="ravenpack.rp_entity_mapping", cusip_list="crsp.stocknames")Runs the following SQL code (tables are changeable by setting the main_table and cusip_list keyword arguments):
SELECT rp_entity_id, data_value as ncusip, range_start, range_end FROM ravenpack.rp_entity_mapping as a
inner join (select distinct ncusip from crsp.stocknames) as b
on left(a.data_value, 8) = b.ncusipand returns a DataFrame.
Changing The Priority for Permno
A single company can have many securities, therefore, there might be multiple options when linking these items. For example, a single GVKey or Permco will match to multiple Permnos. In some tables in WRDS (such as in the case of GVKey <-> Permno), there are explicit primary identifier markers provided, improving the match. In others, there are not (as in Permco <-> Permno). This is a particular problem for Permno since this package prioritizes matches through Permno (as discussed in Supremacy of Permno).
The most common method to resolve these matches is to find the Permno that has the largest market capitalization on the day of the match since that should be the primary identifier. This is difficult to do in a package like this where the values are, ideally, predetermined. Therefore, the default behavior is to average the market capitalization over the period of the link and choose the higher average market capitalization. This behavior is convenient (requiring only a single SQL download), but potentially inconsistent with the end goal. Specifically, if one link has a lower average market capitalization (perhaps due to a long time window where the value was lower) than another link, this package might pick the Permno with a smaller market capitalization on the day of the match.
This is a proposed alternative that makes use of the AbnormalReturns.jl package to provide a quick lookup of the market capitalization just before the link:
First, stock price data is required:
using WRDSMerger, DataFramesMeta, AbnormalReturns
df = raw_sql(wrds_conn, "SELECT permno, date, abs(prc) * shrout AS mkt_cap FROM crsp.dsf")
@rtransform!(df, :mkt_cap = coalesce(:mkt_cap, 0.0))It is recommended to provide some filter on the WRDS download as the crsp.dsf file has over 100 million rows, downloading this data takes a lot of ram, peaking at ~20 GB. Most obviously, selecting dates beyond a certain point helps a lot.
AbnormalReturns needs a market calendar, instead of downloading something, just reuse the dates from df and load that into a MarketData object:
mkt_data = MarketData(
@by(df, :date, :full_mkt = mean(:mkt_cap)),
df
)Then we need to redefine how WRDSMerger goes about choosing between two links when the outcome is a Permno. It is also important to do some error checking since AbnormalReturns does not accept cases when the date is out of the range available or the Permno is not in the dataset. WRDSMerger determines priority uses the is_higher_priority function, which checks the priority of two AbstractLinkPairs and compares them. Therefore, changing the priority function slightly when the outcome is a Permno will create the necessary changes:
function WRDSMerger.priority(
data::AbstractLinkPair{T1, Permno},
dt::Date;
mkt_data=mkt_data # need the market data defined above
) where {T1}
if dt < AbnormalReturns.cal_dt_min(mkt_data.calendar) || dt > AbnormalReturns.cal_dt_min(mkt_data.calendar)
return 0.0
end
if dt > AbnormalReturns.cal_dt_min(mkt_data.calendar)
# typically, the market cap on the day before is checked
# but it is also important to avoid going outside the calendar
# range
dt = BusinessDays.advancebdays(mkt_data.calendar, dt, -1)
end
permno_val = WRDSMerger.childID(data)
if haskey(mkt_data.firmdata, permno_val)
coalesce(
mkt_data[permno_val, dt, :mkt_cap], # returns value or missing
0.0
)
else
0.0
end
endThis method is obviously slower than the default setup, but would provide the market capitalization on the day before the match.
This is not the default in this package since many of these operations are costly, particularly downloading the data.
Adding New Identifiers
There are likely other identifiers in WRDS that are not included by default in this package, making it necessary to define a new identifier. This is quite easy. First, define a new type:
struct IdentiferName <: FirmIdentifier
val::String
IdentifierName(x::AbstractString) = new(x)
end
WRDSMerger.value(x::IdentifierName) = x.valReplacing FirmIdentifier with SecurityIdentifier if necessary and choosing between String or Int or some other type.
Next, provide the information that links this new identifier to some other identifier in the package. This is done by calling new_link_method:
WRDSMerger.new_link_method — Functionnew_link_method(data::Vector{L}) where {L<:AbstractLinkPair}
new_link_method(data::Dict{T1, Vector{L}}) where {T1, ID, L<:AbstractLinkPair{T1, ID}}
function new_link_method(
::Type{T1},
::Type{ID};
current_links = all_pairs(AbstractIdentifier, AbstractIdentifier)
) where {ID<:AbstractIdentifier, T1<:AbstractIdentifier}Creates a new convert_identifier method to link T1 -> ID. See detailed notes under convert_identifier. If a vector of AbstractLinkPair or a dictionary is passed, this creates a direct link method, while passing two types will attempt to find a path between the two identifiers and define the appropriate function.
all_pairs is a relatively slow function, needing to repeatedly check what methods are available. Therefore, if needing to create many new methods, it is best to run all_pairs once and pass that for each new T1 -> ID that needs to be created.
Specifically the method with a vector of AbstractLinkPairs or the dictionary version. Therefore, you need to create a vector of these links, I will assume use of the LinkPair type, but this can be adjusted as discussed in Defining New AbstractLinkPairs. A LinkPair requires 5 elements: the ID it is coming from (parent ID), the ID it is going to (child ID), a start and end date, and a priority (though the start and end date and priority have defaults). Therefore, it is easiest if you create a DataFrame that has similar data (i.e., a column of parent ID, child ID, start date, end date, priority). This package then has a function that allows you to create the bi-directional links required, create_link_pair:
WRDSMerger.create_link_pair — Functionfunction create_link_pair(
::Type{LP},
::Type{T1},
::Type{T2},
df::DataFrame,
cols...
) where {T1<:AbstractIdentifier, T2<:AbstractIdentifier, LP<:AbstractLinkPair}Generic function that creates an AbstractLinkPair based on the types and a DataFrame. cols... should be a list of column names in the DataFrame, the first being ready to convert to type T1 and the second ready to convert to type T2. This function returns a tuple of two dictionaries: (Dict{T1, LP{T1, T2}},Dict{T2, LP{T2, T1}}) which is easily passed to new_link_method.
Example
create_link_pair(
LinkPair,
Permno,
NCusip,
df,
:permno,
:ncusip,
:namedt,
:nameenddt,
:priority
)Since this returns a tuple of dictionaries, each needs to be passed to new_link_method to create the bidirectional links. Then, to create links beyond just T1 <-> T2, call create_all_links().
Other Functions
WRDSMerger.choose_best_match — Functionfunction choose_best_match(
data::AbstractVector{L},
dt::Date;
allow_inexact_date=true,
args...
) where {L<:AbstractLinkPair}Picks the best identifier based on the vector of links provided.
Args
allow_inexact_date=true: If true, and the length of the supplied vector is 1, then is will return that value even if the supplied date does not fit within the link.
WRDSMerger.check_priority_errors — FunctionThis function tests whether there are any dates that are in multiple AbstractLinkPairs and those links have equivalent priority. If this function returns true, then there is at least a date where there is no distinction between two links. The way choose_best_match works, the first in the vector will be chosen.
WRDSMerger.is_higher_priority — Functionis_higher_priority(
data1::AbstractLinkPair{T1, T2},
data2::AbstractLinkPair{T1, T2},
args...
) where {T1, T2}Determines whether data1 has higher priority than data2. args... are automatically passed to the priorityfunction, which can then deal with special circumstances (currently passed as the date of the match). However, none of the default settings use this.
WRDSMerger.identify_overlaps — FunctionThis function looks for overlapping periods. It takes a list of all dates and checks if individual sub periods are a subset of multiple periods.
WRDSMerger.value — Functionvalue(x::AbstractIdentifier)Converts an identifier into a common Julia type (typically Int or String).
WRDSMerger.all_pairs — Functionfunction all_pairs(
a::Type{<:AbstractIdentifier},
b::Type{<:AbstractIdentifier};
out = Vector{Tuple{DataType, DataType}}(),
test_fun=base_method_exists
)Generates a vector of tuples for which a method exists. It specifically looks for base types (not abstract types).
test_fun has two values defined:
base_method_existslooks for methods that have the two types provided and that link is a direct link (e.g., Permno <-> NCusip), as opposed to a method that takes more than a single stepmethod_is_missinglooks for methods that do not exist, this is designed to look for cases where a new method is needed, typically taking more than a single step to complete