Linking Internals
Underlying Methodology of convert_identifier
WRDSMerger.convert_identifier — Function
convert_identifier(::Type{ID}, x::T1, dt::Date; vargs...) where {ID<:AbstractIdentifier, T1<:AbstractIdentifier}
convert_identifier(
::Type{ID},
x::T1,
dt::Date,
data::Dict{T1, Vector{<:AbstractLinkPair{T1, ID}}}=data;
vargs...
) where {ID<:AbstractIdentifier, T1<:AbstractIdentifier}Converts an identifier (T1) to a different identifier (ID). In its most generic form, this throws a MethodError implying there is not a function that exists to directly link T1 -> ID.
Calling new_link_method calls a macro to create different versions of convert_identifier to provide links between identifiers. If these are direct links (which means there is a AbstractLinkPair that links the two identifiers, such as Permno -> Permco or Permno -> NCusip), then this provides a one step method to link these two identifiers.
For other identifiers, there is not a link table that provides a direct link (such as SecID -> GVKey). In those cases, new_link_method will find a path between the two (in the case of SecID -> GVKey, SecID -> NCusip -> Permno -> GVKey). Each case of convert_identifier only does 1 step in the process, so convert_identifier(GVKey, SecID(1), today()) would call convert_identifier(Permno, SecID(1), today()).
There are two slightly different behaviors for the direct links of convert_identifier. When linking a SecurityIdentifier -> FirmIdentifier, the function might retry if a link is not found with the parent identifier of the security. For example, when trying to link NCusip -> GVKey, the default behavior is to try NCusip -> Permno -> GVKey. However, suppose there is not a matching NCusip -> Permno, the function will try again with NCusip6 -> Permno. The logic is that it should not matter if the Permno does not perfectly match the NCusip if the end goal is to find a relevant GVKey. This behavior can be disabled by using allow_parent_firm=false.
LinkPairs
This packages primary storage method for links is an AbstractLinkPair, which is typically a LinkPair:
WRDSMerger.AbstractLinkPair — Type
AbstractLinkPair{T1<:AbstractIdentifier, T2<:AbstractIdentifier}Abstract supertype for all link pairs. A link pair represents a directional relationship from identifier type T1 to T2 over a date range with an associated priority. See LinkPair for the concrete default implementation and Defining New AbstractLinkPairs for how to create custom subtypes.
Required interface for subtypes: parentID, childID, min_date, max_date, priority.
WRDSMerger.LinkPair — Type
function LinkPair(
parent::T1,
child::T2,
dt1::Union{Missing, Date, String}=Date(0, 1, 1),
dt2::Union{Missing, Date, String}=Date(9999, 12, 31),
priority::Real=0.0
) where {T1<:AbstractIdentifier, T2<:AbstractIdentifier}LinkPair is the basic structure that provides a link between two identifiers. These are defined as a single direction link (T1 -> T2) that is valid between a specific date range (inclusive) and has a given priority (higher is better). Priority is useful if there are overlapping T1 -> T2 items. For example, a FirmIdentifier likely has multiple SecurityIdentifiers that relate to it. One common way to pick between different SecurityIdentifiers is to pick the one with the larger market cap as the primary.
If the source data provides priority as something other than a number (such as a String indicating priority), convert it to a numeric value before constructing the LinkPair. See gvkey_crsp_priority and crsp_gvkey_priority for examples of functions that convert CRSP/Compustat link priority strings into numeric priorities.
Conceptually, a LinkPair provides a one direction link between T1 -> T2. These are typically stored in a dictionary for quick lookup:
Dict{T1, Vector{LinkPair{T1, T2}}}()and this package adds a function to Dict to convert an abstract vector of LinkPairs to such a dictionary:
Base.Dict — Type
Dict(data::AbstractVector{L}) where {T1, T2, L<:AbstractLinkPair{T1, T2}}Converts a vector of AbstractLinkPairs to a dictionary where each T1 is a key in the dictionary and the values are vectors of L. It also checks whether those vectors ever have overlapping inconsistent priorities.
Defining New AbstractLinkPairs
While this package currently makes use of LinkPair, it might be easier for other identifiers to define a more complex AbstractLinkPair. A key component of an AbstractLinkPair is being able to compare when one link should be used as opposed to another, which this package refers to as priority. In some cases, there might be multiple values that determine a priority. For example, the link between GVKey <-> Permno has two columns, depending on the direction (e.g., going from Permno -> GVKey, "LC" > "LU" > "LS"...). This package converts these into a single number with decimals ("LC" = 8, "LU" = 7... and the other column, "P" = 0.3, "C" = 0.2 ..., added together). This is switched when defining the link between GVKey -> Permno ("LC" = 0.8, "P" = 3). An alternative way to define this would be to create a separate AbstractLinkPair type that would avoid adding and, perhaps, be clearer on methodology. For example, something like:
struct CrspCompLinkPair{T1<:AbstractIdentifier, T2<:AbstractIdentifier} <: AbstractLinkPair{T1, T2}
parent::T1
child::T2
dt1::Date# first date valid
dt2::Date# last date valid
comp_crsp_priority::Int
crsp_comp_priority::Int
function CrspCompLinkPair(
t1::T1,
t2::T2,
dt1::Date,
dt2::Date,
linktype::AbstractString,
linkprim::AbstractString
) where {T1, T2}
comp_crsp_priority = if linkprim == "P"
3
elseif linkprim == "C"
2
elseif linkprim == "J"
1
end
crsp_comp_priority = if linktype == "LC"
8
elseif linktype == "LU"
7
elseif linktype == "LS"
6
elseif linktype == "LX"
5
elseif linktype == "LD"
4
elseif linktype == "LN"
3
elseif linktype == "NR"
2
elseif linktype == "NU"
1
end
new{T1, T2}(t1, t2, dt1, dt2, comp_crsp_priority, crsp_comp_priority)
end
endWhile most of the default functions for AbstractLinkPair would work with this new type (parentID, childID, min_date, max_date, Base.in), the one that does not is priority, which determines which AbstractLinkPair is preferable. Since the direction of the link matters, two new priority functions are required:
function WRDSMerger.priority(data::CrspCompLinkPair{GVKey, T2}) where {T2<:AbstractIdentifier}
data.comp_crsp_priority + data.crsp_comp_priority / 10
end
function WRDSMerger.priority(data::CrspCompLinkPair{T1, GVKey}) where {T1<:AbstractIdentifier}
data.crsp_comp_priority + data.comp_crsp_priority / 10
endWhile this case is not used by default in this package, following similar methodology could allow for more complex priority structures.
Linking Download Functions
WRDSMerger.download_crsp_links — Function
download_crsp_links(conn, main_table="crsp.stocknames", stockfile="crsp.dsf")Runs the following SQL code (tables are changeable by setting the main_table and stockfile keyword arguments):
select a.*, b.mkt_cap from crsp.stocknames a
left join (
select s.permno, s.namedt, s.nameenddt, avg(d.mkt_cap) as mkt_cap from crsp.stocknames s
inner join (select permno, date, abs(prc) * shrout as mkt_cap from crsp.dsf) as d
on s.permno = d.permno and s.namedt <= d.date and s.nameenddt >= d.date
group by s.permno, s.namedt, s.nameenddt
) b
on a.permno = b.permno and a.namedt = b.namedt and a.nameenddt = b.nameenddtand returns a DataFrame.
WRDSMerger.download_crsp_links_v2 — Function
download_crsp_links_v2(conn, main_table="crsp.stocknames_v2", stockfile="crsp.dsf_v2")CRSP V2 equivalent of download_crsp_links. Uses dlycap (daily market cap) instead of calculating abs(prc) * shrout, and dlycaldt instead of date.
Runs the following SQL code (tables are changeable by setting the main_table and stockfile arguments):
select a.*, b.mkt_cap from crsp.stocknames_v2 a
left join (
select s.permno, s.namedt, s.nameenddt, avg(d.dlycap) as mkt_cap from crsp.stocknames_v2 s
inner join (select permno, dlycaldt, dlycap from crsp.dsf_v2) as d
on s.permno = d.permno and s.namedt <= d.dlycaldt and s.nameenddt >= d.dlycaldt
group by s.permno, s.namedt, s.nameenddt
) b
on a.permno = b.permno and a.namedt = b.namedt and a.nameenddt = b.nameenddtand returns a DataFrame.
WRDSMerger.download_comp_crsp_links — Function
download_comp_crsp_links(conn, main_table="crsp_a_ccm.ccmxpf_linkhist")Runs the following SQL code (table is changeable by setting the main_table keyword argument):
SELECT * FROM crsp_a_ccm.ccmxpf_linkhistand returns the resulting DataFrame
WRDSMerger.download_comp_cik_links — Function
download_comp_cik_links(conn, main_table="comp.company")Runs the following SQL code (table is changeable by setting the main_table keyword argument):
SELECT * FROM comp.companyand returns the resulting DataFrame
WRDSMerger.download_ibes_links — Function
download_ibes_links(conn, main_table="wrdsapps.ibcrsphist")Runs the following SQL code (table is changeable by setting the main_table keyword argument):
SELECT * FROM wrdsapps.ibcrsphistand returns the resulting DataFrame
WRDSMerger.download_option_crsp_links — Function
download_option_crsp_links(conn, main_table="optionm_all.secnmd")Runs the following SQL code (table is changeable by setting the main_table keyword argument):
SELECT * FROM optionm_all.secnmdand returns the resulting DataFrame
WRDSMerger.download_ravenpack_links — Function
download_ravenpack_links(conn, main_table="ravenpack.rp_entity_mapping", cusip_list="crsp.stocknames")Runs the following SQL code (tables are changeable by setting the main_table and cusip_list keyword arguments):
SELECT rp_entity_id, data_value as ncusip, range_start, range_end FROM ravenpack.rp_entity_mapping as a
inner join (select distinct ncusip from crsp.stocknames) as b
on left(a.data_value, 8) = b.ncusipand returns a DataFrame.
Changing The Priority for Permno
A single company can have many securities, therefore, there might be multiple options when linking these items. For example, a single GVKey or Permco will match to multiple Permnos. In some tables in WRDS (such as in the case of GVKey <-> Permno), there are explicit primary identifier markers provided, improving the match. In others, there are not (as in Permco <-> Permno). This is a particular problem for Permno since this package prioritizes matches through Permno (as discussed in Supremacy of Permno).
The most common method to resolve these matches is to find the Permno that has the largest market capitalization on the day of the match since that should be the primary identifier. This is difficult to do in a package like this where the values are, ideally, predetermined. Therefore, the default behavior is to average the market capitalization over the period of the link and choose the higher average market capitalization. This behavior is convenient (requiring only a single SQL download), but potentially inconsistent with the end goal. Specifically, if one link has a lower average market capitalization (perhaps due to a long time window where the value was lower) than another link, this package might pick the Permno with a smaller market capitalization on the day of the match.
This is a proposed alternative that makes use of the AbnormalReturns.jl package to provide a quick lookup of the market capitalization just before the link:
First, stock price data is required:
using WRDSMerger, DataFramesMeta, AbnormalReturns
df = raw_sql(wrds_conn, "SELECT permno, date, abs(prc) * shrout AS mkt_cap FROM crsp.dsf")
@rtransform!(df, :mkt_cap = coalesce(:mkt_cap, 0.0))It is recommended to provide some filter on the WRDS download as the crsp.dsf file has over 100 million rows, downloading this data takes a lot of ram, peaking at ~20 GB. Most obviously, selecting dates beyond a certain point helps a lot.
AbnormalReturns needs a market calendar, instead of downloading something, just reuse the dates from df and load that into a MarketData object:
mkt_data = MarketData(
@by(df, :date, :full_mkt = mean(:mkt_cap)),
df
)Then we need to redefine how WRDSMerger goes about choosing between two links when the outcome is a Permno. It is also important to do some error checking since AbnormalReturns does not accept cases when the date is out of the range available or the Permno is not in the dataset. WRDSMerger determines priority uses the is_higher_priority function, which checks the priority of two AbstractLinkPairs and compares them. Therefore, changing the priority function slightly when the outcome is a Permno will create the necessary changes:
function WRDSMerger.priority(
data::AbstractLinkPair{T1, Permno},
dt::Date;
mkt_data=mkt_data # need the market data defined above
) where {T1}
if dt < AbnormalReturns.cal_dt_min(mkt_data.calendar) || dt > AbnormalReturns.cal_dt_min(mkt_data.calendar)
return 0.0
end
if dt > AbnormalReturns.cal_dt_min(mkt_data.calendar)
# typically, the market cap on the day before is checked
# but it is also important to avoid going outside the calendar
# range
dt = BusinessDays.advancebdays(mkt_data.calendar, dt, -1)
end
permno_val = WRDSMerger.childID(data)
if haskey(mkt_data.firmdata, permno_val)
coalesce(
mkt_data[permno_val, dt, :mkt_cap], # returns value or missing
0.0
)
else
0.0
end
endThis method is obviously slower than the default setup, but would provide the market capitalization on the day before the match.
This is not the default in this package since many of these operations are costly, particularly downloading the data.
Adding New Identifiers
There are likely other identifiers in WRDS that are not included by default in this package, making it necessary to define a new identifier. This is quite easy. First, define a new type:
struct IdentiferName <: FirmIdentifier
val::String
IdentifierName(x::AbstractString) = new(x)
end
WRDSMerger.value(x::IdentifierName) = x.valReplacing FirmIdentifier with SecurityIdentifier if necessary and choosing between String or Int or some other type.
Next, provide the information that links this new identifier to some other identifier in the package. This is done by calling new_link_method:
WRDSMerger.new_link_method — Function
new_link_method(data::Vector{L}) where {L<:AbstractLinkPair}
new_link_method(data::Dict{T1, Vector{L}}) where {T1, ID, L<:AbstractLinkPair{T1, ID}}
function new_link_method(
::Type{T1},
::Type{ID};
current_links = all_pairs(AbstractIdentifier, AbstractIdentifier)
) where {ID<:AbstractIdentifier, T1<:AbstractIdentifier}Creates a new convert_identifier method to link T1 -> ID. See detailed notes under convert_identifier. If a vector of AbstractLinkPair or a dictionary is passed, this creates a direct link method, while passing two types will attempt to find a path between the two identifiers and define the appropriate function.
When a direct link is created (via a vector or dictionary), this also defines an identifier_data method for the corresponding (ID, T1) pair, providing access to the underlying link dictionary.
all_pairs is a relatively slow function, needing to repeatedly check what methods are available. Therefore, if needing to create many new methods, it is best to run all_pairs once and pass that for each new T1 -> ID that needs to be created.
Specifically the method with a vector of AbstractLinkPairs or the dictionary version. Therefore, you need to create a vector of these links, I will assume use of the LinkPair type, but this can be adjusted as discussed in Defining New AbstractLinkPairs. A LinkPair requires 5 elements: the ID it is coming from (parent ID), the ID it is going to (child ID), a start and end date, and a priority (though the start and end date and priority have defaults). Therefore, it is easiest if you create a DataFrame that has similar data (i.e., a column of parent ID, child ID, start date, end date, priority). This package then has a function that allows you to create the bi-directional links required, create_link_pair:
WRDSMerger.create_link_pair — Function
function create_link_pair(
::Type{LP},
::Type{T1},
::Type{T2},
df::DataFrame,
sym1::Symbol,
sym2::Symbol,
dt1::Union{Symbol, Missing}=missing,
dt2::Union{Symbol, Missing}=missing,
priority_sym::Union{Symbol, Missing}=missing,
priority_sym2::Union{Symbol, Missing}=missing
) where {T1<:AbstractIdentifier, T2<:AbstractIdentifier, LP<:AbstractLinkPair}Generic function that creates an AbstractLinkPair based on the types and a DataFrame. sym1 and sym2 are the column names in df whose values will be converted to types T1 and T2, respectively. dt1 and dt2 are optional column names for the start and end dates of each link. priority_sym is an optional column name used as the priority when building links from T1 to T2, and priority_sym2 is an optional separate priority column for the reverse (T2 to T1) direction; when priority_sym2 is missing, priority_sym is used for both directions, and when both are missing a default priority of 0.0 is used.
The function selects only the relevant columns, drops rows where sym1 or sym2 are missing, and returns a tuple of two dictionaries: (Dict{T1, LP{T1, T2}}, Dict{T2, LP{T2, T1}}) which is easily passed to new_link_method.
Example
create_link_pair(
LinkPair,
Permno,
NCusip,
df,
:permno,
:ncusip,
:namedt,
:nameenddt,
:priority
)Since this returns a tuple of dictionaries, each needs to be passed to new_link_method to create the bidirectional links. Then, to create links beyond just T1 <-> T2, call create_all_links().
Core Functions
WRDSMerger.create_all_links — Function
create_all_links()Create indirect links between all identifier types that do not yet have a direct link method. This should be called after all generate_* functions have been run. It finds all missing identifier pairs and creates linking methods that route through intermediate identifiers (preferring paths through Permno, see Supremacy of Permno).
WRDSMerger.download_all_links — Function
download_all_links(conn, funs=[...], save_dfs=true)Convenience function that calls each function in funs (passing conn) and then calls create_all_links. If save_dfs=true (default), returns a vector of the DataFrames produced by each function. The default funs are: generate_crsp_links, generate_comp_crsp_links, generate_comp_cik_links, generate_ibes_links, generate_option_crsp_links, generate_ravenpack_links.
Other Functions
WRDSMerger.choose_best_match — Function
function choose_best_match(
data::AbstractVector{L},
dt::Date;
allow_inexact_date=true,
args...
) where {L<:AbstractLinkPair}Picks the best identifier based on the vector of links provided.
Args
allow_inexact_date=true: If true, and the length of the supplied vector is 1, then is will return that value even if the supplied date does not fit within the link.
WRDSMerger.check_priority_errors — Function
This function tests whether there are any dates that are in multiple AbstractLinkPairs and those links have equivalent priority. If this function returns true, then there is at least a date where there is no distinction between two links. The way choose_best_match works, the first in the vector will be chosen.
The algorithm uses a sweep-line approach: intervals are sorted by start date to efficiently find overlapping pairs. When an overlap between two links with different children is found, a representative date is checked against all active links to determine whether a higher-priority link resolves the tie. This avoids false positives where two low-priority links tie but a third higher-priority link takes precedence.
WRDSMerger.is_higher_priority — Function
is_higher_priority(
data1::AbstractLinkPair{T1, T2},
data2::AbstractLinkPair{T1, T2},
args...
) where {T1, T2}Determines whether data1 has higher priority than data2. args... are automatically passed to the priorityfunction, which can then deal with special circumstances (currently passed as the date of the match). However, none of the default settings use this.
WRDSMerger.value — Function
value(x::AbstractIdentifier)Converts an identifier into a common Julia type (typically Int or String).
WRDSMerger.all_pairs — Function
all_pairs(
a::Type{<:AbstractIdentifier},
b::Type{<:AbstractIdentifier}=a;
out = Vector{Tuple{DataType, DataType}}(),
test_fun = base_method_exists
)Returns a vector of (T1, T2) tuples representing every concrete identifier pair for which test_fun(T1, T2) returns true. Both a and b are expanded to their concrete subtypes (via InteractiveUtils.subtypes) before testing, and self-pairs (T1 == T2) are skipped.
By default, test_fun is base_method_exists, which checks whether an identifier_data method exists for the pair — i.e., whether a direct link table has been loaded. This makes all_pairs useful for discovering which direct links are currently available.
all_pairs can be slow because it must inspect methods for every combination of concrete subtypes. When creating many indirect links with new_link_method, call all_pairs once and pass the result via the current_links keyword to avoid repeated work.
Examples
# All direct links currently loaded:
all_pairs(AbstractIdentifier, AbstractIdentifier)
# Check which pairs are still missing a convert_identifier method:
all_pairs(AbstractIdentifier, AbstractIdentifier; test_fun=method_is_missing)WRDSMerger.identifier_data — Function
identifier_data(::Type{ID}, ::Type{T1}) where {ID<:AbstractIdentifier, T1<:AbstractIdentifier}Returns the underlying Dict{T1, Vector{<:AbstractLinkPair{T1, ID}}} that maps source identifiers of type T1 to their link records targeting type ID.
This function is automatically defined by new_link_method whenever a direct link (i.e., one backed by an AbstractLinkPair table) is created. It can be useful for inspecting the raw link data or for advanced workflows such as serializing the dictionary for faster future loading.
If no direct link data has been loaded for the requested pair, a MethodError will be thrown.
Examples
# After links have been created:
data = identifier_data(GVKey, Permno)WRDSMerger.crsp_gvkey_priority — Function
crsp_gvkey_priority(linkprim::AbstractString, linktype::AbstractString) -> Float64Converts CRSP/Compustat link descriptor strings into a numeric priority for links going from a CRSP identifier (Permno or Permco) to GVKey. Higher values indicate a stronger link.
The linktype flag (LC > LU > LS > LX > LD > LN > NR > NU) receives the larger weight, reflecting that the link quality matters most when starting from a security identifier. linkprim (P > C > J) serves as a tiebreaker.
See also gvkey_crsp_priority for the reverse direction.
WRDSMerger.gvkey_crsp_priority — Function
gvkey_crsp_priority(linkprim::AbstractString, linktype::AbstractString) -> Float64Converts CRSP/Compustat link descriptor strings into a numeric priority for links going from GVKey to a CRSP identifier (Permno or Permco). Higher values indicate a stronger link.
The linkprim flag (P > C > J) receives the larger weight, reflecting that the primary security match matters most when starting from a firm identifier. linktype (LC > LU > LS > LX > LD > LN > NR > NU) serves as a tiebreaker.
See also crsp_gvkey_priority for the reverse direction.