For mnemonics, why does Data Buffet use its own system of geo codes, instead of the official codes issued by each national provider?
The challenge
Because Data Buffet republishes data from hundreds of providers, and over a span of decades, direct reuse of their identifiers raises the potential for conflict or collisions:
- For the same area, different sources may use different codes.
- For the same code, different sources may have different meanings.
- A single source may change its system over time, using different codes for the same area.
- A single source may retain a code despite changes to an area (its composition or boundaries)
Moreover, we aim to achieve these goals with our geo codes:
- Uniformity among areas at the same geo level
- Distinction between areas of different types
- Subnational areas are visibly related to their nation
- Different vintages of a taxonomy can be isolated
- Limit the impact of coding changes that are "distinction without a difference"
- Are amenable to wild card expressions, so that related areas can be selected as a group
At the national level
For example, for the United Kingdom: Eurostat uses code "UK". IMF has used both "112" and "GBR". We elected to build upon the ISO 3166 alpha-3 standard, by using the fixed prefix "I" followed by "GBR", hence, "IGBR".
The fixed prefix means all national data can be retrieved with the wild card ".I^^^", and precludes collision with our 1990-delineation geo codes for U.S. metro areas, which consist of three letters. For example, IARE "United Arab Emirates" vs. ARE "Arecibo, Puerto Rico".
Note that we assign "national" geo codes to areas when statistically convenient, even if they are not sovereign (e.g., Hong Kong SAR or Puerto Rico) or their status is disputed (Taiwan ROC).
At the metro level, for the U.S.
For metro areas (more generally, "core-based statistical areas" or CBSAs), the maintainer is the U.S. Office of Management and Budget. Each CBSA is a composite of contiguous counties or county-equivalent areas, selected according to the results of the decennial census and the American Community Survey. The numeric code may change; or the numeric code may be retained even if the composition changes. Either circumstance prompts us to define a new geo code. For example:
Census | Bulletin | Code | Name | Components | Geo code |
1990 |
|
0120 |
Albany, GA |
Dougherty, Lee |
ALN |
2000 |
|
10500 |
Albany, GA |
Baker, Dougherty, Lee, Terrell, Worth |
MALN |
2010 |
18-03 |
10500 |
Albany, GA |
Baker, Dougherty, Lee, Terrell, Worth |
IUSA_MALN |
2010 |
18-04 |
10500 |
Albany, GA |
Dougherty, Lee, Terrell, Worth |
IUSA_MABY |
Individual sources do not adopt the new delineations in lockstep. As a transitional measure, if a source reports using delineation "A", we may construct a supplemental dataset under delineation "B". This is possible only with distinct geo codes.
At the subnational level, for Europe
Under the Eurostat NUTS standard, an area may be terminated, merged, renamed, or created; in these cases, we assign a new geo code.
There are also cases where the boundaries do not change (same physical territory) but the area is nonetheless assigned a new identifier ("code change" or "recoded"); for this case, Data Buffet policy is to leave our geo code unchanged. Here are four examples in two countries:
NUTS vintages | NUTS level | Mutation | Code 1 | Code 2 | Name | Geo code |
2003 to 2006 |
3 |
Code change |
DEE21 |
DEE02 |
Halle (Saale), Kreisfreie Stadt |
IDEU_15002 |
2003 to 2006 |
3 |
Code change |
DEE31 |
DEE03 |
Magdeburg, Kreisfreie Stadt |
IDEU_15003 |
2003 to 2006 |
3 |
Code change |
DEE3B |
DEE04 |
Altmarkkreis Salzwedel |
IDEU_15081 |
2013 to 2016 |
3 |
Code change |
UKM21 |
UKM71 |
Angus and Dundee City |
IGBR_ADU |
See also
Updates
- Nov 2017 - Initial version
- Jun 2019 - Considerations re: OMB codes for U.S. metro areas