Strengths and Limitations of CMS Administrative Data in Research


The purpose of this article is to identify 1) common strengths of Medicare and Medicaid administrative data and 2)  broad limitations for researchers to consider when requesting and using the data.
Current Version Date:
Origin of Health Services Utilization, or “Claims” Data

Health Services utilization data, commonly referred to as claims data, are derived from reimbursement information or the payment of bills. As a general rule, those pieces of information that are required to determine payment/reimbursement will be of higher quality than other information reported on a claim. Also included in the available CMS data are enrollment data, which are the basis for determining whose bills are qualified to be paid by Medicare.

Strengths of CMS Administrative Data

There are several reasons that make administrative data useful in health services research.

Clinical validity

Medicare data contain information about covered services used by enrollees in the program. Examples include:

  • Admission and discharge dates
  • Diagnoses
  • Procedures
  • Source of care

Demographic data, such as age, date of birth, race, place of residence and date of death, are also included in these administrative datasets and are considered largely reliable and valid.  Files containing this type of information about all enrolled Medicare beneficiaries are known as “denominator” files.

Linkage to Other CMS Datasets

There are numerous data files available, each containing different types of information on topics such as utilization, enrollment, provider characteristics and patient assessments. Fortunately, it is possible to link these files. For example, a “denominator” file containing demographic information can be linked to a claims file containing utilization information for the same beneficiaries.

Population Coverage

It is estimated that over 98 percent of adults age 65 and over are enrolled in Medicare, making Medicare data one of the richest sources of utilization information in the country. Furthermore, over 99 percent of deaths in the US among persons age 65 and older are accounted for by the Medicare program. There are over 45 million beneficiaries enrolled in the Medicare program today, allowing for detailed sub-group analysis with reduced concerns about loss of statistical power.


Conducting research using the CMS data is a cost-effective way to conduct analysis of a large segment of the Medicare population, especially when considering the alternative of requesting individual patients’ medical charts. The data also allow access to claims information across multiple providers for a given beneficiary while providing a consistent reporting format.

Convenient Linkage to External Data Sources

Below is a list of some external datasets that can be linked to the Medicare utilization and enrollment data:

  • US Census
  • Cancer registries (e.g. SEER/Medicare)
  • Other providers (e.g. VA, Medicaid)
  • National death index/State vital statistics
  • Surveys (e.g. Health and Retirement Study)
  • Provider Information

Depending on the availability of identifying/common variables, other external data sources may be linked to the Medicare data. Linking can take place either at the group level (based on geography, place of service, etc.) or at the person level (through SSN or Medicare ID).

Data Availability

CMS data files are complete and available relatively quickly after the close of a given calendar or fiscal year. For example, Medicare enrollment information for each calendar year contained in the Master Beneficiary Summary File is generally available the following Fall. Similarly, calendar year utilization files are more than 98 percent complete by the Summer of the following year and available for release soon after.

Broad Limitations of CMS Administrative Data

Record of Care Received

Conditions must be diagnosed in order to appear in the utilization files; however, some diseases such as hypertension, depression and diabetes are often under-diagnosed. In addition, while the files provide a reliable record of the care received by the beneficiary, they do not provide information on the care needed. It is difficult to study disease recurrence in detail since all the data may reveal is the start of a new treatment.

Another important point is that services that providers know in advance will be denied may be inconsistently submitted as bills and, therefore, inconsistently recorded in the files.

Diagnosis Information

Diagnosis information may not be comprehensive enough in some cases to allow detailed analysis. For example, a cancer diagnosis can be found as an ICD-9 diagnosis code in the data (e.g. lung cancer is 162.xx), but no information on stage or histology is included in the Medicare claims data.

The data do contain information on chronic diseases; however, knowing that someone has a chronic disease does not reveal how long they have had the condition (incidence vs. prevalence) or the severity of their condition.

Another limitation related to diagnosis information is that the Part D prescription drug event file contains no diagnosis codes. Because many drugs and procedures have multiple indications, it can be difficult to interpret the reason for a given prescription.

Inconsistencies in Use of Coding Systems for Procedures by Care Setting

Different care settings use different coding systems for procedures treated in inpatient and outpatient settings. For example, inpatient care is coded using ICD-9 procedure codes (4 digits), while physician/supplier and durable medical equipment data are coded using CPT and HCPCS codes. Furthermore, hospital outpatient care is coded as a mix of CPT and revenue center (hospital billing center) codes. Currently, there exists a less-than-perfect crosswalk between ICD-9 codes and CPT codes.

Limited Clinical Information

Physiological measurements such as blood pressure, pulse, and cardiac ejection fraction are absent from the utilization files. In addition, results of common tests such as PSA, angiography and pathological tests are not included. Exact timing of events can be difficult to discern. Specifically, the time from admission to a given event or timestamps for dates of service cannot be found in the data.

Exclusions in Utilization Data

Outlined below are several types of services and care that are not contained in the Medicare data.

1) Until recently, prior to Part D, Medicare had no pharmacy benefit, therefore, outpatient medications could not be studied. With Part D, studies will have to take plan formularies into account.

2) Covered services for which claims are not submitted are not included in the data (e.g. immunizations provided through grocery-store immunization clinics).

3) Some services are not covered by Medicare and would, therefore, not be included.

4) Prior to the release of Medicare Advantage Encounter data, no information was available for Part B services provided to managed care enrollees.

5) Little information (and of largely unknown quality) was also available about hospitalizations for managed care enrollees prior to release of Medicare Advantage Encounter data.

6) Not all beneficiaries have Part D coverage; not all beneficiaries with Part D coverage will have Part D utilization information contained in files.

7) Encounter data do not include information on payments to providers.

Variable Quality

A good rule of thumb when trying to determine the reliability of a given data field is this: If the information impacts payment, then the quality of that information will be better.

Keeping this in mind, different types of care may be subject to different payment rules. This implies that, for example, comorbidity and severity of illness information may be inconsistently recorded if they are subject to varying payment rules. In addition, some components of treatments may not be included in bills (and therefore in the claims data) if reimbursement rates are very low, even if the treatment is provided.


In general, data elements provided by CMS provide consistent and reliable information. When assessing data quality, a good place to start is with the data dictionaries created for each file. The data dictionaries (record layouts) contain information about some assumptions, data combinations, limitations, etc. These are an important tool to use when designing your study and analyzing your data, and they can be found on our website under the CMS Data section.