About the Author(s)
Recent Developments
Automated Removal of Protected Health Information From Electronic Medical Records
Lisa Skrzycki

With the advent of Electronic Health Records (EHR), researchers are eager to use HIPAA-compliant de-identified patient health records for research. However, the complex and varied nature of textual documents poses special problems for researchers and technical staff seeking to use the data.
Keywords: electronic health records, HIPAA, patient privacy, health policy

Published: 16 August 2012
Cite as: Skrzycki L. Automated Removal of Protected Health Information From Electronic Medical Records.
Bull Health L Policy. 2012;1(1): e7.


On April 26, 2004, former President George Bush called for all American hospitals to have electronic health records within ten years.
1 Nearly eight years later, over three-fourths of American hospitals employ some type of Electronic Record System (ERS).2 As medical systems expand their ERS to improve efficiency and reduce errors,3 academia and industry are eager to take advantage of potential secondary uses of EHRs.4

HIPAA’s Safe Harbor
As direct patient authorization for use of EHRs in secondary research is often impractical, the Health Insurance Portability and Accountability Act (HIPAA)'s Safe Harbor provides researchers access to EHRs if 18 types of Protected Health Information (PHI) are removed.5 PHI is any unique identifying number, characteristic, or code, such as name, age, address, license number, or geographic information that could be used to identify an individual patient.6 Of all the EHR document types, narrative texts (e.g., provider notes, discharge summaries, pathology reports) present the greatest hurdle to de-identification efforts due to both their variation and complexity.7 Researchers are often required to remove PHI from these documents manually, expending already scarce resources.8 Efforts to automate this process have been met with varied success.9

Issue: Identifying and Comparing Automated Methods for Removing Protected Health Information

Perhaps unsurprisingly, most research on automatic de-identification to date focuses on the relatively easier task of removing PHI found in structured or form data, rather than in PHI in narrative texts.10 De-identification systems can be broadly classified into “pattern-matching” and “machine-learning” types.

Pattern-Matching Systems

Pattern-matching systems seek to identify PHI based on large dictionaries, which typically require months of manual input by programmers experienced in informatics.11 Dictionaries come in two general forms: “PHI-like,” in which terms typically considered PHI, such as proper names, geographic locales, and healthcare institutions, are targeted for removal, and “non-PHI-like,” in which certain terms, such as biomedical phrases, are protected from removal.12
Creating the dictionary database is the most cumbersome aspect of implementing pattern-matching systems. Once completed, pattern-matching systems offer researchers the ability to quickly and easily modify the dictionary set to fit the particular dataset in use.13 However, these systems are not easily transferable to other institutions14 and require developers to anticipate all possible PHI patterns, including transcription errors and unconventional abbreviations.15

Automated Systems
The most recently developed automatic de-identification processes incorporate some form of machine learning. Informatics experts train the systems' algorithms based on large bodies of text which have been annotated to recognize PHI and non-PHI.16 These bodies of text have the additional advantage of being shared by different groups, as was done in a recent de-identification challenge.17 Significant labor is required to create both the annotated training texts and the computer software,18 but once completed, are more easily transferable and tend to perform better than systems based solely on pattern-matching systems.19

The most effective systems combined facets of pattern-matching and machine-learning. However, no system provided for complete de-identification of all 18 classes of PHI.20 Both systems placed the greatest emphasis on removal of patient names,21 but were the most effective in removal of number-based data (addresses, telephone and fax numbers, identification numbers).22 Reliability of PHI removal varied substantially across document types.23 Pattern-matching systems fared better with uncommon PHI, but machine-learning systems fared best overall.24 Every system faced problems of false positives, called “over-scrubbing,” in which non-PHI were improperly targeted for removal. 25

Next Steps

Further research is required before these systems may be implemented on a wide scale. No automated system can ever be perfect, but the greatest successes thus far have been limited to a relatively small number of document types.26 More attention must also be paid to issues of over-scrubbing, where important information such as laboratory test values are accidentally erased.27 Additionally, some researchers have questioned whether the complete removal of all PHI is sufficient to characterize an EHR as “de-identified,” especially where social history or rare medical conditions are concerned.28

EHRs offer great promise for researchers, but significant barriers to de-identification must be overcome before secondary uses of EHR can become widespread. Greater cooperation between research institutions is needed, with a focus on processes that can reliably de-identify PHI across all document types.

Competing Interests:
None reported
Acknowledgments: None reported

Lisa Skrzycki is a third-year student at California Western School of Law, and is a member of Law Review and the Pro Bono Honors Society. Ms. Skrzycki received her B.A. in Sociology from the University of Pittsburgh.

References (Bluebook)
1. President George W. Bush, Remarks to the American Association of Community Colleges: Convention in Minneapolis, Minnesota (Apr. 26, 2004), available at http://www.presidency.ucsb.edu/ws/index.php?pid=72610&st=&st1#axzz1cTBi0SPW.
2. See Ashish K. Jha et al., Use of Electronic Health Records in U.S. Hospitals, 360 New Eng. J. Med. 1628, 1631 (2009).
3. Id. at 1628.
4. See Charles Safran et al., Toward a National Framework for the Secondary Use of Health Data: An American Medical Informatics Association White Paper, 14 J. Am. Med. Informatics Ass'n. 1, 1-2 (2007).
5. See 45 C.F.R. § 164.514 (2011).
6. 45 C.F.R. § 164.514(b)(2) (2011).
7. Stephane Meystre et al., Extracting Information from Textual Documents in the Electronic Health Record: A Review of Recent Research, 2008 Y.B. Med. Informatics 138, 139 (2008).
8. D.A. Dorr et al., Assessing the Difficulty and Time Cost of De-identification in Clinical Narratives, 45 Methods Informatics Med. 246, 251 (2006).
9. See Stephane Meystre et al., Automatic De-identification of Textual Documents in the Electronic Health Record: A Review of Recent Research, 10 BMC Med. Res. Methodology, 1, 2 (Aug. 2, 2010), available at http://www.biomedcentral.com/1471-2288/10/70.
10. See id.
11. See id. at 5.
12. Id.
13. Id.
14. See Meystre, supra note 7, at 141.
15. Meystre, supra note 9, at 5-7.
16. Id. at 7.
17. See Őzlem Uzuner et al., Evaluating the State-of-the-Art in Automatic De-identification, 14 J. Am. Med. Informatics Ass'n. 550, 551 (2007).
18. See Meystre, supra note 9, at 5.
19. See id. at 7.
20. See id. at 14.
21. Id. at 3.
22. Dorr, supra note 8, at 250.
23. See Meystre, supra note 9, at 1.
24. See id.
25. Id at 14-15.
26. See id. at 3.
27. Id. at 14-15.
28. See Jeanmarie Mayer et al., Inductive Creation of an Annotation Schema and a Reference Standard for De-identification of VA Electronic Clinical Notes, 2009 AMIA Ann. Symp. Proc. 416, 419 (2009), available at https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2815367/pdf/amia-f2009-416.pdf (last visited Nov. 26, 2011).

© Institute of Health Law Studies 2012
All rights reserved
e-ISSN: 2168-6513