Anonymizing Datasets at Scale Leveraging Databricks Interoperability

A key challenge for data-driven companies across a wide range of industries is how to leverage the benefits of analytics at scale when working with Personally Identifiable Information (PII). In this blog, we walk through how to leverage Databricks and the 3rd party Faker library to anonymize sensitive datasets in the context of broader analytical pipelines.

Problem Overview

Data anonymization is often the first step performed when preparing data for analysis. It is important that anonymization preserves the integrity of the data. This blog focuses on a simple example that requires the anonymization of four fields: name, email, social security number and phone number. This sounds easy, right? The challenge is we must preserve the semantic relationships and distributions from the source dataset in our target dataset. This allows us to later analyze and mine the data for important patterns related to the semantic relationships and distribution. So, what happens if there are multiple rows per user? Real data often contains redundant data like rows with repeated full names, emails etcetera that may or may not have different features associated with them. Therefore, we will need to maintain a mapping of profile information.

We want to anonymize the data but maintain the relationship between a given person and their features. So, if John Doe appears in our data set multiple times, we want to anonymize all of John Doe’s PII data while maintaining the relationship between John Doe and the different feature sets associated with John. However, we also must guard against someone reverse engineering our anonymization. To do this, we will use a python package, known as Faker, designed to generate fake data. Each call to the method fake.name() yields a different random result. This means that the same source file will yield a different set of anonymized data in the target file each time Faker is run against the source file. However, the semantic relationships and distributions will be maintained.

Anonymizing Datasets at Scale Leveraging Databricks Interoperability

Figure 1 Source

Figure 2 Target

The Solution: Anonymizing Data While Maintaining Semantic Relationships and Distributions.

First, let’s install Faker for anonymization and unicodecsv so we can handle unicode strings without a hassle.

%sh pip install Faker unicodecsv

Second, let’s import our packages into the Databricks Notebook. The unicodecsv is a drop-in replacement for Python 2.7’s csv module which supports unicode strings without a hassle. Faker is a Python package that generates fake data.

import unicodecsv as csv from faker import Factory

Third, let’s build the function that will anonymize the data

def anonymize_rows(rows): """ Rows is an iterable of dictionaries that contain name and email fields that need to be anonymized. """ # Load faker faker = Factory.create() # Create mappings of names, emails, social security numbers, and phone numbers to faked names & emails. names = defaultdict(faker.name) emails = defaultdict(faker.email) ssns = defaultdict(faker.ssn) phone_numbers = defaultdict(faker.phone_number) # Iterate over the rows from the file and yield anonymized rows. for row in rows: # Replace name and email fields with faked fields. row["name"] = names[row["name"]] row["email"] = emails[row["email"]] row["ssn"] = ssns[row["ssn"]] row["phone_number"] = phone_numbers[row["phone_number"]] # Yield the row back to the caller yield row

Next, let’s build the function that takes our source file and outputs our target file:

def anonymize(source, target): """ The source argument is a path to a CSV file containing data to anonymize, while target is a path to write the anonymized CSV data to. """ with open(source, 'rU') as f: with open(target, 'w') as o: # Use the DictReader to easily extract fields reader = csv.DictReader(f) writer = csv.DictWriter(o, reader.fieldnames) # Read and anonymize data, writing to target file. for row in anonymize_rows(reader): writer.writerow(row)

Finally, we will pass our source file to the anonymize function and provide the name and location of the target file we would like it to create. Here we are using the DBFS functionality of Databricks, see the DBFS documentation to learn more about how it works.

anonymize("/dbfs/databricks/mydir/test.txt", "/dbfs/databricks/mydir/anonymized.txt")

You can see all the code in a Databricks notebookhere.

Our source data, shown in Figure 3 , has the same patients, highlighted in green appearing multiple times, with different types of care, highlighted in red.

Figure 3

Our anonymization process should anonymize the patient’s PII while maintaining the semantic relationships and distributions. Figure 4 demonstrates that the relationships and distributions are maintained while the PII is completely anonymized.

Figure 4

Figure 5demonstrates that our anonymization is completely random because when we run the source dataset against Faker again we get a completely different set of anonymized data while still maintaining relationships and distributions.

Figure 5

What’s next

The example code we used in this blog is available as aDatabricks notebook, you can try it with a free trial of Databricks. If you are interested in more examples of Apache Spark in action, check out my earlier blog on asset optimization in the oil & gas industry .

This just one of many examples of how Databricks can seamlessly enable innovation in the healthcare industry. In the next installation of this blog, we will build a predictive model using this anonymized data. If you want to get started with Databricks, sign-up for a free trial orcontact us.

Anonymizing Datasets at Scale Leveraging Databricks Interoperability

Trending Articles

SM3268AB 8CE三星量产无法格式化

[下载工具]Think4V utubedown(Youtube高清视频下载工具) v2.1.6 官方版2.1.3

出售: SINE Othello 電源線

博讯｜张磊帮助下，李源潮的儿子被耶鲁录取

FullEventLogView 1.73 免安裝中文版 - 事件檢視器取代工具

同門四角戀？李沛旭喇舌「小郭雪芙」曾智希，蔡淑臻拍完婚紗...怒毀婚

五代RAV4 降車身（機械車位因素）

[攻略] 《魔獸世界》6.2.2 白色魚人蛋再現！來去收編魚人寶寶特基！

jetBrains Product crack 2024 Java based

2013 KUGA 6G轉動方向盤會聽到摳摳摳的異音，有人知道原因嗎?

【豌豆字幕組】[藥屋少女的呢喃（藥師少女的獨語）/ Kusuriya no Hitorigoto][25][繁體][1080P][MP4]

好用的照片后期处理软件【DxO PhotoLab Elite 5.4.0.4765 (x64) 多语言便携版】..

出售: Thixar Silence Plus 啫喱板

df-dferh-01 中国区 Android 安装 Google Play Store 后报错的解决办法

三條崙討海人故事…重建烏倉寮憶43年前船難

致喬立建設道歉聲明

[一般] 神州全地圖掉寶資料

方易通7862 8/128G 無360 刷機

動感校園小記者・瑪利諾修院學校｜採訪王瑋駿陳晞文帶領試玩風帆

有藍電流行車紀錄器分享文嗎