# 6. Data Privacy-Preserving Techniques

{% hint style="info" %}
About [Zümrüt MÜFTÜOĞLU](https://www.linkedin.com/in/zumrut-muftuoglu-98704537/?originalSubdomain=tr)
{% endhint %}

This week, we hosted Zümrüt Müftüoğlu, who works at the Digital Transformation Office of the Presidency of Türkiye and is an expert on data privacy. She talked to us about the balance between privacy and security and the technologies that increase privacy.

> <mark style="color:orange;">“Data is a population problem of the Information Age, and protecting privacy is an environmental challenge.” —</mark> <mark style="color:orange;"></mark>*<mark style="color:orange;">Bruce Schneier</mark>*

### Data vs Information

![Data versus Information](https://cdn-images-1.medium.com/max/800/1*sGkRx_TlVYs_afukl3Qxrw.png)

As an example; The number of likes on a social media post is a single element of data. When that’s combined with other social media engagement statistics, like followers, comments, and shares, a company can intuit which social media platforms perform the best and which platforms they should focus on to more effectively engage their audience.

#### **Data Classification**

Carnegie Mellon University offers a proposal for sensitivity-based classification of data in the context of information security.

![](https://cdn-images-1.medium.com/max/800/1*R3f6akduscOBeZ_s6q2GEw.png)

![https://www.cmu.edu/iso/governance/guidelines/data-classification.html](https://cdn-images-1.medium.com/max/800/1*nrL26SuQNhNGhm-YWy9qXw.png)

#### **Types and Structures of Data**

* **Location Data** — location history of a person,
* **Graph Data** — data relating to a social network, communication network, or physical network,
* **Time Series** — Data contains an element of updating in time, such as census information.

### Privacy threats in data analytics

**Surveillance:** Many organizations including retail, e-commerce, etc. study their customers’ buying habits and try to come up with various offers and value-added services.

**Disclosure:** The third-party data analyst can map sensitive information with freely available external data sources like census data.

**Discrimination:** Discrimination is the bias or inequality which can happen when some private information of a person is disclosed.

**Personal embracement and abuse:** Whenever some private information of a person is disclosed, it can even lead to personal embracement or abuse.

Governments and regulatory agencies can be touted as the most responsible institutions on the subject. Because governments can enforce privacy regulations, they have the ability to ensure that data stakeholders comply with those regulations. Facebook, Instagram, etc. With the inappropriate use of social media applications such as social media applications, users also upload personal data to the public domain, which leads to privacy threats. With the increase in privacy threats and the results, awareness among users has increased. Accordingly, it increased the demand for privacy protection. In this way, countries began to create privacy laws and regulations. The most prominent among these are the European Union’s GDPR (General Data Protection Regulation) and India’s Personal Data Protection law. Some of the practices are shown in the table below, along with the privacy risk.

![Application vs. privacy risk.](https://cdn-images-1.medium.com/max/800/1*HxNIbk38JB5P-_fpBTzfVQ.png)

### Security vs Privacy

![](https://cdn-images-1.medium.com/max/800/1*zMmk5K4EzmOTlUdmHOaeNg.png)

**Security** is about protecting data. It means protection against unauthorized access to data. We implement security controls to limit who can access information.

**Privacy** is about protecting user identity, although it is sometimes difficult to define.

However, we also encounter areas where the two concepts overlap.

Let’s try to clarify the difference with an example. A company you shop with can access many of your personal data. Precautions such as choosing secure systems and software for safe storage against third parties and systems or processing if you have given permission are examined under the heading of security. However, the employees of this company are the cashiers, etc. Determining the conditions of access of personnel to this data can be examined under the title of privacy.

> A study examined the security and privacy practices of more than 300 HIV outpatient clinics in Vietnam. The result of the research; “most staff have appropriate safeguards and practices in place to ensure data security; however, improvements are still needed, particularly in protecting patient privacy for data access, sharing and transmission.”[\*](https://www.hiv.gov/blog/difference-between-security-and-privacy-and-why-it-matters-your-program#:~:text=Security%20is%20about%20the%20safeguarding,the%20unauthorized%20access%20of%20data.)

#### **Privacy Failures in History**

* [Massachusetts Health Records](https://epic.org/wp-content/uploads/privacy/reidentification/Sweeney_Article.pdf) (1990s)
* [AOL Search Logs](https://www.wikiwand.com/en/AOL_search_log_release) (2006)
* [Netflix Prize](https://ieeexplore.ieee.org/document/4531148) (2007)
* [Facebook Ads](https://www.wikiwand.com/en/Facebook–Cambridge_Analytica_data_scandal) (2010)
* [New York City Taxi Trips](https://www.theguardian.com/technology/2014/jun/27/new-york-taxi-details-anonymised-data-researchers-warn) (2014)

### Privacy-Enhancing Technologies(PETs)

![Taxonomy of privacy-enhancing technologies.
N. Kaaniche et al. https://www.sciencedirect.com/science/article/pii/S0167404815000668](https://cdn-images-1.medium.com/max/800/1*T7lDNst4ONIrh8RCvgJzAw.png)

<figure><img src="/files/hWweWGODIbMvUtgNq2cO" alt=""><figcaption><p><a href="https://www.private-ai.com/wp-content/uploads/2021/10/PETs-Decision-Tree.pdf">Privacy Enhancing Technologies Decision Tree by PrivateAI</a></p></figcaption></figure>

#### **Differential Privacy**

Differential privacy is a system designed to share general patterns about the group(s) covered by the dataset while keeping information about the individuals in the dataset. Differential privacy is not an algorithm. A system or framework that provides more effective data privacy. We can simply describe the differential privacy approach as follows. A group of individuals has random noise in the range of +100/-100 in their pocket. We don’t want to know who has what. We want to know what we’re left with when this group collects it in their pockets. For example, one of the people in the group has $55 in their pocket and there is a noise of -15 beside it. This means that we get (55+(-15))=$40 and we protect the privacy of the person. People with the same coin but different random noise do not have a regular relationship with each other. This noise can be added to the input or output of the system.

Actually, the magic potion here is; is the law of large numbers in probability and statistics. According to the law, the larger the sample size, the closer it refers to the mean of the entire population. In other words, if there are enough individuals in the dataset when the average of these statistically collected data is taken, it is seen that the noise disappears and the obtained average is close to the real average. A result is a random number in the sum of the data before adding noise. Thus, we have information about the average amount in the pocket of the individual. But at the same time, we do not know the amount in each individual’s pocket. So we provide privacy. We can think the same for cancer and healthy patients, as shown in the figure, instead of the amount of money in their pocket. Those who want to dive deeper into the subject should check the reference of the figure.

![How to deploy machine learning with differential privacy](https://cdn-images-1.medium.com/max/800/0*WzXs_vBYQI7gd3B1)

<mark style="color:green;">**Source:**</mark> Extra explanation by *Nicolas Papernot and Ian Goodfellow* according to student’s questions: [Privacy and machine learning: two unexpected allies?](http://www.cleverhans.io/privacy/2018/04/29/privacy-and-machine-learning.html)

#### **Homomorphic Encryption**

It is another method of providing input privacy for AI. Homomorphic encryption is a type of encryption that allows computation on encrypted data. Homomorphic encryption poses privacy concerns, not to protect data owners, but to model owners and users’ valuable intellectual property with their data. Therefore, if the model is to be used in an unreliable environment, it is preferred to keep its parameters encrypted.

**Advantages:**

* It can infer on the encrypted data so that the model owner never sees the customer’s private data and therefore cannot leak or misuse it.
* It does not require interaction between data and model owners to perform the calculation.

**Disadvantages:**

* It requires high computing power.
* It is limited to certain types of calculations.

![MLaaS is one of the most exciting applications of Homomorphic Encryption.](https://cdn-images-1.medium.com/max/800/0*icMLz2-Iby33UrTh.png)

<mark style="color:green;">**Source:**</mark> [Here’s a link to Andrew Trask’s article on privacy-preserving security](http://iamtrask.github.io/2017/06/05/homomorphic-surveillance/) if you want to dive deeper.

#### **Federated Learning**

Sharing and processing the huge amount of data needed for AI often poses privacy and security risks. Another method to overcome these challenges is Federated Learning.

Federated Learning is an approach to bringing code to data rather than bringing data to code. Thus, it addresses issues such as data privacy, ownership, and locality.

* Certain techniques are used to compress model updates.
* It does quality updates instead of simple gradient steps.
* Noise is added by the presenter before performing the aggregation to hide the influence of an individual on the learned model.
* If gradient updates are too large, they are clipped.

![Federated Learning](https://cdn-images-1.medium.com/max/800/0*fSNnvUfBMM_BFa3P.PNG)

**Federated learning differs from decentralized computing:**

* Client devices such as smartphones have limited network bandwidth.
* Their ability to transfer large amounts of data is poor. Usually, the upload speed is lower than the download speed.
* Client devices may not always be suitable for participating in an educational environment. Internet connection quality, charging status, etc. conditions must be suitable.
* The data available on the device is updated quickly and is not always the same.
* For client devices, not participating in the training is also an option.
* The number of client devices available is huge but inconsistent.
* Federated learning provides distributed education and aggregation across a large population and privacy protection.
* Data is often unstable as it is user-specific and auto-correlated.

Zümrüt Müftüoğlu, who was the guest in our lecture, emphasized that there is a trade-off between privacy and data sharing. Next week we will discuss legal cases and approaches to IoT applications in our lecture.

#### ***References:***

1. [What is Differential Privacy?](https://becominghuman.ai/what-is-differential-privacy-1fd7bf507049)
2. [*Differential Privacy Basics Series — (Part-1) Introduction with examples.*&#x62;ecominghuman.a](https://becominghuman.ai/what-is-differential-privacy-1fd7bf507049)
3. [What is Homomorphic Encryption?\
   \&#xNAN;*Input privacy is one of the most relevant issues in private ML. We explored one solution to this problem in Secure…*&#x62;log.openmined.org](https://blog.openmined.org/what-is-homomorphic-encryption/)
4. [What is federated learning?](https://bdtechtalks.com/2021/08/09/what-is-federated-learning/)
5. [*One of the key challenges of machine learning is the need for large amounts of data. Gathering training datasets for…*&#x62;dtechtalks.com](https://bdtechtalks.com/2021/08/09/what-is-federated-learning/)
6. [Introduction to Federated Learning and Privacy Preservation using PySyft and PyTorch\
   \&#xNAN;*Update as of November 18, 2021: The version of PySyft mentioned in this post has been deprecated. Any implementations…*&#x62;log.openmined.org](https://blog.openmined.org/federated-learning-additive-secret-sharing-pysyft/)
7. [Instilling Responsible and Reliable AI Development with Federated Learning | by Faisal Zaman | Accenture The Dock | Medium](https://medium.com/accenture-the-dock/instilling-responsible-and-reliable-ai-development-with-federated-learning-d23c366c5efd)&#x20;

<br>


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://ayyucekizrak.gitbook.io/computers-and-ethics-lecturenotes/computers-and-ethics/computers-and-ethics-lecture-notes/6.-data-privacy-preserving-techniques.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
