Incorporating Security and Privacy in Machine Learning Projects

Joy Bose
6 min readJul 24, 2021

--

Machine learning (ML) projects, when being deployed, also need to pay attention to security and privacy considerations. In this article, we introduce some of the concepts related to security and ways in which security and privacy can be incorporated into ML projects.

Security and Privacy of a software system (including an ML system)

Security here refers to protecting a software system from attacks by malicious hackers and other agents. To make a software system more secure, one has to consider the design of the software system right from the inception and be aware of security considerations in its architecture. In an ML system, an attacker might gain unauthorized access to the system and tamper with the input or output data or with the ML model. It is important to preemptively identify the threats and make the system secure.

Privacy refers to the privacy of user data and and identity of the users of the software. For example, in an ML system, the user’s personal demographic data (such as age, name, address, mobile number, email address or their interests) could be collected as input and used to make predictions or recommendations based on the user profile. This could be used to identify the user and violates privacy. Similar to security, one should design a system that takes privacy into consideration.

Steps in security and privacy analysis

Security and privacy analysis involves the following steps:

Step 1: Identify the different entities in your software including databases, ML models, input data, preprocessed data, output predictions, APIs, flows of the data across entities, user interfaces etc.

Step 2: Draw a DFD (Data Flow Diagram) of the system, that gives the data flowing across different entities in the system. One can use the freely downloadable Microsoft Threat Modeling Tool or any of these other tools, or diagram tools like draw.io for the same.

A simple DFD drawn using Microsoft Threat Modeling Tool

Step 3: Based on the DFD, identify the threats using a threat modeling methodology such as CIA (confidentiality, integrity, accessibility) or STRIDE. STRIDE is a comprehensive tool for security analysis of a system, first developed by Microsoft. It categorizes threats into different categories including Spoofing (where an attacker illegitimately gains access to your system), Tampering (when they tamper with data or ML models in your system), Repudiation (when they can successfully dispute that they gained illegal access or tampered with your system), Information disclosure (privacy breach or data leak), Denial of service (when they can cause one of your resources such as APIs or services to be unavailable) and Elevation of privilege (when the attacker gets unauthorized superuser access).

Identifying threats in each software component, APIs, data flows etc using STRIDE methodology

The Microsoft Threat Modeling Tool can also generate a report of the different security risks in the system, based on the DFD and the various components.

Extract from a risk assessment report from the Microsoft threat modeling tool

Step 4: For privacy analysis, similarly identify the threats to each of the components in the DFD based on a privacy methodology such as TRIM (Transfer, Retention, Inference, Minimization) or LINDDUN (standing for linkability, identifiability, non repudiation, detectability, disclosure of information, unawareness and non-compliance).

Step 5: Based on the previous analysis, make a list of all the identified security and privacy risks and threats to the system. For each of the identified threats, evaluate it by how much probable it is (how easy it is for the attacker to gain access, for example) as well as what is the impact of that threat (will it cause a minor or major problem if the customer’s data gets leaked, for example). OWASP risk rating methodology can be used to evaluate the threats. The OWASP risk rating calculator, based on the above methodology, provides a consistent way to evaluate different kinds of security risks and get a risk vector for each type of risk.

Step 6: After getting each of the identified and analyzed threats and risks, decide how to mitigate or avoid the risk, or make a decision to accept the risk. For example, a privacy threat might be mitigated by applying a method such as differential privacy or simply masking the user identifiable data such as numbers and emails.

Hardening and vulnerability assessment

Hardening and Vulnerability assessment(VA) are two other paradigms in designing a secure system.

Hardening refers to making changes to the software system in such a way as to reduce the possibility of attacks by external agents, for example by blocking unused ports or encrypting the data flows across different components of the system.

Vulnerability assessment (VA) refers to using tools to scan the ports and other components of the system or application, and highlighting the issues discovered.

Secure Coding

Secure Coding refers to writing the code in a secure way, or one that does not introduce unnecessary security vulnerabilities in the system.

Some guidelines for secure coding in python are available here.

There are some tools available such as bandit, that can perform static code analysis in Python and highlight the issues where the code can be made more secure. A list of static code analysis tools for different languages is here.

References

Some useful references on the topics of security in ML and software projects generally are as follows:

References on Security and threat modeling

  1. Microsoft Threat Modeling Tool https://www.microsoft.com/en-us/securityengineering/sdl/threatmodeling
  2. STRIDE cards https://owasp.org/www-project-cornucopia/ and
  3. https://www.riskio.co.uk/stride
  4. Threat modeling card game https://www.microsoft.com/en-in/download/details.aspx?id=20303
  5. Microsoft. Threat Modeling AI/ML Systems and Dependencies https://docs.microsoft.com/en-us/security/engineering/threat-modeling-aiml
  6. Microsoft. Failure Modes in Machine Learning. https://docs.microsoft.com/en-us/security/engineering/failure-modes-in-machine-learning
  7. Threat Modeling of Connected Cars using STRIDE. https://alissaknight.medium.com/threat-modeling-of-connected-cars-using-stride-e8184764eb0a

References on Privacy

LINDDUN: Downloads of useful resources including user guides and tutorials https://www.linddun.org/downloads

LINDDUN: a privacy threat analysis framework. https://people.cs.kuleuven.be/~kim.wuyts/LINDDUN/LINDDUN.pdf

Books on security and secure coding

Books on security and secure coding

Some good books for an introduction to security and secure coding in software projects (also applies for ML projects) are as follows:

  1. Threat Modeling: Designing for Security by Adam Shostack
  2. Threat Modeling: A Practical Guide for Development Teams by Izar Tarandach, Matthew J. Coles. O’Reilly
  3. Secure Coding — Principles & Practices: Principles and Practices by Mark G Graff. O’Reilly
  4. Java Coding Guidelines: 75 Recommendations for Reliable and Secure Programs (SEI Series in Software Engineering) by Fred Long , Dhruv Mohindra. Addison Wesley

(Note: Books on secure coding guidelines in Java and C++ are available, but seems a book on Python secure coding guidelines is not available yet)

References on Hardening and Vulnerability Assessment

References on Secure Coding in Python (static code analysis)

  1. How To Secure Python Web App Using Bandit https://soshace.com/how-to-secure-python-web-app-using-bandit/

2. Bandit (Free tool for security analysis in Python code) https://bandit.readthedocs.io/en/latest/

3. Fortify Static code analysis tool: Visual Studio Code extension https://marketplace.visualstudio.com/items?itemName=fortifyvsts.fortify-extension-for-vs-code

--

--

Joy Bose

Working as a software developer in machine learning projects. Interested in the intersection between technology, machine learning, society and well being.