Main Menu

My Account
Online Free Samples
   Free sample   Netflix data anonymization

Netflix Data Anonymization: Analysing The Challenges


Task: A 2000 word report on the research and findir€s of the potential security and privacy issues faced when dealing with data and analytics. The associated constraints from a regulatory, governance and ethical perspective will also be presented.


Executive Summary
In the past few years, there has been an immense rise in the enhancement of the publicly releases databases that comprise of the micro-data about the individuals, such as, the choices & preferences, movie rating, details on the transactional information, and likewise. There are several reasons behind this trend discussed in this netflix Data Anonymization. For instance, the improvement in the recommendation system and sharing the details about government schemes and policies through the justice system are some of the prime reasons. However, there are a few negative aspects about the process described in this netflix Data Anonymization. These include the increased probability of the information security risks and attacks as the private details of the users may get compromised as an outcome. The users may not wish to share their personal details and opinions with the entire world. This has led to the development of several techniques to overcome such situations and anonymize the database before it is released. These techniques have been developed to treat and mitigate some of the specific attacks and threats that occur on these data sets. This netflix Data Anonymization gives a brief overview of the Netflix Data Processing challenge and the controversy surrounding it. It will also provide an overview of anonymization techniques available in the market today as well as their downsides. Furthermore, this netflix dataset also presents ethical issues surrounding with Facial recognition systems.

Netflix Data Anonymization

The anonymization is not a mere process of removing unique identifiers of the users, such as social security numbers or email addresses. It also includes the removal of the auxiliary information of these users that is captured from the other sources of information to de-anonymize the information records. The identification and computing of the database joints is done as one of the techniques.

One of the most popular examples is the de-anonymization of the Netflix dataset. The company had published the over 10 million movie rankings provided by more than 500,000 customers. It was done as a challenge to the people to provide enhanced recommendation system that the one being already used by the organization.

In this netflix dataset anonymization process was done by eliminating the personal details of the users and their names were replaced by certain random number to ensure information and data privacy. Arvind Narayanan as well as Vitaly Shmatikov are the 2 main researchers from the UT, Austin that carried out the process of de-anonymization of the Netflix data. These researchers also compared the ranking and timestamp with the public information available on the web portal, Internet Movie Database (IMDb).

High-dimensional sparse dataset
In a variety of applications, the data sets are represented as high-dimensional data; however, the use of sparse vectors is done in this netflix dataset. Some of the examples of such applications include text processing applications, computer vision, etc. It is extremely essential that the similarity scores between the objects are identified for the activities, such as clustering, classification, etc. However, it is not easy to develop similarity measures for such data sets. This is due to the fact that only a small and usually unknown subset of the features is relevant to the specific activity being carried out. For example, in the process of drug discovery, the presentation of the chemical compounds can be done with the aid of the sparse features, for example, the 3D properties of these compounds. There are only a few such properties that are actually significant in determining the nature and target receptors of the compound. In the process of text classification, the representation of the document is done using the sparse bag of words. In this case, on a subset of the works is usually enough to classify among various topics.

Data Anonymization
Data confidentiality is an issue that is existence since long. It is not just the new-age problem due to emergence of the digitization of the data sets. One such example can be obtained in the confidentiality of the information of the citizens in the US census that was identified in mid 1800s. The census bureau began using the practice of removal of personally identifying data properties from the census data as it was publically accessible.

Netflix Data Anonymization

The bureau continued with the practices of data anonymization using the techniques that it could implement to make sure that mitigation of identity revelation of the people could be done. Some of the techniques that were used include rounding, appending random values to the uniquely identifying information pieces, suppression and swapping of the cells, sampling, and various others. With the introduction and increase in the use of the computer systems, the enhanced ability of filtering and cross-tabulation were provided. As a result, the analysts got the ability to anonymize the data sets. However, there were permissions provided to the analysts to keep a track of the information and query the data sets. The use of computers was started by the bureau in early 1950s and the use of computerized anonymization techniques began in 1960s. For example, if the analyst was aware of certain data properties, such as date of birth, gender and zip code, then the querying of the data using this information could be done (Data Anonymization Approach for Data Privacy, 2015, pp. 1534-1539). The uniquely identifying attributes associated with the data sets are referred as pseudo-identifiers. The acronym for the same is PsID. The combination of the zip code, date of birth, and gender will be the pseudo-identifier in this example (GUO and ZHANG, 2014, pp. 1852-1867). Similarly, the querying of the data sets could be done by the analysts using other information attributes and range of information sets.

One of the initial methods to anonymize the information sets is the addition of noise to the information sets. For instance, there may be addition of noise done to the mathematical values so that the accurate result is not obtained. A random integer could be added to the information pieces so that the results could be modified accordingly. The method could be successful only when the noise was not possible to be predicted and was random in nature. However, this technique had its own share of flaws. For example, if the results obtained provided a new zero-mean random noise sample, then it became possible for the analyst to repeat the query a few times and the average could then be used for the elimination of the noise.

In the current times, there are several anonymization techniques that have been developed. Some of the most popular such techniques are K-Anonymity and Differential Privacy. However, the real and practical application of these techniques is only limited (Prakash, 2018, pp. 2083-2087).

In the current times, there are several anonymization techniques that have been developed. Some of the most popular such techniques are K-Anonymity and Differential Privacy. However, the real and practical application of these techniques is only limited (Prakash, 2018, pp. 2083-2087).

The usability of differential privacy as the anonymization technique is also poor. Some of the market giants, such as Apple, Google, and Uber have declared that they have been successful with the use of this technique for anonymization. This may surprise certain readers as the statement does not comply with the claims made by these companies.

In the technique of differential privacy, the deployments that involvement the honor composability and a value of ? that would generally be regarded as strongly anonymous (i.e. ? < 1) allow for only a small number of queries, maybe a few 10s. Certain mechanisms, such as the ones used by Google and Apple have been successful to avoid composability by considering certain assumptions regarding the co-relation between the attributed and the usage of the same attribute time and again. This results in the unlimited number of queries; however, the noise involved is also high. This also eliminates the ability to observe correlation between ranges of value(s).

Challenges in Data Anonymity
Large Volumes of Data: There are certain issues associated with anonymization of the large volumes of data. The validation of the results of anonymization in large data volumes is difficult. Also, the process of anonymization with abundant data volumes is slow. Therefore, the overall success rate in the anonymization of the large data sets is not good.

Dynamic Data: There are a few requirements to anonymize the data sets and stability of the data sets is one of such basic requirements. The dynamic data sets keep on changing frequently and there is a high degree of instability that gets associated with these data sets (Praveena and Smys, 2016 pp. 12-14). Therefore, the anonymization of the information and data sets may become very difficult.

Streaming Data: There are two primary reasons that make the anonymization of the streaming data extremely difficult. The primary reason is the incompleteness of the data sets as the data inclusion and arrival in the system is not structured properly and is discrete in nature. The reorganization of the data sets is another reason that deteriorates the effectiveness of the anonymization results in such cases.

Challenges related to the web-based data: There are various models and algorithms that have been developed to anonymize the relational data models. However, these models and methods is not possible to be applied to the current data obtained from the social media web 2.0 based channels. The anonymization of these data sets is even more challenging when compared with the relational data.

The modeling of the background knowledge related to attacks regarding such data is a bit more challenging when compared with relational data sets. There are various attributes and information pieces that may be used to identify the data sets in the case of social media and web 2.0 data, such as neighborhood graphs, labels of vertices and edges, also includes induced subgraphs, and their combinations thereafter. This makes it even more challenging and complex than relational data.

Essential preparation needed for the Data Analytics Contest
There is involvement of personal information identifiers, such as attributes that can uniquely identify the information pieces of individuals. For example, social security numbers of the individuals. The involvement of quasi-identifier is present which includes the attributes associated with the external data to identify the individual information. The sensitive attributes, such as healthcare information that is not desired to be disclosed and non-sensitive information is involved.

In order to preserve the privacy of these attributes, there are different techniques that are involved, such as generalization, suppression, anatomization, permutation, and perturbation. Apart from these techniques, the use of K-Anonymity and Differential Privacy shall also be done.

Ethical Issues in Image Recognition for Law Enforcement
There are ethical issues involved in the use of facecams for the purpose of security and surveillance. There are threats that these facecams propose to the civil liberties. The violence of basic right to privacy is violated with the deployment of smart CCTV cameras in the public places. It is therefore, essential that a trade-off between security and ethical compliance is established.

The facial recognition is one of the widely used techniques which have the potential to limit the anonymity. With the increase in the social media networks, the facial recognition technique is now publically deployed. The face of an individual is a unique identifier and the use of facial recognition technique brings up the issues of violation of privacy (Introna, 2005). There are certain principles that shall be used to avoid the ethical issues involved in the process. These include the incorporation of collection of consent from the owner, along with consent for information use, sharing, and access as discussed in this netflix data privacy.

There is use of facial recognition technique done for policing purposes. This technique is usually deployed in controlled spaces. However, the violation of ethical principles may be observed with the risks to privacy and exposure of information.

The need of the data sets is essential in the current times as the business functions rely on the data and information sets that are available. However, it is discussed in this netflix data privacy that the proper use and storage of these information pieces is also the primary responsibility and accountability of the business organizations. It is necessary that data anonymization and protection of the ethical and privacy principles is done as per the required and the nature of the data sets that are involved. The combination of different data anonymization techniques may be used to carry out the process in an effective manner in this netflix data privacy. Netflix dataset assignments are being prepared by our IT experts from top universities which let us to provide you a reliable best assignment help service.

Data Anonymization Approach for Data Privacy, 2015. International Journal of Science and Research (IJSR), 4(12), pp.1534-1539.

Data Anonymization Approach for Data Privacy, 2015. International Journal of Science and Research (IJSR), 4(12), pp.1534-1539.

Data Anonymization Approach for Data Privacy, 2015. International Journal of Science and Research (IJSR), 4(12), pp.1534-1539.

Praveena, A. and Smys, D, 2016. Anonymization in Social Networks: A Survey on the issues of Data Privacy in Social Network Sites. International Journal of Engineering and Computer Science, 5(3), pp 12-14.


Related Samples

Question Bank

Looking for Your Assignment?

Search Assignment
Plagiarism free Assignment









9/1 Pacific Highway, North Sydney, NSW, 2060
1 Vista Montana, San Jose, CA, 95134