Preparation of 7COM1073 Foundations of Data Science Assignment: To do tasks set in this piece of work, you need to load the data using Pandas and change the labels: change ‘positive’ to the value of 1 and ‘negative’ to the value of 0.
Divide the data set into a training set (from now on we shall refer to this as training set (I)) and a test set: write Python code to make the first 500 rows in the original data comprise the training set (I); the rest of rows in the original dataset will form the test set. Use Python code to check and
report how many data points are labelled as 0’s (negative) in the training set and the test set, respectively, and how many data points are labelled as 1’s (positive) in the training set and the test set, respectively. (4 marks)
Task 2: PCA Analysis on the training set (10 marks)
a) Normalise the training set and the test set using StandardScaler() (Hint: the parameters should come from the training set only) (2 marks).
b) Perform a PCA analysis on the training data set (I) and plot a scree plot to report variances captured by each principal component (3 marks)
c) Plot two subplots in one figure: in one subplot project the training set in the first two principal components’ projection space and label the training data using different colours in the picture according to its class; in the other subplot project the training set in the third and fourth principal components’ projection space and also label the test data using different colours according to its class (5 marks).
Task 3: Do a classification using the logistic regression model with a regularisation term (13 marks)
a) In your report, describe the model you have used, including (6 marks): What is the cost function? You need to give a mathematical expression describing it.
Which optimization algorithm has been used in your code?
Which regularisation term have you used?
b) Define your own function ([num1, index1, num2, index2]=misPatterns(predictions, labels)) using Python. The inputs of this function should be the predictions and labels in the test set; and the
outputs of this function should ne the number (num1) of misclassified patterns whose label is 1 but was given prediction of 0 and their indices (index1) in the test set, and the number (num2) of misclassified patterns whose label is 0 but was given a prediction of 1 and their indices (index2) in the test set. (4 marks)
c) Train the model on the training set and report the performance on the test set using the precision score and results obtained using the misPatterns function you have defined in b). (3 marks)
Task 4: Investigate how the number of features in the training dataset affects the model performance on the validation set (18 marks)
a) Divide the training dataset (I) into a smaller training set (II) and a validation set using train_test_split and report the number of points in each set. Usually, we use 20%-30% of the total data points in the whole training set as the validation data. It is your choice on how you set the exact ratio. (2 marks)
b) Use the training set (II) to train 8 logistic regression models, with 8 different feature sets. That is:
the first one is to use the 1st feature only; the second one is to use the 1st and the 2nd features; the third one is to use the 1st, 2nd, and 3rd features, the fourth one is to use the first 4 features. In other words, the nth feature set should make use of the first n features.
Measure the precision score on both the training set (II) and the validation set. Report the results by plotting them in a figure: that is, a plot of the precision score against the number of features
used in each model. There should be two curves in this figure: one for the training set (II); the other one for the validation set (10 marks).
c) Report what is the best number of features you would like to use in this work and explain why you choose it. (3 marks).
d) Use the selected number of features to train the model and report the performance on the test set (3 marks).
Task 5: Writing a report (10 marks)
In this report, you need to summarize what you have done, which model you have used, what results you have obtained, and also your findings and conclusions. The highest mark will be given to reports with outstanding presentation and clarity, no significant grammatical/ spelling or structural errors, and which show an outstanding level of analysis with critical evaluation/reflection where it is required.
7COM1073 Foundations of Data Science Assignment assesses the following module Learning Outcomes
- Have knowledge and understanding of the fundamental mathematical ideas behind data science;
- Have knowledge and understanding of relevant computational algorithms and the fundamentals of probability, information and statistical methods;
- Have knowledge and understanding of producing and appreciating algorithmic definitions to provide useful data science analysis;
- Be able to apply basic mathematical skills to simple data science problems;
- Be able to implement algorithms and programs to analyze a given dataset;
- Be able to make sensible recommendations of the nature of the data analyzed.