Data Mining of 7COM1018 Data Mining Assignment
A data set of text is provided on Canvas. Analyse this data using the WEKA toolkit and tools introduced within this module. You should attempt the following tasks:
- Look at the individual texts and the target classes using a text editor, and try to find 5 keywords that you believe are indicative of one target class and 5 keywords of the other – this step should be done
manually, not using WEKA. Explain why you have chosen the keywords you have.
- Convert the text dataset into TWO different databases in ARFF format. Explain the conversion techniques and parameters that you have used, and justify your choice of parameters to form two databases. For example, you may make one dataset with stemming enabled, and another without stemming.
- Perform some pre-processing on the two datasets. Explain what pre-processing you do, why you think it is helpful to do, and what impact the pre-processing has on the data.
- For each database, produce a table and a graph of classification performance against training set size for the following three classifiers: decision-tree (J48), Naïve Bayes, Support Vector Machine. For the Support-Vector Machine you will have to determine the kernel, kernel parameter and C.
- For each database, train a decision tree on the entire database and look at its representation. Which keywords is the decision tree using? Are they the same as those you selected in (1)?
- Write a conclusion covering at least:
a. how well each classifier performs on classifying the text documents
b. the keywords which identify the two classes
c. which of your choice of conversion techniques and parameters from (2) you think was most effective.
Explain the steps you have taken to complete each task in your report. Screenshots should be used sparingly. In total, your report should contain no more than 10 pages.