Data Mining Applications Pdf 181376

Partial capture of text on file.

                                    Course Code: SEEM4630                                                                                                                                                                                                           Page 1 of 3
                                                                                                Course Examinations Midterm, 2012-2013
                                                                                                                                  SEEM4630 E-Commerce Data Mining
                                  Course Code & Title : ..............................................................................
                                                                                                                                2                                                               0
                                  Time allowed                                                        :      . . . . . . . . . . . . . . . . . . .    hours ................... minutes
                                  Student I.D. No.                                                    :      . . . . . . . . . . . . . . . . . . . . . . . . . . . . .            Seat No. : ...................
                                  The questions ask for explanations. The explanations should be concise descriptions of your
                                  understanding. Greater marks will be awarded for answers that are simple, short and concrete
                                  than for answers of a sketchy and rambling nature. Marks will be lost for giving information that
                                  is irrelevant to a question.
                                  Question 1 [30 marks] Data Preprocessing
                                     (a) What is the best distance (or similarity) measure for each of the following applications.
                                                 (1) calculate driving distance between two locations in Downtown New York;
                                                 (2) compare similar diseases with a set of medical test results as positive or negative;
                                                 (3) ﬁnd similar web documents to a keyword query.
                                                                                                                                                                                                                                                              [9 marks]
                                     (b) For the following group of data: 200, 400, 800, 1000, 2000, 2200, normalize them with
                                                 min=0andmax=100.                                                                                                                                                                                             [6 marks]
                                     (c) For the above group of data, partition them into two bins by each of the following methods:
                                                 (1) equal-width partitioning,
                                                 (2) equal-frequency partitioning.
                                                                                                                                                                                                                                                              [6 marks]
               Course Code: SEEM4630                                                                      Page 2 of 3
               (d) For the following two vectors, p = [1,1,0,0,0,0,1,0,0,0] and q = [0,1,0,0,0,0,1,0,1,0],
                    compute the following similarities:
                    (1) Simple Matching Similarity,
                    (2) Jaccard Similarity,
                    (3) Cosine Similarity.
                                                                                                        [9 marks]
              Question 2 [30 marks] Decision Tree Induction
              Consider the training dataset shown in Table 1.
                                                     A B ClassLabel
                                                     0    1         c1
                                                     0    0         c2
                                                     1    1         c1
                                                     0    1         c1
                                                     1    0         c1
                                                     0    0         c2
                                                     1    1         c1
                                                     0    0         c2
                                                     1    0         c1
                                                     1    0         c2
                                      Table 1: A Training Dataset for Questions 2 and 3
               (a) Calculate the gain in the Gini index when splitting on attributes A and B, respectively.
                    Show your calculation details. According to the gain, which one will you choose as the ﬁrst
                    attribute to split in the decision tree induction?                                [15 marks]
               (b) Calculate the gain in the misclassiﬁcation error when splitting on attributes A and B,
                    respectively. Show your calculation details. According to the gain, which one will you
                    choose as the ﬁrst attribute to split in the decision tree induction?             [15 marks]
             Course Code: SEEM4630                                                            Page 3 of 3
            Question 3 [20 marks] Naive Bayes Classiﬁcation
            Consider the training dataset shown in Table 1, and answer the following questions.
              (a) Compute the conditional probabilities P(A = 1|C = c1), P(A = 0|C = c1), P(B = 1|C =
                  c1), P(B = 0|C = c1), P(A = 1|C = c2), P(A = 0|C = c2), P(B = 1|C = c2), and
                  P(B =0|C =c2).                                                          [12 marks]
             (b) Use the computed conditional probabilities to predict the class label for a test sample
                  (A=1,B=0)using the naive Bayes approach.                                  [8 marks]
            Question 4 [20 marks] Classiﬁcation Accuracy and Cost
            Table 2 shows a confusion matrix and a cost matrix for a two-class problem. Calculate the
            following measures:
                              Predicted +   Predicted -              Predicted +   Predicted -
                     True +       100           40           True +       -1           100
                      True -       60          300           True -       20           0
                            (a) Confusion Matrix                     (b) Cost Matrix
                                Table 2: Confusion and Cost Matrices for Question 4
              (a) Accuracy,                                                                [4 marks]
             (b) Misclassiﬁcation cost,                                                    [4 marks]
              (c) Precision,                                                               [4 marks]
             (d) Recall,                                                                   [4 marks]
              (e) F-measure.                                                               [4 marks]
            -End-

The words contained in this file might help you see if this file matches what you are looking for:

...Course code seem page of examinations midterm e commerce data mining title time allowed hours minutes student i d no seat the questions ask for explanations should be concise descriptions your understanding greater marks will awarded answers that are simple short and concrete than a sketchy rambling nature lost giving information is irrelevant to question preprocessing what best distance or similarity measure each following applications calculate driving between two locations in downtown new york compare similar diseases with set medical test results as positive negative nd web documents keyword query b group normalize them min andmax c above partition into bins by methods equal width partitioning frequency vectors p q compute similarities matching jaccard cosine decision tree induction consider training dataset shown table classlabel gain gini index when splitting on attributes respectively show calculation details according which one you choose rst attribute split misclassication err...

Related files

Share

Help

Related files

Share

Share to social media

Help

Login Area