Forest Pdf 159408 | Ijisa V10 N11 2

Partial capture of text on file.
                I.J. Intelligent Systems and Applications, 2018, 11, 11-19 
                Published Online November 2018 in MECS (http://www.mecs-press.org/) 
                DOI: 10.5815/ijisa.2018.11.02 
                             Medical Big Data Classification Using a 
                 Combination of Random Forest Classifier and K-
                                                           Means Clustering 
                                                                                   
                                                                                                
                                                                     R. Saravana kumar
                      Professor, Department of computer science and Engineering, Dayananda Sagar Academy of Technology and 
                                                                     Management, Bangalore 
                                                             E-mail:saravanaram0516@gmail.com 
                                                                                   
                                                                                             
                                                                        P. Manikandan
                     Professor, Computer Science and Engineering Department from Malla Reddy Engineering College for Women, 
                                                           Maisammaguda, Secunderabad, Telangana 
                                                              E-mail: parasumani001@gmail.com 
                                                                                   
                                   Received: 19 February 2018; Accepted: 20 May 2018; Published: 08 November 2018 
                                                                                   
                                                                                   
                Abstract—An  efficient  classification  algorithm  used             algorithms,    such    as    decision    trees,  support    vector  
                recently  in  many  big  data  applications  is  the  Random        machine, Naive  Bayes neutral  network, and k Nearest 
                forest classifier algorithm.  Large complex data include            Neighbors       (kNN),      Differential     Evolution(DE)[9] 
                patient  record,  medicine  details,  and  staff  data  etc.,       algorithm,  Machine  Learning  [10]  algorithm,  Big  Data 
                comprises the medical big data. Such massive data is not            Analytics(BDA)[11, 12]. 
                easy to be classified and handled in an efficient manner.              Fast  finding  the  nearest  samples  and  selecting 
                Because of less accuracy and there is a chance of data              representative or eliminating certain samples are the two 
                deletion and also data missing using traditional methods            traditional KNN methods utilized in big data. For training, 
                such as Linear Classifier K-Nearest Neighbor, Random                the earlier version of SVM-KNN [13, 14] has to compute 
                Clustering  K-Nearest  Neighbor.  Hence  we  adapt  the             the  query  distances  which  are  comparatively  slow. 
                Random Forest Classification using K-means clustering               However  in  big  data  the  computational  complexity  is 
                algorithm to overcome the complexity and accuracy issue.            high. The task of partitioning a feature space into fuzzy 
                In this paper, at first the medical big data is partitioned         classes is called the Fuzzy classification. In each region 
                into  various  clusters  by  utilizing  k-  means  algorithm        the  feature  space  can  be  specified  with  fuzzy  regions, 
                based  upon  some  dimension.  Then  each  cluster  is              which  is  maintained  using  fuzzy  rules  [15].  Feature 
                classified by utilizing random forest classifier algorithm          subset selection and linear discriminate analysis are the 
                then it generating decision tree and it is classified based         two  methods  used  in  neuro-fuzzy  classifier.  These 
                upon  the  specified  criteria.  When  compared  to  the            methods  are  used  to  evaluate  the  important  feature 
                existing  systems,  the  experimental  results  indicate  that      subsets. Hence for training the neuro-fuzzy classifier, the 
                the proposed algorithm increases the data accuracy.                 characteristics  of  the  data  distribution  is  restored  in 
                                                                                    feature space [16]. Usage of many fuzzy rules is the main 
                Index  Terms—Decision  trees,  k-means  clustering,                 drawback  of  this  method.  The  fuzzy  neural  classifier 
                medical big data, random forest, Classification.                    algorithm resulted weak identification of data and hence 
                                                                                    cost and time is increased.   
                                                                                       In  this  paper,  to  classify  the  medical  big  data,  we 
                                      I.  INTRODUCTION                              propose  a  combined  clustering  and  classification 
                   Massive  and  complex  data  usually  represented  in            technique.  The  proposed  technique  is  the  joined 
                             18                                                     execution of both the k-mean clustering method and RF 
                exabyte (10  bytes) are referred as big data. The large             (Random  Forest)  classification  method.  Compared  to 
                sensitive  data  is  being  used  frequently  in  various           other  clustering  algorithm  the  k-means  clustering  gives 
                organizations such as biomedical, IT, banking and so on.            better  performance.  Since,  it  takes  less  time  for 
                Using conventional database and other data analysis tools           partitioning the high-dimensional datasets. The RF is an 
                such  data  is  difficult  to  manage,  standardize  [1]  and       efficient learning method which is easy to interpret and 
                secure  [2,  3].  Regarding  their  structure,  storage  and        explain  non-parametric.  At  first,  the  k-mean  clustering 
                analysis, medical big data involve many issues. Several             method  is  utilized  to  separate  the  high-dimensional 
                techniques  are  to  be  followed  [4-8]  to  increase  the         medical data into various parts  where  each  partition  is 
                accuracy,  cost  reduction  and  improve  the  efficiency  of       considered  as  a  cluster.  The  difference  between  each 
                big  data.  Big  data  uses  the  traditional    classification     cluster  member  and  the  mean  of  the  cluster  value  is 
                Copyright © 2018 MECS                                                           I.J. Intelligent Systems and Applications, 2018, 11, 11-19 
               12           Medical Big Data Classification Using a Combination of Random Forest Classifier and K-Means Clustering              
               computed after the mean estimation of each cluster. Then,         time and cost complexity was reduced and the QoS was 
               the  clustered  information  is  classified  using  the  RF       increased.  Data  security  and  privacy  was  the  major 
               classification  technique.  To  effectively  recognize  the       limitation. 
               underrepresented  class,  this  RF  technique  can  oversee          Zhang Yaoxue et al., [20] have presented the survey on 
               datasets as vast as required giving the vital support. By         cloud  computing  and  analyzed  its  related  distributed 
               the proposed RF approach the big medical data will be             computing technologies. To support the big data of IoT, 
               reduced accurately and efficiently.                               two  promising  computing  paradigms  were  introduced 
                  The rest of the paper is organized as follows. Section 2       they are transparent computing and fog computing. The 
               briefly reviews the related works. Section 3 elaborates the       computational  performance  against  big  incoming  data 
               training  and  testing  process.  The  proposed  technique        requests  was  improved  from  multiple  clients  but  there 
               achievement results and the related discussion are given          was less sensitive data security.  
               in section 4 and the paper is concluded in section 5.                A  big  data  analytics-enabled  business  value  was 
                                                                                 introduced by Yichuan Wang and Nick Hajli [21]. All the 
                                                                                 phases the information life cycle in big data architecture 
                                   II.  RELATED WORKS                            was  made  understood  by  the  concept  of  ILM.  By 
                  Gang  Luo  [17]  have  introduced  a  system  named            analyzing  secondary  data  consisting  of  big  data  cases 
               Predict-ML for transformation of big clinical data into           specifically in the healthcare context it also explores the 
               several datasets which was used in various applications.          three  path-to-value  chains  to  reach  big  data  analytics 
               The results were predicted automatically, in which the            success.  Thus  to  analyze  big  data  for  business 
               main advantage of the system is less time and reduced             transformation it provides new methodology to healthcare 
               cost.  The  Predictive  model  can  guide  personalized           practitioners  and  detailed  investigation  of  big  data 
               medicine  and  clinical  decision  making.  The  software         analytics implementation was offered more. There was no 
               takes several years to be built fully and hence not easily        proper  method  to  retain  and  manage  data  efficiently, 
               affordable.                                                       though it has flexibility to deal with big data. 
                  For  more  flexibility  in  dynamic  data  streams  a  new        To reduce the challenges faced by big data in radiology 
               evolving  interval  type-2  fuzzy  rule-based  classifier         and  other  healthcare  organization  P.    Marcheschi  [22] 
               (eT2Class) was presented by Mahardhika Pratama et al.,            implemented  HL7  (High  Level  7)  CDA  technique. 
               [18].  While  retaining  more  compact  and  parsimonious         Standard  radiology  was  highly  benefitted  due  to  the 
               rule base on the state-of-art EFC’s that method produces          presence of DICOM (Digital Image and Communication 
               more  reliable  classification  rates.  Referring  to  the        in Medicine) and faster implementation can be done by 
               summarization and generalization power of data streams,           the  developers.  The  dissemination  usage  of  FHIR 
               initial data stream was pruned and the fuzzy rules were           standard  simplifies  developer  works  to  make  it  less 
               grown  automatically.  Accuracy  and  reliability  were           abstract,  for  the  process  of  document  creation.  The 
               accomplished in that technique. However the complexity            implementation simplifies the presence of more templates 
               was prediction and management of big data.                        and  for  its  simplicity  and  completeness  it  was  hence 
                  A  KNN  rule  classifier  based  on  GPU  devices  was         proved to be more successful. Lack of “plug and play” 
               designed  by  P.  Gutiérrez  et  al.,  [19]  to  overcome  the    solution that helps in the standardization of data was the 
               dependency  between  datasets  and  the  GPU  memory              major drawback. 
               requirement. Using this method, an efficient CPU-GPU                 The devices that track real-time health data, or devices 
               communication was designed. Due to its Irrespective size          that  auto-administer  therapies,  devices  that  constantly 
               GPU  keeps  the  memory  usage  stable  and  allows  the          monitor health indicator  when a  patient self-oversees a 
               dataset addressing significantly from hours to minutes in         therapy,  D.  Dimitrov  [23]  initiated  mIOT  (medical 
               which the run time has been reduced. The design was best          Internet  of  Things)  and  big  data.  In  smart  phones, 
               suitable  in  lazy  learning  algorithms  such  as  KNN  rule.    wireless devices can be implemented and the time spends 
               The time complexity was reduced and also the run-time             by the end users can be reduced by that method. Based 
               performance  was  improved.  But  for  every  training            upon  the  symptoms  it  can  be  diagnosed.  However  it 
               datasets the nearest distance calculation was complex for         cannot determine the exact health condition of users this 
               big data.                                                         method      cannot    be    fully    trusted.    Accordingly 
                  In order to reduce the complexity of big data a novel          modifications  were  done  by  updating  current  data  and 
               architecture was introduced by Entesar Althagafy and M.           also it should be made user friendly. 
               Rizwan Jameel Qureshi [15]. To efficiently analyze, the               
               big data in IT companies includes large complex data’s                              III.  PROPOSED METHOD 
               were difficult. To improve the performance of quality of 
               service  (QoS)  this  system  integrates  the  Amazon  Web           Training  process  and  testing  process  are  the  two 
               Service  (AWS) remote cloud, Eucalyptus, Hadoop. All              processes  being  introduced  in  this  section.  In  training 
               incoming requests were accepted by AWS remote cloud               process, initially the medical big data is partitioned using 
               and to the best proper Eucalyptus cloud it is forwarded           K-means clustering. Then randomly select the partitioned 
               intelligently.  The  complexity  was  overcome  by  this          datasets and for each dataset, decision trees are generated. 
               architecture  and  the  performance  against  incoming            In testing process, each test sample is classified from the 
               request from multiple clients was enhanced and hence the          values generated in decision trees. 
                Copyright © 2018 MECS                                                           I.J. Intelligent Systems and Applications, 2018, 11, 11-19 
                            Medical Big Data Classification Using a Combination of Random Forest Classifier and K-Means Clustering          13 
                                                                             Medical Bigdata      Training Process
                                                                           K-means clustering
                                                                           Training sample set
                                                                              Randomization
                                                             Sample              Sample                    Sample 
                                                             subset 1            subset 2                 subset n
                                       Test sample        Decision tree 1     Decision tree 2          Decision tree n
                                                             Class A             Class B                   Class n
                                                                             Majority Voting
                                         Testing Process
                                                                                Final Class
                                                                                                                            
                                                    Fig.1. Random forest Classification using  K -means Clustering 
                  The classification  of  medical  big  data  using  random      results  are  compared.  Generally,  for  determining 
               forest is shown in fig. 1. By using k-means clustering the        exact value of K  there is no method, but using certain 
               medical  big  data  is  partitioned  into  number  of  groups     methods an accurate estimate can be obtained. Across 
               named as clusters. Then, based on the RF classification           different values of K , the mean distance between data 
               method the cluster data is classified and which are start         points and their cluster centroid, the most commonly 
               with  decision  tree  generation.    For  each  split,  random    used  method  is  to  compare  their  results.  Whenever 
               forest  selects  random  subset  of  predictors.  Even  on        the  number  of  clusters  is  increased  the  distance  to 
               smaller  sample  set  sizes.  Random  selection  further          data points is reduced, and will always decrease this 
               reduces  variance  and  hence  the  accuracy  is  increased.      metric hence increasing K   and when  K  is the same 
               Here,  for  each  decision  tree  the  random  values  are        as the number of data points it goes to the extreme of 
               generated and finally the best class is selected based on         reaching zero. K can also be determined using certain 
               majority voting.                                                  methods  like  Cross-validation,  information  criteria, 
               A.  Training Process                                              the information theoretic jump method, the silhouette 
                  The medical big data contains  massive and complex             method,  and  the  G-means  algorithm  and  so  on. 
                                                                                 Across  a  group  that  provides  insight  into  how  the 
               data such as management information, staff details, data          algorithm  is  splitting  the  data  for  each K the 
               and  medicinal  information.  Random  forest  classifier  is      distribution  of  data  points  can  be  monitored.  In  the 
               used to classify such data accurately and efficiently. To         data  point  the  data  set  is  a  collection  of  features. 
               partition the big data at first training process clustering is    Either randomly generated or selected from the data 
               done for selection of random subsets. The decision trees          set.  Initially  the  algorithm  starts  with  selection  of 
               are generated from the random subsets.                            cluster  centroid.  Between  two  steps  the  algorithm 
               B.  K-means clustering                                            then iterates: 
                                                                                    Step  1:  Data  assignment:  Each  cluster  from  the 
                  1)  Choosing K : In  K -means clustering to produce            dataset  is  defined  by  cluster  centroid.  In  data 
               the best result the repeated refinement is used.  K  is           assignment  step,  each  data  point  is  assigned  to  the 
               given  as  input  for  the  dataset  and  the  number  of         centroid  with  the  minimum  distance  based  on  the 
               clusters. The clusters are determined by the  K  means            squared Euclidean distance. More formally, in set C 
               algorithm and for a particular pre-chosen K  data set             if  ci  is  the  collection  of  centroids,  then  each  data 
               labels are determined. The number of clusters in the              point x is assigned to a cluster based on 
               datasets  for  a  range  of  K  values  is  found  using 
               the  K -means  clustering  algorithm  and  hence  the 
                Copyright © 2018 MECS                                                           I.J. Intelligent Systems and Applications, 2018, 11, 11-19 
                  14             Medical Big Data Classification Using a Combination of Random Forest Classifier and K-Means Clustering                                     
                                                                 2                                                                   x        st
                                          arg min dist(c ,x)                            (1)      Where  fx1 is  the  feature             of  1   sample,  fyn  is  the 
                                                            i                                                    
                                             cC
                                              i                                                  feature  y  of  nth  sample  .Likewise  Zth feature  samples 
                                                                                                                                                       
                      Here, in equation 1, using Euclidean distance the                          are  determined.  The  randomized  training  dataset  is 
                  minimum distance between dataset and the centroid                              depicted in equation 3. 
                  is  calculated.  For  each ithcluster centroid the set of                         Step 2: Next based upon the criteria the random data 
                  data point assignments be S .                                                  subsets are created. These subsets are called as decision 
                                                       i                                         trees. 
                      Step 2: Centroid updation: The cluster centroid are                            
                  recomputed. In equation 2, this is done by taking the                             Decision tree1 
                  mean  of  all  data  points  assigned  to  that  centroid's                     
                  cluster.                                                                                              fx12     fy12          Z12
                                                                                                                      
                                                                                                                      
                                                   1                                                            S1
                                                                                               (2)                                                                         (4) 
                                            cx
                                                                                                                     
                                             ii
                                                   Si xS
                                                       ii                                                             
                                                                                                                        fx35     fy35          Z35
                                                                                                                      
                      The  two  steps  are  repeated  until  no  data  points                     
                  change clusters. To converge a result this algorithm is                           Decision tree2, 
                  guaranteed. The possible outcome is not produced by                             
                                                                                                                           fx2     fy2         Z2
                  the  result,  meaning  that  with  randomized  starting                                              
                  centroids may give a best output assessing more than                                                 
                                                                                                                S2
                  one run of the algorithm.                                                                                                                                (5) 
                                                                                                                       
                  C.  Random Forest Algorithm                                                                          
                                                                                                                         fx20     fy20          Z20
                                                                                                                       
                      Input: Training Datasets, Test sample                                       
                      Output: Majority vote from all individual trained trees                       Decision treet , 
                  after classification.                                                           
                      Let T       be the number of trees to build.                                                         fx4     fy4          Z4
                            trees                                                                                       
                      For each of T         iterations                                                                  
                                       trees
                                                                                                                 S2
                                                                                                                                                                           (6) 
                      1.   From training set a new bootstrap sample is to be                                            
                                                                                                                        
                                                                                                                          fx12    fy12          Z12
                           selected.                                                                                    
                      2.   On the bootstrap an unpruned tree is grown.                            
                      3.   Randomly  select  m              at  each  internal  node                Where, S S12S                 St 
                                                       try                                          Equation 4, 5, and 6 shows that  S1 is the decision tree 
                           predictors  and  using  only  these  predictors                       1, S2  is  the  decision  tree  2,  St  is  the  decision  tree  t 
                           determine the best split.                                             respectively  and  hence  from  the  given  dataset,  the 
                                                                                                  t decision trees are generated. 
                      The  datasets  best  split  is  chosen  based  upon  the                      Step 3: The final decision values are evaluated from 
                  regression and classification of data. Efficiently T                    is 
                                                                                    trees        the decision trees and using random forest all the values 
                  selected by building trees until entire dataset is splitted.                   are compared with the test sample classifier and hence the 
                  For better split,  m      .  is the number of predictions and it is 
                                         try                                                     majority vote is determined. In the required class label 
                  randomly  selected  from  the  dataset.  Where, m                      =k      the majority vote thus obtained is considered. 
                                                                                    try
                  bagging is a special case in random forest. Many benefits                      D.  Testing Process 
                  of  decision  trees  such  as  handling  missing  values,                         Whether  the  decision  is  true  or  false  is  predicted 
                  continuous  and  categorical  prediction  are  retained  by                    initially by two types of values. If the set contain samples 
                  Random  forest.  The  forest  model  is  built  initially.  To                 of different pattern, the entire data set is redefined into 
                  make predictions we use the forest. In general the random                      single pattern subsamples. Belonging to a single pattern, 
                  forest can be specified as follows.                                            the  decision  tree  is  composed  by  a  leaf  when  the  set 
                      Step 1: At first the dataset is to be created as,                          contains only samples. By purity measure of each node 
                                                                                                 the  feature  selection  is  improved.  As in equation 7 the 
                                            fx1    fy1          Z1
                                          general dataset can be converted into decision trees. 
                                           
                                     S  
                                                                                             (3)                                                 M
                                          
                                                                                                                                                         
                                                                                                      F(x)F(x,a)F  yT(x)                       c I xR      (7) 
                                                                                                                                                m
                                                                                                                                                             m
                                            fxn    fyn          Zn                                                                              m1      
                                           
                                                                                                  
                                                                                                  
                   Copyright © 2018 MECS                                                           I.J. Intelligent Systems and Applications, 2018, 11, 11-19
The words contained in this file might help you see if this file matches what you are looking for:

...I j intelligent systems and applications published online november in mecs http www press org doi ijisa medical big data classification using a combination of random forest classifier k means clustering r saravana kumar professor department computer science engineering dayananda sagar academy technology management bangalore e mail saravanaram gmail com p manikandan from malla reddy college for women maisammaguda secunderabad telangana parasumani received february accepted may abstract an efficient algorithm used algorithms such as decision trees support vector recently many is the machine naive bayes neutral network nearest large complex include neighbors knn differential evolution de patient record medicine details staff etc learning comprises massive not analytics bda easy to be classified handled manner fast finding samples selecting because less accuracy there chance representative or eliminating certain are two deletion also missing traditional methods utilized training linear nei...
Related files

Share

Help

Related files

Share

Share to social media

Help

Login Area