220x Filetype PDF File size 0.40 MB Source: www.mecs-press.org
I.J. Intelligent Systems and Applications, 2018, 11, 11-19
Published Online November 2018 in MECS (http://www.mecs-press.org/)
DOI: 10.5815/ijisa.2018.11.02
Medical Big Data Classification Using a
Combination of Random Forest Classifier and K-
Means Clustering
R. Saravana kumar
Professor, Department of computer science and Engineering, Dayananda Sagar Academy of Technology and
Management, Bangalore
E-mail:saravanaram0516@gmail.com
P. Manikandan
Professor, Computer Science and Engineering Department from Malla Reddy Engineering College for Women,
Maisammaguda, Secunderabad, Telangana
E-mail: parasumani001@gmail.com
Received: 19 February 2018; Accepted: 20 May 2018; Published: 08 November 2018
Abstract—An efficient classification algorithm used algorithms, such as decision trees, support vector
recently in many big data applications is the Random machine, Naive Bayes neutral network, and k Nearest
forest classifier algorithm. Large complex data include Neighbors (kNN), Differential Evolution(DE)[9]
patient record, medicine details, and staff data etc., algorithm, Machine Learning [10] algorithm, Big Data
comprises the medical big data. Such massive data is not Analytics(BDA)[11, 12].
easy to be classified and handled in an efficient manner. Fast finding the nearest samples and selecting
Because of less accuracy and there is a chance of data representative or eliminating certain samples are the two
deletion and also data missing using traditional methods traditional KNN methods utilized in big data. For training,
such as Linear Classifier K-Nearest Neighbor, Random the earlier version of SVM-KNN [13, 14] has to compute
Clustering K-Nearest Neighbor. Hence we adapt the the query distances which are comparatively slow.
Random Forest Classification using K-means clustering However in big data the computational complexity is
algorithm to overcome the complexity and accuracy issue. high. The task of partitioning a feature space into fuzzy
In this paper, at first the medical big data is partitioned classes is called the Fuzzy classification. In each region
into various clusters by utilizing k- means algorithm the feature space can be specified with fuzzy regions,
based upon some dimension. Then each cluster is which is maintained using fuzzy rules [15]. Feature
classified by utilizing random forest classifier algorithm subset selection and linear discriminate analysis are the
then it generating decision tree and it is classified based two methods used in neuro-fuzzy classifier. These
upon the specified criteria. When compared to the methods are used to evaluate the important feature
existing systems, the experimental results indicate that subsets. Hence for training the neuro-fuzzy classifier, the
the proposed algorithm increases the data accuracy. characteristics of the data distribution is restored in
feature space [16]. Usage of many fuzzy rules is the main
Index Terms—Decision trees, k-means clustering, drawback of this method. The fuzzy neural classifier
medical big data, random forest, Classification. algorithm resulted weak identification of data and hence
cost and time is increased.
In this paper, to classify the medical big data, we
I. INTRODUCTION propose a combined clustering and classification
Massive and complex data usually represented in technique. The proposed technique is the joined
18 execution of both the k-mean clustering method and RF
exabyte (10 bytes) are referred as big data. The large (Random Forest) classification method. Compared to
sensitive data is being used frequently in various other clustering algorithm the k-means clustering gives
organizations such as biomedical, IT, banking and so on. better performance. Since, it takes less time for
Using conventional database and other data analysis tools partitioning the high-dimensional datasets. The RF is an
such data is difficult to manage, standardize [1] and efficient learning method which is easy to interpret and
secure [2, 3]. Regarding their structure, storage and explain non-parametric. At first, the k-mean clustering
analysis, medical big data involve many issues. Several method is utilized to separate the high-dimensional
techniques are to be followed [4-8] to increase the medical data into various parts where each partition is
accuracy, cost reduction and improve the efficiency of considered as a cluster. The difference between each
big data. Big data uses the traditional classification cluster member and the mean of the cluster value is
Copyright © 2018 MECS I.J. Intelligent Systems and Applications, 2018, 11, 11-19
12 Medical Big Data Classification Using a Combination of Random Forest Classifier and K-Means Clustering
computed after the mean estimation of each cluster. Then, time and cost complexity was reduced and the QoS was
the clustered information is classified using the RF increased. Data security and privacy was the major
classification technique. To effectively recognize the limitation.
underrepresented class, this RF technique can oversee Zhang Yaoxue et al., [20] have presented the survey on
datasets as vast as required giving the vital support. By cloud computing and analyzed its related distributed
the proposed RF approach the big medical data will be computing technologies. To support the big data of IoT,
reduced accurately and efficiently. two promising computing paradigms were introduced
The rest of the paper is organized as follows. Section 2 they are transparent computing and fog computing. The
briefly reviews the related works. Section 3 elaborates the computational performance against big incoming data
training and testing process. The proposed technique requests was improved from multiple clients but there
achievement results and the related discussion are given was less sensitive data security.
in section 4 and the paper is concluded in section 5. A big data analytics-enabled business value was
introduced by Yichuan Wang and Nick Hajli [21]. All the
phases the information life cycle in big data architecture
II. RELATED WORKS was made understood by the concept of ILM. By
Gang Luo [17] have introduced a system named analyzing secondary data consisting of big data cases
Predict-ML for transformation of big clinical data into specifically in the healthcare context it also explores the
several datasets which was used in various applications. three path-to-value chains to reach big data analytics
The results were predicted automatically, in which the success. Thus to analyze big data for business
main advantage of the system is less time and reduced transformation it provides new methodology to healthcare
cost. The Predictive model can guide personalized practitioners and detailed investigation of big data
medicine and clinical decision making. The software analytics implementation was offered more. There was no
takes several years to be built fully and hence not easily proper method to retain and manage data efficiently,
affordable. though it has flexibility to deal with big data.
For more flexibility in dynamic data streams a new To reduce the challenges faced by big data in radiology
evolving interval type-2 fuzzy rule-based classifier and other healthcare organization P. Marcheschi [22]
(eT2Class) was presented by Mahardhika Pratama et al., implemented HL7 (High Level 7) CDA technique.
[18]. While retaining more compact and parsimonious Standard radiology was highly benefitted due to the
rule base on the state-of-art EFC’s that method produces presence of DICOM (Digital Image and Communication
more reliable classification rates. Referring to the in Medicine) and faster implementation can be done by
summarization and generalization power of data streams, the developers. The dissemination usage of FHIR
initial data stream was pruned and the fuzzy rules were standard simplifies developer works to make it less
grown automatically. Accuracy and reliability were abstract, for the process of document creation. The
accomplished in that technique. However the complexity implementation simplifies the presence of more templates
was prediction and management of big data. and for its simplicity and completeness it was hence
A KNN rule classifier based on GPU devices was proved to be more successful. Lack of “plug and play”
designed by P. Gutiérrez et al., [19] to overcome the solution that helps in the standardization of data was the
dependency between datasets and the GPU memory major drawback.
requirement. Using this method, an efficient CPU-GPU The devices that track real-time health data, or devices
communication was designed. Due to its Irrespective size that auto-administer therapies, devices that constantly
GPU keeps the memory usage stable and allows the monitor health indicator when a patient self-oversees a
dataset addressing significantly from hours to minutes in therapy, D. Dimitrov [23] initiated mIOT (medical
which the run time has been reduced. The design was best Internet of Things) and big data. In smart phones,
suitable in lazy learning algorithms such as KNN rule. wireless devices can be implemented and the time spends
The time complexity was reduced and also the run-time by the end users can be reduced by that method. Based
performance was improved. But for every training upon the symptoms it can be diagnosed. However it
datasets the nearest distance calculation was complex for cannot determine the exact health condition of users this
big data. method cannot be fully trusted. Accordingly
In order to reduce the complexity of big data a novel modifications were done by updating current data and
architecture was introduced by Entesar Althagafy and M. also it should be made user friendly.
Rizwan Jameel Qureshi [15]. To efficiently analyze, the
big data in IT companies includes large complex data’s III. PROPOSED METHOD
were difficult. To improve the performance of quality of
service (QoS) this system integrates the Amazon Web Training process and testing process are the two
Service (AWS) remote cloud, Eucalyptus, Hadoop. All processes being introduced in this section. In training
incoming requests were accepted by AWS remote cloud process, initially the medical big data is partitioned using
and to the best proper Eucalyptus cloud it is forwarded K-means clustering. Then randomly select the partitioned
intelligently. The complexity was overcome by this datasets and for each dataset, decision trees are generated.
architecture and the performance against incoming In testing process, each test sample is classified from the
request from multiple clients was enhanced and hence the values generated in decision trees.
Copyright © 2018 MECS I.J. Intelligent Systems and Applications, 2018, 11, 11-19
Medical Big Data Classification Using a Combination of Random Forest Classifier and K-Means Clustering 13
Medical Bigdata Training Process
K-means clustering
Training sample set
Randomization
Sample Sample Sample
subset 1 subset 2 subset n
Test sample Decision tree 1 Decision tree 2 Decision tree n
Class A Class B Class n
Majority Voting
Testing Process
Final Class
Fig.1. Random forest Classification using K -means Clustering
The classification of medical big data using random results are compared. Generally, for determining
forest is shown in fig. 1. By using k-means clustering the exact value of K there is no method, but using certain
medical big data is partitioned into number of groups methods an accurate estimate can be obtained. Across
named as clusters. Then, based on the RF classification different values of K , the mean distance between data
method the cluster data is classified and which are start points and their cluster centroid, the most commonly
with decision tree generation. For each split, random used method is to compare their results. Whenever
forest selects random subset of predictors. Even on the number of clusters is increased the distance to
smaller sample set sizes. Random selection further data points is reduced, and will always decrease this
reduces variance and hence the accuracy is increased. metric hence increasing K and when K is the same
Here, for each decision tree the random values are as the number of data points it goes to the extreme of
generated and finally the best class is selected based on reaching zero. K can also be determined using certain
majority voting. methods like Cross-validation, information criteria,
A. Training Process the information theoretic jump method, the silhouette
The medical big data contains massive and complex method, and the G-means algorithm and so on.
Across a group that provides insight into how the
data such as management information, staff details, data algorithm is splitting the data for each K the
and medicinal information. Random forest classifier is distribution of data points can be monitored. In the
used to classify such data accurately and efficiently. To data point the data set is a collection of features.
partition the big data at first training process clustering is Either randomly generated or selected from the data
done for selection of random subsets. The decision trees set. Initially the algorithm starts with selection of
are generated from the random subsets. cluster centroid. Between two steps the algorithm
B. K-means clustering then iterates:
Step 1: Data assignment: Each cluster from the
1) Choosing K : In K -means clustering to produce dataset is defined by cluster centroid. In data
the best result the repeated refinement is used. K is assignment step, each data point is assigned to the
given as input for the dataset and the number of centroid with the minimum distance based on the
clusters. The clusters are determined by the K means squared Euclidean distance. More formally, in set C
algorithm and for a particular pre-chosen K data set if ci is the collection of centroids, then each data
labels are determined. The number of clusters in the point x is assigned to a cluster based on
datasets for a range of K values is found using
the K -means clustering algorithm and hence the
Copyright © 2018 MECS I.J. Intelligent Systems and Applications, 2018, 11, 11-19
14 Medical Big Data Classification Using a Combination of Random Forest Classifier and K-Means Clustering
2 x st
arg min dist(c ,x) (1) Where fx1 is the feature of 1 sample, fyn is the
i
cC
i feature y of nth sample .Likewise Zth feature samples
Here, in equation 1, using Euclidean distance the are determined. The randomized training dataset is
minimum distance between dataset and the centroid depicted in equation 3.
is calculated. For each ithcluster centroid the set of Step 2: Next based upon the criteria the random data
data point assignments be S . subsets are created. These subsets are called as decision
i trees.
Step 2: Centroid updation: The cluster centroid are
recomputed. In equation 2, this is done by taking the Decision tree1
mean of all data points assigned to that centroid's
cluster. fx12 fy12 Z12
1 S1
(2) (4)
cx
ii
Si xS
ii
fx35 fy35 Z35
The two steps are repeated until no data points
change clusters. To converge a result this algorithm is Decision tree2,
guaranteed. The possible outcome is not produced by
fx2 fy2 Z2
the result, meaning that with randomized starting
centroids may give a best output assessing more than
S2
one run of the algorithm. (5)
C. Random Forest Algorithm
fx20 fy20 Z20
Input: Training Datasets, Test sample
Output: Majority vote from all individual trained trees Decision treet ,
after classification.
Let T be the number of trees to build. fx4 fy4 Z4
trees
For each of T iterations
trees
S2
(6)
1. From training set a new bootstrap sample is to be
fx12 fy12 Z12
selected.
2. On the bootstrap an unpruned tree is grown.
3. Randomly select m at each internal node Where, S S12S St
try Equation 4, 5, and 6 shows that S1 is the decision tree
predictors and using only these predictors 1, S2 is the decision tree 2, St is the decision tree t
determine the best split. respectively and hence from the given dataset, the
t decision trees are generated.
The datasets best split is chosen based upon the Step 3: The final decision values are evaluated from
regression and classification of data. Efficiently T is
trees the decision trees and using random forest all the values
selected by building trees until entire dataset is splitted. are compared with the test sample classifier and hence the
For better split, m . is the number of predictions and it is
try majority vote is determined. In the required class label
randomly selected from the dataset. Where, m =k the majority vote thus obtained is considered.
try
bagging is a special case in random forest. Many benefits D. Testing Process
of decision trees such as handling missing values, Whether the decision is true or false is predicted
continuous and categorical prediction are retained by initially by two types of values. If the set contain samples
Random forest. The forest model is built initially. To of different pattern, the entire data set is redefined into
make predictions we use the forest. In general the random single pattern subsamples. Belonging to a single pattern,
forest can be specified as follows. the decision tree is composed by a leaf when the set
Step 1: At first the dataset is to be created as, contains only samples. By purity measure of each node
the feature selection is improved. As in equation 7 the
fx1 fy1 Z1
general dataset can be converted into decision trees.
S
(3) M
F(x)F(x,a)F yT(x) c I xR (7)
m
m
fxn fyn Zn m1
Copyright © 2018 MECS I.J. Intelligent Systems and Applications, 2018, 11, 11-19
no reviews yet
Please Login to review.