329x Filetype PPT File size 0.32 MB Source: www.cse.ust.hk
Also adapted from sources
Tan, Steinbach, Kumar (TSK) Book:
Introduction to Data Mining
Weka Book: Witten and Frank (WF):
Data Mining
Han and Kamber (HK Book):
Data Mining
BI Book is denoted as “BI Chapter #...”
2
BI1.4 Business Intelligence
Architectures
• Data Sources • An example
– Gather and integrate data – Building a telecom
– Challenges customer retention model
• Data Warehouses and • Given a customer’s
Data Marts telecom behavior, predict if
the customer will stay or
– Extract, transform and load leave
data – KDDCUP 2010 Data
– Multidimensional
Exploratory Analysis
• Data Mining and Data
Analytics
– Extraction of Information
and Knowledge from Data
– Build Models of Prediction
3
BI3: Data Warehousing
• Data warehouse:
– Repository for the data available for BI and Decision Support Systems
– Internal Data, external Data and Personal Data
– Internal data:
• Back office: transactional records, orders, invoices, etc.
• Front office: call center, sales office, marketing campaigns,
• Web-based: sales transactions on e-commerce websites
– External:
• Market surveys, GIS systems
– Personal: data about individuals
– Meta: data about a whole data set, systems, etc. E.g., what structure is
used in the data warehouse? The number of records in a data table, etc.
• Data marts: subset of data warehouse for one function (e.g.,
marketing).
• OLAP: set of tools that perform BI analysis and decision making.
• OLTP: transactional related online tools, focusing on dynamic data.
4
Working with Data: BI Chap 7
• Let’s first consider an
Independent Variables Dependent
example dataset Variable
Outlook Temp Humidity Windy Play
• Univariate Analysis (7.1) sunny 85 85 FALSE no
• Histograms sunny 80 90 TRUE no
overcast 83 86 FALSE yes
– Empirical density=e_h/m, rainy 70 96 FALSE yes
e_h=values that belong to rainy 68 80 FALSE yes
class h. rainy 65 70 TRUE no
overcast 64 65 TRUE yes
– X-axis=value range sunny 72 95 FALSE no
– Y-axis=empirical density sunny 69 70 FALSE yes
rainy 75 80 FALSE yes
sunny 75 70 TRUE yes
overcast 72 90 TRUE yes
overcast 81 75 FALSE yes
rainy 71 91 TRUE no
5
Measures of Dispersion
1 m
• Variance 2 (x )2
m1 i
i1
1 m 1/2
• (x )2
Standard deviation m 1 i
i1
• r*
Normal Distribution: interval
– r=1 contains approximately 68% of the observed Thm 7.1Chebyshev’s Theorem
values; r>=1, and (x1, x2, …xm)
– r=2: 95% of the observed values be a group of m values.
– r=3: 100% of values
– Thus, if a sample outside ( ), it may be an 2
3 (1-1/r ) of the values will fall
outlier r*
within interval
6
no reviews yet
Please Login to review.