Data Analysis Project
- WBS; Work Breakdown Structure Plan Definition
- Experimental Design for Analysis
- Data Resources
Work Breakdown Structure Plan Definition
Identify Business Problem
- WBS(Work Breakdown Structure) Planning
- Computing Resource Operation Planning
- Literature Reivew & Research
- Problem Diagnosis
- Project Initiation
Data Acquisition & Preprocessing
Objective: manipulation or dropping of data before it is used in order to ensure or enhance performance
- Analytical Table Definition
- Data Quality Assurance
- Missing Value Detection and Preprocessing
- Deletion
- Imputation
- Anomaly Detection and Preprocessing
- Point Anomaly (an Outlier)
- Grubbs' Test
- Collective Anomaly
- Deviation-based Outliers
- Interquartile-based Outliers
- Isolation Forest
- Minimum Covariance Determinant
- Local Outlier Factor (Local Anomaly)
- One-Class SVM
- Contextual Anomaly
- Point Anomaly (an Outlier)
- Imbalance Sample Detection and Preprocessing
- Sampling
- Undersampling
- Oversampling
- Sampling
- Missing Value Detection and Preprocessing
- Dimensionality Reduction
- Manifold Learning
- Isomap
- Locally Linear Embedding
- Multidimensional Scaling(MDS)
- Spectral Embedding
- t-distributed Stochastic Neighbor Embedding(t-SNE)
- Matrix Decomposition
- Principle Component Analysis(PCA)
- Factor Analysis
- Truncated Singular Value Decomposition(SVD)
- Manifold Learning
- Data Transformation
- Quantile Tranformation
- Normal Quantile Tranformation
- Uniform Quantile Tranformation
- Scaling Transformation
- Minmax Scaling
- Standard Scaling
- Robust Scaling
- Gaussian-like Tranformation
- yeo-johnson Tranformation
- box-cox Tranformation for positive-data
- Discretization Transformation
- Uniform Discretization Transformation
- K-Means Discretization Transformation
- Qunatile Discretization Transformation
- Quantification for Categorical Variables
- Numerical Encoding
- Label Encoding
- Ordinal Encoding
- OneHot Encoding
- Numerical Embedding
- Numerical Encoding
- Quantile Tranformation
- Feature Selection & Extraction
- Information Values
- Variance Inflation Factor (VIF)
- Chi-Squared Test
- ANOVA Correlation Coefficient (Linearity)
- Kendall’s Rank Coefficient (Non-Linearity)
- Mutual Information
- Recursive Feature Elimination (RFE, RFECV: Feature Importance, Coefficient)
- Sequential Feature Selection (Forward, Backward, Stepwise Selection: AIC, BIC, R-squared, Accuracy)
- Selection From Model (Threshold)
- Heuristic Selection on Baseline Model
- Noise Filtering
- Kalman Filter
Exploratory/Confirmatory Data Analysis
Objective: analyzing data sets to summarize their main characteristics
- Data Summarization
- Descriptive Statistics
- Aggregation-based Frequency & Percentile Analysis
- Entropy: Class Imbalance
- Aggregation-based Frequency & Percentile Analysis
- Probability Analysis
- Posterior: Naive Bayes Analysis, Discriminant Analysis
- Data Tendency: Decision Tree Analysis
- Data Pattern: Probability Function Analysis Fitting Distribution
- Independency Analysis between Random Variables
- Chi-squared Test
- Contingency Table
- Correlation Analysis
- Tetrachoric Correlation between Binary Categorical Variables
- Polychoric Correlation between Ordinal Categorical Variables
- Cramer’s Correlation between Nominal Categorical Variables
- Pearson Correlation between Continuous Variables (Linear Relationship)
- Spearman’s Rank Correlation between a Categorical and a Continuous Variable (Non-Linear Relationship of Monotonic Function)
- Kendall’s Rank Correlation between a Categorical and a Continuous Variable
- Matual Information
- Chi-squared Test
- Association Analysis between Categorical Random Variables
- Support, Confidence, Lift
- Information Quantitation
- Information Values
- Feature Importance
- Tree-based Feature Importance
- Coefficient-based Feature Importance
- Permutation Importance
- Descriptive Statistics
- Data Visualization
- 1-Dimensional Visualization
- Categorical Variables
- Pie Plot: Count Plot
- Bar Plot: Count Plot
- Continuous Variables
- Line Plot
- Box Plot
- Hist Plot
- Categorical Variables
- 2-Dimensinal Visualization
- Categorical Variables & Categorical Variables
- Heatmap using Cross Table
- Categorical Variables & Continuous Variables
- Box Plot by Category
- Continuous Variables & Continuous Variables
- Scatter Plot
- Countour Plot
- Categorical Variables & Categorical Variables
- 1-Dimensional Visualization
- Statistical Hypothesis Testing
- Randomness Test
- Equality Test for Mean of Random Variable from a Normal Distribution
- Z Test
- (Independent / Paried) T Test
- One-way ANOVA
- post hoc
- Fisher Fisher's Least Significant Difference(LSD)
- Bonferroni
- Scheffe
- Tukey HSD
- post hoc
- Non Parameteric Statistical Test
- Mann-Whitney U Test ~ T Test
- Wilcoxon Signed-Rank Test ~ T Test
- Friedman Test ~ ANOVA
- Kruskal-Wallis H Test ~ ANOVA
- Spearman’s Rank Correlation ~ Correlation
- Kendall’s Rank Correlation ~ Correlation
- Equality Test for Variance of Random Variable from Normal Distribution (Homoscedasticity)
- F Test
- Normality Test
- QQ Plot
- Shapiro–Wilk Test
- Kolmogorov–Smirnov Test
- Lilliefors Test
- Anderson–Darling Test
- Jarque–Bera Test
- Pearson's Chi-Squared Test
- D'Agostino's K-squared Test
- Stationary Test
- ADF(Augmented Dickey-Fuller) Test
- ADF-GLS Test
- PP(Phillips–Perron) Test
- KPSS(Kwiatkowski Phillips Schmidt Shin) Test
- Heteroscedasticity Test
- Goldfeld–Quandt Test
- Breusch–Pagan Test
- Bartlett's Test
- Autocorrelation Test
- Ljung–Box Test
- Portmanteau Test
- Breusch–Godfrey Test
- Durbin–Watson Test
- Independence Test
- Chi-Squared Test
- Correlation Test
- Pearson's Correlation
- Spearman’s Rank Correlation
- Kendall’s Rank Correlation
- Granger Causality Test
Analysis and Modeling
Objective: analyzing data sets to extract valuable information and insights.
- Association Analysis
- Analysis of Variace (ANOVA)
- (Between/Within/Mixed) Two-way ANOVA
- (Between/Within/Mixed) Three-way ANOVA
- (Between/Within/Mixed) Multi-way ANOVA
- Regression Analysis
- Logistic Regression
- Generalized Linear Regression
- Cluster Analysis
- Time Series Analysis
- Stationary Process Analysis
- Impulse Response Analysis
- Granger Causality Analysis
- Factor Analysis
- Exploratory Factor Analysis
- Confirmatory Factor Analysis
- Predictive Modeling
- Target Model Review
- Machine Learning Core Models
- Parametric Algorithms
- Regression
- Naive Bayes
- Discriminative Analysis
- Neural Network
- Non-Parametric Algorithms
- Support Vector Machine
- k-Nearest Neighbors
- Decision Trees
- Parametric Algorithms
- Reinforcement Learning
- Machine Learning Core Models
- Model Selection
- Information Criterion
- Akaike Information Criterion (AIC)
- Bayes Information Criterion (BIC)
- Data Spliters
- Leave Out
- Leave-One Out
- Leave-P Out
- Leave-One Group Out
- Leave-P Groups Out
- Shuffle Split
- Shuffle Split
- Stratified Shuffle Split
- Group Shuffle Split
- K-Fold
- K-Fold
- Repeated K-Fold
- Stratified K-Fold
- Repeated Stratified K-Fold
- Group K-Fold
- Time Series Split
- Leave Out
- Hyper-Parameter Optimizers
- Validation Curve
- Random Search
- Grid Search
- Bayesian HPO
- Information Criterion
- Target Model Review
- Model Evaluation
- Task: Inference
- Task: Classification
- Confusion Matrix
- ROC and PR Curve
- Cost-Sensitive Learning
- Task: Regression
- Mean Squared Error (MSE)
- Root Mean Squared Error (RMSE)
- Mean Absolute Error (MAE)
- Mean Absolute Percentage Error (MAPE)
- Task: Clustering
- MLOps/ModelOps Construction
- Data Drifit Monitoring
- Data-Model Pipelining
- Offline Data
- Source Repository
- Feature Store
- Model Registry
- ML Metadata Store
- Analysis Result Summary Report
Project Performance Management
- Key Performance Indicator Design (KPI)
- User Guide Manual
Experimental Design for Analysis
Analysis Manuals
1.
Architecture Design for analysis modeling
import pandas
class Result(pandas.DataFrame):
def __init__(self):
super(Result, self).__init__()
summary = Result()
summary['A'] = [1, 2, 3]
summary['B'] = [10, 20, 30]
supplementary = pandas.DataFrame([1,2,3], columns=['A'])
summary.merge(supplementary, on='A')
import pandas as pd
class Analysis:
def __init__(self):
self.__storage__ = pd.Series(data=[1,2,3], index=['A', 'B', 'C'])
self.assign_type_A = type('assignment1', (object,), {})()
self.assign_type_B = type('assignment2', (dict,), {})()
self.assign_type_C = type('assignment3', (pd.Series,), {})()
self.assign_type_D = type('assignment4', (pd.DataFrame,), {})()
def __getitem__(self, idx):
return self.__storage__[idx]
@property
def storage(self):
return self.__storage__
anlys = Analysis()
anlys.storage
anlys['A']
anlys.assign_type_A.item = None
anlys.assign_type_B.item = None
anlys.assign_type_C.item = None
anlys.assign_type_D.item = None
anlys.assign_type_B['key'] = None
anlys.assign_type_C['key'] = None
anlys.assign_type_D['key'] = None
anlys.assign_type_A, anlys.assign_type_B, anlys.assign_type_C, anlys.assign_type_D
Libarary: sklearn
# https://scikit-learn.org/stable/developers/develop.html
from sklearn.datasets import make_classification, make_regression
from sklearn.preprocessing import FunctionTransformer
from sklearn.compose import make_column_transformer
from sklearn.base import BaseEstimator, TransformerMixin, ClassifierMixin, RegressorMixin, MultiOutputMixin
from sklearn.pipeline import make_pipeline, make_union
from sklearn.model_selection import RepeatedStratifiedKFold, RepeatedKFold, GridSearchCV, cross_validate
# BaseEstimator: .get_params(), .set_params()
# TransformerMixin: .fit_transform()
class Transformer(BaseEstimator, TransformerMixin):
def __init__(self):
pass
def fit(self, X, y=None):
return self
def transform(self, X):
return X
class Classifier(BaseEstimator, ClassifierMixin):
def __init__(self):
pass
def fit(self, X, y=None):
return self
def predict(self, X):
return X
def predict_proba(self, X):
return X
class Regressor(BaseEstimator, RegressorMixin):
def __init__(self):
pass
def fit(self, X, y=None):
return self
def predict(self, X):
return X
Libarary: tensorflow
# https://www.tensorflow.org/guide/keras/custom_layers_and_models
import tensorflow as tf
from tensorflow.keras import layers
from tensorflow.keras import models
class CustomLayer(layers.Layer):
def __init__(self, units=32, name=None):
super(CustomLayer, self).__init__(name=name)
self.units = units
def build(self, input_shape):
self.w = self.add_weight(
shape=(input_shape[-1], self.units),
initializer="random_normal",
trainable=True,
)
self.b = self.add_weight(
shape=(self.units,), initializer="random_normal", trainable=True
)
def call(self, X, training=None):
return tf.matmul(X, self.w) + self.b
def get_config(self):
return {"units": self.units, "name": self.name}
class CustomModel(models.Model):
def __init__(self, **kwargs):
super(CustomModel, self).__init__(**kwargs)
self.dense_1 = layers.Dense(64, activation='relu', name='L1')
self.dense_2 = CustomLayer(units=32)
self.dense_3 = layers.Dense(10, name='L2')
def call(self, inputs):
x = self.dense_1(inputs)
x = self.dense_2(x)
x = self.dense_3(x)
return x
# training
model = CustomModel(name='CustomModel')
model.compile(optimizer="Adam", loss="mse", metrics=["mae"])
history = model.fit(tf.random.normal(shape=(100,100)), tf.random.normal(shape=(100,10)))
history.params
history.history
Libarary: pytorch
# https://pytorch.org/tutorials/beginner/introyt/modelsyt_tutorial.html
import torch
import torch.nn as nn
from torch import optim
from torch.utils.data import Dataset
from torch.utils.data import DataLoader
class TorchDataset(Dataset):
def __init__(self):
self.x_data = torch.tensor(
[[73, 80, 75],
[93, 88, 93],
[89, 91, 90],
[96, 98, 100],
[73, 66, 70]]).type(dtype=torch.FloatTensor)
self.y_data = torch.tensor([[152], [185], [180], [196], [142]]).type(dtype=torch.FloatTensor)
def __len__(self):
return self.y_data.size()[0]
def __getitem__(self, idx):
x = self.x_data[idx]
y = self.y_data[idx]
return x, y
class Model(nn.Module):
def __init__(self):
super(Model, self).__init__()
self.linear1 = nn.Linear(3,3)
self.linear2 = nn.Linear(3,1)
self.relu = nn.ReLU()
def forward(self, x):
x = self.linear1(x)
x = self.linear2(self.relu(x))
return x
class Criterion(nn.Module):
def __init__(self):
super(Criterion, self).__init__()
self.mse = nn.MSELoss()
def forward(self, hypothesis, target):
return self.mse(hypothesis, target)
Libarary: statsmodels
# https://www.statsmodels.org/stable/examples/notebooks/generated/statespace_custom_models.html
import numpy as np
import statsmodels.tsa.api as smt
from statsmodels.tsa.statespace.tools import constrain_stationary_univariate, unconstrain_stationary_univariate
# Construct the model
class AR2(smt.statespace.MLEModel):
def __init__(self, endog):
# Initialize the state space model
super(AR2, self).__init__(endog, k_states=2, k_posdef=1,
initialization='stationary')
# Setup the fixed components of the state space representation
self['design'] = [1, 0]
self['transition'] = np.array(
[[0, 1],
[0, 0]]
).T
self['selection'] = np.array(
[[1],
[0]]
)
# Setup parameter names
self._param_names = ['param.phi1', 'param.phi2', 'param.sigma2']
def transform_params(self, params):
# params: unconstrained to constrained
phi1 = constrain_stationary_univariate(params[0:1])
phi2 = constrain_stationary_univariate(params[1:2])
sigma2 = params[2]**2
constrained = np.r_[phi1, phi2, sigma2]
return constrained
def untransform_params(self, params):
# params: constrained to unconstrained
phi1 = unconstrain_stationary_univariate(params[0:1])
phi2 = unconstrain_stationary_univariate(params[1:2])
sigma2 = params[2]**0.5
unconstrained = np.r_[phi1, phi2, sigma2]
return unconstrained
# Describe how parameters enter the model
def update(self, params, transformed=True, **kwargs):
params = super(AR2, self).update(params, transformed, **kwargs)
self['transition', 0, :] = params[:2]
self['state_cov', 0, 0] = params[2]
# Specify start parameters and parameter names
@property
def start_params(self):
return [0,0,1] # these are very simple
# Create and fit the model
y = smt.ArmaProcess(ar=[1, -.3, .2], ma=[1]).generate_sample(1000, burnin=50)
model = AR2(endog=y)
result = model.fit(disp=False)
result.summary()
Libarary: arch
#
Libarary: pymc3
#
Automatic Differentiation
Libarary: pytorch
#
Libarary: tensorflow
#
Libarary: scipy
#
Matrix Operation
Libarary: numpy
#
Libarary: pytorch
#
Libarary: tensorflow
#
Sampling for simulation
Libarary: numpy
import torch
torch.Tensor(10,3).normal_(0,1)
torch.Tensor(10,3).uniform_(0,1)
torch.Tensor(10,3).exponential_(100)
torch.Tensor(10,3).bernoulli_(0.9)
torch.Tensor(10,3).geometric_(0.9)
Libarary: pytorch
#
Libarary: tensorflow
#
Libarary: scipy
from scipy import stats, special
stats.bernoulli.rvs(p=2/3, size=1000)
stats.binom.rvs(n=5, p=2/3, size=1000)
stats.poisson.rvs(mu=5, size=1000)
stats.multinomial.rvs(n=5, p=[.3, .3, .3], size=1000)
stats.randint.rvs(0, 10, size=1000)
stats.geom.rvs(p=.2, size=1000)
stats.nbinom.rvs(n=5, p=.2, size=1000)
stats.norm.rvs(0,1, size=1000)
stats.uniform.rvs(0,10, size=1000)
stats.expon.rvs(0, 1, size=1000)
Data Resources
Korea
국회 의안정보시스템: https://likms.assembly.go.kr/bill/main.do
통계청 국가통계포털: https://kosis.kr/index/index.do
한국은행 경제통계시스템: https://ecos.bok.or.kr/
한국거래소 정보데이터시스템: http://data.krx.co.kr/contents/MDC/MAIN/main/index.cmd
Law & Press & Society Issues
LIKMS(Bill Information, 의안정보시스템): https://likms.assembly.go.kr/bill/main.do
KOSIS(KOrean Statistical Information Service, 국가통계포털): https://kosis.kr/index/index.do
지표누리: https://www.index.go.kr/
Economy & Finance
BOK(한국은행 경제통계시스템): https://ecos.bok.or.kr/
KRX(한국거래소 정보데이터시스템): http://data.krx.co.kr/contents/MDC/MAIN/main/index.cmd
FNGuide: https://comp.fnguide.com/
Value Line: https://www.valueline.co.kr/
Abroad