quantitative analysis/data analysis

Data Analysis Project

ecstasy 2023. 5. 7. 17:34
  • WBS; Work Breakdown Structure Plan Definition
  • Experimental Design for Analysis
  • Data Resources

Work Breakdown Structure Plan Definition

Identify Business Problem

  • WBS(Work Breakdown Structure) Planning
  • Computing Resource Operation Planning
  • Literature Reivew & Research
  • Problem Diagnosis
  • Project Initiation

Data Acquisition & Preprocessing

Objective: manipulation or dropping of data before it is used in order to ensure or enhance performance

  • Analytical Table Definition
  • Data Quality Assurance
    • Missing Value Detection and Preprocessing
      • Deletion
      • Imputation
    • Anomaly Detection and Preprocessing
      • Point Anomaly (an Outlier)
        • Grubbs' Test
      • Collective Anomaly 
        • Deviation-based Outliers
        • Interquartile-based Outliers
        • Isolation Forest
        • Minimum Covariance Determinant
        • Local Outlier Factor (Local Anomaly)
        • One-Class SVM
      • Contextual Anomaly
    • Imbalance Sample Detection and Preprocessing
      • Sampling
        • Undersampling
        • Oversampling
  • Dimensionality Reduction
    • Manifold Learning
      • Isomap
      • Locally Linear Embedding
      • Multidimensional Scaling(MDS)
      • Spectral Embedding
      • t-distributed Stochastic Neighbor Embedding(t-SNE)
    • Matrix Decomposition
      • Principle Component Analysis(PCA)
      • Factor Analysis
      • Truncated Singular Value Decomposition(SVD)
  • Data Transformation
    • Quantile Tranformation
      • Normal Quantile Tranformation
      • Uniform Quantile Tranformation
    • Scaling Transformation
      • Minmax Scaling
      • Standard Scaling
      • Robust Scaling
    • Gaussian-like Tranformation
      • yeo-johnson Tranformation
      • box-cox Tranformation for positive-data
    • Discretization Transformation
      • Uniform Discretization Transformation
      • K-Means Discretization Transformation
      • Qunatile Discretization Transformation
    • Quantification for Categorical Variables
      • Numerical Encoding 
        • Label Encoding
        • Ordinal Encoding
        • OneHot Encoding
      • Numerical Embedding
  • Feature Selection & Extraction
    • Information Values
    • Variance Inflation Factor (VIF)
    • Chi-Squared Test
    • ANOVA Correlation Coefficient (Linearity)
    • Kendall’s Rank Coefficient (Non-Linearity)
    • Mutual Information
    • Recursive Feature Elimination (RFE, RFECV: Feature Importance, Coefficient)
    • Sequential Feature Selection (Forward, Backward, Stepwise Selection: AIC, BIC, R-squared, Accuracy)
    • Selection From Model (Threshold)
    • Heuristic Selection on Baseline Model
  • Noise Filtering
    • Kalman Filter

Exploratory/Confirmatory Data Analysis

Objective: analyzing data sets to summarize their main characteristics

  • Data Summarization
    • Descriptive Statistics
      • Aggregation-based Frequency & Percentile Analysis
        • Entropy: Class Imbalance
    • Probability Analysis
      • Posterior: Naive Bayes Analysis, Discriminant Analysis
      • Data Tendency: Decision Tree Analysis
      • Data Pattern: Probability Function Analysis Fitting Distribution
      • Independency Analysis between Random Variables
        • Chi-squared Test
          • Contingency Table 
        • Correlation Analysis
          • Tetrachoric Correlation between Binary Categorical Variables
          • Polychoric Correlation between Ordinal Categorical Variables
          • Cramer’s Correlation between Nominal Categorical Variables
          • Pearson Correlation between Continuous Variables (Linear Relationship)
          • Spearman’s Rank Correlation between a Categorical and a Continuous Variable (Non-Linear Relationship of Monotonic Function)
          • Kendall’s Rank Correlation between a Categorical and a Continuous Variable
        • Matual Information
      • Association Analysis between Categorical Random Variables
        • Support, Confidence, Lift
    • Information Quantitation
      • Information Values
      • Feature Importance
        • Tree-based Feature Importance
        • Coefficient-based Feature Importance
      • Permutation Importance
  • Data Visualization
    • 1-Dimensional Visualization 
      • Categorical Variables
        • Pie Plot: Count Plot
        • Bar Plot: Count Plot
      • Continuous Variables
        • Line Plot
        • Box Plot
        • Hist Plot
    • 2-Dimensinal Visualization
      • Categorical Variables & Categorical Variables
        • Heatmap using Cross Table
      • Categorical Variables & Continuous Variables
        • Box Plot by Category
      • Continuous Variables & Continuous Variables
        • Scatter Plot
        • Countour Plot
  • Statistical Hypothesis Testing
    • Randomness Test
    • Equality Test for Mean of Random Variable from a Normal Distribution
      • Z Test
      • (Independent / Paried) T Test
      • One-way ANOVA
        • post hoc
          • Fisher Fisher's Least Significant Difference(LSD)
          • Bonferroni
          • Scheffe
          • Tukey HSD 
    • Non Parameteric Statistical Test
      • Mann-Whitney U Test ~ T Test
      • Wilcoxon Signed-Rank Test ~ T Test
      • Friedman Test ~ ANOVA
      • Kruskal-Wallis H Test ~ ANOVA
      • Spearman’s Rank Correlation ~ Correlation
      • Kendall’s Rank Correlation ~ Correlation
    • Equality Test for Variance of Random Variable from Normal Distribution (Homoscedasticity)
      • F Test
    • Normality Test
      • QQ Plot
      • Shapiro–Wilk Test
      • Kolmogorov–Smirnov Test
      • Lilliefors Test
      • Anderson–Darling Test
      • Jarque–Bera Test
      • Pearson's Chi-Squared Test
      • D'Agostino's K-squared Test
    • Stationary Test
      • ADF(Augmented Dickey-Fuller) Test
      • ADF-GLS Test
      • PP(Phillips–Perron) Test
      • KPSS(Kwiatkowski Phillips Schmidt Shin) Test
    • Heteroscedasticity Test
      • Goldfeld–Quandt Test
      • Breusch–Pagan Test
      • Bartlett's Test
    • Autocorrelation Test
      • Ljung–Box Test
      • Portmanteau Test
      • Breusch–Godfrey Test
      • Durbin–Watson Test
    • Independence Test
      • Chi-Squared Test
    • Correlation Test
      • Pearson's Correlation
      • Spearman’s Rank Correlation
      • Kendall’s Rank Correlation
    • Granger Causality Test

Analysis and Modeling

Objective: analyzing data sets to extract valuable information and insights.

  • Association Analysis
  • Analysis of Variace (ANOVA)
    • (Between/Within/Mixed) Two-way ANOVA
    • (Between/Within/Mixed) Three-way ANOVA
    • (Between/Within/Mixed) Multi-way ANOVA
  • Regression Analysis
    • Logistic Regression
    • Generalized Linear Regression
  • Cluster Analysis
  • Time Series Analysis
    • Stationary Process Analysis
    • Impulse Response Analysis
    • Granger Causality Analysis
  • Factor Analysis
    • Exploratory Factor Analysis
    • Confirmatory Factor Analysis
  • Predictive Modeling
    • Target Model Review 
      • Machine Learning Core Models 
        • Parametric Algorithms
          • Regression
          • Naive Bayes
          • Discriminative Analysis
          • Neural Network
        • Non-Parametric Algorithms
          • Support Vector Machine
          • k-Nearest Neighbors
          • Decision Trees
      • Reinforcement Learning
    • Model Selection
      • Information Criterion
        • Akaike Information Criterion (AIC)
        • Bayes Information Criterion (BIC)
      • Data Spliters
        • Leave Out
          • Leave-One Out
          • Leave-P Out
          • Leave-One Group Out
          • Leave-P Groups Out
        • Shuffle Split
          • Shuffle Split
          • Stratified Shuffle Split
          • Group Shuffle Split
        • K-Fold
          • K-Fold
          • Repeated K-Fold 
          • Stratified K-Fold
          • Repeated Stratified K-Fold
          • Group K-Fold
        • Time Series Split
      • Hyper-Parameter Optimizers
        • Validation Curve
        • Random Search
        • Grid Search
        • Bayesian HPO
  • Model Evaluation
    • Task: Inference
    • Task: Classification
      • Confusion Matrix
      • ROC and PR Curve
      • Cost-Sensitive Learning
    • Task: Regression
      • Mean Squared Error (MSE)
      • Root Mean Squared Error (RMSE)
      • Mean Absolute Error (MAE)
      • Mean Absolute Percentage Error (MAPE)
    • Task: Clustering
  • MLOps/ModelOps Construction
    • Data Drifit Monitoring
    • Data-Model Pipelining
      • Offline Data
      • Source Repository
      • Feature Store
      • Model Registry
      • ML Metadata Store
  • Analysis Result Summary Report

Project Performance Management

  • Key Performance Indicator Design (KPI)
  • User Guide Manual

 


Experimental Design for Analysis

Analysis Manuals

1. 

 

 

Architecture Design for analysis modeling

import pandas

class Result(pandas.DataFrame):
    def __init__(self):
        super(Result, self).__init__()
        
summary = Result()
summary['A'] = [1, 2, 3]
summary['B'] = [10, 20, 30]

supplementary = pandas.DataFrame([1,2,3], columns=['A'])
summary.merge(supplementary, on='A')
import pandas as pd

class Analysis:
    def __init__(self):
        self.__storage__ = pd.Series(data=[1,2,3], index=['A', 'B', 'C'])
        self.assign_type_A = type('assignment1', (object,), {})()
        self.assign_type_B = type('assignment2', (dict,), {})()
        self.assign_type_C = type('assignment3', (pd.Series,), {})()
        self.assign_type_D = type('assignment4', (pd.DataFrame,), {})()
    
    def __getitem__(self, idx):
        return self.__storage__[idx]

    @property
    def storage(self):
        return self.__storage__

anlys = Analysis()
anlys.storage
anlys['A']

anlys.assign_type_A.item = None
anlys.assign_type_B.item = None
anlys.assign_type_C.item = None
anlys.assign_type_D.item = None
anlys.assign_type_B['key'] = None
anlys.assign_type_C['key'] = None
anlys.assign_type_D['key'] = None
anlys.assign_type_A, anlys.assign_type_B, anlys.assign_type_C, anlys.assign_type_D

 

Libarary: sklearn

# https://scikit-learn.org/stable/developers/develop.html

from sklearn.datasets import make_classification, make_regression
from sklearn.preprocessing import FunctionTransformer
from sklearn.compose import make_column_transformer
from sklearn.base import BaseEstimator, TransformerMixin, ClassifierMixin, RegressorMixin, MultiOutputMixin
from sklearn.pipeline import make_pipeline, make_union
from sklearn.model_selection import RepeatedStratifiedKFold, RepeatedKFold, GridSearchCV, cross_validate

# BaseEstimator: .get_params(), .set_params()
# TransformerMixin: .fit_transform()

class Transformer(BaseEstimator, TransformerMixin):
    def __init__(self):
        pass
    
    def fit(self, X, y=None):
        return self

    def transform(self, X):
        return X

class Classifier(BaseEstimator, ClassifierMixin):
    def __init__(self):
        pass
    
    def fit(self, X, y=None):
        return self

    def predict(self, X):
        return X

    def predict_proba(self, X):
        return X
    
class Regressor(BaseEstimator, RegressorMixin):
    def __init__(self):
        pass
    
    def fit(self, X, y=None):
        return self

    def predict(self, X):
        return X

Libarary: tensorflow

# https://www.tensorflow.org/guide/keras/custom_layers_and_models
import tensorflow as tf
from tensorflow.keras import layers
from tensorflow.keras import models

class CustomLayer(layers.Layer):
    def __init__(self, units=32, name=None):
        super(CustomLayer, self).__init__(name=name)
        self.units = units

    def build(self, input_shape):
        self.w = self.add_weight(
            shape=(input_shape[-1], self.units),
            initializer="random_normal",
            trainable=True,
        )
        self.b = self.add_weight(
            shape=(self.units,), initializer="random_normal", trainable=True
        )

    def call(self, X, training=None):
        return tf.matmul(X, self.w) + self.b

    def get_config(self):
        return {"units": self.units, "name": self.name}

class CustomModel(models.Model):
    def __init__(self, **kwargs):
        super(CustomModel, self).__init__(**kwargs)
        self.dense_1 = layers.Dense(64, activation='relu', name='L1')
        self.dense_2 = CustomLayer(units=32)
        self.dense_3 = layers.Dense(10, name='L2')

    def call(self, inputs):
        x = self.dense_1(inputs)
        x = self.dense_2(x)
        x = self.dense_3(x)
        return x

# training
model = CustomModel(name='CustomModel')
model.compile(optimizer="Adam", loss="mse", metrics=["mae"])
history = model.fit(tf.random.normal(shape=(100,100)), tf.random.normal(shape=(100,10)))
history.params
history.history

Libarary: pytorch

# https://pytorch.org/tutorials/beginner/introyt/modelsyt_tutorial.html

import torch
import torch.nn as nn
from torch import optim
from torch.utils.data import Dataset
from torch.utils.data import DataLoader

class TorchDataset(Dataset):
    def __init__(self):
        self.x_data = torch.tensor(
            [[73, 80, 75],
             [93, 88, 93],
             [89, 91, 90],
             [96, 98, 100],
             [73, 66, 70]]).type(dtype=torch.FloatTensor)
        self.y_data = torch.tensor([[152], [185], [180], [196], [142]]).type(dtype=torch.FloatTensor)
        
    def __len__(self):
        return self.y_data.size()[0]

    def __getitem__(self, idx):
        x = self.x_data[idx]
        y = self.y_data[idx]
        return x, y

class Model(nn.Module):
    def __init__(self):
        super(Model, self).__init__()
        self.linear1 = nn.Linear(3,3)
        self.linear2 = nn.Linear(3,1)
        self.relu = nn.ReLU()

    def forward(self, x):
        x = self.linear1(x)
        x = self.linear2(self.relu(x))
        
        return x

class Criterion(nn.Module):
    def __init__(self):
        super(Criterion, self).__init__()
        self.mse = nn.MSELoss()

    def forward(self, hypothesis, target):
        return self.mse(hypothesis, target)

Libarary: statsmodels

# https://www.statsmodels.org/stable/examples/notebooks/generated/statespace_custom_models.html
import numpy as np
import statsmodels.tsa.api as smt
from statsmodels.tsa.statespace.tools import constrain_stationary_univariate, unconstrain_stationary_univariate


# Construct the model
class AR2(smt.statespace.MLEModel):
    def __init__(self, endog):
        # Initialize the state space model
        super(AR2, self).__init__(endog, k_states=2, k_posdef=1,
                                  initialization='stationary')

        # Setup the fixed components of the state space representation
        self['design'] = [1, 0]
        self['transition'] = np.array(
            [[0, 1],
             [0, 0]]
        ).T
        self['selection'] = np.array(
            [[1],
             [0]]
        )

        # Setup parameter names
        self._param_names = ['param.phi1', 'param.phi2', 'param.sigma2']

    def transform_params(self, params):
        # params: unconstrained to constrained
        phi1 = constrain_stationary_univariate(params[0:1])
        phi2 = constrain_stationary_univariate(params[1:2])
        sigma2 = params[2]**2
        
        constrained = np.r_[phi1, phi2, sigma2]
        return constrained
    
    def untransform_params(self, params):
        # params: constrained to unconstrained
        phi1 = unconstrain_stationary_univariate(params[0:1])
        phi2 = unconstrain_stationary_univariate(params[1:2])
        sigma2 = params[2]**0.5
        
        unconstrained = np.r_[phi1, phi2, sigma2]
        return unconstrained

    # Describe how parameters enter the model
    def update(self, params, transformed=True, **kwargs):
        params = super(AR2, self).update(params, transformed, **kwargs)

        self['transition', 0, :] = params[:2]
        self['state_cov', 0, 0] = params[2]

    # Specify start parameters and parameter names
    @property
    def start_params(self):
        return [0,0,1]  # these are very simple

# Create and fit the model
y = smt.ArmaProcess(ar=[1, -.3, .2], ma=[1]).generate_sample(1000, burnin=50)
model = AR2(endog=y)
result = model.fit(disp=False)
result.summary()

Libarary: arch

#

Libarary: pymc3

#

 

 

Automatic Differentiation

Libarary: pytorch

#

Libarary: tensorflow

#

Libarary: scipy

#

 

 

 

Matrix Operation

Libarary: numpy

#

Libarary: pytorch

#

Libarary: tensorflow

#

 

 

Sampling for simulation

Libarary: numpy

import torch

torch.Tensor(10,3).normal_(0,1)
torch.Tensor(10,3).uniform_(0,1)
torch.Tensor(10,3).exponential_(100)
torch.Tensor(10,3).bernoulli_(0.9)
torch.Tensor(10,3).geometric_(0.9)

Libarary: pytorch

#

Libarary: tensorflow

#

Libarary: scipy

from scipy import stats, special

stats.bernoulli.rvs(p=2/3, size=1000)
stats.binom.rvs(n=5, p=2/3, size=1000)
stats.poisson.rvs(mu=5, size=1000)
stats.multinomial.rvs(n=5, p=[.3, .3, .3], size=1000)
stats.randint.rvs(0, 10, size=1000)
stats.geom.rvs(p=.2, size=1000)
stats.nbinom.rvs(n=5, p=.2, size=1000)
stats.norm.rvs(0,1, size=1000)
stats.uniform.rvs(0,10, size=1000)
stats.expon.rvs(0, 1, size=1000)

 

 

 

 


Data Resources

Korea

국회 의안정보시스템: https://likms.assembly.go.kr/bill/main.do

통계청 국가통계포털: https://kosis.kr/index/index.do

한국은행 경제통계시스템: https://ecos.bok.or.kr/

한국거래소 정보데이터시스템: http://data.krx.co.kr/contents/MDC/MAIN/main/index.cmd

 

 

Law & Press & Society Issues

LIKMS(Bill Information, 의안정보시스템): https://likms.assembly.go.kr/bill/main.do

KOSIS(KOrean Statistical Information Service, 국가통계포털): https://kosis.kr/index/index.do

지표누리: https://www.index.go.kr/

 

 

Economy & Finance

BOK(한국은행 경제통계시스템): https://ecos.bok.or.kr/

KRX(한국거래소 정보데이터시스템): http://data.krx.co.kr/contents/MDC/MAIN/main/index.cmd

FNGuide: https://comp.fnguide.com/

Value Line: https://www.valueline.co.kr/

 

 

 

Abroad

 

 

 

 

 

 


Reference