Skip to content

Industry specific framework of Feature Engineering in Machine Learning

Notifications You must be signed in to change notification settings

MvMukesh/FeatureEngineering-Framework-ML

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

21 Commits
 
 
 
 

Repository files navigation

                        

FeatureEngineering-Framework-ML

GitHub Issues GitHub followers GitHub forks GitHub stars


  1. Variable Types
  • Categorical Variables: Nominal and Ordinal
  • Numerical Variables: Discrete and continuous
  • Mixed variables: strings and numbers
  • Datetime variables
Variable Types Code + Blog Link Video Link
  1. Variable Characteristics
  • Missing Data
  • Cardinality
  • Category Frequency
  • Distributions
  • Outliers
  • Magnitude
Variable Characteristics Code + Blog Link Video Link
  1. Missing Data Imputation
  • For Numerical Variables
    • Mean and Median Imputation
    • Arbitrary value imputation
    • End of Tail Imputation
  • For Categorical Variables
    • Frequent category imputation
    • Adding a missing category
  • Random Sample Imputation
  • Adding a missing indicator
  • Imputation with Scikit-learn
  • Imputation with Feature-engine
Missing Data Imputation Code + Blog Link Video Link
  1. Multivariate Imputation
  • MICE
  • KNN imputation
Multivariate Imputation Code + Blog Link Video Link
  1. Categorical Variable Encoding
  • Traditional Techniques
    • One hot encoding: simple and of frequent categories
    • Ordinal / Label encoding: arbitrary and ordered
    • Count / Frequency encoding
  • Monotonic Relationship
    • Target mean encoding
    • Weight of evidence
    • Ordered label encoding
  • Alternative Techniques
    • Binary encoding
    • Feature hashing
    • Probability Ratio
  • For Rare Labels
    • One hot encoding of frequent categories
    • Grouping of rare categories
    • Rare Label encoding
    • Encoding with Scikit-learn
    • Encoding with category encoders
Categorical Variable Encoding Code + Blog Link Video Link
  1. Variable Transformation
  • Mathematical Transformations
    • Logarithic
    • Exponential / Power
    • Reciprocal
    • Box-Cox
    • Yeo-Johnson
  • Discretisation
    • Unsupervised
      • Equal-width
      • Equal-frequency
      • K means
    • Supervised
      • Decision Tree
  • Other
    • Transformation with Scikit-learn
Variable Transformation Code + Blog Link Video Link
  1. Discretisation
  • Arbitrary
  • Equal-frequency discretisation
  • Equal-width discretisation
  • K-means discretisation
  • Discretisation with trees
  • Discretisation with Scikit-learn
  • Discretisation with Feature-engine
Discretisation Code + Blog Link Video Link
  1. Outliers
  • Discretisation
  • Capping / Censoring
  • Trimming / Truncation
Outliers Code + Blog Link Video Link
  1. Feature Scaling
  • Standardisation (common one)
  • MinMaxScaling (common one)
  • MaxAbsoluteScaling
  • RobustScaling
  • Scaling to absolute maxima
  • Scaling to median & quantiles
  • Scaling to unit norm

Models Effected by magnitude of feature

  • Linear & Logistic Regression
  • SVM
  • KNN
  • K-means Clustering
  • LDA
  • PCA
  • Neural Networks

Models insensitive to feature magnitude - Tree Based Models

  • Classification & Regression Trees
  • Random Forest
  • Gradient Boosted Trees
Feature Scaling Code + Blog Link Video Link
  1. Mixed variables
  • Creating new variables from strings and numbers
Mixed variables Code + Blog Link Video Link
  1. Datetime Variables
  • Extracting day, month, week, semester, year ...etc
  • Extracting hour, min, sec ...etc
  • Capturing Elapsed time
    • Time between transactions
    • Age
  • Working with timezones
Datetime Code + Blog Link Video Link
  1. Text
  • Characters, Words, Unique words
  • Lexical diversity
  • Sentences, Paragraphs
  • Bag of Words
  • TFiDF
  1. Transactions & Time Series
  • Aggregate data
    • Number of payments in last 3, 6, 12 months
    • Time since last transaction
    • Total spending in last month
  1. Feature Combination
  • Ratio : total debt with income --> Debt to income ratio
  • Sum : Debt in different credit cards --> total debt
  • Subtraction : Income without expenses --> disposable income
  1. Pipelines
  • Classification Pipeline
  • Regression Pipeline
  • Pipeline with cross-validation
Pipelines Code + Blog Link Video Link