This repository contains the original MatLab code for DeepInsight as described in the paper DeepInsight: A methodology to transform a non-image data to an image for convolution neural network architecture.
This package provides a python version of the image transformation procedure of DeepInsight. This is not guaranteed to give the same results as the published MatLab code and should be considered experimental.
python3 -m pip -q install git+git://github.com/alok-ai-lab/DeepInsight.git#egg=DeepInsight
The following is a walkthrough of standard usage of the ImageTransformer class
from pyDeepInsight import ImageTransformer, LogScaler
from sklearn.model_selection import train_test_split
import pandas as pd
import numpy as np
from matplotlib import pyplot as plt
import matplotlib.ticker as ticker
import seaborn as sns
Load example TCGA data
expr_file = r"./examples/data/tcga.rnaseq_fpkm_uq.example.txt.gz"
expr = pd.read_csv(expr_file, sep="\t")
y = expr['project'].values
X = expr.iloc[:, 1:].values
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=23, stratify=y)
X_train.shape
(480, 5000)
Normalize data to values between 0 and 1. The following normalization procedure is described in the DeepInsight paper supplementary information as norm-2.
ln = LogScaler()
X_train_norm = ln.fit_transform(X_train)
X_test_norm = ln.transform(X_test)
Initialize image transformer. There are three built-in feature extraction options, 'tsne', 'pca', and 'kpca' to align with the original MatLab implementation.
it = ImageTransformer(feature_extractor='tsne',
pixels=50, random_state=1701,
n_jobs=-1)
Alternatively, any class instance with method .fit_transform()
that returns a 2-dimensional array of extracted features can also be provided to the ImageTransformer class. This allows for customization of the feature extraction procedure.
from sklearn.manifold import TSNE
tsne = TSNE(n_components=2, perplexity=30, metric='cosine',
random_state=1701, n_jobs=-1)
it = ImageTransformer(feature_extractor=tsne, pixels=50)
Train image transformer on training data. Setting plot=True results in at a plot showing the reduced features (blue points), convex full (red), and minimum bounding rectagle (green) prior to rotation.
plt.figure(figsize=(5, 5))
_ = it.fit(X_train_norm, plot=True)
The feature density matrix can be extracted from the trained transformer in order to view overall feature overlap.
fdm = it.feature_density_matrix()
fdm[fdm == 0] = np.nan
plt.figure(figsize=(10, 7))
ax = sns.heatmap(fdm, cmap="viridis", linewidths=0.01,
linecolor="lightgrey", square=True)
ax.xaxis.set_major_locator(ticker.MultipleLocator(5))
ax.yaxis.set_major_locator(ticker.MultipleLocator(5))
for _, spine in ax.spines.items():
spine.set_visible(True)
_ = plt.title("Genes per pixel")
It is possible to update the pixel size without retraining.
px_sizes = [25, (25, 50), 50, 100]
fig, ax = plt.subplots(1, len(px_sizes), figsize=(25, 7))
for ix, px in enumerate(px_sizes):
it.pixels = px
fdm = it.feature_density_matrix()
fdm[fdm == 0] = np.nan
cax = sns.heatmap(fdm, cmap="viridis", linewidth=0.01,
linecolor="lightgrey", square=True,
ax=ax[ix], cbar=False)
cax.set_title('Dim {} x {}'.format(*it.pixels))
for _, spine in cax.spines.items():
spine.set_visible(True)
cax.xaxis.set_major_locator(ticker.MultipleLocator(5))
cax.yaxis.set_major_locator(ticker.MultipleLocator(5))
plt.tight_layout()
it.pixels = 50
The trained transformer can then be used to transform sample data to image matricies.
X_train_img = it.transform(X_train_norm)
Fit and transform can be done in a single step.
X_train_img = it.fit_transform(X_train_norm)
Plotting the image matrices first four samples of the training set.
fig, ax = plt.subplots(1, 4, figsize=(25, 7))
for i in range(0,4):
ax[i].imshow(X_train_img[i])
ax[i].title.set_text("Train[{}] - class '{}'".format(i, y_train[i]))
plt.tight_layout()
Transforming the testing data is done the same as transforming the training data.
X_test_img = it.transform(X_test_norm)
fig, ax = plt.subplots(1, 4, figsize=(25, 7))
for i in range(0,4):
ax[i].imshow(X_test_img[i])
ax[i].title.set_text("Test[{}] - class '{}'".format(i, y_test[i]))
plt.tight_layout()
The image matrices can then be used as input for the CNN model.