Section IV: Data Visualization in Notebooks

By brsanders, 8 November, 2024

Summary of Key Points

One of the key strengths of computational notebooks is their ability to integrate code, data, and rich visualizations in the same environment. Data visualization helps in understanding patterns, trends, and insights from data, and notebooks provide powerful libraries and tools to create a wide variety of plots and graphs.

1. Common Visualization Libraries

In Python-based notebooks (like Jupyter or Google Colab), several popular libraries are used for creating visualizations. Some of the most widely used ones include:

Matplotlib: A versatile and powerful plotting library used for creating static, animated, and interactive visualizations.
Seaborn: Built on top of Matplotlib, Seaborn provides a high-level interface for creating attractive and informative statistical graphics.
Plotly: A library for interactive plots that can be zoomed, panned, and customized easily.
Altair: A declarative statistical visualization library that generates visualizations using concise code.
Bokeh: Great for interactive, web-ready visualizations.

2. Basic Plots with Matplotlib

Matplotlib is often the go-to library for creating basic visualizations like line charts, bar charts, scatter plots, and histograms. To use it, you’ll first need to import the library:

import matplotlib.pyplot as plt

Here are some common plots:

Line Plot:

import numpy as np

x = np.linspace(0, 10, 100)
y = np.sin(x)

plt.plot(x, y)
plt.title("Sine Wave")
plt.xlabel("x-axis")
plt.ylabel("y-axis")
plt.show()

Bar Chart:

categories = ['A', 'B', 'C', 'D']
values = [4, 7, 1, 8]

plt.bar(categories, values)

## <BarContainer object of 4 artists>

plt.title("Bar Chart Example")
plt.show()

Histogram:

data = np.random.randn(1000)
plt.hist(data, bins=30)

## (array([ 1., 0., 1., 1., 4., 5., 6., 15., 20., 31., 38.,
## 57., 76., 94., 94., 85., 103., 100., 72., 57., 60., 30.,
## 21., 16., 3., 4., 3., 2., 0., 1.]), array([-3.86251409, -3.6129108 , -3.36330752, -3.11370424, -2.86410095,
## -2.61449767, -2.36489439, -2.11529111, -1.86568782, -1.61608454,
## -1.36648126, -1.11687797, -0.86727469, -0.61767141, -0.36806813,
## -0.11846484, 0.13113844, 0.38074172, 0.63034501, 0.87994829,
## 1.12955157, 1.37915486, 1.62875814, 1.87836142, 2.1279647 ,
## 2.37756799, 2.62717127, 2.87677455, 3.12637784, 3.37598112,
## 3.6255844 ]), <BarContainer object of 30 artists>)

plt.title("Histogram of Random Data")
plt.show()

3. Advanced Plots with Seaborn

Seaborn makes it easy to create visually appealing and informative statistical plots with minimal code. It is particularly useful for creating complex plots, such as pair plots and heatmaps.

You can install Seaborn if it’s not already available:

pip install seaborn

Here are a few examples:

Scatter Plot with Seaborn:

import seaborn as sns
import pandas as pd

# Example dataset
df = pd.DataFrame({
'x': np.random.rand(50),
'y': np.random.rand(50),
'category': np.random.choice(['A', 'B'], 50)
})

sns.scatterplot(x='x', y='y', hue='category', data=df)
plt.title("Scatter Plot with Seaborn")
plt.show()

Heatmap:

# Correlation heatmap for a random dataset
data = np.random.rand(10, 12)
sns.heatmap(data, annot=True, cmap='coolwarm')
plt.title("Heatmap Example")
plt.show()

4. Interactive Visualizations with Plotly

If you need interactive plots (e.g., zoomable, hover tooltips), Plotly is a great choice. Unlike static visualizations from Matplotlib, Plotly plots can be interacted with directly in the notebook.

Install Plotly using pip:

pip install plotly

Here’s a simple example of an interactive scatter plot:

import plotly.express as px

df = px.data.iris() # Using Plotly's built-in Iris dataset
fig = px.scatter(df, x='sepal_width', y='sepal_length', color='species')
fig.show()

Plotly also supports a wide range of chart types, from 3D scatter plots to geographic maps.

5. Saving Visualizations

You can save any plot created in a notebook to an image file for later use, such as in reports or presentations. This can be done easily using Matplotlib’s savefig() function:

plt.savefig('my_plot.png')

Similarly, Plotly figures can be saved as images or HTML files for sharing or embedding on websites:

fig.write_image('plotly_figure.png') # Save as PNG
fig.write_html('plotly_figure.html') # Save as interactive HTML

6. Combining Multiple Plots

In both Matplotlib and Seaborn, you can combine multiple plots into a single figure. This is particularly useful for comparative visualizations or displaying several datasets at once.

Subplots with Matplotlib:

fig, axs = plt.subplots(1, 2, figsize=(10, 4))

# First plot
axs[0].plot(x, np.sin(x))
axs[0].set_title('Sine Wave')

# Second plot
axs[1].plot(x, np.cos(x))
axs[1].set_title('Cosine Wave')

plt.show()

Boxplot of Gene Expression (Seaborn): A boxplot allows you to compare the distribution of gene expression levels across different samples.

import pandas as pd

# Example gene expression data for different samples
data = {
'Gene': ['Gene1', 'Gene2', 'Gene3', 'Gene1', 'Gene2', 'Gene3', 'Gene1', 'Gene2', 'Gene3'],
'Sample': ['Sample1', 'Sample1', 'Sample1', 'Sample2', 'Sample2', 'Sample2', 'Sample3', 'Sample3', 'Sample3'],
'Expression': [50, 60, 70, 55, 65, 72, 58, 63, 75]
}
df = pd.DataFrame(data)

# Create a boxplot
plt.figure(figsize=(8,6))
sns.boxplot(x='Sample', y='Expression', hue='Gene', data=df)
plt.title('Gene Expression Across Samples')
plt.show()

3. Variant Frequency Visualization (GWAS Data)

In genomics, visualizing the frequency of genetic variants across populations or genomic locations is common in Genome-Wide Association Studies (GWAS).

Manhattan Plot for GWAS Results (Matplotlib + Seaborn): A Manhattan plot visualizes genetic variants across the genome and their significance levels in GWAS.

import pandas as pd
import matplotlib.pyplot as plt

# Simulated data for Manhattan plot
chromosomes = np.tile([1, 2, 3, 4], 50) # Repeated chromosome numbers
positions = np.random.randint(1, 1000000, size=200) # Random positions on the genome
p_values = np.random.uniform(0.0001, 1, size=200) # Random p-values for significance

# Create a DataFrame
gwas_data = pd.DataFrame({'Chromosome': chromosomes, 'Position': positions, 'p_value': -np.log10(p_values)})

# Plot the Manhattan plot
plt.figure(figsize=(10, 6))
sns.scatterplot(x='Position', y='p_value', hue='Chromosome', palette='tab20', data=gwas_data, legend=None)
plt.axhline(y=-np.log10(0.05), color='red', linestyle='--', label='Significance threshold (p=0.05)')
plt.title('Manhattan Plot of GWAS Results')
plt.xlabel('Genomic Position')
plt.ylabel('-log10(p-value)')
plt.show()

4. Interactive Variant Analysis with Plotly

Interactive visualizations can help you explore complex genomic data more intuitively. Let’s use Plotly to create an interactive scatter plot of allele frequencies across different samples.

Interactive Scatter Plot of Allele Frequencies (Plotly):

import plotly.express as px
import pandas as pd

# Simulated allele frequency data
df = pd.DataFrame({
'Sample': ['Sample1', 'Sample1', 'Sample2', 'Sample2', 'Sample3', 'Sample3'],
'Allele': ['A', 'B', 'A', 'B', 'A', 'B'],
'Frequency': [0.45, 0.55, 0.60, 0.40, 0.50, 0.50]
})

# Create the interactive scatter plot
fig = px.scatter(df, x='Sample', y='Frequency', color='Allele', title='Allele Frequencies Across Samples')
fig.show()

5. Population Genetics Visualization

Principal Component Analysis (PCA) of Genomic Data (Scikit-allel): PCA is often used to visualize the genetic structure of populations.

import numpy as np
import matplotlib.pyplot as plt
from sklearn.decomposition import PCA

# Simulated genomic data (allele frequencies or genotypes)
X = np.random.rand(100, 10) # 100 samples, 10 variants

# PCA
pca = PCA(n_components=2)
components = pca.fit_transform(X)

# Plot the first two principal components
plt.figure(figsize=(8, 6))
plt.scatter(components[:, 0], components[:, 1], c=np.random.rand(100), cmap='viridis', s=50)
plt.title('PCA of Genomic Data')
plt.xlabel('Principal Component 1')
plt.ylabel('Principal Component 2')
plt.colorbar(label='Population')

## <matplotlib.colorbar.Colorbar object at 0x124e56f10>

plt.show()

6. Best Practices for Genomics Data Visualization

When creating genomic visualizations, it’s essential to follow some best practices:

Use Appropriate Color Palettes: For categorical data (e.g., alleles, chromosomes), use distinct color palettes to differentiate between groups.
Label Axes and Legends: Always include clear labels and legends for the x and y axes, and indicate what different colors or symbols represent.
Log Transformations: When visualizing p-values (as in GWAS), consider using log transformations to better display significance (e.g., -log10(p)).
Handle Large Datasets Carefully: Genomics data can be large. Use libraries that handle large data efficiently or generate plots from summary statistics (e.g., allele frequencies, p-values).

Application Tools

n/a

Learning Activities

n/a