Memo for Rewriting Code in Python and R (Data Visualization Edition)

This is the data visualization section of the Python and R conversion memo.
Data visualization is one of the important steps in data analysis. Here, we have summarized the commonly used methods for graphing data in Python and R.

Tools for visualization in Python and R language

Pythonmatplotlib
RStandard graphing tools or ggplot.

Preparing the data sample (iris)

Here, we will perform data visualization using the iris dataset, which is commonly used in statistics, machine learning, and other fields. The iris dataset is included as a standard dataset in R. In Python, you will need to prepare it separately. You can import it from datasets included in seaborn or scikit-learn, but here we will demonstrate how to export the iris dataset from R  as a CSV file and then import it.

Iris dataset

Consists of 150 samples of three types of iris flowers (setosa, versicolor, virginica).

Components (column names).
Sepal.Length
Sepal.Width
Petal.Length
Petal.Width
Species

R

> iris
    Sepal.Length Sepal.Width Petal.Length Petal.Width    Species
1            5.1         3.5          1.4         0.2     setosa
2            4.9         3.0          1.4         0.2     setosa
3            4.7         3.2          1.3         0.2     setosa
4            4.6         3.1          1.5         0.2     setosa
5            5.0         3.6          1.4         0.2     setosa

Exporting iris data sample from R language as a CSV file.

write.csv(iris, “iris.csv")

Python
Reading the iris.csv file from Python.

>>> import pandas as pd
>>> iris = pd.read_csv('iris.csv', index_col=0)

Histogram

Example of a histogram with 9 bins and range matching the minimum and maximum values of the elements.

Python

>>> import numpy as np
>>> import matplotlib.pyplot as plt

>>> x = iris["Sepal.Length"]
>>> tmp = np.linspace(min(x),max(x),10)
>>> plt.hist(iris["Sepal.Length"],bins=tmp.round(2), color='lightgray', ec='black')
>>> plt.show()
# Array of 10 elements from min to max
# round(2) to get up to 2 decimal places.

R

> x <- iris$Sepal.Length
> hist(x, breaks=seq(min(x),max(x),length.out=10), right=FALSE)

R

> library(ggplot2)
> x <- iris$Sepal.Length
> ggplot(iris, aes(x=Sepal.Length))+geom_histogram(breaks=seq(min(x),max(x),length.out=10), closed="left")

Exporting the graph in PNG format

Python

>>> import numpy as np
>>> import matplotlib.pyplot as plt

>>> figsize_px = np.array([800, 600])
>>> dpi = 72
>>> figsize_inch = figsize_px / dpi
>>> fig, ax = plt.subplots(figsize=figsize_inch, dpi=dpi)
>>> x = iris["Sepal.Length"]
>>> tmp = np.linspace(min(x),max(x),10)
>>> ax.hist(iris["Sepal.Length"],bins=tmp.round(2),color='lightgray', ec='black')
>>> fig.savefig("hist_iris_py.png")

R

# Exporting a standard histogram graph.
> png("hist_iris_R.png", width=800, height=600)
> x <- iris$Sepal.Length
> hist(x, breaks=seq(min(x),max(x),length.out=10), right=FALSE)
> dev.off()

# Exporting a histogram graph using ggplot.
> library(ggplot2)
> png("hist_iris_R_ggp.png", width=800, height=600)
> x <- iris$Sepal.Length
> ggplot(iris, aes(x=Sepal.Length))+geom_histogram(breaks=seq(min(x),max(x),length.out=10), closed="left")
> dev.off()

Scatter plot

Python

>>> import matplotlib.pyplot as plt
>>> iris.plot('Sepal.Length','Sepal.Width',kind='scatter')
>>> plt.show()

R

> plot(iris$Sepal.Length, iris$Sepal.Width, pch=20, col=“blue")

# pch: type of plot.

R(ggplot)

> library(ggplot2)
> ggplot(iris, aes(x=Sepal.Length,y=Sepal.Width))+geom_point()

Box plot

Python

>>> import matplotlib.pyplot as plt
>>> iris.boxplot(by='Species',column='Sepal.Length')
>>> plt.show()

R

> boxplot(formula=Sepal.Length~Species, data=iris)

R(ggplot)

> ggplot(iris,aes(x=Species, y=Sepal.Length))+geom_boxplot()

Bar graph

Python

>>> import matplotlib.pyplot as plt
>>> pd.options.display.float_format=('{:.2f}'.format)
>>> df = (iris.describe().transpose()[['mean','std']])
>>> df.plot(y='mean', kind='bar', capsize=10)
>>> plt.show()

R

> library(tidyverse)
> df <- psych::describe(iris[,-5]) 
> tmp <- select(.data=df, mean) 
> x <- tmp[,1] > names(x) <- rownames(tmp) 
> barplot(x,ylim=c(0,round(max(x))))

R(ggplot)

> library(tidyverse)
> df <- psych::describe(iris[,-5]) 
> df <- select(.data=df, mean) 
> tmp <- rownames(df) 
> df %>% ggplot(aes(x=factor(tmp, levels=tmp),y=mean))+geom_col()+xlab(NULL)

Displaying the average for each species in a bar graph

Python

>>> group=iris.groupby('Species')
>>> df=group.agg('mean')
>>> group.agg('mean').plot(kind='bar')
>>> plt.show()

R

> mygroup <- iris %>% group_by(Species)
> df <- mygroup %>%
summarize(across(everything(),mean)) %>%
pivot_longer(-Species)
> barplot(df$value,names=df$Species)

R (ggplot)

> mygroup <- iris %>% group_by(Species)
> df <- mygroup %>%
summarize(across(everything(),mean)) %>%
pivot_longer(-Species)
> df %>% ggplot(aes(x=Species, y=value, fill=name))+
geom_col(position="dodge")