Memo for Rewriting Code in Python and R (Data Frame Edition)

2023年7月31日

This is the DataFrame section of the Python and R lconversion memo.
In Python, data frame manipulation is performed by importing the pandas module. Let’s import it first.

Python

>>> import pandas as pd




Creating a data frame.

Python

>>> df = pd.DataFrame({
	'name': ["Yamada", "Yamakawa", "Yamamoto", "Yamazaki"],
	'height': [170, 168, 176, 182],
	'weight': [62, 60, 70, 80],
	'age': [35, 40, 28, 33],
	'blood': ["A", "O", "B", "A"]})

>>> df
       name  height  weight  age blood
0    Yamada     170      62   35     A
1  Yamakawa     168      60   40     O
2  Yamamoto     176      70   28     B
3  Yamazaki     182      80   33     A

※ In Python, line numbers start from 0.

R

> df <- data.frame(
    name = c("Yamada", "Yamakawa", "Yamamoto", "Yamazaki"),
    height = c(170, 168, 176, 182), weight = c(62, 60, 70, 80),
    age = c(35, 40, 28, 33),
    blood = c("A", "O", "B", "A"))
> df
      name height weight age blood
1   Yamada    170     62  35     A
2 Yamakawa    168     60  40     O
3 Yamamoto    176     70  28     B
4 Yamazaki    182     80  33     A

※R line numbers start from 1.

To obtain the number of rows and columns in a data frame

Python

# Get the number of rows and columns at once.
>>> df.shape
(4, 5)

# Get only the number of rows.
>>> df.shape[0]
4

# Get only the number of columns.
>>> df.shape[1]
5

R

# Get the number of rows and columns at once.
> dim(df)
[1] 4 5

# Get only the number of rows.
> nrow(df)
[1] 4

# Get only the number of columns.
> ncol(df)
[1] 5

 

Get column names of a data frame

Python

>>> df.columns
Index(['name', 'height', 'weight', 'age', 'blood'], dtype=‘object')

R

> colnames(df)
[1] "name"   "height" "weight" "age"    "blood" 

 

Get an overview (summary) of a data frame

Python

>>> df.describe()
           height     weight        age
count    4.000000   4.000000   4.000000
mean   174.000000  68.000000  34.000000
std      6.324555   9.092121   4.966555
min    168.000000  60.000000  28.000000
25%    169.500000  61.500000  31.750000
50%    173.000000  66.000000  34.000000
75%    177.500000  72.500000  36.250000
max    182.000000  80.000000  40.000000

・Values returned by the describe function

count : Number of data points
mean : Mean value
std : Standard deviation
min : Minimum value
25% : 25th percentile (first quartile)
50% : 50th percentile (second quartile) (median)
75% : 75th percentile (third quartile)
max : Maximum value

R

> summary(df)
     name               height          weight          age           blood          
 Length:4           Min.   :168.0   Min.   :60.0   Min.   :28.00   Length:4          
 Class :character   1st Qu.:169.5   1st Qu.:61.5   1st Qu.:31.75   Class :character  
 Mode  :character   Median :173.0   Median :66.0   Median :34.00   Mode  :character  
                    Mean   :174.0   Mean   :68.0   Mean   :34.00                     
                    3rd Qu.:177.5   3rd Qu.:72.5   3rd Qu.:36.25                     
                    Max.   :182.0   Max.   :80.0   Max.   :40.00

・Values returned by summary:

Min : Minimum value
1st Qu : First quartile (25th percentile)
Median : Median (50th percentile)
Mean : Mean value
3rd Qu : Third quartile (75th percentile)
Max : Maximum value

Get the value of a specific row in a data frame

Python

# Get the value of the data in the 3rd row.
>>> df.iloc[2]
name      Yamamoto
height         176
weight          70
age             28
blood            B
Name: 2, dtype: object

# Get only the value of the 'height' column in the 3rd row.
>>> df.iloc[2].height
176

R

# Get the value of the data in the 3rd row.
> df[3,]
      name height weight age blood
3 Yamamoto    176     70  28     B

# Get only the value of the 'height' column in the 3rd row.
> df[3,'height']
[1] 176

 

Get the data from the first row of a data frame

Python

>>> df.head(1)
     name  height  weight  age blood
0  Yamada     170      62   35     A

R

> head(df, n=1)
    name height weight age blood
1 Yamada    170     62  35     A

 

Get the data from the last row of a data frame

Python

>>> df.tail(1)
       name  height  weight  age blood
3  Yamazaki     182      80   33     A

R

> tail(df,n=1)
      name height weight age blood
4 Yamazaki    182     80  33     A

 

Calculate the sum of the data

Python

>>> df['height'].sum()
696
>>> df.sum().height
696

R

> sum(df$height)
[1] 696

 

Calculate the mean of the data

Python

# Get the mean value of the 'height' column.
>>> df['height'].mean()
174.0
>>> df.mean().height
174.0

R

# Get the mean value of the 'height' column.
> mean(df$height)
[1] 174

 

Calculate the median of the data

Python

>>> df['height'].median()
173.0
>>> df.median().height
173.0

R

> median(df$height)
[1] 173

 

Calculate the variance of the data

Python

>>> df['height'].var()
40.0
>>> df.var().height
40.0

R

> var(df$height)
[1] 40

 

Calculate the standard deviation of the data

Python

>>> df['height'].std()
6.324555320336759
>>> df.std().height
6.324555320336759

R

> sd(df$height)
[1] 6.324555

 

Find the maximum value in the data

Python

>>> df['height'].max()
182
>>> df.max().height
182

R

> max(df$height)
[1] 182

 

Find the minimum value in the data

Python

>>> df['height'].min()
168
>>> df.min().height
168

R

> min(df$height)
[1] 168

 

Count the number of data entries

Python

>>> df['height'].count()
4
>>> df.count().height
4

R

> length(df$height)
[1] 4

Perform a search on the data

Python

# Get the data of individuals with blood type A.
>>> df[df["blood"] == "A"]
       name  height  weight  age blood
0    Yamada     170      62   35     A
3  Yamazaki     182      80   33     A

# Get the data of individuals with blood type A.
>>> df.query('blood =="A"')
       name  height  weight  age blood
0    Yamada     170      62   35     A
3  Yamazaki     182      80   33     A

R

# Get the data of individuals with blood type A.
> df[df$blood == "A",]
      name height weight age blood
1   Yamada    170     62  35     A
4 Yamazaki    182     80  33     A

# Get the data of individuals with blood type A.
> library(dplyr)
> df %>% filter(blood == "A")
      name height weight age blood
1   Yamada    170     62  35     A
2 Yamazaki    182     80  33     A

Perform a search with multiple conditions.

Python

# Get the data of individuals with blood type A and weight equal to or above 70 kg.
>>> df[(df["blood"] == "A") & (df["weight"] > 70)]
       name  height  weight  age blood
3  Yamazaki     182      80   33     A

# Get the data of individuals with blood type A and weight equal to or above 70 kg.
>>> df.query('blood == "A" & weight > 70')
       name  height  weight  age blood
3  Yamazaki     182      80   33     A

R

# Get the data of individuals with blood type A and weight equal to or above 70 kg.
> df[df$weight > 70 & df$blood=="A",]
      name height weight age blood
4 Yamazaki    182     80  33     A

# Get the data of individuals with blood type A and weight equal to or above 70 kg.
> df %>% filter(blood == "A" & weight > 70)
      name height weight age blood
1 Yamazaki    182     80  33     A

Add or remove rows in a data frame

Python

# Add row data using the append function.
>>> df1 = df.append({
    'name': "Yamaguchi",
    'height': 174,
    'weight': 75,
    'age': 48,
    'blood': "AB"},
    ignore_index=True)

# Create the row data to be added in a variable named 'tmp' and then add the row data using the concat function.
>>> tmp = pd.DataFrame({
    'name': ["Yamaguchi"],
    'height': [174],
    'weight': [75],
    'age': [48],
    'blood': ["AB"]})
>>> df1 = pd.concat([df, tmp], ignore_index=True)

# Check the added data.
>>> df1
        name  height  weight  age blood
0     Yamada     170      62   35     A
1   Yamakawa     168      60   40     O
2   Yamamoto     176      70   28     B
3   Yamazaki     182      80   33     A
4  Yamaguchi     174      75   48    AB

# Remove data by specifying the index number using the drop function.
>>> df1 = df1.drop(index=4)
>>> df1
       name  height  weight  age blood
0    Yamada     170      62   35     A
1  Yamakawa     168      60   40     O
2  Yamamoto     176      70   28     B
3  Yamazaki     182      80   33     A

R

# Create the data frame row to be added in a variable named 'tmp' and then add the data using the bind function.
> tmp <- data.frame(
name = c("Yamaguchi"),
height = c(174),
weight = c(75),
age = c(48),
blood = c("AB"))

> df1 <- rbind(df, tmp)

# Check the added data.
> df1
       name height weight age blood
1    Yamada    170     62  35     A
2  Yamakawa    168     60  40     O
3  Yamamoto    176     70  28     B
4  Yamazaki    182     80  33     A
5 Yamaguchi    174     75  48    AB

# Remove data by adding a minus sign to the index number.
> df1 <-df1[-5,] 
> df1
      name height weight age blood
1   Yamada    170     62  35     A
2 Yamakawa    168     60  40     O
3 Yamamoto    176     70  28     B
4 Yamazaki    182     80  33     A

Add or remove columns in a data frame

Python

# Add the 'gender' column using the assign function.
>>> df2 = df.assign(gender=["m","f","m","f"])

# Add the 'gender' column to a copied data frame.
>>> df2 = df
>>> df2["gender"]=["m","f","m","f"]

# Check the added data.
>>> df2
       name  height  weight  age blood gender
0    Yamada     170      62   35     A      m
1  Yamakawa     168      60   40     O      f
2  Yamamoto     176      70   28     B      m
3  Yamazaki     182      80   33     A      f

# Remove the added 'gender' column.
>>> df2 = df2.drop('gender', axis=1)

R

# Add the 'gender' column using the mutate function.
> library(dplyr)
> df2 <- df %>% mutate(gender = c("m","f","m","f"))

# Add the 'gender' column to a copied data frame.
> df2 <- df
> df2["gender"] <- c("m","f","m","f")

# Remove the added 'gender' column.
> df2 <- df2[, -6]

Export a data frame to a CSV file

Python

>>> df.to_csv(‘mydata.csv')

R

> write.csv(df, "mydata.csv")

Import a CSV file in CSV format

Python

>>> df = pd.read_csv('mydata.csv', index_col=0)
>>> df
       name  height  weight  age blood
0    Yamada     170      62   35     A
1  Yamakawa     168      60   40     O
2  Yamamoto     176      70   28     B
3  Yamazaki     182      80   33     A

R

> df=read.csv("mydata.csv", row.names=1)
> df
      name height weight age blood
1   Yamada    170     62  35     A
2 Yamakawa    168     60  40     O
3 Yamamoto    176     70  28     B
4 Yamazaki    182     80  33     A