Memo for Rewriting Code in Python and R (Data Frame Edition)
This is the DataFrame section of the Python and R lconversion memo.
In Python, data frame manipulation is performed by importing the pandas module. Let’s import it first.
Python
>>> import pandas as pd
- 1. Creating a data frame.
- 2. To obtain the number of rows and columns in a data frame
- 3. Get column names of a data frame
- 4. Get an overview (summary) of a data frame
- 5. Get the value of a specific row in a data frame
- 6. Get the data from the first row of a data frame
- 7. Get the data from the last row of a data frame
- 8. Calculate the sum of the data
- 9. Calculate the mean of the data
- 10. Calculate the median of the data
- 11. Calculate the variance of the data
- 12. Calculate the standard deviation of the data
- 13. Find the maximum value in the data
- 14. Find the minimum value in the data
- 15. Count the number of data entries
- 16. Perform a search on the data
- 17. Add or remove rows in a data frame
- 18. Add or remove columns in a data frame
- 19. Export a data frame to a CSV file
- 20. Import a CSV file in CSV format
Creating a data frame.
Python
>>> df = pd.DataFrame({
'name': ["Yamada", "Yamakawa", "Yamamoto", "Yamazaki"],
'height': [170, 168, 176, 182],
'weight': [62, 60, 70, 80],
'age': [35, 40, 28, 33],
'blood': ["A", "O", "B", "A"]})
>>> df
name height weight age blood
0 Yamada 170 62 35 A
1 Yamakawa 168 60 40 O
2 Yamamoto 176 70 28 B
3 Yamazaki 182 80 33 A
※ In Python, line numbers start from 0.
R
> df <- data.frame(
name = c("Yamada", "Yamakawa", "Yamamoto", "Yamazaki"),
height = c(170, 168, 176, 182), weight = c(62, 60, 70, 80),
age = c(35, 40, 28, 33),
blood = c("A", "O", "B", "A"))
> df
name height weight age blood
1 Yamada 170 62 35 A
2 Yamakawa 168 60 40 O
3 Yamamoto 176 70 28 B
4 Yamazaki 182 80 33 A
※R line numbers start from 1.
To obtain the number of rows and columns in a data frame
Python
# Get the number of rows and columns at once. >>> df.shape (4, 5) # Get only the number of rows. >>> df.shape[0] 4 # Get only the number of columns. >>> df.shape[1] 5
R
# Get the number of rows and columns at once. > dim(df) [1] 4 5 # Get only the number of rows. > nrow(df) [1] 4 # Get only the number of columns. > ncol(df) [1] 5
Get column names of a data frame
Python
>>> df.columns Index(['name', 'height', 'weight', 'age', 'blood'], dtype=‘object')
R
> colnames(df) [1] "name" "height" "weight" "age" "blood"
Get an overview (summary) of a data frame
Python
>>> df.describe()
height weight age
count 4.000000 4.000000 4.000000
mean 174.000000 68.000000 34.000000
std 6.324555 9.092121 4.966555
min 168.000000 60.000000 28.000000
25% 169.500000 61.500000 31.750000
50% 173.000000 66.000000 34.000000
75% 177.500000 72.500000 36.250000
max 182.000000 80.000000 40.000000
・Values returned by the describe function
count : Number of data points
mean : Mean value
std : Standard deviation
min : Minimum value
25% : 25th percentile (first quartile)
50% : 50th percentile (second quartile) (median)
75% : 75th percentile (third quartile)
max : Maximum value
R
> summary(df)
name height weight age blood
Length:4 Min. :168.0 Min. :60.0 Min. :28.00 Length:4
Class :character 1st Qu.:169.5 1st Qu.:61.5 1st Qu.:31.75 Class :character
Mode :character Median :173.0 Median :66.0 Median :34.00 Mode :character
Mean :174.0 Mean :68.0 Mean :34.00
3rd Qu.:177.5 3rd Qu.:72.5 3rd Qu.:36.25
Max. :182.0 Max. :80.0 Max. :40.00
・Values returned by summary:
Min : Minimum value
1st Qu : First quartile (25th percentile)
Median : Median (50th percentile)
Mean : Mean value
3rd Qu : Third quartile (75th percentile)
Max : Maximum value
Get the value of a specific row in a data frame
Python
# Get the value of the data in the 3rd row. >>> df.iloc[2] name Yamamoto height 176 weight 70 age 28 blood B Name: 2, dtype: object # Get only the value of the 'height' column in the 3rd row. >>> df.iloc[2].height 176
R
# Get the value of the data in the 3rd row.
> df[3,]
name height weight age blood
3 Yamamoto 176 70 28 B
# Get only the value of the 'height' column in the 3rd row.
> df[3,'height']
[1] 176
Get the data from the first row of a data frame
Python
>>> df.head(1)
name height weight age blood
0 Yamada 170 62 35 A
R
> head(df, n=1)
name height weight age blood
1 Yamada 170 62 35 A
Get the data from the last row of a data frame
Python
>>> df.tail(1)
name height weight age blood
3 Yamazaki 182 80 33 A
R
> tail(df,n=1)
name height weight age blood
4 Yamazaki 182 80 33 A
Calculate the sum of the data
Python
>>> df['height'].sum() 696 >>> df.sum().height 696
R
> sum(df$height) [1] 696
Calculate the mean of the data
Python
# Get the mean value of the 'height' column. >>> df['height'].mean() 174.0 >>> df.mean().height 174.0
R
# Get the mean value of the 'height' column. > mean(df$height) [1] 174
Calculate the median of the data
Python
>>> df['height'].median() 173.0 >>> df.median().height 173.0
R
> median(df$height) [1] 173
Calculate the variance of the data
Python
>>> df['height'].var() 40.0 >>> df.var().height 40.0
R
> var(df$height) [1] 40
Calculate the standard deviation of the data
Python
>>> df['height'].std() 6.324555320336759 >>> df.std().height 6.324555320336759
R
> sd(df$height) [1] 6.324555
Find the maximum value in the data
Python
>>> df['height'].max() 182 >>> df.max().height 182
R
> max(df$height) [1] 182
Find the minimum value in the data
Python
>>> df['height'].min() 168 >>> df.min().height 168
R
> min(df$height) [1] 168
Count the number of data entries
Python
>>> df['height'].count() 4 >>> df.count().height 4
R
> length(df$height) [1] 4
Perform a search on the data
Python
# Get the data of individuals with blood type A.
>>> df[df["blood"] == "A"]
name height weight age blood
0 Yamada 170 62 35 A
3 Yamazaki 182 80 33 A
# Get the data of individuals with blood type A.
>>> df.query('blood =="A"')
name height weight age blood
0 Yamada 170 62 35 A
3 Yamazaki 182 80 33 A
R
# Get the data of individuals with blood type A.
> df[df$blood == "A",]
name height weight age blood
1 Yamada 170 62 35 A
4 Yamazaki 182 80 33 A
# Get the data of individuals with blood type A.
> library(dplyr)
> df %>% filter(blood == "A")
name height weight age blood
1 Yamada 170 62 35 A
2 Yamazaki 182 80 33 A
Perform a search with multiple conditions.
Python
# Get the data of individuals with blood type A and weight equal to or above 70 kg.
>>> df[(df["blood"] == "A") & (df["weight"] > 70)]
name height weight age blood
3 Yamazaki 182 80 33 A
# Get the data of individuals with blood type A and weight equal to or above 70 kg.
>>> df.query('blood == "A" & weight > 70')
name height weight age blood
3 Yamazaki 182 80 33 A
R
# Get the data of individuals with blood type A and weight equal to or above 70 kg.
> df[df$weight > 70 & df$blood=="A",]
name height weight age blood
4 Yamazaki 182 80 33 A
# Get the data of individuals with blood type A and weight equal to or above 70 kg.
> df %>% filter(blood == "A" & weight > 70)
name height weight age blood
1 Yamazaki 182 80 33 A
Add or remove rows in a data frame
Python
# Add row data using the append function.
>>> df1 = df.append({
'name': "Yamaguchi",
'height': 174,
'weight': 75,
'age': 48,
'blood': "AB"},
ignore_index=True)
# Create the row data to be added in a variable named 'tmp' and then add the row data using the concat function.
>>> tmp = pd.DataFrame({
'name': ["Yamaguchi"],
'height': [174],
'weight': [75],
'age': [48],
'blood': ["AB"]})
>>> df1 = pd.concat([df, tmp], ignore_index=True)
# Check the added data.
>>> df1
name height weight age blood
0 Yamada 170 62 35 A
1 Yamakawa 168 60 40 O
2 Yamamoto 176 70 28 B
3 Yamazaki 182 80 33 A
4 Yamaguchi 174 75 48 AB
# Remove data by specifying the index number using the drop function.
>>> df1 = df1.drop(index=4)
>>> df1
name height weight age blood
0 Yamada 170 62 35 A
1 Yamakawa 168 60 40 O
2 Yamamoto 176 70 28 B
3 Yamazaki 182 80 33 A
R
# Create the data frame row to be added in a variable named 'tmp' and then add the data using the bind function.
> tmp <- data.frame(
name = c("Yamaguchi"),
height = c(174),
weight = c(75),
age = c(48),
blood = c("AB"))
> df1 <- rbind(df, tmp)
# Check the added data.
> df1
name height weight age blood
1 Yamada 170 62 35 A
2 Yamakawa 168 60 40 O
3 Yamamoto 176 70 28 B
4 Yamazaki 182 80 33 A
5 Yamaguchi 174 75 48 AB
# Remove data by adding a minus sign to the index number.
> df1 <-df1[-5,]
> df1
name height weight age blood
1 Yamada 170 62 35 A
2 Yamakawa 168 60 40 O
3 Yamamoto 176 70 28 B
4 Yamazaki 182 80 33 A
Add or remove columns in a data frame
Python
# Add the 'gender' column using the assign function.
>>> df2 = df.assign(gender=["m","f","m","f"])
# Add the 'gender' column to a copied data frame.
>>> df2 = df
>>> df2["gender"]=["m","f","m","f"]
# Check the added data.
>>> df2
name height weight age blood gender
0 Yamada 170 62 35 A m
1 Yamakawa 168 60 40 O f
2 Yamamoto 176 70 28 B m
3 Yamazaki 182 80 33 A f
# Remove the added 'gender' column.
>>> df2 = df2.drop('gender', axis=1)
R
# Add the 'gender' column using the mutate function.
> library(dplyr)
> df2 <- df %>% mutate(gender = c("m","f","m","f"))
# Add the 'gender' column to a copied data frame.
> df2 <- df
> df2["gender"] <- c("m","f","m","f")
# Remove the added 'gender' column.
> df2 <- df2[, -6]
Export a data frame to a CSV file
Python
>>> df.to_csv(‘mydata.csv')
R
> write.csv(df, "mydata.csv")
Import a CSV file in CSV format
Python
>>> df = pd.read_csv('mydata.csv', index_col=0)
>>> df
name height weight age blood
0 Yamada 170 62 35 A
1 Yamakawa 168 60 40 O
2 Yamamoto 176 70 28 B
3 Yamazaki 182 80 33 A
R
> df=read.csv("mydata.csv", row.names=1)
> df
name height weight age blood
1 Yamada 170 62 35 A
2 Yamakawa 168 60 40 O
3 Yamamoto 176 70 28 B
4 Yamazaki 182 80 33 A




