Memo for Rewriting Code in Python and R (Data Frame Edition)
This is the DataFrame section of the Python and R lconversion memo.
In Python, data frame manipulation is performed by importing the pandas module. Let’s import it first.
Python
>>> import pandas as pd
- 1. Creating a data frame.
- 2. To obtain the number of rows and columns in a data frame
- 3. Get column names of a data frame
- 4. Get an overview (summary) of a data frame
- 5. Get the value of a specific row in a data frame
- 6. Get the data from the first row of a data frame
- 7. Get the data from the last row of a data frame
- 8. Calculate the sum of the data
- 9. Calculate the mean of the data
- 10. Calculate the median of the data
- 11. Calculate the variance of the data
- 12. Calculate the standard deviation of the data
- 13. Find the maximum value in the data
- 14. Find the minimum value in the data
- 15. Count the number of data entries
- 16. Perform a search on the data
- 17. Add or remove rows in a data frame
- 18. Add or remove columns in a data frame
- 19. Export a data frame to a CSV file
- 20. Import a CSV file in CSV format
Creating a data frame.
Python
>>> df = pd.DataFrame({ 'name': ["Yamada", "Yamakawa", "Yamamoto", "Yamazaki"], 'height': [170, 168, 176, 182], 'weight': [62, 60, 70, 80], 'age': [35, 40, 28, 33], 'blood': ["A", "O", "B", "A"]}) >>> df name height weight age blood 0 Yamada 170 62 35 A 1 Yamakawa 168 60 40 O 2 Yamamoto 176 70 28 B 3 Yamazaki 182 80 33 A
※ In Python, line numbers start from 0.
R
> df <- data.frame( name = c("Yamada", "Yamakawa", "Yamamoto", "Yamazaki"), height = c(170, 168, 176, 182), weight = c(62, 60, 70, 80), age = c(35, 40, 28, 33), blood = c("A", "O", "B", "A")) > df name height weight age blood 1 Yamada 170 62 35 A 2 Yamakawa 168 60 40 O 3 Yamamoto 176 70 28 B 4 Yamazaki 182 80 33 A
※R line numbers start from 1.
To obtain the number of rows and columns in a data frame
Python
# Get the number of rows and columns at once. >>> df.shape (4, 5) # Get only the number of rows. >>> df.shape[0] 4 # Get only the number of columns. >>> df.shape[1] 5
R
# Get the number of rows and columns at once. > dim(df) [1] 4 5 # Get only the number of rows. > nrow(df) [1] 4 # Get only the number of columns. > ncol(df) [1] 5
Get column names of a data frame
Python
>>> df.columns Index(['name', 'height', 'weight', 'age', 'blood'], dtype=‘object')
R
> colnames(df) [1] "name" "height" "weight" "age" "blood"
Get an overview (summary) of a data frame
Python
>>> df.describe() height weight age count 4.000000 4.000000 4.000000 mean 174.000000 68.000000 34.000000 std 6.324555 9.092121 4.966555 min 168.000000 60.000000 28.000000 25% 169.500000 61.500000 31.750000 50% 173.000000 66.000000 34.000000 75% 177.500000 72.500000 36.250000 max 182.000000 80.000000 40.000000
・Values returned by the describe function
count : Number of data points
mean : Mean value
std : Standard deviation
min : Minimum value
25% : 25th percentile (first quartile)
50% : 50th percentile (second quartile) (median)
75% : 75th percentile (third quartile)
max : Maximum value
R
> summary(df) name height weight age blood Length:4 Min. :168.0 Min. :60.0 Min. :28.00 Length:4 Class :character 1st Qu.:169.5 1st Qu.:61.5 1st Qu.:31.75 Class :character Mode :character Median :173.0 Median :66.0 Median :34.00 Mode :character Mean :174.0 Mean :68.0 Mean :34.00 3rd Qu.:177.5 3rd Qu.:72.5 3rd Qu.:36.25 Max. :182.0 Max. :80.0 Max. :40.00
・Values returned by summary:
Min : Minimum value
1st Qu : First quartile (25th percentile)
Median : Median (50th percentile)
Mean : Mean value
3rd Qu : Third quartile (75th percentile)
Max : Maximum value
Get the value of a specific row in a data frame
Python
# Get the value of the data in the 3rd row. >>> df.iloc[2] name Yamamoto height 176 weight 70 age 28 blood B Name: 2, dtype: object # Get only the value of the 'height' column in the 3rd row. >>> df.iloc[2].height 176
R
# Get the value of the data in the 3rd row. > df[3,] name height weight age blood 3 Yamamoto 176 70 28 B # Get only the value of the 'height' column in the 3rd row. > df[3,'height'] [1] 176
Get the data from the first row of a data frame
Python
>>> df.head(1) name height weight age blood 0 Yamada 170 62 35 A
R
> head(df, n=1) name height weight age blood 1 Yamada 170 62 35 A
Get the data from the last row of a data frame
Python
>>> df.tail(1) name height weight age blood 3 Yamazaki 182 80 33 A
R
> tail(df,n=1) name height weight age blood 4 Yamazaki 182 80 33 A
Calculate the sum of the data
Python
>>> df['height'].sum() 696 >>> df.sum().height 696
R
> sum(df$height) [1] 696
Calculate the mean of the data
Python
# Get the mean value of the 'height' column. >>> df['height'].mean() 174.0 >>> df.mean().height 174.0
R
# Get the mean value of the 'height' column. > mean(df$height) [1] 174
Calculate the median of the data
Python
>>> df['height'].median() 173.0 >>> df.median().height 173.0
R
> median(df$height) [1] 173
Calculate the variance of the data
Python
>>> df['height'].var() 40.0 >>> df.var().height 40.0
R
> var(df$height) [1] 40
Calculate the standard deviation of the data
Python
>>> df['height'].std() 6.324555320336759 >>> df.std().height 6.324555320336759
R
> sd(df$height) [1] 6.324555
Find the maximum value in the data
Python
>>> df['height'].max() 182 >>> df.max().height 182
R
> max(df$height) [1] 182
Find the minimum value in the data
Python
>>> df['height'].min() 168 >>> df.min().height 168
R
> min(df$height) [1] 168
Count the number of data entries
Python
>>> df['height'].count() 4 >>> df.count().height 4
R
> length(df$height) [1] 4
Perform a search on the data
Python
# Get the data of individuals with blood type A. >>> df[df["blood"] == "A"] name height weight age blood 0 Yamada 170 62 35 A 3 Yamazaki 182 80 33 A # Get the data of individuals with blood type A. >>> df.query('blood =="A"') name height weight age blood 0 Yamada 170 62 35 A 3 Yamazaki 182 80 33 A
R
# Get the data of individuals with blood type A. > df[df$blood == "A",] name height weight age blood 1 Yamada 170 62 35 A 4 Yamazaki 182 80 33 A # Get the data of individuals with blood type A. > library(dplyr) > df %>% filter(blood == "A") name height weight age blood 1 Yamada 170 62 35 A 2 Yamazaki 182 80 33 A
Perform a search with multiple conditions.
Python
# Get the data of individuals with blood type A and weight equal to or above 70 kg. >>> df[(df["blood"] == "A") & (df["weight"] > 70)] name height weight age blood 3 Yamazaki 182 80 33 A # Get the data of individuals with blood type A and weight equal to or above 70 kg. >>> df.query('blood == "A" & weight > 70') name height weight age blood 3 Yamazaki 182 80 33 A
R
# Get the data of individuals with blood type A and weight equal to or above 70 kg. > df[df$weight > 70 & df$blood=="A",] name height weight age blood 4 Yamazaki 182 80 33 A # Get the data of individuals with blood type A and weight equal to or above 70 kg. > df %>% filter(blood == "A" & weight > 70) name height weight age blood 1 Yamazaki 182 80 33 A
Add or remove rows in a data frame
Python
# Add row data using the append function. >>> df1 = df.append({ 'name': "Yamaguchi", 'height': 174, 'weight': 75, 'age': 48, 'blood': "AB"}, ignore_index=True) # Create the row data to be added in a variable named 'tmp' and then add the row data using the concat function. >>> tmp = pd.DataFrame({ 'name': ["Yamaguchi"], 'height': [174], 'weight': [75], 'age': [48], 'blood': ["AB"]}) >>> df1 = pd.concat([df, tmp], ignore_index=True) # Check the added data. >>> df1 name height weight age blood 0 Yamada 170 62 35 A 1 Yamakawa 168 60 40 O 2 Yamamoto 176 70 28 B 3 Yamazaki 182 80 33 A 4 Yamaguchi 174 75 48 AB # Remove data by specifying the index number using the drop function. >>> df1 = df1.drop(index=4) >>> df1 name height weight age blood 0 Yamada 170 62 35 A 1 Yamakawa 168 60 40 O 2 Yamamoto 176 70 28 B 3 Yamazaki 182 80 33 A
R
# Create the data frame row to be added in a variable named 'tmp' and then add the data using the bind function. > tmp <- data.frame( name = c("Yamaguchi"), height = c(174), weight = c(75), age = c(48), blood = c("AB")) > df1 <- rbind(df, tmp) # Check the added data. > df1 name height weight age blood 1 Yamada 170 62 35 A 2 Yamakawa 168 60 40 O 3 Yamamoto 176 70 28 B 4 Yamazaki 182 80 33 A 5 Yamaguchi 174 75 48 AB # Remove data by adding a minus sign to the index number. > df1 <-df1[-5,] > df1 name height weight age blood 1 Yamada 170 62 35 A 2 Yamakawa 168 60 40 O 3 Yamamoto 176 70 28 B 4 Yamazaki 182 80 33 A
Add or remove columns in a data frame
Python
# Add the 'gender' column using the assign function. >>> df2 = df.assign(gender=["m","f","m","f"]) # Add the 'gender' column to a copied data frame. >>> df2 = df >>> df2["gender"]=["m","f","m","f"] # Check the added data. >>> df2 name height weight age blood gender 0 Yamada 170 62 35 A m 1 Yamakawa 168 60 40 O f 2 Yamamoto 176 70 28 B m 3 Yamazaki 182 80 33 A f # Remove the added 'gender' column. >>> df2 = df2.drop('gender', axis=1)
R
# Add the 'gender' column using the mutate function. > library(dplyr) > df2 <- df %>% mutate(gender = c("m","f","m","f")) # Add the 'gender' column to a copied data frame. > df2 <- df > df2["gender"] <- c("m","f","m","f") # Remove the added 'gender' column. > df2 <- df2[, -6]
Export a data frame to a CSV file
Python
>>> df.to_csv(‘mydata.csv')
R
> write.csv(df, "mydata.csv")
Import a CSV file in CSV format
Python
>>> df = pd.read_csv('mydata.csv', index_col=0) >>> df name height weight age blood 0 Yamada 170 62 35 A 1 Yamakawa 168 60 40 O 2 Yamamoto 176 70 28 B 3 Yamazaki 182 80 33 A
R
> df=read.csv("mydata.csv", row.names=1) > df name height weight age blood 1 Yamada 170 62 35 A 2 Yamakawa 168 60 40 O 3 Yamamoto 176 70 28 B 4 Yamazaki 182 80 33 A