Memo for Rewriting Code in Python and R (Data Frame Edition)

2023年5月22日2023年7月31日

This is the DataFrame section of the Python and R lconversion memo.
In Python, data frame manipulation is performed by importing the pandas module. Let’s import it first.

Python

>>> import pandas as pd

1. Creating a data frame.
2. To obtain the number of rows and columns in a data frame
3. Get column names of a data frame
4. Get an overview (summary) of a data frame
- 4.1. ・Values returned by the describe function
- 4.2. ・Values returned by summary:
5. Get the value of a specific row in a data frame
6. Get the data from the first row of a data frame
7. Get the data from the last row of a data frame
8. Calculate the sum of the data
9. Calculate the mean of the data
10. Calculate the median of the data
11. Calculate the variance of the data
12. Calculate the standard deviation of the data
13. Find the maximum value in the data
14. Find the minimum value in the data
15. Count the number of data entries
16. Perform a search on the data
- 16.1. Perform a search with multiple conditions.
17. Add or remove rows in a data frame
18. Add or remove columns in a data frame
19. Export a data frame to a CSV file
20. Import a CSV file in CSV format

Creating a data frame.

Python

>>> df = pd.DataFrame({
	'name': ["Yamada", "Yamakawa", "Yamamoto", "Yamazaki"],
	'height': [170, 168, 176, 182],
	'weight': [62, 60, 70, 80],
	'age': [35, 40, 28, 33],
	'blood': ["A", "O", "B", "A"]})

>>> df
       name  height  weight  age blood
0    Yamada     170      62   35     A
1  Yamakawa     168      60   40     O
2  Yamamoto     176      70   28     B
3  Yamazaki     182      80   33     A

※ In Python, line numbers start from 0.

> df <- data.frame(
    name = c("Yamada", "Yamakawa", "Yamamoto", "Yamazaki"),
    height = c(170, 168, 176, 182), weight = c(62, 60, 70, 80),
    age = c(35, 40, 28, 33),
    blood = c("A", "O", "B", "A"))
> df
      name height weight age blood
1   Yamada    170     62  35     A
2 Yamakawa    168     60  40     O
3 Yamamoto    176     70  28     B
4 Yamazaki    182     80  33     A

※R line numbers start from 1.

To obtain the number of rows and columns in a data frame

Python

# Get the number of rows and columns at once.
>>> df.shape
(4, 5)

# Get only the number of rows.
>>> df.shape[0]
4

# Get only the number of columns.
>>> df.shape[1]
5

# Get the number of rows and columns at once.
> dim(df)
[1] 4 5

# Get only the number of rows.
> nrow(df)
[1] 4

# Get only the number of columns.
> ncol(df)
[1] 5

Get column names of a data frame

Python

>>> df.columns
Index(['name', 'height', 'weight', 'age', 'blood'], dtype=‘object')

> colnames(df)
[1] "name"   "height" "weight" "age"    "blood"

Get an overview (summary) of a data frame

Python

>>> df.describe()
           height     weight        age
count    4.000000   4.000000   4.000000
mean   174.000000  68.000000  34.000000
std      6.324555   9.092121   4.966555
min    168.000000  60.000000  28.000000
25%    169.500000  61.500000  31.750000
50%    173.000000  66.000000  34.000000
75%    177.500000  72.500000  36.250000
max    182.000000  80.000000  40.000000

・Values returned by the describe function

count : Number of data points
mean : Mean value
std : Standard deviation
min : Minimum value
25% : 25th percentile (first quartile)
50% : 50th percentile (second quartile) (median)
75% : 75th percentile (third quartile)
max : Maximum value

> summary(df)
     name               height          weight          age           blood          
 Length:4           Min.   :168.0   Min.   :60.0   Min.   :28.00   Length:4          
 Class :character   1st Qu.:169.5   1st Qu.:61.5   1st Qu.:31.75   Class :character  
 Mode  :character   Median :173.0   Median :66.0   Median :34.00   Mode  :character  
                    Mean   :174.0   Mean   :68.0   Mean   :34.00                     
                    3rd Qu.:177.5   3rd Qu.:72.5   3rd Qu.:36.25                     
                    Max.   :182.0   Max.   :80.0   Max.   :40.00

・Values returned by summary:

Min : Minimum value
1st Qu : First quartile (25th percentile)
Median : Median (50th percentile)
Mean : Mean value
3rd Qu : Third quartile (75th percentile)
Max : Maximum value

Get the value of a specific row in a data frame

Python

# Get the value of the data in the 3rd row.
>>> df.iloc[2]
name      Yamamoto
height         176
weight          70
age             28
blood            B
Name: 2, dtype: object

# Get only the value of the 'height' column in the 3rd row.
>>> df.iloc[2].height
176

# Get the value of the data in the 3rd row.
> df[3,]
      name height weight age blood
3 Yamamoto    176     70  28     B

# Get only the value of the 'height' column in the 3rd row.
> df[3,'height']
[1] 176

Get the data from the first row of a data frame

Python

>>> df.head(1)
     name  height  weight  age blood
0  Yamada     170      62   35     A

> head(df, n=1)
    name height weight age blood
1 Yamada    170     62  35     A

Get the data from the last row of a data frame

Python

>>> df.tail(1)
       name  height  weight  age blood
3  Yamazaki     182      80   33     A

> tail(df,n=1)
      name height weight age blood
4 Yamazaki    182     80  33     A

Calculate the sum of the data

Python

>>> df['height'].sum()
696
>>> df.sum().height
696

> sum(df$height)
[1] 696

Calculate the mean of the data

Python

# Get the mean value of the 'height' column.
>>> df['height'].mean()
174.0
>>> df.mean().height
174.0

# Get the mean value of the 'height' column.
> mean(df$height)
[1] 174

Calculate the median of the data

Python

>>> df['height'].median()
173.0
>>> df.median().height
173.0

> median(df$height)
[1] 173

Calculate the variance of the data

Python

>>> df['height'].var()
40.0
>>> df.var().height
40.0

> var(df$height)
[1] 40

Calculate the standard deviation of the data

Python

>>> df['height'].std()
6.324555320336759
>>> df.std().height
6.324555320336759

> sd(df$height)
[1] 6.324555

Find the maximum value in the data

Python

>>> df['height'].max()
182
>>> df.max().height
182

> max(df$height)
[1] 182

Find the minimum value in the data

Python

>>> df['height'].min()
168
>>> df.min().height
168

> min(df$height)
[1] 168

Count the number of data entries

Python

>>> df['height'].count()
4
>>> df.count().height
4

> length(df$height)
[1] 4

Perform a search on the data

Python

# Get the data of individuals with blood type A.
>>> df[df["blood"] == "A"]
       name  height  weight  age blood
0    Yamada     170      62   35     A
3  Yamazaki     182      80   33     A

# Get the data of individuals with blood type A.
>>> df.query('blood =="A"')
       name  height  weight  age blood
0    Yamada     170      62   35     A
3  Yamazaki     182      80   33     A

# Get the data of individuals with blood type A.
> df[df$blood == "A",]
      name height weight age blood
1   Yamada    170     62  35     A
4 Yamazaki    182     80  33     A

# Get the data of individuals with blood type A.
> library(dplyr)
> df %>% filter(blood == "A")
      name height weight age blood
1   Yamada    170     62  35     A
2 Yamazaki    182     80  33     A

Perform a search with multiple conditions.

Python

# Get the data of individuals with blood type A and weight equal to or above 70 kg.
>>> df[(df["blood"] == "A") & (df["weight"] > 70)]
       name  height  weight  age blood
3  Yamazaki     182      80   33     A

# Get the data of individuals with blood type A and weight equal to or above 70 kg.
>>> df.query('blood == "A" & weight > 70')
       name  height  weight  age blood
3  Yamazaki     182      80   33     A

# Get the data of individuals with blood type A and weight equal to or above 70 kg.
> df[df$weight > 70 & df$blood=="A",]
      name height weight age blood
4 Yamazaki    182     80  33     A

# Get the data of individuals with blood type A and weight equal to or above 70 kg.
> df %>% filter(blood == "A" & weight > 70)
      name height weight age blood
1 Yamazaki    182     80  33     A

Add or remove rows in a data frame

Python

# Add row data using the append function.
>>> df1 = df.append({
    'name': "Yamaguchi",
    'height': 174,
    'weight': 75,
    'age': 48,
    'blood': "AB"},
    ignore_index=True)

# Create the row data to be added in a variable named 'tmp' and then add the row data using the concat function.
>>> tmp = pd.DataFrame({
    'name': ["Yamaguchi"],
    'height': [174],
    'weight': [75],
    'age': [48],
    'blood': ["AB"]})
>>> df1 = pd.concat([df, tmp], ignore_index=True)

# Check the added data.
>>> df1
        name  height  weight  age blood
0     Yamada     170      62   35     A
1   Yamakawa     168      60   40     O
2   Yamamoto     176      70   28     B
3   Yamazaki     182      80   33     A
4  Yamaguchi     174      75   48    AB

# Remove data by specifying the index number using the drop function.
>>> df1 = df1.drop(index=4)
>>> df1
       name  height  weight  age blood
0    Yamada     170      62   35     A
1  Yamakawa     168      60   40     O
2  Yamamoto     176      70   28     B
3  Yamazaki     182      80   33     A

# Create the data frame row to be added in a variable named 'tmp' and then add the data using the bind function.
> tmp <- data.frame(
name = c("Yamaguchi"),
height = c(174),
weight = c(75),
age = c(48),
blood = c("AB"))

> df1 <- rbind(df, tmp)

# Check the added data.
> df1
       name height weight age blood
1    Yamada    170     62  35     A
2  Yamakawa    168     60  40     O
3  Yamamoto    176     70  28     B
4  Yamazaki    182     80  33     A
5 Yamaguchi    174     75  48    AB

# Remove data by adding a minus sign to the index number.
> df1 <-df1[-5,] 
> df1
      name height weight age blood
1   Yamada    170     62  35     A
2 Yamakawa    168     60  40     O
3 Yamamoto    176     70  28     B
4 Yamazaki    182     80  33     A

Add or remove columns in a data frame

Python

# Add the 'gender' column using the assign function.
>>> df2 = df.assign(gender=["m","f","m","f"])

# Add the 'gender' column to a copied data frame.
>>> df2 = df
>>> df2["gender"]=["m","f","m","f"]

# Check the added data.
>>> df2
       name  height  weight  age blood gender
0    Yamada     170      62   35     A      m
1  Yamakawa     168      60   40     O      f
2  Yamamoto     176      70   28     B      m
3  Yamazaki     182      80   33     A      f

# Remove the added 'gender' column.
>>> df2 = df2.drop('gender', axis=1)

# Add the 'gender' column using the mutate function.
> library(dplyr)
> df2 <- df %>% mutate(gender = c("m","f","m","f"))

# Add the 'gender' column to a copied data frame.
> df2 <- df
> df2["gender"] <- c("m","f","m","f")

# Remove the added 'gender' column.
> df2 <- df2[, -6]

Export a data frame to a CSV file

Python

>>> df.to_csv(‘mydata.csv')

> write.csv(df, "mydata.csv")

Import a CSV file in CSV format

Python

>>> df = pd.read_csv('mydata.csv', index_col=0)
>>> df
       name  height  weight  age blood
0    Yamada     170      62   35     A
1  Yamakawa     168      60   40     O
2  Yamamoto     176      70   28     B
3  Yamazaki     182      80   33     A

> df=read.csv("mydata.csv", row.names=1)
> df
      name height weight age blood
1   Yamada    170     62  35     A
2 Yamakawa    168     60  40     O
3 Yamamoto    176     70  28     B
4 Yamazaki    182     80  33     A

Programming,Python,R言語