Filling Test Data Using Python's Data Science Modules
Check which version of Python MySQL Shell is running .
MySQL 127.0.0.1:33060+ ssl cookbook Py > import sys
MySQL 127.0.0.1:33060+ ssl cookbook Py > sys.version
3.9.6 (default, Aug 25 2021, 16:22:38)
[GCC 8.5.0 20210514 (Red Hat 8.5.0-3)]
MySQL 127.0.0.1:33060+ ssl cookbook Py > sys.path.append("/home/maxwell/.local/lib/python3.9/site-packages")
Data filling step-by-step
MySQL 127.0.0.1:33060+ ssl cookbook Py >
MySQL 127.0.0.1:33060+ ssl cookbook Py > import numpy as np
MySQL 127.0.0.1:33060+ ssl cookbook Py > import pandas as pd
MySQL 127.0.0.1:33060+ ssl cookbook Py > import random
MySQL 127.0.0.1:33060+ ssl cookbook Py > import string
MySQL 127.0.0.1:33060+ ssl cookbook Py > from datetime import datetime
MySQL 127.0.0.1:33060+ ssl cookbook Py > from datetime import timedelta
MySQL 127.0.0.1:33060+ ssl cookbook Py > num_rows=1000
MySQL 127.0.0.1:33060+ ssl cookbook Py >
Names and genders
[maxwell@MaxwellDBA datasets]$ pwd
/home/maxwell/datasets
[maxwell@MaxwellDBA datasets]$ head top-male-and-female-names.csv
Rank,Female Name,Count,Male Name,Count_
1,Mary,54276,John,108533
2,Margaret,49170,William,87239
3,Elizabeth,36556,James,69987
4,Sarah,28230,David,62774
5,Patricia,20689,Robert,56511
6,Catherine,19713,Michael,51768
7,Susan,19165,Peter,44758
8,Helen,18881,Thomas,42467
9,Emma,18192,George,39195
[maxwell@MaxwellDBA datasets]$
MySQL 127.0.0.1:33060+ ssl cookbook Py >
MySQL 127.0.0.1:33060+ ssl cookbook Py > import sys
MySQL 127.0.0.1:33060+ ssl cookbook Py > sys.version
3.9.6 (default, Aug 25 2021, 16:22:38)
[GCC 8.5.0 20210514 (Red Hat 8.5.0-3)]
MySQL 127.0.0.1:33060+ ssl cookbook Py > sys.path.append("/home/maxwell/.local/lib/python3.9/site-packages")
MySQL 127.0.0.1:33060+ ssl cookbook Py >
MySQL 127.0.0.1:33060+ ssl cookbook Py > import numpy as np
MySQL 127.0.0.1:33060+ ssl cookbook Py > import pandas as pd
MySQL 127.0.0.1:33060+ ssl cookbook Py > import random
MySQL 127.0.0.1:33060+ ssl cookbook Py > import string
MySQL 127.0.0.1:33060+ ssl cookbook Py > from datetime import datetime
MySQL 127.0.0.1:33060+ ssl cookbook Py > from datetime import timedelta
MySQL 127.0.0.1:33060+ ssl cookbook Py > num_rows=1000
MySQL 127.0.0.1:33060+ ssl cookbook Py > female_names=pd.read_csv("/home/maxwell/datasets/top-male-and-female-names.csv",usecols=["Female Name"]).rename(columns={'Female Name': 'name'})
MySQL 127.0.0.1:33060+ ssl cookbook Py > male_names=pandas.read_csv("/home/maxwell/datasets/top-male-and-female-names.csv",usecols=["Male Name"]).rename(columns={'Male Name': 'name'})
MySQL 127.0.0.1:33060+ ssl cookbook Py > male_names=pd.read_csv("/home/maxwell/datasets/top-male-and-female-names.csv",usecols=["Male Name"]).rename(columns={'Male Name': 'name'})
MySQL 127.0.0.1:33060+ ssl cookbook Py > female_names['gender']=(['F']*female_names.count()['name'])
MySQL 127.0.0.1:33060+ ssl cookbook Py > male_names['gender']=(['M']*male_names.count()['name'])
MySQL 127.0.0.1:33060+ ssl cookbook Py > names=pd.concat([female_names, male_names],ignore_index=True)
MySQL 127.0.0.1:33060+ ssl cookbook Py > import os
MySQL 127.0.0.1:33060+ ssl cookbook Py > os.getcwd()
/home/maxwell
MySQL 127.0.0.1:33060+ ssl cookbook Py > os.chdir('/home/maxwell/datasets')
MySQL 127.0.0.1:33060+ ssl cookbook Py > os.getcwd()
/home/maxwell/datasets
MySQL 127.0.0.1:33060+ ssl cookbook Py > names
name gender
0 Mary F
1 Margaret F
2 Elizabeth F
3 Sarah F
4 Patricia F
5 Catherine F
6 Susan F
7 Helen F
8 Emma F
9 John M
10 William M
11 James M
12 David M
13 Robert M
14 Michael M
15 Peter M
16 Thomas M
17 George M
MySQL 127.0.0.1:33060+ ssl cookbook Py > names=names.sample(num_rows, replace=True).set_index(pd.Series(range(num_rows)))
MySQL 127.0.0.1:33060+ ssl cookbook Py >
Surnames
MySQL 127.0.0.1:33060+ ssl cookbook Py > surnames=pd.read_csv("/home/maxwell/datasets/Names_2010Census.csv",usecols=["name"],skipfooter=1,engine='python').rename(columns={'name': 'surname'})
MySQL 127.0.0.1:33060+ ssl cookbook Py > surnames=surnames.sample(num_rows,replace=True).set_index(pd.Series(range(num_rows)))
MySQL 127.0.0.1:33060+ ssl cookbook Py >
Diagnoses
MySQL 127.0.0.1:33060+ ssl cookbook Py > diagnoses=pd.read_csv('/home/maxwell/datasets/diagnosis.csv')
MySQL 127.0.0.1:33060+ ssl cookbook Py > diagnoses
diagnosis frequency
0 Acute coronary syndrome 2.1
1 Alcoholic liver disease 0.3
2 Pneumonia 3.6
3 Chronic obstructive pulmonary disease 2.1
4 Gastro-intestinal bleed 0.8
5 Heart failure 0.8
6 Sepsis 0.8
7 Urinary tract infection 2.4
8 Data Phobia 6.2
MySQL 127.0.0.1:33060+ ssl cookbook Py > diagnoses=diagnoses.sample(num_rows,replace=True,weights=diagnoses['frequency']).set_index(pd.Series(range(num_rows)))
Results
MySQL 127.0.0.1:33060+ ssl cookbook Py > results=pandas.DataFrame({"result":["R","N","D"],"frequency":[6,3,1]})
MySQL 127.0.0.1:33060+ ssl cookbook Py > results=results.sample(num_rows,replace=True,weights=results['frequency']).set_index(pd.Series(range(num_rows)))
MySQL 127.0.0.1:33060+ ssl cookbook Py >
The table
MySQL 127.0.0.1:33060+ ssl cookbook Py > patients=session.get_schema('cookbook').get_table('patients')
MySQL 127.0.0.1:33060+ ssl cookbook Py > # we'll start start the loop
MySQL 127.0.0.1:33060+ ssl cookbook Py > for i in range(num_rows):
National ID
national_id=str(random.randrange(10,99)) +''.join(random.sample(string.ascii_uppercase, 2)) + str(random.randrange(100000, 999999))
Age
MySQL 127.0.0.1:33060+ ssl cookbook Py > age=random.randrange(15,99)
Dates a patient spent in the hospital
MySQL 127.0.0.1:33060+ ssl cookbook Py > date_arrived=datetime.strptime('2020-01-01','%Y-%m-%d')+ timedelta(days=random.randrange(365))
MySQL 127.0.0.1:33060+ ssl cookbook Py > date_departed=date_arrived+timedelta(days=random.randrange(60))
MySQL 127.0.0.1:33060+ ssl cookbook Py > date_arrived=date_arrived.strftime('%Y-%m-%d')
MySQL 127.0.0.1:33060+ ssl cookbook Py > date_departed=date_departed.strftime('%Y-%m-%d')
MySQL 127.0.0.1:33060+ ssl cookbook Py >
Putting it all together
import numpy as np
import pandas as pd
import random
import string
from datetime import datetime
from datetime import timedelta
num_rows=1000
def generate_patients_data(num_rows):
# read datasets
# names and genders
female_names=pd.read_csv("/home/maxwell/datasets/top-male-and-female-names.csv",
usecols=["Female Name"]).rename(columns={'Female Name':'name'})
female_names['gender']=(['F']*female_names.count()['name'])
male_names=pd.read_csv("/home/maxwell/datasets/top-male-and-female-names.csv",
usecols=["Male Name"]).rename(columns={'Male Name':'name'})
male_names['gender']=(['M']*male_names.count()['name'])
names=pd.concat([female_names,male_names],ignore_index=True)
surnames=pd.read_csv("/home/maxwell/datasets/Names_2010Census.csv",
usecols=['name'],skipfooter=1,engine='python').rename(columns={'name':'surname'})
# diagnoses
diagnoses=pd.read_csv('/home/maxwell/datasets/diagnosis.csv')
# Possible results
results=pd.DataFrame({
"result":["R","N","D"],
"frequency":[6,3,1]})
# Start building data
diagnoses=diagnoses.sample(
num_rows,replace=True,
weights=diagnoses['frequency']).set_index(pd.Series(range(num_rows)))
results=results.sample(num_rows,replace=True,weights=results['frequency']).set_index(pd.Series(range(num_rows)))
names=names.sample(num_rows,replace=True).set_index(pd.Series(range(num_rows)))
surnames=surnames.sample(num_rows,replace=True).set_index(pd.Series(range(num_rows)))
# Get table object
patients=session.get_schema('cookbook').get_table('patients')
# Loop, inserting rows
for i in range(num_rows):
national_id=str(random.randrange(10,99))+ \
''.join(random.sample(string.ascii_uppercase,2))+ \
str(random.randrange(100000,999999))
age=random.randrange(15,99)
date_arrived=datetime.strptime('2020-01-01','%Y-%m-%d')+ \
timedelta(days=random.randrange(365))
date_departed=date_arrived+timedelta(days=random.randrange(60))
date_arrived=date_arrived.strftime('%Y-%m-%d')
date_departed=date_departed.strftime('%Y-%m-%d')
name=names['name'][i]
gender=names['gender'][i]
surname=surnames['surname'][i]
result=results['result'][i]
diagnosis=diagnoses['diagnosis'][i]
patients.insert(
'national_id','name','surname',
'gender','age','diagnosis',
'result','date_arrived','date_departed'
).values(
national_id,name,surname,
gender,age,diagnosis,
result,date_arrived,date_departed
).execute()
MySQL 127.0.0.1:33060+ ssl cookbook Py > def generate_patients_data(num_rows):
-> # read datasets
-> # names and genders
-> female_names=pd.read_csv("/home/maxwell/datasets/top-male-and-female-names.csv",
-> usecols=["Female Name"]).rename(columns={'Female Name':'name'})
-> female_names['gender']=(['F']*female_names.count()['name'])
-> male_names=pd.read_csv("/home/maxwell/datasets/top-male-and-female-names.csv",
-> usecols=["Male Name"]).rename(columns={'Male Name':'name'})
-> male_names['gender']=(['M']*male_names.count()['name'])
-> names=pd.concat([female_names,male_names],ignore_index=True)
-> surnames=pd.read_csv("/home/maxwell/datasets/Names_2010Census.csv",
-> usecols=['name'],skipfooter=1,engine='python').rename(columns={'name':'surname'})
-> # diagnoses
-> diagnoses=pd.read_csv('/home/maxwell/datasets/diagnosis.csv')
-> # Possible results
-> results=pd.DataFrame({
-> "result":["R","N","D"],
-> "frequency":[6,3,1]})
-> # Start building data
-> diagnoses=diagnoses.sample(
-> num_rows,replace=True,
-> weights=diagnoses['frequency']).set_index(pd.Series(range(num_rows)))
-> results=results.sample(num_rows,replace=True,weights=results['frequency']).set_index(pd.Series(range(num_rows)))
-> names=names.sample(num_rows,replace=True).set_index(pd.Series(range(num_rows)))
-> surnames=surnames.sample(num_rows,replace=True).set_index(pd.Series(range(num_rows)))
-> # Get table object
-> patients=session.get_schema('cookbook').get_table('patients')
-> # Loop, inserting rows
-> for i in range(num_rows):
-> national_id=str(random.randrange(10,99))+ \
-> ''.join(random.sample(string.ascii_uppercase,2))+ \
-> str(random.randrange(100000,999999))
-> age=random.randrange(15,99)
-> date_arrived=datetime.strptime('2020-01-01','%Y-%m-%d')+ \
-> timedelta(days=random.randrange(365))
-> date_departed=date_arrived+timedelta(days=random.randrange(60))
-> date_arrived=date_arrived.strftime('%Y-%m-%d')
-> date_departed=date_departed.strftime('%Y-%m-%d')
-> name=names['name'][i]
-> gender=names['gender'][i]
-> surname=surnames['surname'][i]
-> result=results['result'][i]
-> diagnosis=diagnoses['diagnosis'][i]
-> patients.insert(
-> 'national_id','name','surname',
-> 'gender','age','diagnosis',
-> 'result','date_arrived','date_departed'
-> ).values(
-> national_id,name,surname,
-> gender,age,diagnosis,
-> result,date_arrived,date_departed
-> ).execute()
->
MySQL 127.0.0.1:33060+ ssl cookbook Py > \sql truncate table patients
Query OK, 0 rows affected (0.0244 sec)
MySQL 127.0.0.1:33060+ ssl cookbook Py > session.get_schema('cookbook').get_table('patients').count()
0
MySQL 127.0.0.1:33060+ ssl cookbook Py > generate_patients_data(1000)
MySQL 127.0.0.1:33060+ ssl cookbook Py > session.get_schema('cookbook').get_table('patients').count()
1000
MySQL 127.0.0.1:33060+ ssl cookbook Py > session.get_schema('cookbook').get_table('patients').select().limit(10)
+----+-------------+-----------+----------------+--------+-----+-----------------+-------------------------+--------+--------------+---------------+
| id | national_id | name | surname | gender | age | additional_data | diagnosis | result | date_arrived | date_departed |
+----+-------------+-----------+----------------+--------+-----+-----------------+-------------------------+--------+--------------+---------------+
| 1 | 79AF555574 | Emma | BALKEY | F | 58 | NULL | Pneumonia | R | 2020-12-12 | 2021-01-12 |
| 2 | 11YS819517 | Elizabeth | PAQUETTE | F | 60 | NULL | Heart failure | R | 2020-08-18 | 2020-09-09 |
| 3 | 46YL924212 | David | GLICKER | M | 73 | NULL | Urinary tract infection | R | 2020-08-07 | 2020-08-12 |
| 4 | 86JD612318 | George | LOPEZBELTRAN | M | 52 | NULL | Acute coronary syndrome | R | 2020-08-16 | 2020-09-17 |
| 5 | 55KW110008 | George | OROZCOGONZALEZ | M | 25 | NULL | Urinary tract infection | R | 2020-07-15 | 2020-08-09 |
| 6 | 24HB929221 | Mary | KIRKELIE | F | 81 | NULL | Heart failure | R | 2020-06-30 | 2020-08-21 |
| 7 | 32HE856471 | David | CANZONA | M | 33 | NULL | Acute coronary syndrome | R | 2020-09-21 | 2020-11-07 |
| 8 | 64BK628444 | John | GANDIA | M | 67 | NULL | Urinary tract infection | D | 2020-06-20 | 2020-06-25 |
| 9 | 46GC444803 | John | SHERIFF | M | 73 | NULL | Pneumonia | R | 2020-10-28 | 2020-12-12 |
| 10 | 70WC944263 | Sarah | WILKS | F | 95 | NULL | Acute coronary syndrome | R | 2020-04-13 | 2020-05-01 |
+----+-------------+-----------+----------------+--------+-----+-----------------+-------------------------+--------+--------------+---------------+
10 rows in set (0.0008 sec)
MySQL 127.0.0.1:33060+ ssl cookbook Py >
Reusing Your Scripts for MySQL Shell
MySQL 127.0.0.1:33060+ ssl cookbook Py >
MySQL 127.0.0.1:33060+ ssl cookbook Py > import numpy as np
MySQL 127.0.0.1:33060+ ssl cookbook Py > import pandas as pd
MySQL 127.0.0.1:33060+ ssl cookbook Py > \source /home/maxwell/datasets/mysql_shell/generate_patients_data.py
MySQL 127.0.0.1:33060+ ssl cookbook Py > \sql
Switching to SQL mode... Commands end with ;
Fetching table and column names from `cookbook` for auto-completion... Press ^C to stop.
MySQL 127.0.0.1:33060+ ssl cookbook SQL >
MySQL 127.0.0.1:33060+ ssl cookbook SQL > \source /home/maxwell/datasets/tables/patients.sql
Query OK, 0 rows affected (0.0002 sec)
Query OK, 0 rows affected (0.0136 sec)
Query OK, 0 rows affected (0.0002 sec)
Query OK, 0 rows affected (0.0179 sec)
Query OK, 0 rows affected (0.0003 sec)
Query OK, 20 rows affected (0.0042 sec)
Records: 20 Duplicates: 0 Warnings: 0
+----+-------------+-----------+-----------+--------+-----+-----------------+---------------+--------+--------------+---------------+
| id | national_id | name | surname | gender | age | additional_data | diagnosis | result | date_arrived | date_departed |
+----+-------------+-----------+-----------+--------+-----+-----------------+---------------+--------+--------------+---------------+
| 1 | 89AR642465 | Mary | Smith | F | 24 | NULL | Appendicitis | R | 2020-01-15 | 2020-01-20 |
| 2 | 71GE601633 | John | Johnson | M | 57 | NULL | Asthma | N | 2020-03-10 | 2020-03-25 |
| 3 | 15QT023022 | Margaret | Williams | F | 50 | NULL | Breast Cancer | D | 2020-05-08 | 2020-08-07 |
| 4 | 84DC051879 | William | Brown | M | 64 | NULL | Arthritis | N | 2020-05-10 | 2020-05-27 |
| 5 | 54XT916306 | Elizabeth | Jones | F | 39 | NULL | Data Phobia | R | 2020-04-17 | 2020-06-03 |
| 6 | 06VZ647400 | James | Garcia | M | 86 | NULL | Diabetes | D | 2020-08-09 | 2020-08-15 |
| 7 | 02WS884704 | Sarah | Miller | F | 31 | NULL | Diabetes | R | 2020-08-10 | 2020-08-20 |
| 8 | 78FS043029 | David | Davis | M | 75 | NULL | Arthritis | N | 2020-06-13 | 2020-06-22 |
| 9 | 45MY529190 | Patricia | Rodriguez | F | 35 | NULL | Diabetes | N | 2020-02-18 | 2020-03-02 |
| 10 | 30NC108735 | Robert | Martinez | M | 35 | NULL | Appendicitis | R | 2020-09-12 | 2020-09-18 |
| 11 | 38BP394037 | Catherine | Hernandez | F | 61 | NULL | Asthma | R | 2020-07-13 | 2020-07-19 |
| 12 | 96HE717336 | Michael | Lopez | M | 33 | NULL | Data Phobia | D | 2020-09-15 | 2020-09-20 |
| 13 | 26ES505212 | Susan | Gonzalez | F | 81 | NULL | Data Phobia | R | 2020-10-01 | 2020-10-28 |
| 14 | 04WT954962 | Peter | Wilson | M | 86 | NULL | Diabetes | N | 2020-09-26 | 2020-10-07 |
| 15 | 65XJ029932 | Helen | Anderson | F | 57 | NULL | Data Phobia | R | 2020-10-06 | 2020-10-26 |
| 16 | 59DW008178 | Thomas | Thomas | M | 53 | NULL | Data Phobia | R | 2020-09-15 | 2020-10-26 |
| 17 | 99XC682639 | Emma | Taylor | F | 69 | NULL | Breast Cancer | R | 2020-07-18 | 2020-09-15 |
| 18 | 72TY932806 | George | Moore | M | 20 | NULL | Data Phobia | R | 2020-11-01 | 2020-11-10 |
| 19 | 28VU492728 | Alice | Jackson | F | 72 | NULL | Arthritis | N | 2020-10-01 | 2020-11-10 |
| 20 | 09SK434607 | Richard | Martin | M | 23 | NULL | Asthma | N | 2020-11-18 | 2020-11-28 |
+----+-------------+-----------+-----------+--------+-----+-----------------+---------------+--------+--------------+---------------+
20 rows in set (0.0008 sec)
MySQL 127.0.0.1:33060+ ssl cookbook SQL >