Monday, August 31, 2020

SQL JOIN

 SQL JOIN


If two tables already exist, and the 2 tales have a dependency column between them, this permits you to do a JOIN operation.


The format looks like this:


SELECT column1 column2 FROM dependent table INNER JOIN independent table ON interconnected dependent table's column3 = independent table's column4


 Columns 1 2 3 and 4 can all be different.


Column1 --> Column3 --> Column4 --> Column2


will return Column1, Column2 pairs


There is an optional AS keyword variable for aliasing a database x.y variable.


A LEFT JOIN shows all named rows of Column1 even if no match. A None is returned for Column2 in the cases of no matches; this may omit some Column2 values from being displayed.


A RIGHT JOIN guarantees to return all unique Column2 values, even if no matches to Column1 exist. This may omit some Column1 values and will return None for Column1 if there is no match.


Example


players table

columns: ID name sport


Sport table

columns: ID name


players table - the dependent table

1 John 3

2 Sam 2

3 Mary 


Mary has no sport


Sport table - the INdependent table

1 baseball

2 golf

3 basketball


SELECT players.name sport.name FROM players INNER JOIN sport ON players.sport = sport.id


will return


('John', 'basketball')

('Sam', 'golf')


If you don't want to use players.name, you can shorten it as ply


players.name AS ply


LEFT JOIN


SELECT players.name sport.name FROM players LEFT JOIN sport ON players.sport = sport.id


will return all players names no matter what


('John', 'basketball')

('Sam', 'golf')

('Mary', 'None)


Note that baseball is omitted.


RIGHT JOIN


SELECT players.name sport.name FROM players RIGHT JOIN sport ON players.sport = sport.id


will return all sport names no matter what


('John', 'basketball')

('Sam', 'golf')

(None, 'baseball')


Note that Mary is omitted.


Another thing to remember about LEFT and RIGHT JOIN is that multiplicity of rows may occur, one-for-one with multiple matches on the crossover column.


SELECT A.a, B.b

FROM A

RIGHT JOIN B ON A.c = B.c


results in all rows of B are guaranteed to be returned. If any row of B does not contain a match with A on the mutually shared-in-common column c, then column c will be NULL in the A.a field in the result for that row in B. If any row of B does contain 1 or more matches with A on the mutually shared-in-common column c, then column c will be populated uniquely one-for-one in the A.a field with the one or more rows of B that match. This means that the resultant number of rows will contain  the number of rows equal to the total number of matches against A.c on B.c

-----------------------------------


SELECT A.a, B.b

FROM A

LEFT JOIN B ON A.c = B.c


results in all rows of A are guaranteed to be returned. If any row of A does not contain a match with B on the mutually shared-in-common column c, then column c will be NULL in the B.b field in the result for that row in A. If any row of A does contain 1 or more matches with B on the mutually shared-in-common column c, then column c will be populated uniquely one-for-one in the B.b field with the one or more rows of A that match. This means that the resultant number of rows will contain  the number of rows equal to the total number of matches against B.c on A.c



Friday, August 28, 2020

SQL: ORDER BY - DELETE FROM - DROP TABLE - UPDATE - LIMIT/OFFSET

 SQL: ORDER BY  - DELETE FROM - 

DROP TABLE - UPDATE - LIMIT/OFFSET


ORDER BY is ascending by default but can be descending with the DESC additional keyword.


"SELECT * FROM players ORDER BY name"


"SELECT * FROM players ORDER BY name DESC"


name is a column in the values part of the table players.


-----------------------


DELETE FROM table_name WHERE condition


deletes records only.

forgetting to use WHERE will delete all records


A .commit() is required.


Use %s in the WHERE qualifier to prevent SQL injections.


sql = "DELETE FROM players WHERE name = %s"

nam = ('Smith',)



mycursor.execute(sql, nam)

mydb.commit()


--------------------------


sql = "DROP TABLE IF EXISTS table_name"


removes table and all of its records.

This will remove all records and no .commit() needed. The optional IF EXISTS will prevent an error message.


------------------------------


UPDATE is a more complex SQL statement.


The format is:


sql_comm = "UPDATE table_name SET column_name='user_input' WHERE column_name='user_value'"


If you forget the WHERE conditional, then all records will be updated to user_input.


A .commit() call is required and it is best to use %s holder to prevent SQL injection.


Note that when 2 or more placeholders are used, then a trailing comma is not needed in the tuple.


sql_command = "UPDATE players SET name = %s WHERE name = %s"


person = ('Jonathan', 'John')


mycursor.execute(sql_command, person)


mydatab.commit()


print(mycursor.rowcount, "record(s) affected")


.rowcount() gives the number of matches.


-----------------------


Use LIMIT at the end of the SELECT statement to restrict the first # matches.


sql_command = "SELECT * FROM players WHERE name = %s LIMIT 3"


nam = ('Steve',)


mycursor.execute(sql_command, nam)


just returns the first 3 names of Steve, even if there are 4 or more matches.


Since results are returned in ascending key order, the optional OFFSET keyword will stagger down the list and thus skip the OFFSET # of first (early) matches.


sql_command = "SELECT * FROM players WHERE name = %s LIMIT 3 OFFSET 3"


makes the first returned match be the 4th actual match overall in the table.



Thursday, August 27, 2020

SQL: INSERT - SELECT - WHERE LIKE

 SQL: INSERT - SELECT - WHERE LIKE


After inserting a row make sure to issue a .commit() command to get the row officially in the database. The  .commit() is done on the variable returned from the connector call, not the cursor object.


Use .executemany() to concurrently insert 2 or more rows. The rows must be set up as tuples and the tuples are then in a list. It is OK to mix single and double quotes to mark the values in each row.


Either .execute() or .executemany() contain the actual SQL commands that must be in single or double quotes.

import mysql.connector

mydatab = mysql.connector.connect(
  host="localhost",
  user="yourusername",
  password="yourpassword",
  database="mydatabase"
)

mycursor = mydatab.cursor()


sql = 'INSERT INTO players (name, age) VALUES (%s, %s)'

val = [

  ("Peter", "25"),

  ('Sam', '31'),

  ('Josie', "28")

  ]

# It is Ok to mix and match single and double quote in the same call


mycursor.executemany(sql, val)


mydatab.commit()


-------------------------------


SELECT * for all rows

SELECT column names separated by commas to only return those columns, but still gets all rows.


Either way, you still need a .fetchall() and a for loop to view all matches.


mysqlcursor.execute("SELECT name, address FROM players")

mysqlresult = mysqlcursor.fetchall()

for x in mysqlresult:
  print(x)


--------------------------------


WHERE is used for searching.


WHERE column_name=value

or

WHERE column_name LIKE '%value%' 

with % as wildcard.


To prevent SQL injections, use a %s as a placeholder for the end-user input value to be passed in via a variable. This adds an extra step to capture the end-user input as a string.

sql = "SELECT * FROM players WHERE age = %s"
age = ("31")


with age containing the user input at a prompt.


----------------------------------


Here is a really clever example that permits an end-user to provide an ad hoc search query with an arbitrary keyword wildcarded. However, it is not clear if this construct is vulnerable to SQL injection.


import mysql.connector


mydatab = mysql.connector.connect(

  host="localhost",

   user="your username",

  password="fill in",

  database="mydatabase"

)


# search query phrase

user_input = "Jo"


mycursor = mydatab.cursor()

name_stem = '\'%' + user_input + '%\''


sql = "SELECT * FROM players WHERE name LIKE " + name_stem

print(sql)


mycursor.execute(sql)


mysqlresult = mycursor.fetchall()


for x in mysqlresult:

    print(x)

  

This will find all players with Jo somewhere in their name.


To help combat SQL injection, the following version of the same code is preferred.


import mysql.connector


mydatab = mysql.connector.connect(

  host="localhost",

  user="your username",

  password="fill in",

  database="mydatabase"

)


user_input = "Jo" 

# captured from a GUI or command line prompt.


mycursor = mydatab.cursor()


name_stem = ("%" + str(user_input) + "%")

 # need pair of % to wildcard


print(name_stem)

print(type(name_stem))


sql = "SELECT * FROM customers WHERE name LIKE %s"

# parameter substitution

print(sql)


nm = (name_stem,) 


print(nm)

print(type(nm))


# for some strange reason, it needs to a tuple with an empty trailing comma!!??


mycursor.execute(sql, nm) # nm goes in for %s


mysqlresult = mycursor.fetchall()


for x in mysqlresult:

    print(x) # matches John Josie Joseph Jo 

  




Wednesday, August 26, 2020

Installing a MySQL Database

 Installing a MySQL Database


Get a free MySQL database package for Windows OS from:


https://dev.mysql.com/downloads/installer/


and choose either the off-line or web enabled installer download. You will have the option to create a free Oracle account which I did.

Note the TCP/IP connection port.

Create a root user and DB Admin user, the latter run as a standard system account.

I created all 4 possible error logs.

I choose the Full installation when prompted. Along the way you may be prompted to install Microsoft C++ redistributables which are free and the MySQL installer takes care of it for you. The only thing you have to install manually and separately is the MS Visual Studio C++ (unless you already have it), which I don't need, so I skipped that.

The full installation has a lot of programs, so be patient and you may have to retry some of them a 2nd and 3rd time to get them to download properly, all within the installer  - there is a retry button. 


Next install the mysql-connector-python driver using PIP. Do NOT install the mysql-connector package as that is something else.


Here is how to check that everything is installed properly and that  the various components can talk to each other:


################ test 1 ######################

import mysql.connector

print(dir(mysql.connector.connect))

# If the two statements above run without errors, then the MySQL connector driver is working properly.


################ test 2 ######################

mydb = mysql.connector.connect(

  host="localhost",

  user="your standard system username",

  password="you fill in"

)

print(dir(mydb))

print(mydb)

# if the connect function and the two print statements run without errors, then you can properly make a connection from a python script to the MySQL database.

However, if you get: 

"mysql.connector.errors.NotSupportedError: Authentication plugin 'caching_sha2_password' is not supported" 

then you probably installed both mysql-connector and mysql-connector-python. Uninstall mysql-connector and try again.


################ test 3 ######################

mycursor = mydb.cursor()

mycursor.execute("CREATE DATABASE mydatabase")


# if the 2 statements above run without error, then a new database is created.


################ test 4 ######################

# mycursor = mydb.cursor()

# mycursor.execute("CREATE DATABASE mydatabase")

mycursor.execute("SHOW DATABASES")

for x in mycursor:
  print(x)


Once you have made a database, check to make sure it really is there with the SHOW command and a loop; the MySQL installed with some default already created databases and you want to be able to see them all.

It is not enough to just issue a SHOW command, you still have to loop to see them. If you try to loop without first running SHOW, the loop will exit because mycursor will have length zero for the its internal database count.


################ test 5 ######################

import mysql.connector

mydb = mysql.connector.connect(

  host="localhost",

  user="standard system username",

  password="fill in",

database="mydatabase"

) # Once the database is first successfully created and checked for existence,  just put the database name in the connector call as above.


mycursor.execute("CREATE TABLE customers (name VARCHAR(255), address VARCHAR(255))")


If the above runs without error, then you can create a table within a database.


################ test 6 ######################

##### mycursor.execute("CREATE TABLE customers (name

##### VARCHAR(255), address VARCHAR(255))")

mycursor.execute("SHOW TABLES")

for x in mycursor:
  print(x)

to prove the table still exists without errors.


Once you've completed those 6 tests, it is time to assign a Primary Key to the already existing database.


mycursor.execute("ALTER TABLE customers ADD COLUMN id INT AUTO_INCREMENT PRIMARY KEY")


Now show tables is still OK after adding the Primary key


mycursor.execute("SHOW TABLES")


for y in mycursor:

    print(y)

  

Tuesday, August 25, 2020

ufunc

 Ufunc


ufunc operates only on ndarrays. The purpose of ufunc is to vectorize loops. 


Let's say we want to add two ndarrays of the same dimensions. We can use python's built-in .zip() function and a for loop:


import numpy as np


# ndarrays must be the same size

no1 = [7 8 9 10 1 12 13 14]

no2 = [17 28 9 120 11 122 1300 314]

no3 = []


# i iterates over no1 and j iterates over no2

for i,j in zip(no1, no2)

    z.append(i+j)


print(no3)


# result is

[24, 36, 18, 130, 12, 134, 1313, 328]


Or we can use ndarrays and its .add() method.


import numpy as np


# ndarrays must be the same size

no1 = np.ndarray([7, 8, 9, 10, 1, 12, 13, 14])

no2 = np.ndarray([17, 28, 9, 120, 11, 122, 1300, 314])

no3 = np.add(no1, no2)

print(no3)


# result is same as before


[  24   36   18  130   12  134 1313  328]


What happens if the two arrays are not the same size:


import numpy as np


# ndarrays must be the same size

no1 = np.ndarray([7, 8, 9, 10, 1, 12, 13]) # 7 elements

no2 = np.ndarray([17, 28, 9, 120, 11, 122, 1300, 314]) # 8 elements

no3 = np.add(no1, no2)

print(no3)


ValueError: operands could not be broadcast together with shapes (8,) (7,) 


What happens if the two arrays are the same size, but with different dimensions:

import numpy as np


# ndarrays must be the same size

no1 = np.ndarray([7], [8], [9], [10], [1], [12], [13], [14]]) # 8  elements in one column

no2 = np.ndarray([17, 28, 9, 120, 11, 122, 1300, 314]) # 8 elements in one row

no3 = np.add(no1, no2)

print(no1)


# results in:


[[  24   35   16  127   18  129 1307  321]

 [  25   36   17  128   19  130 1308  322]

 [  26   37   18  129   20  131 1309  323]

 [  27   38   19  130   21  132 1310  324]

 [  18   29   10  121   12  123 1301  315]

 [  29   40   21  132   23  134 1312  326]

 [  30   41   22  133   24  135 1313  327]

 [  31   42   23  134   25  136 1314  328]]


Which is 8 by 8 square and all of the normally summed elements make a trace along the diagonal. The off-diagonal elements go something like this: take no2, transpose it into a column vector, then shift it from left to right, hold while the element of no1 is added to each of no2, then shift the original no2 again one column to the right, and repeat.


There is also a .multiply() method


import numpy as np


x = [1, 2, 3, 4]

y = [4, 5, 6, 7]

z = np.multiply(x,y)


print(z)


[ 4 10 18 28]


Finally, .zip() can produced some unusual results when used in non-standard ways


import numpy as np

x = [1, 2, 3, 4]

y = [4, 5, 6, 7]

z = []


for j in zip(y, x):

  z.append(j + j)

print(z)


[(4, 1, 4, 1), (5, 2, 5, 2), (6, 3, 6, 3), (7, 4, 7, 4)]


z = []


for i in zip(y, x):

  z.append(i + i)

print(z)


[(4, 1, 4, 1), (5, 2, 5, 2), (6, 3, 6, 3), (7, 4, 7, 4)]


z = []


for i in zip(y, x):

  z.append(i)

print(z)


[(4, 1), (5, 2), (6, 3), (7, 4)]


In these non-standard cases, there is no addition done by zip, but rather appending and repeating of elements.





Monday, August 24, 2020

Random Functions and Distributions Part 2

 


Uniform Distributions


random.uniform() returns a float between 0 and 1 but if you use matplotlib and seaborn to graph the distribution, the curve has tails above 1 and below zero. The higher the sample size parameter, the more the distribution becomes flat top but it still has the tails.


The seaborn.distplot() has a label="some term here" parameter.


A logistic distribution has fatter tails than a normal distribution. 


Multinomial Distribution


The multinomial function has 3 parameters:


1. The 1st argument is the number of samples.

2. 2nd argument is a list of probabilities expressed numerically or as fractions, or both. Oddly, the probabilities do not have to sum to one and can be less than one. If the probabilities sum to more than one, python will generate an error. You are permitted to insert zeros.

3. The size=(#,#,#) argument where # is an integer and an array is generated. If there is only one #, i.e. a 1-D array, then the return variable is a row vector.


Examples 


from numpy import random


x = random.multinomial(n=44, pvals=[0, 0, 0, 0, 0, 1/6])


print(x)


# result is:


[ 0  0  0  0  0 44]


from numpy import random


# probabilities sum to one but there are two zeros

x = random.multinomial(n=44, pvals=[0, 2/6, 1/6, 0, 2/6, 1/6])


print(x)


# result is:


[ 0 15  9  0 14  6]


from numpy import random


# probabilities sum to one but there are two zeros

x = random.multinomial(n=1, pvals=[0, 2/6, 1/6, 0, 2/6, 1/6])


print(x)


# result is:


[0 1 0 0 0 0]


from numpy import random


# probabilities sum to one but there are two zeros

x = random.multinomial(n=12, pvals=[0, 2/6, 1/6, 0, 2/6, 1/6], size=(3,4))


print(x)


# result is:

# Note that the 3rd dimension is the number of distinct probabilities in pvals. 

# So the total number of elements is 3 * 4 *6 = 72.

# Note that columns where pval =0 have a zero as the element returned.


[[[0 3 0 0 5 4]

  [0 6 1 0 5 0]

  [0 4 4 0 3 1]

  [0 3 0 0 7 2]]


 [[0 5 2 0 0 5]

  [0 3 2 0 5 2]

  [0 4 3 0 3 2]

  [0 4 1 0 4 3]]


 [[0 4 1 0 5 2]

  [0 2 1 0 5 4]

  [0 4 3 0 1 4]

  [0 9 0 0 2 1]]]


Friday, August 14, 2020

Random functions and distributions Part 1

 Random functions and distributions


You can create multidimensional arrays of random integers and floats with an optional size = (x,y,z) parameter, where x is the #rows and y is the #columns.


from numpy import random


The .choice() method of a numpy object can take as a parameter an array which can be a mix of datatypes. It also accepts an array of probabilities that must sum to 1.


from numpy import random


# works fine with floats, integers, and text

x = random.choice([13.7, 43, 7, 'test', 2, True], p=[0.1, 0.2, 0.5, 0.0, 0.1, 0.1], size=(300))

# notice that 'test' will never appear because it has prob = 0.0.

print(x)

# result is:

['2' '13.7' '43' '7' '13.7' 'True' '7' '2' '2' '7' '7' '7' '7' '7' '7'

 '43' '7' '43' '7' '7' '13.7' '7' '2' '7' '7' '7' '2' '7' '7' '7' '7' '7'

 '2' '7' '7' '7' '43' '7' '2' '13.7' '7' '7' '7' '43' '13.7' '7' '2'

 'True' '43' '7' '43' 'True' '43' '7' '43' '7' '43' 'True' '43' '7' '43'

 '2' '7' '13.7' '2' '7' '7' 'True' '7' '7' '43' '7' '7' '7' '13.7' '43'

 '2' '7' '7' '7' '7' '43' '43' '2' '7' '7' '7' '43' '7' '13.7' '2' '7' '2'

 '7' '43' 'True' '7' '13.7' '7' 'True' '43' '7' 'True' '7' 'True' '2' '7'

 '7' '7' 'True' '43' '7' '7' '7' '7' '43' '7' '7' '2' '2' '2' '7' '7' '7'

 '7' '7' '43' '7' '7' '7' 'True' 'True' '7' '13.7' '2' '7' '7' '43' '43'

 '7' '43' '7' '43' '7' '7' '7' 'True' '7' '7' '7' '7' 'True' '43' '2' '2'

 '7' '13.7' '7' '2' '7' '43' '7' '7' '7' '13.7' '13.7' '7' '7' '7' '2' '7'

 '7' '13.7' '7' '13.7' '43' '43' '2' '2' '7' '7' '7' '7' '2' '7' '7' '43'

 '7' '7' '7' 'True' '7' '43' '7' 'True' '13.7' '43' '2' '7' '43' 'True'

 '7' '7' '43' '43' '13.7' '13.7' '43' '2' '13.7' 'True' '7' '43' '43' '7'

 '7' '43' 'True' 'True' '43' '13.7' '43' '7' '13.7' '7' '13.7' '7' 'True'

 'True' '13.7' '2' '7' '43' '13.7' '43' '43' 'True' '7' '43' 'True' '43'

 '2' '7' '7' '7' '7' '43' 'True' '7' '7' '7' '43' '7' 'True' '7' '7'

 '13.7' '2' 'True' '7' '7' '7' '2' 'True' '13.7' '43' '7' '7' '7' '7' '7'

 '43' '7' '2' '7' '7' '43' '2' '2' '7' '7' '7' '43' '13.7' '7' '7' '7' '7'

 '7' 'True' '43' '7' '2' '7' '7' '7' 'True' '43' '7' '7']




.shuffle() changes the original array.


.permutation() returns a new array.


The seaborn module, along with the matlibplot module, can be utilized for plotting and making histograms of both continuous and discrete probability distributions.


import matplotlib.pyplot as plt

import seaborn as sns


Somehow the .show() member function of a matplotlib.pyplot object already knows what needs to be plotted from having run one or more seaborn .distplot() first!!??


Poisson and Binomial distributions converge when the number of trials is very large and the p is near zero, meaning much less than even 0.1.


random.randint(maxint) returns an integer no larger than maxint at random.


.random() returns a float out to 16 decimal places.



Normal distribution


from numpy import random

loc is the mean. scale is the standard deviation

x = random.normal(loc=3, scale=0.5, size=(4, 13))

# remember that .random() returns floats

print(x)


# result is:


[[2.80136473 3.54880004 2.89150748 2.14326926 3.03565892 3.41711576

  3.32692697 3.35579836 2.8661686  3.76114488 3.01858288 3.39992108

  2.62656407]

 [3.03767037 2.54722319 3.00753337 3.31534086 3.31006815 3.46443564

  3.36939275 3.24040843 2.65063788 3.59313176 2.84027165 2.21936811

  2.07943729]

 [2.35956149 3.70604286 2.91394761 2.00746178 3.45957505 3.49191263

  2.56781576 2.47446244 3.04135718 2.99737503 3.32111672 2.14935568

  3.21543959]

 [2.82624186 2.79458755 3.20454706 2.97537827 3.04814055 2.16203024

  2.87561735 3.03233987 2.32153526 3.14072811 3.62589382 3.16902181

  3.50404782]]



Binomial distributions are discrete.



from numpy import random


x = random.binomial(n=100, p=0.5, size=(20,5))


Note that the product of the size dimensions must equal n.


print(x)

# result and notice how values cluster around 50 which is n*p = 100*0.5 = 50

[[53 48 43 47 59]

 [46 44 52 52 59]

 [53 49 52 51 53]

 [46 51 56 62 52]

 [49 52 51 52 51]

 [42 48 52 53 49]

 [51 50 45 50 48]

 [47 49 51 54 47]

 [41 38 47 42 53]

 [53 50 45 57 42]

 [63 51 39 49 47]

 [53 43 42 49 48]

 [53 52 59 55 58]

 [54 48 51 57 51]

 [45 46 47 48 48]

 [51 42 44 53 48]

 [46 39 48 44 44]

 [52 53 46 45 46]

 [52 41 48 55 49]

 [51 42 47 46 57]]


when you see a kde parameter that is set to True in a .distplot() call, you will get a fitted curve drawn based on the histogram bars.



Normal and Binomial distributions converge when the number of Binomial trials times the probability is near the mean of the Normal distribution (and the scale factor goes as the square root of the number of trials times the previous scale factor).



Poisson distributions are discrete and deal with the number of times a specific event can recur.


from numpy import random

import matplotlib.pyplot as plt

import seaborn as sns


# lam is the targeted number of occurences.


# careful that hist and label and kde are parameters to distplot

# .poisson() is responsible for the size and lam values

x = random.poisson(lam=2, size=(9,5))

sns.distplot(x, hist=False, label='poisson')


plt.show()

print(x)


[[2 3 6 7 2]

 [1 0 2 1 3]

 [2 3 0 0 1]

 [2 4 3 2 2]

 [2 2 4 3 2]

 [4 4 3 0 0]

 [2 2 4 1 1]

 [2 5 1 1 0]

 [3 1 1 1 2]]


# notice that the largest discrete value generated is 6, but the curve

# plots beyond that to higher values.




----------------------------------


from numpy import random

import matplotlib.pyplot as plt

import seaborn as sns


# Notice that the result of the random module function serves as input for the seaborn function.


sns.distplot(random.normal(loc=20, scale=3, size=1000), hist=False, label='normal')

sns.distplot(random.poisson(lam=8, size=1000), hist=False, label='poisson')


plt.show()