Python

Python is the primary language used for data scientists.
It features some of the most useful scientific computing and machine learning libraries such as Numpy, Tensorflow, and PyTorch.

Installation

Use Anaconda.

Add C:\Users\[username]\Anaconda3\Scripts to your path
Run conda init in your bash
Run conda config --set auto_activate_base false

Usage

How to use Python 3.

pip

Pip is the package manager for python.
Your package requirements should be written to requirements.txt
Install all requirements using pip install -r requirements.txt

Syntax

Ternary Operator

Reference

is_nice = True
state = "nice" if is_nice else "not nice"

Lambda Function

lambda x: x * 2

Spread

Reference

myfun(*tuple)

For loops

# Normal for loop
for i in range(5):
  pass

# 2D for loop
for i, j in np.ndindex((5, 5)):
  pass

Strings

String Interpolation

Reference
Python has 3 syntax variations for string interpolation.

name = 'World'
program = 'Python'
print(f'Hello {name}! This is {program}')

print("%s %s" % ('Hello','World',))

name = 'world'
program ='python'
print('Hello {}! This is {}.'.format(name, program))
print('Hello {name}! This is {program}.'.format(name=name, program=program))

# Format to two decimal places
print(f"Accuracy: {accuracy:.02f}%")

# Format an int to 2 digits
print(f"Size: {size:02}%")

Arrays

Use Numpy to provide array functionality

Array Indexing

Numpy Indexing Numpy has very powerful indexing. See the above reference.

Filesystem

Paths

Use os.path

import os.path as path

my_file = path.join("folder_1", "my_great_dataset.tar.gz")
#  "folder_1\\my_great_dataset.tar.gz"

# Get the filename with extension
filename = path.basename(my_file)
# "my_great_dataset.tar.gz"

# Get the filename without extension
filename_no_ext = path.splitext(filename)[0]
# Note that splitext returns ("my_great_dataset.tar", ".gz")

If using Python >=3.4, you also have pathlib

from pathlib import Path

p = Path("my_folder")

# Join paths
pp = Path(p, "files.tar.gz")

pp.suffix      # returns ".gz"
pp.suffixes    # returns [".tar", ".gz"]
pp.name        # returns "files.tar.gz"
pp.parent      # returns "my_folder"

Notes

One annoyance with pathlib.Path is that you need to convert things to strings manually
- This can be done with str, .resolve(), or os.fspath()
"No really, pathlib is great" by Trey Hunger

List all files in a folder

Reference

gaze_directory = "Gaze_txt_files"
# List of folders in root folder
gaze_folders = [path.join(gaze_directory, x) for x in os.listdir(gaze_directory)]
# List of files 2 folders down
gaze_files = [path.join(x, y) for x in gaze_folders for y in os.listdir(x)]

Read/Write entire text file into a list

Reading
[1]

with open('C:/path/numbers.txt') as f:
    lines = f.read().splitlines()

Writing
[2]

with open('your_file.txt', 'w') as f:
    f.write("\n".join(my_list))

Directories

Create, Move, and Delete directories or folders

import os, shutil, time
import os.path as path

# Create a directory
os.makedirs("new_dir", exist_ok=True)
# or os.makedirs(os.path.dirname("new_dir/my_file.txt"), exist_ok=True)

# Delete an empty directory
os.rmdir(dir_path)

# Delete an empty or non-empty directory
shutil.rmtree(dir_path)
# Wait until it is deleted
while os.path.isdir(dir_path):
  time.sleep(0.01)

Copying or moving a file or folder

Copying
Shutil docs

import shutil

# Copy a file
shutil.copy2('original.txt', 'duplicate.txt')

# Move a file
shutil.move('original.txt', 'my_folder/original.txt')

Regular Expressions (Regex)

Reference

import re
my_regex = re.compile(r'height:(\d+)cm')
my_match = my_regex.match("height:33cm");
print(my_match[1])
# 33

Notes

re.match will return None if there is no match
re.match matches from the beginning of the string
Use re.search to match from anywhere in the string
Use re.findall to find all occurrences from anywhere in the string

Spawning Processes

Use subprocess to spawn other programs.

import subprocess
subprocess.run(["ls", "-l"], cwd="/")

Timing Code

StackOverflow
Python Time Documentation

time.time() return the seconds since epoch as a float
You can also use timeit to time over several iterations

import time

t0 = time.time()
code_block
t1 = time.time()

total = t1-t0

requests

Use the requests library to download files and scrape webpages
See Get and post requests in Python

Get Request

import requests
url = R"https://www.google.com"
req = requests.get(url)
req.text

# To save to disk
with open("google.html", "wb") as f:
  f.write(req.content)

Post Request

data = {'api_dev_key':API_KEY, 
        'api_option':'paste', 
        'api_paste_code':source_code, 
        'api_paste_format':'python'} 

# sending post request and saving response as response object 
r = requests.post(url = API_ENDPOINT, data = data) 

# extracting response text
pastebin_url = r.text 
print("The pastebin URL is:%s"%pastebin_url)

Download a file

SO Answer

def download_file(url, folder=None, filename=None):
    if filename is None:
        filename = path.basename(url)
    if folder is None:
        folder = os.getcwd()
    full_path = path.join(folder, filename)
    temp_path = path.join(folder, str(uuid.uuid4()))
    with requests.get(url, stream=True) as r:
        r.raise_for_status()
        with open(temp_path, 'wb') as f:
            for chunk in r.iter_content(chunk_size=8192):
                if chunk:
                    f.write(chunk)
    shutil.move(temp_path, full_path)
    return full_path

if main

What does if __name__ == "__main__" do?

If you are writing a script with functions you want to be included in other scripts, use __name__ to detect if your script is being run or being imported.

if __name__ == "__main__":
  # do..something..here

iterators and iterables

Iterables include lists, np arrays, tuples.
To create an iterator, pass an iterable to the iter() function.

my_arr = [1,2,3,4]
my_iter = iter(my_arr)

v1 = my_iter.next()

itertools contains many helper functions for interacting with iterables and iterators.

zip

documentation

zip takes two iterables and combines them into an iterator of tuples

i.e. zip([a1, ...], [b1,...]) = [(a1, b1), ...]

enumerate

documentation

enumerate adds indices to an iterable

i.e. enumerate([a1,...], start=0) = [(0, a1), (1, a2), ...]

slice

itertools.islice will allow you to create a slice from an iterable

from itertools import islice
import numpy as np

a = np.arange(5)
b = islice(a, 3)
list(b) # [0,1,2]

Exceptions

See https://docs.python.org/3/library/exceptions.html

Raising

raise ValueError("You have bad inputs")

assert 1=1, "Something is very wrong if 1!=1"

Try Catch/Except

try:
  something_which_may_raise()
except AssertError as error:
  do_fallback()
  raise # Raise the previous error.
else:
  do_something_if_no_exception()
finally:
  finish_program_and_cleanup()

Classes

Static and Class methods

See realpython

class MyClass:
    def method(self):
        return 'instance method called', self

    @classmethod
    def classmethod(cls):
        return 'class method called', cls

    @staticmethod
    def staticmethod():
        return 'static method called'

Notes

That the Google Python style guide discourages use of static methods.
- Class methods should only be used to define alternative constructors (e.g. from_matrix).

Multithreading

threading

import threading

Use threading.Thread to create a thread.

concurrrency

In Python 3.2+, concurrent.futures gives you access to thread pools.

import os
import threading
from concurrent.futures import ThreadPoolExecutor, as_completed

executor = ThreadPoolExecutor(max_workers=os.cpu_count())
thread_lock = threading.Lock()
total = 0

def do_something(a, b):
  with thread_lock:
    total += a + b
  return total

my_futures = []
for i in range(5):
  future = executor.submit(do_something, 1, 2+i)
  my_futures.append(future)

for future in as_completed(my_futures):
   future.result()
executor.shutdown()

len(os.sched_getaffinity(0)) returns the number of threads available to the Python process.
Starting in Python 3.5, if max_workers is none, it defaults to 5 * os.cpu_count().
- os.cpu_count() returns the number of logical CPUs (i.e. threads)
executor.shutdown() will wait for all jobs to finish but you cannot submit any additional jobs from other threads, after calling shutdown.
List operations are thread-safe but most other operations will require using a thread lock or semaphore.

Data Structures

Tuples

Tuples are immutable lists. This means that have fixed size and fixed elements, though elements themselves may be mutable. In general, they perform marginally faster than lists so you should use tuples over lists when possible, especially as parameters to functions.

Typically people use tuples as structs, i.e. objects with structure such as coordinates. See StackOverflow: Difference between lists and tuples.

# Tuple with one element
m_tuple = (1,)

# Tuple with multiple elements
vals = (1,2,3, "car")

# Return a tuple
def int_divide(a, b):
  return a // b, a % b

Lists

The default data structure in Python is lists.
A lot of functional programming can be done with lists

groceries = ["apple", "orange"]

groceries.reverse()
# ["orange", "apple"]

groceries_str = ",".join(groceries)
# "apple,orange"

groceries_str.split(",")
# ["apple", "orange"]

# Note that functions such as map, enumerate, range return enumerable items
# which you can iterate over in a for loop
# You can also convert these to lists by calling list() if necessary

enumerate(groceries)
# [(0, "apple"), (1, "orange")]

Dictionaries

Dictionaries are hashmaps in Python

# Create a dictionary
my_map = {}
# Or
my_map = {1: "a", 2: "b"}

# Check if a key is in a dictionary
# O(1)
1 in my_map 

# Check if a value is in a dictionary
# Usually you should have a second dictionary if you need this functionality
# O(n)
'a' in d.values()

# Loop through dictionary
for k in my_map:
   print(k)
# With key and value
for k, v in my_map.items():
   print(k, v)

Numpy

See also Cupy which is a numpy interface implemented with CUDA for GPU acceleration. Large speedups can be had for big arrays.

random

Legacy code uses functions from np.random.*.

New code should initialize a rng using np.random.default_rng().
See Random Generator for more details.

import numpy as np

rng = np.random.default_rng()

# Random integer between [0, 6)
rng.integers(0, 6)
# array of 5 random integers
rng.integers(0, 6, size=5)

Anaconda

How to use Anaconda:

# Create an environment
conda create -n tf2 python=3.6

# Activate an environment
conda activate tf2

# Change version of Python
conda install python=3.7

# Update all packages
conda update --all

Documentation

conda install

Notes

Use flag --force-reinstall to reinstall packages

JSON

Documentation

import json

# Encode/Stringify (pretty)
json.dumps({})

# Decode/Parse
json.loads("{}")

# Write to file
with open("my_data.json", "w") as f:
  json.dump(my_data, f, indent=2)

# Read from file
with open("my_data.json", "r") as f:
  my_data = json.load(f)

Notes

Using json.dump(data, f) will dump without pretty printing
- Add indent parameter for pretty printing.

Type Annotations

Python 3 supports adding type annotations. However it is not enforced at runtime.
You can check types ahead of time using pytype.

function add_two_values(a: float, b: float) -> float:
    return a + b

Images

Pillow (PIL)

pip install pillow

from PIL import Image, ImageOps

img = Image.open("my_image.png")
# Converts to int array of shape (H,W,4)
img = np.array(img)

ImageOps.flip(img) - Returns an image flipped across y axis
ImageOps.mirror(img) - Returns an image flipped across x axis

Bilinear Interpolation

Coped from https://stackoverflow.com/questions/12729228/simple-efficient-bilinear-interpolation-of-images-in-numpy-and-python

Bilinear Interpolation function

def bilinear_interpolate(im, x, y):
    """
    Basic bilinear interpolation
    :param im:
    :param x:
    :param y:
    :return:
    """
    x = np.asarray(x)
    y = np.asarray(y)

    x0 = np.floor(x).astype(int)
    x1 = x0 + 1
    y0 = np.floor(y).astype(int)
    y1 = y0 + 1

    x0 = np.clip(x0, 0, im.shape[1] - 1)
    x1 = np.clip(x1, 0, im.shape[1] - 1)
    y0 = np.clip(y0, 0, im.shape[0] - 1)
    y1 = np.clip(y1, 0, im.shape[0] - 1)

    Ia = im[y0, x0]
    Ib = im[y1, x0]
    Ic = im[y0, x1]
    Id = im[y1, x1]

    wa = (x1 - x) * (y1 - y)
    wb = (x1 - x) * (y - y0)
    wc = (x - x0) * (y1 - y)
    wd = (x - x0) * (y - y0)
  <br />
    if len(Ia.shape) > len(wa.shape):
        wa = wa[..., np.newaxis]
        wb = wb[..., np.newaxis]
        wc = wc[..., np.newaxis]
        wd = wd[..., np.newaxis]

    return wa * Ia + wb * Ib + wc * Ic + wd * Id

Libraries

Other notable libraries.

Matplotlib

Matplotlib is the main library used for making graphs.
Examples
Gallery

Alternatively, there are also Python bindings for ggplot2

configargparse

ConfigArgParse is the same as argparse except it allows you to use config files as args.

parser = configargparse.ArgParser()
parser.add('-c', '--config', is_config_file=True, help='config file path')

# Parse all args, throw exception on unknown args.
parser.parse_args()

# Parse only known args.
parser.parse_known_args()

If you want to use bools without store-true or store-false, you need to define an str2bool function: Stack Overflow Answer

str2bool

def str2bool(val):
  """Converts the string value to a bool.
  Args:
    val: string representing true or false
  Returns: 
    bool
  """
  if isinstance(val, bool):
    return val
  if val.lower() in ('yes', 'true', 't', 'y', '1'):
    return True
  elif val.lower() in ('no', 'false', 'f', 'n', '0'):
    return False
  else:
    raise argparse.ArgumentTypeError('Boolean value expected.')

#...
parser.add_argument("--augment",
                    type=str2bool,
                    help="Augment",
                    default=False)