Bash for data science

8 minute read

As a data professional, chances are you already know Python, one of the most powerful general purpose languages which has libraries to do almost anything conveniently. Why, then, to learn Bash and the Unix shell in general? Some of the use cases I’ve found in my work:

  • Using file system locally/remotely
  • Monitoring system resources locally/remotely
  • Inspecting data files remotely (locally I find it faster to just spin up a Python console with Pandas)
  • Automating any jobs that make use of different command line tools
  • Piping together complex preprocessing steps that involve compiling software or syncing with a db (possibly through Docker)
  • Building the glue between query/preprocessing/train/deploy stages
  • Piping together custom utilities that make use of tools you would use from CLI anyway (e.g Git)
  • Makefile is often used as a poor man’s DAG framework for data science, for example in the cookiecutter project

Setup

Note that on Mac you’re by default stuck with BSD utilities that are less powerful than GNU ones. You can however do brew install coreutils and then use them with a g prefix, e.g. split becomes gsplit.

You’ll probably also want to update the Mac Bash version with brew install bash.

If any of the commands do not exist, then on Mac you can run brew install command usually.

On Mac, you probably want to install iTerm2 instead of the default Terminal.

Basics

Initiate scripts with either #!/bin/bash or #!/usr/bin/env bash (on Mac you might have multiple versions of Bash - which -a bash).

Put double quotes around "$VARIABLE"s to preserve the value.

set -euox pipefail is a nice default configuration (in scripts), which provides:

  • raising an error in case any command in the script fails
  • printing the commands as they are run
  • raising an error in case of unset variables

Use Ctrl+R in terminal for searching through command history, history to see it completely.

Use info command or man command or help command or command --help to get help with a command. whatis command also prints the first line description from the man manual.

Aliases

An alias works as a shortcut that can be used for a long and tedious command such as

connecting to a remote machine: alias remote="ssh user@bigawesomemachine.cloud

connecting to a database: alias db="psql -h database.redshift.amazonaws.com -d live -U database_user -p 5439"

activating a Conda environment: alias pyenv="source activate py_39_latest

and even just pure laziness: alias jn="jupyter notebook"

alias repo="cd /Users/john/work/team_repo"

alias l="ls -lah"

Add these to ~/.bash_profile for them to persist over shell sessions.

Inspecting files

Use tree to list contents of directories in a tree-like format.

Read one end of a file with head -n 10 file or tail -n 10 file

Observe a dynamically changing file, such as a log

tail -f file.log

Find a file in the current (nested) directory

find . -name data.csv

Browse a .csv file where “,” is the column separator

cat data.csv | column -t -s "," | less -S

Get the value for a key from a plaintext configuration that is not necessarily sourcable in Bash sed -n 's/^MODEL_DATA_PATH = //p' model_conf.py

You can do a Pandas-style value counts on a .csv column with cut, for example for the fifth column in a semicolon separated file

cut -d ";" -f 5 data.csv | sort | uniq -c | sort

Compare files with diff file1 file2 or cmp file1 file2 and directories with diff <(ls directory1) <(ls directory2).

Use grep to search files for a (regex) pattern. This command takes a lot of useful options, so better to read info grep. grep+cut makes for a poweful pattern to filter rows on a pattern and get the column you care about, for example to get the latest commit hash:

git log -1 | grep '^commit' | cut -d " " -f 2

sort and uniq can be used together to deal with duplicates, for example to count the number of duplicate lines in a .csv file

sort data.csv | uniq -d | wc -l

Control structures

Fast if-then-else:

[[ "$ENV" == prod ]] && bash run_production_job.sh || bash run_test_job.sh

Checking for file existence: [[ -f configuration.file ]] && bash run_training_job.sh || echo "Configuration missing"

Numeric comparisons:

[[ "$MODEL_ACCURACY" -ge 0.8 ]] && bash deploy_model.sh || echo "Insufficient accuracy" | tee error.log

Bash is a programming language, so it supports the regular for/while/break/continue structures. For example, to create backup files:

for i in `ls .csv`; do cp "$i" "$i".bak ; done

Definining functions in Bash is also straightforward and might be a useful alternative to aliases for more complex logic

view_map() {
    open "https://www.google.com/maps/search/$1,$2/"
}

For example the above opens a Google maps browser window in Tallinn with view_map 59.43 24.74

Use xargs to execute a command for multiple arguments: for example remove all local Git branches that have been merged to master.

git branch --merged master | grep -v "master" | xargs git branch -d

IO basics

Append to a file

python server.py >> server.log

Write stdout and stderr to a file and console

python run_something.py 2>&1 | tee something.log

Write all output to the void (discard it)

python run_something.py > /dev/null 2>&1

Manipulating string

You can generate “a list” of strings with brace expansion: mkdir /var/project/{data,models,conf,outputs}

There are a few nifty string manipulation functionalities built into Bash such as

${string//substring/replacement}

for string replace or lower to upper case using translation

cat lower_case_file | tr 'a-z' 'A-Z'.

However, there are much more powerful languages such awk and sed built in to handle any string manipulation. You can also use Perl or Python with regular expressions for string manipulation of arbitrary complexity.

To get started with regular expressions, what I’ve found useful is interactive tutorials such as RegexOne and playgrounds such as regexr.

Manipulating files

Ensure that a file exists:

if [[ ! -f stuff/parent_of_files/files/necessary.file ]]; then
	mkdir -p stuff/parent_of_files/files
	touch files/necessary.file
fi

Compressing and decompressing a directory

tar -cvzf archive_name.tar.gz content_directory

tar -xvzf archive_name.tar.gz -C target_directory

Merge two .csv files that have an identical index

join file1.csv file2.csv

Merge N .csv files with a separate header file

cat header.csv file1.csv file2.csv ... > target_file.csv

Use split to separate a file into chunks, for example for a .csv file (keeping the header)

tail -n +2 file.csv | split -l 4 - split_
for file in split_*
do
    head -n 1 file.csv > tmp_file
    cat "$file" >> tmp_file
    mv -f tmp_file "$file"
done

Take a random sample from a large .csv file (this will load it in memory though):

shuf -n 10 data.csv and if you need to keep the header but write a subsample to another file:

head -n 1 data.csv > data_sample.csv
shuf -n 10000 <(tail -n +2 data.csv) >> data_sample.csv

Change a file separator with sed sed 's/;/,/g' data.csv > data.csv

Send a file to a remote machine

scp /Users/andrei/local_directory/conf.py user@hostname:/home/andrei/conf.py

Managing resources and jobs

What is eating up all my disk space? ncdu provides an interactive view. An alternative without additional installs is du -hs * | sort -h

What/who is slowing down the entire machine? htop for the colorful version, top otherwise.

uptime - self explanatory.

Leave a long running model/script executing in the background on a server and write the outputs to model.log nohup python long_running_model.py > model.log & This also print the PID you can use to kill the process with, if it’s been a while you can find the PID with ps -ef | grep long_running_model.py

Send a request with (nested) payload to a local Flask service

curl \
  -H 'Content-Type: application/json' \
  -X POST \
  -d '{"model_type": "neuralnet", "features": {"measurement_1": 50002.3, "measurement_2": -13, "measurement_3": 1.001}}' \
  http://localhost:5000/invoke

Is a process occupying a port? lsof -i :port_number

To kill jobs, try kill PID first and kill -9 PID as the last resort.

Variables

Basics

Use printenv to see your environment variables. To set a permanent environment variable, add it to your ~/.bash_profile: export MLFLOW_TRACKING_URI=http://mlflow.remoteserver Now you can access os.environ["MLFLOW_TRACKING_URI"] in Python.

Use env to run a command with a custom environment:

env -i INNER_SHELL=True bash

Use local to declare local scope variables in a function.

Special parameters

There are a few, but some useful ones

$? - exit status of the last command

$# - number of positional arguments for a script

$$ - process ID

$_ - absolute path of the shell

$0 - name of the script

[[ $? == 0 ]] && echo "Last command succeeded" || echo "Last command failed"

Variable manipulation

Some tricks you can do with variables:

If parameter is not set, use a default value: ${MODEL_DIR:-"/Users/jack/work/models"} or set it to another value: ${MODEL_DIR:=$WORK_DIR} or raise an error with a message: ${MODEL_DIR:?'No directory exists!'}

Command line arguments

$# - the number of command line arguments

You could always use the simple $1, $2, $3 to use positional arguments passed together with a script.

For simple single letter named arguments, you can use the builtin getopts.

while getopts ":m:s:e:" opt; do
    case ${opt} in
        m) MODEL_TAG="$OPTARG"
        ;;
        s) DATA_START_DATE="$OPTARG"
        ;;
        e) DATA_END_DATE="$OPTARG"
        ;;
        \?) echo "Incorrect usage!"
        ;;
    esac
done

Varia

Python code can be run inline in pipes - meaning you can use it to replace awk/sed/perl if necessary:

echo "Hello World" | python3 -c "import sys; import re; input = sys.stdin.read(); output = re.sub('Hello World', 'Privet Mir', input); print(output)"

Or do pretty much anything:

model_accuracy=$(python -c 'from sklearn.svm import SVC; from sklearn.multiclass import OneVsRestClassifier; from sklearn.metrics import accuracy_score; from sklearn.preprocessing import LabelBinarizer; X = [[1, 2], [2, 4], [4, 5], [3, 2], [3, 1]]; y = [0, 0, 1, 1, 2]; classif = OneVsRestClassifier(estimator=SVC(random_state=0)); y_preds = classif.fit(X, y).predict(X); print(accuracy_score(y, y_preds))';)

Another way to feed multi line input to a command would be using the here document, for example:

python <<HEREDOC
import sys
for p in sys.path:
  print(p)
HEREDOC

where HEREDOC acts as the EOF encoding and the script is executed only after the EOF has been met.

Use the jq command for working on the command line with JSON. Examples.

Resources

Bash Guide for Beginners - Decent resource for an intro into scripting

GNU Coreutils manual - Learn what you can do with default tools

Data Science at the Command Line - if you like the shell so much that you’d like to do EDA/modelling there :)

Thanks for some tips and tricks to Mark Cowan who actually knows some Bash :)