pyutils_sh package¶

Module contents¶

pyutils_sh¶

An assortment of Python utilities for my personal projects. For example, there are functions for aggregating different types of survey data, grading scantron exams, calculating various statistics, and other general Python helper functions.

Documentation¶

Documentation is available via docstrings provided with the code, and an online API reference found at ReadTheDocs.

To view documentation for a function or module, first make sure the package has been imported:

>>> import pyutils_sh

Then, use the built-in help function to view the docstring for any function or module:

>>> help(pyutils_sh.exam.grade_scantron)

Modules¶

battery: Functions for aggregating subject data from Cognitive Battery (https://github.com/sho-87/cognitive-battery)
exam: Functions for aggregating different types of data from school exams (e.g. student grades)
gaze: Functions for analyzing gaze/eye-tracking data
image: Functions for analyzing images
stats: Tools for calculating different types of statistics
survey: Tools for aggregating and analyzing data from different surveys
utils: General utility functions used for Python programming

Submodules¶

pyutils_sh.battery module¶

Functions for aggregating subject data from Cognitive Battery (https://github.com/sho-87/cognitive-battery)

aggregate_ant(data, sub_num, response_type='full')[source]¶

Aggregate data from the ANT task.

Calculates various summary statistics for the ANT task for a given subject.

Parameters:	data (dataframe) – Pandas dataframe containing a single subjects trial data for the task. sub_num (str) – Subject number to which the data file belongs. response_type ({‘full’, ‘correct’, ‘incorrect’}, optional) – Should the summary data be calculated using all trials? Only correct trials? Or only incorrect trials? This is not supported in all tasks.
Returns:	stats – List containing the calculated data for the subject.
Return type:	list

aggregate_digit_span(data, sub_num)[source]¶

Aggregate data from the digit span task.

Calculates various summary statistics for the digit span task for a given subject.

Parameters:	data (dataframe) – Pandas dataframe containing a single subjects trial data for the task. sub_num (str) – Subject number to which the data file belongs.
Returns:	stats – List containing the calculated data for the subject.
Return type:	list

aggregate_flanker(data, sub_num, response_type='full')[source]¶

Aggregate data from the Flanker task.

Calculates various summary statistics for the Flanker task for a given subject.

Parameters:	data (dataframe) – Pandas dataframe containing a single subjects trial data for the task. sub_num (str) – Subject number to which the data file belongs. response_type ({‘full’, ‘correct’, ‘incorrect’}, optional) – Should the summary data be calculated using all trials? Only correct trials? Or only incorrect trials? This is not supported in all tasks.
Returns:	stats – List containing the calculated data for the subject.
Return type:	list

aggregate_mrt(data, sub_num)[source]¶

Aggregate data from the MRT task.

Calculates various summary statistics for the MRT task for a given subject.

Parameters:	data (dataframe) – Pandas dataframe containing a single subjects trial data for the task. sub_num (str) – Subject number to which the data file belongs.
Returns:	stats – List containing the calculated data for the subject.
Return type:	list

aggregate_ravens(data, sub_num)[source]¶

Aggregate data from the Raven’s Matrices task.

Calculates various summary statistics for the Raven’s Matrices task for a given subject.

Parameters:	data (dataframe) – Pandas dataframe containing a single subjects trial data for the task. sub_num (str) – Subject number to which the data file belongs.
Returns:	stats – List containing the calculated data for the subject.
Return type:	list

aggregate_sart(data, sub_num)[source]¶

Aggregate data from the SART task.

Calculates various summary statistics for the SART task for a given subject.

Parameters:	data (dataframe) – Pandas dataframe containing a single subjects trial data for the task. sub_num (str) – Subject number to which the data file belongs.
Returns:	stats – List containing the calculated data for the subject.
Return type:	list

aggregate_sternberg(data, sub_num, response_type='full')[source]¶

Aggregate data from the Sternberg task.

Calculates various summary statistics for the Sternberg task for a given subject.

Parameters:	data (dataframe) – Pandas dataframe containing a single subjects trial data for the task. sub_num (str) – Subject number to which the data file belongs. response_type ({‘full’, ‘correct’, ‘incorrect’}, optional) – Should the summary data be calculated using all trials? Only correct trials? Or only incorrect trials? This is not supported in all tasks.
Returns:	stats – List containing the calculated data for the subject.
Return type:	list

aggregate_wide(dir_battery, dir_output, response_type='full', use_file=False, save=True)[source]¶

Aggregate data from all battery tasks.

Takes a directory containing individual subject data files created from the Cognitive Battery, and calculates summary statistics for all subjects across all tasks. A single output summary file is created containing the aggregated battery data.

Parameters:	dir_battery (str) – Path to the directory containing subject data files created by the Cognitive Battery. dir_output (str) – Path to the directory where the output summary file will be saved. A filed named ‘battery_data.csv’ will be created in this directory. response_type ({‘full’, ‘correct’, ‘incorrect’}, optional) – Should the summary data be calculated using all trials? Only correct trials? Or only incorrect trials? This is not supported in all tasks. use_file (bool, optional) – If True, aggregated battery data will be imported from the existing summary file instead of being re-aggregated. save (bool, optional) – Set to True to save an output summary file to the output directory. If False, then no file will be saved, but a dataframe will still be returned from this function.
Returns:	all_data – Pandas dataframe containing the aggregated summary data for all tasks.
Return type:	dataframe

pyutils_sh.exam module¶

Functions for aggregating and analyzing exam-related data, such as calculating student exam performance.

grade_scantron(input_scantron, correct_answers, drops=[], item_value=1, incorrect_threshold=0.5)[source]¶

Calculate student grades from scantron data.

Compiles data collected from a scantron machine (5-option multiple choice exam) and calculates grades for each student. Also provides descriptive statistics of exam performance, as well as a list of the questions “most” students got incorrect, and saves the distribution of answers for those poorly performing questions.

This function receives 1 scantron text file and produces 2 output files. Splitting of the scantron data is specific to each scantron machine. The indices used in this function are correct for the scantron machine in the UBC Psychology department as of 2015. Indices need to be adjusted for different machines.

Scantron exams can be finicky. Students who incorrectly fill out scantrons need to be considered. Make sure to manually inspect the text file output by the scantron machine for missing answers before running this. This function does not correct for human error when filling out the scantron.

Parameters:

input_scantron (string) – Path to the .txt file produced by the scantron machine.
correct_answers (list) – A list of strings containing the correct exam answers. For example: [“A”, “E”, “D”, “A”, B”]. The order must match the order of presentation on the exam (i.e. the first list item must correspond to the first exam question)
drops (list, optional) – List of integers containing question numbers that should be excluded from calculation of grades. For example: [1, 5] will not include questions 1 and 5 when calculating exam scores.
item_value (int, optional) – Integer representing how many points each exam question is worth.
incorrect_threshold (float between [0., 1.], optional) – Poorly performing questions are those where few students got the correct answer. This parameter sets the threshold at which an item is considered poor. For example, a threshold of 0.4 means that a poor item is considered to be one where less than 40% of students chose the correct answer.

pyutils_sh.gaze module¶

Functions for calculating various gaze/eye-tracking related statistics.

cross_correlation(person1, person2, framerate=25, constrain_seconds=2)[source]¶

Calculate cross-correlation between two gaze signals.

This function takes 2 lists/arrays of data, each containing an individual’s coded gaze data from an eye-tracker, and calculates the normalized max cross-correlation value with its associated lag.

Additionally, it will also return the cross-correlation value at 0 lag, as well as the entire normalized array as a Python list.

Negative lag value means person2 lagged behind person1 by x frames e.g. A = [0,1,1,1,0,0,0] B = [0,0,0,1,1,1,0] cross_correlation(A,B)

Positive lag value means person1 lagged behind person2 by x frames e.g. A = [0,0,0,1,1,1,0] B = [0,1,1,1,0,0,0] cross_correlation(A,B)

Parameters:

person1 (ndarray or list) – 1D array of person 1’s gaze over time, coded as 0 = not looking, 1 = looking. The values represent whether the person was looking at a target at a particular point in time.
person2 (ndarray or list) – 1D array of person 2’s gaze over time, coded as 0 = not looking, 1 = looking. The values represent whether the person was looking at a target at a particular point in time.
framerate (int, optional) – The framerate (frames per second) of the eye-tracker.
constrain_seconds (int, optional) – Number of seconds to constrain the cross-correlation values by. The returned lags and cross-correlations will be centered around 0 lag by this many seconds.

Returns:

max_R (float) – Maximum (normalized) cross-correlation value.
max_lag_adj (float) – Lag at which max cross-correlation occurs.
zero_R (float) – Cross-correlation value at 0 lag.
norm_array (list) – A list of all (normalized) cross-correlation values.

pyutils_sh.image module¶

Functions for analyzing images.

boxcount(img, k)[source]¶

Internal box counting function used by pyutils_sh.image.fractal_dimension().

From https://github.com/rougier/numpy-100 (#87)

Parameters:	img (ndarray) – Thresholded grayscale image for box counting. k (ndarray) – Array of box sizes to use.
Returns:	count – Count value.
Return type:	int

fractal_dimension(img, threshold=0.5, mean_threshold=True, plot=False)[source]¶

Calculate (Minkowski–Bouligand) fractal dimension.

From https://github.com/rougier/numpy-100 (#87)

Parameters:	img (ndarray) – Grayscale image for box counting. threshold (float between [0., 1.], optional) – Value at which to binarized the image. mean_threshold (bool, optional) – If true, binarize image at the its mean value. plot (bool, optional) – Display a plot of the thresholded image.
Returns:	fd – Fractal dimension value for the image.
Return type:	float

rgb2gray(img)[source]¶

Convert RGB image to grayscale.

Parameters:	img (ndarray) – Normalized (/255) image array from scipy.misc.imread() or imageio.imread().
Returns:	gray – Grayscale image array.
Return type:	ndarray

pyutils_sh.stats module¶

Tools for calculating different types of statistics, such as effect size estimates.

cohens_d(g1_m, g1_sd, g1_n, g2_m, g2_sd, g2_n)[source]¶

Calculate Cohen’s d for two independent samples.

This calculation involves taking the mean difference between groups, and dividing it by the pooled standard deviation.

Parameters:	g1_m (float) – Mean value for group 1. g1_sd (float) – Standard deviation for group 1. g1_n (int) – Sample size of group 1. g2_m (float) – Mean value for group 2. g2_sd (float) – Standard deviation for group 2. g2_n (int) – Sample size of group 2.
Returns:	d – Standardized effect size (Cohen’s d) for the group difference.
Return type:	float

pyutils_sh.survey module¶

Functions for aggregating and analyzing different types of survey data.

ipaq_long_aggregate(q_map, domains=False)[source]¶

Aggregate self-reported activity values into IPAQ summary data.

Calculates MET/minutes and IPAQ category for each individual based on self-reported physical activity levels. The scoring follows the official IPAQ scoring guide as closely as possible.

Section 7.4 (Truncation rules) of the scoring guide is extremely unclear about how to truncate time data for the long form IPAQ. The rule doesn’t allow for the separation of weekly time or weekly METs. This rule has not been followed here.

Additionally, the American College of Sports Medicine (ACSM) provides minimum recommended physical activity levels. In addition to the IPAQ categorical variable, an ACSM activity variable is also calculated, which indicates whether the individual met the minimum recommended levels.

Parameters:	q_map (dict) – Dictionary mapping question names to separate Pandas columns/series. This is used so internally the function uses a consistent name for each column of the survey (which may be named differently). The dictionary keys follow a strict naming scheme of qXX where XX is question number. Time variables need to be split, and hours and minutes must be stored under separate keys. The naming scheme for the time variables are qXX_h and qXX_m. The dictionary must also contain a column of subject numbers under the key sub_num. All questions must be included in the dictionary (41 IPAQ questions, plus 1 subject number). An example dictionary might look like this: >>> q_map = {'sub_num': data['subject_id'], 'q1': data['IPAQ_1'], 'q2': data['IPAQ_2'], 'q3_h': data['IPAQ_3_1'], 'q3_m': data['IPAQ_3_2'], ... 'q27_h': data['IPAQ_27_1'], 'q27_m': data['IPAQ_27_2']} domains (bool, optional) – If True, MET minutes and time values will be included separately for each IPAQ activity domain.
Returns:	aggregated – Pandas dataframe containing the calculated IPAQ summary data.
Return type:	dataframe

ipaq_to_minutes(hours, mins)[source]¶

Convert hours and minutes into minutes, following IPAQ data cleaning rules.

Internal function used by survey.ipaq_long_aggregate(). Takes a hours and a minutes column (Pandas series) and calculates total time in minutes. Follows IPAQ data cleaning rules as outlined in the scoring guide, such as handling out-of-bound hour values (no conversion), and removing individuals that reported too large of a time value (> 24 hours).

Parameters:	hours (series) – Pandas series containing reported hours for all participants. mins (series) – Pandas series containing reported minutes for all participants.
Returns:	converted – Pandas series containing time spent in minutes.
Return type:	series

pas_aggregate(q_map)[source]¶

Aggregate PAS questionnaire data

The original development paper: Aadahl_Jørgensen (2003) - Validation of a New Self-Report Instrument for Measuring Physical Activity

Parameters:

q_map (dict) – Dictionary mapping question names to separate Pandas columns/series. This is used so internally the function uses a consistent name for each column of the survey (which may be named differently).

The dictionary keys follow a strict naming scheme of ‘a’, ‘b’ … ‘i’, where each key represents a PAS category. Time variables need to be split, and hours and minutes must be stored under separate keys. The naming scheme for the time variables are x_hours and x_mins. The dictionary must also contain a column of subject numbers under the key sub_num.

An example dictionary might look like this:

>>> pas_qmap = {'sub_num': df['subNum'],
                'a_hours': df['PAS_a_hours'],
                'a_mins': df['PAS_a_mins'],
                'b_hours': df['PAS_b_hours'],
                'b_mins': df['PAS_b_mins'] ... }

Returns:

aggregated – Pandas dataframe containing the calculated PAS summary data.

Return type:

dataframe

pyutils_sh.utils module¶

Utility functions for general purpose Python programming.

get_path(f='/home/docs/checkouts/readthedocs.org/user_builds/pyutils-sh/envs/latest/bin/sphinx-build')[source]¶

Get path to, and name of, a file.

Parameters:	f (str, optional) – Full path to a file. Defaults to the currently executing Python file.
Returns:	directory (str) – Path to the directory containing the file. filename (str) – Name of the file.