Matlab Object Oriented Model Fitting Tutorial

I frequently fit computational models to behavioral and neuroimaging datasets using statistical, reinforcement learning, and utility frameworks.  However, I typically write separate scripts for each specific study.  To make this process easier I have developed a new matlab function that allows anyone to easily fit any type of model to a multi-subject dataset using very simple commands.  To use it, simply clone my github toolbox repository and add the matlab folder to your matlab path.

Model Fitting Overview

The general idea of model fitting is to posit a theory about how something might work and test it by examining how well the theory can account for a dataset given a set of free parameters.  In principle, I try to evaluate how well the model performs (i.e., the goodness of fit) relative to other competing theories.  This comparison assumes that the model with the fewest number of free parameters that has the best goodness of fit for the most subjects is the best model to explain the data.  To calculate the goodness of fit, it is often necessary to determine the optimal free parameters specified in the model, a process referred to as parameter estimation.   The most common type of estimation techniques are Least Squares Estimation, which attempts to minimize the squared difference between the model prediction and the observed data, and Maximum Likelihood Estimation, which calculates the likelihood of a model explaining a dataset given a set of parameters.  I recommend reading Myung (2003) for a tutorial on maximum likelihood.  I typically use a two-stage multi-level framework to estimate parameters for a group.  This is identical to neuroimaging analyses in which parameters are estimated for each individual subject and then averaged across participants.   For hypothesis testing, I often use non-parametric statistical tests (e.g., signed rank test) to perform model comparison on model fits penalized for their number of free parameters (e.g., AIC, BIC ).  I also occasionally use F and chi square tests to perform nested model comparisons and the vuong statistic for non-nested model comparisons.  More recently, I have been playing with the stan statistical language which has both R and Python wrappers to perform hierarchical bayesian parameter estimation.  For a more in depth overview of this process, I highly recommend an excellent chapter by Nathaniel Daw.

Computational Model Class ( comp_model() )

The computational model ( comp_model() ) class creates an object instance that is used to fit a computational model to a multi-subject dataset. The object uses the design_matrix() class for the data set, which has rudimentary functionality similar to an R data.frame().  There are additional fields for the model and parameters for the model fitting procedure such as the parameter constraints, number of iterations, and type of estimation (e.g., maximum likelihood or least squares).  I use Matlab’s fmincon optimization function to find the parameters that result in the best fit of the model.  This constrained optimization function is very efficient, but has a tendency to occasionally get stuck in a local minima.  Using more iterations will rerun the model multiple times with different initial starting values which decreases the likelihood of fmincon getting stuck in local minimas, but increases the amount of time it takes to fit the models.

Current Methods for comp_model (inherits from design_matrix class too)

  • avg_aic – display average AIC value
  • avg_bic – display average BIC value
  • avg_params – display average parameter estimates
  • comp_model – class constructor
  • fit_model – estimate parameters using model
  • get_aic – extract all subject’s AIC values
  • get_bic – extract all subject’s BIC values
  • get_params – extract all subject’s estimated parameters
  • plot – plot average model predictions across subjects
  • summary – display summary table for model
  • save – save object as .mat file
  • write_tables – write out parameter estimates and trial-to-trial predictions to csv data frame.

Example Usage

First we will load some example data using the importdata function and set some of the defaults settings for the optimization algorithm.  The comp_model() function assumes that data is in the long format and that the subject indicator is the first column of the data frame.

% Load data
basedir = '~/Dropbox/TreatmentExpectations';
dat = importdata(fullfile(basedir, 'Data','Seattle_Sona','seattle_ses_by_session.csv'));

% Set optimization parameters for fmincon (OPTIONAL)
options = optimset(@fmincon);
options = optimset(options, 'TolX', 0.00001, 'TolFun', 0.00001, 'MaxFunEvals', 900000000, 'LargeScale','off');

Next, we will create comp_model() class instance for an example linear model.  This requires that we specify a data frame, cell array of column names, and a model name. The Model Name must refer to function containing the model, which is located on the matlab path (See example ‘linear_model’ function below).  We can also specify some additional parameters that are required for the model fitting procedure.

    • nStart – the number of iterations to repeat model estimation – selects the iteration associated with the best model fit. Higher numbers decrease the likelihood of fmincon getting stuck in a local minima, but increase the computational time.
    • param_min – vector of lower bound of parameters
    • param_max – vector of upper bound of parameters
    • esttype – type of parameter estimation (‘SSE’ – minimize sum of squared error; ‘LLE’ – maximize log likelihood; ‘LE’ – maximize likelihood)
lin = comp_model(dat.data,dat.textdata,'linear_model','nStart',10, 'param_min',[-5, -20], 'param_max', [60, 20], 'esttype','SSE');

911x8 comp_model array with properties:

model: 'linear_model'
param_min: [-5 -20]
param_max: [60 20]
nStart: 10
esttype: 'SSE'
params: []
trial: []
dat: [911x8 double]
varname: {'subj' 'group' 'sess' 'se_count' 'se_sum_intensity' 'any_action_taken' 'hamtot' 'bditot'}
fname: ''

Once the comp_model() object has been created with all of the necessary setup parameters, the model can be fit to the data with following command.

lin = lin.fit_model();

911x8 comp_model array with properties:

model: 'linear_model'
param_min: [-5 -20]
param_max: [60 20]
nStart: 10
esttype: 'SSE'
params: [77x6 double]
trial: [911x4 double]
dat: [911x8 double]
varname: {'subj' 'group' 'sess' 'se_count' 'se_sum_intensity' 'any_action_taken' 'hamtot' 'bditot'}
fname: ''

This adds data to two new fields, params and trial.

  • params – is the parameters estimated for each subject. Rows are individual subjects. Columns are {‘Subject’, ‘Estimated Parameters (specific to each model)’, ‘Model Fit’, ‘AIC’, ‘BIC’}
  • trial – is the trial by trial data and predicted values for all subjects stacked together.

The overall average results from the model can be quickly viewed using the summary() method.

summary(lin)

Summary of Model: linear_model
-----------------------------------------
Average Parameters: 18.1073
Average AIC: 35.8264
Average BIC: 36.3802
Average SSE: -1.86
Number of Subjects: 77
-----------------------------------------

The ‘params’ and ‘trial’ data frames can be written to separate .csv files.  This can be helpful to open these data in a separate program such as R or Python for plotting or futher analyses.

lin.write_tables(fullfile(basedir,'Analysis','Modeling'))

The overall object instance can be saved as .mat file. This is helpful as sometimes model estimation can take a long time especially if using a high number of iterations.

lin.save(fullfile(basedir,'Analysis','Modeling','Linear_ModelFit.mat'))

Example_PlotThe average predicted values from the model can be quickly plotted using the plot() function.  It is necessary to specifiy the particular columns from obj.trial to plot as these will be specific to the data set and model. This method has some rudimentary options to customize the plot.

 

plot(lin, [3,4], 'title', 'Linear Model', 'xlabel','session', 'ylabel', 'Average BDI', 'legend', {'Predicted','Observed'})

Example Model Function

Here is an example function of a very simple linear model. Functions can be completely flexible, but need to have the free parameter (xpar) and data as inputs and output the model fit (e.g., sse).  This is so fmincon can optimize the parameters for this function by minimizing the Sum of Squared Error (sse – for this example).  This function needs to be in a separate function file on the matlab path.

function sse = linear_model(xpar, data)
% Fit Linear Decay Treatment Model

global trialout %this allows trial to be saved to comp_model() object

% Model Parameters
beta0 = xpar(1); % Intercept
beta1 = xpar(2); % Slope

% Parse Data
obssx = data(:,8); %BDI symptom scores

%Model Initial Values
sse = 0; % initial value sum of squared error
time = 1:size(data,1);
time = time - mean(time); %center time variable

% This model is looping through every trial. Obviously this isn't
% necessary for this specific model, but it is for more dynamic models that
% change with respect to time such as RL models.
for t = 1:length(obssx)

%Calculate symptom decay using linear decay
predsx(t) = beta0 + beta1 * time(t); %linear trend of symptom change

% update sum of squared error (sse)
sse = sse + (predsx(t) - obssx(t))^2;

end

%Output trial by trial results - saved as obj.trial
trialout = [ones(t,1)*data(1,1) (1:t)', obssx, predsx(1:t)'];

end % model

 Model Comparison

For a quick example of testing a hypothesis by comparing competing models we will use Matlab’s robust non-parametric signrank test to determine if the test model (i.e. linear_expect_model) fits the data significantly better than the null model (i.e. linear_model) using the get_aic() command.  We use the AIC metric to compare the models as the test model has 3 free parameters compared to the null model which only has 2.

[P,H,STATS] = signrank(get_aic(lin),get_aic(lin_expect))

P =
	8.8387e-08

H =
	1

STATS =
          zval: 5.3491
    signedrank: 2555

In this example, we see that the test Linear Expectation model fits the data significantly better than the null Linear model, supporting our hypothesis.

Let me know if you have any comments or suggestions on how to improve the software.

 

 

Advertisements

Collaborative Project Management Software

Have you ever had difficulty keeping track of all of your various tasks, projects, and collaborations?  Many effective scientists I know use project management software to keep track of projects and deadlines.  Popular choices are often based on David Allen’s Getting Things Done philosophy such as omnifocus and remember the milk.  I have personally used omnifocus for many years with mixed success.  I enjoyed that my tasks were synced across my computers and devices, but never felt like I was using it beyond a virtual “sticky pad”.  Plus I always found it a bit tedious to enter all of the information associated with tasks.  However, I routinely rely on many cloud based collaborative tools such as dropbox to share files with collaborators and across computers, google apps for collaborative scheduling, google hangouts for video conferencing, and hackpad, for collaborative note taking and writing.  I have recently discovered cloud based collaborative project management software, and anticipate it will dramatically improve my efficiency in constant multi-tasking across dozens of collaborative projects.  These software packages have become popular in tech startups, but have been relatively slow to infiltrate the scientific community, which I suspect is because our needs are a bit different.  Here are a few examples.

Scientists Needs:

  1. Scientists work in teams, but are mostly interested in their individual careers/projects.  This is why dropbox and google apps work well.   Each scientist has their own account and can work with other people, but in the end they are in control of their data and interface.  Collaborations occur within and between laboratories and personnel are in a constant state of flux (i.e., primarily undergraduate, graduate, postdoc trainees). Many cloud based project management softwares are designed to be run in a stable business setting and do not work well with the ever changing laboratory landscape.
  2. Scientists need an interface for planning, tracking progress and discussing.  These needs are obviously shared with many other disciplines, but scientists are generally accustomed to technology such as laboratory notebooks, which track progress of data collection, analyses, results, brainstorms, and literature searches.
  3. Scientists are not as well financed as their industry counterparts and have limited pockets of money from startup and grants to pay for software.  Scientists are accustomed to academic pricing models and freeware, and are not used to the software as a service (SAAS) model, particularly when revenue sources from grants are not constant.
  4. Scientists are busy and won’t use things that are too complicated, take a lot of effort, or are unintuitive.  This is one of the reasons why I think Apple products have such deep penetration into the academic community.  Many scientists have little patience for the instability and insecurity of windows, or the complexity of linux operating systems.
  5. Scientists are fairly naive when it comes to management.  We are trained through an apprenticeship model, and are generally unaccustomed to managing a lot of personnel and projects.  In addition, we have limited resources and do not typically hire management other than lab managers.  Thus, we have less training and resources to manage teams compared to other industries.
  6. Scientists like stability.  Nobody wants to spend their precious time or money in learning technology that becomes quickly obsolete or discontinued (anyone remember google wave? or apple xserves?).

Examples:

Fortunately, there are a number of really nice software packages being developed that will likely be helpful in improving efficient management of projects and collaborations

2014-03-06_00-11-53

Trello: A free online collaborative tool started by Joel Spolsky (one of the founders of stackoverflow) and based on the kanban philosophy.  The general idea is that there are organizations with multiple people, boards for specific projects, lists within a board, and cards within lists that can be discussed and tasked to specific individuals.  Each board can have different permissions and every user has unlimited boards.  Kanban encourages cards to move across different stages of production represented as lists.  This software is very simple and visually intuitive and I think works well for many different tasks such as brainstorming and collaborating on specific projects.  This software is well suited to be adopted by scientists because it is free and based on interacting individuals rather than a central system for a particular organization.  I think it nicely complements other cloud based apps such as google calendar, dropbox, and hackpad.

Pros:

  • Free!
  • Web interface, tablet, and mobile apps are intuitive, fast, and well implemented.
  • Unlimited users and boards
  • Works well for individual research projects within a lab and across labs.  Relatively easily to invite outside collaborators to specific boards.
  • Works well as an individual project manager board.
  • Email and dropbox integration.
  • Has an open API, which adds to it’s flexible nature.

Cons:

  • No real dashboard feature which allows you to see upcoming tasks and updates across boards.
  • Discussions are limited to specific cards.  There is no general discussion board.
  • Calendar app is only for projects associated with a specific board.  No integrated shared calendar (can link calendar feeds to google calendar though).
  • No collaborative writing option.  (but try hackpad!)

2014-03-06_00-16-16

Basecamp:  One of the oldest and most popular cloud based collaborative projective management softwares and resulted in the development of Ruby on Rails.  This software has projects, which can have tasks, discussions, files, and text documents. Overall, I think this is a great software for many of the important aspects of lab based collaborations.  It does an excellent job of facilitating discussions and centralizing scheduling. However, it might be difficult to get all members of a large lab to participate as it is expensive, has a limited number of projects, and stays with the organization not the user.

Pros:

  • Very intuitive and clean interface
  • Very nice permissions.  People can be clustered into groups and everything can different permissions depending on who should be involved.
  • Can discuss anything (e.g., tasks, files, etc.).  Discussions do not have to be tied to a specific task.
  • Nice calendar integration.  Can have separate calendars for specific projects, lab events, personnel availability, resource availability, etc.
  • Nice mobile apps.
  • Nice individual user dashboard to see upcoming tasks and updates on all projects
  • Unlimited users.
  • Really nice email integration and there are lots of third party integrations.
  • Can pay for individual boards at $15 a month per board if you leave the lab.

Cons:

  • Expensive.  ($50 a month for 40 projects, $3k a year for unlimited projects)
  • Limited projects.  This was a very limiting factor in our lab where we needed at least 100 boards just for our ongoing research projects.
  • Not very convenient to take with you if you leave a lab that is highly integrated with basecamp.
  • Poor collaborative writing implementation.

Additional Software Packages:

  • Asana
  • ActiveCollab
  • Producteev
  • Flow
  • Redbooth
  • Huddle
  • Proofhub
  • Smart sheet
  • Wrike
  • crowdbase
  • Binfire
  • Breeze
  • BCSocial
  • Freedcamp
  • Podio
  • Wridea
  • Gantter
  • OpenProject
  • LibrePlan
  • ProjectLibra

Thoughts?

Anyone else had positive or negative experiences with a particular software in the context of a scientific collaboration?