/* This is an example HY-PHY Batch File.



   It reads in a '#' nucleotide dataset data/hiv.nuc and estimates
   maximum ln-likelihood based on the tree contained in the data file,
   using the general non-reversible 12 parameter (with 3 constraints)
   model. The tree from the data file is unrooted. 

   

   Output is printed out as a Newick Style tree with branch lengths
   representing the number of expected substitutions per branch (which
   is the default setting for nucleotide models w/o rate variation).


   Sergei L. Kosakovsky Pond and Spencer V. Muse 

   November 2002. 

*/

/* 1. Read in the data and store the result in a DataSet variable.*/

DataSet 		nucleotideSequences = ReadDataFile ("data/hiv.nuc");

/* 2. Filter the data, specifying that all of the data is to be used
	  and that it is to be treated as nucleotides.*/
	 
DataSetFilter	filteredData = CreateFilter (nucleotideSequences,1);

/* 3. Collect observed nucleotide frequencies from the filtered data. observedFreqs will
	  store the vector of frequencies. */

HarvestFrequencies (observedFreqs, filteredData, 1, 1, 1);

/* 4. Define the 12 parameter non-reversible substitution matrix. 
The constraints on some of the substitution rates are necessary to
ensure that observed nucleotide frequencies are equilibrium 
frequencies for the model.

'*' is defined to be -(sum of off-diag row elements) */

/* frequency ratios */
	 
r0 = observedFreqs[0]/observedFreqs[3];
r1 = observedFreqs[1]/observedFreqs[3];
r2 = observedFreqs[2]/observedFreqs[3];

/* All the global rate parameters are defined relative to 
the rate for A->C. For instance, CG represents the ratio
of the rates C->G/A->C. */

global AG;
global AT;
global CA;
global CG;
global CT;
global GA;
global GC;
global GT;

/* note that these constraints are 
satisfied if we restrict the model to the
general reversible case */

global TA:=AT+(1-CA)*r1__+(AG-GA)*r2__;
global TC:=CT+(CA-1)*r0__+(CG-GC)*r2__;
global TG:=GT+(GA-AG)*r0__+(GC-CG)*r1__;


NRRateMatrix = 
		{{*,t,t*AG,t*AT}
		 {t*CA,*,t*CG,t*CT}
		 {t*GA,t*GC,*,t*GT}
		 {t*TA,t*TC,t*TG,*}};

/*5.  Define the Non-Rev model, by combining the substitution matrix with the vector of observed (equilibrium)
	  frequencies. */
	  
Model 	NRM = (NRRateMatrix, observedFreqs);

/*6.  Now we can define the tree variable, using the tree string read from the data file,
	  and, by default, assigning the last defined model (NRM) to all tree branches. */

Tree	givenTree = DATAFILE_TREE;

/*7.  Since all the likelihood function ingredients (data, tree, equilibrium frequencies)
	  have been defined we are ready to construct the likelihood function. */

LikelihoodFunction  theLnLik = (filteredData, givenTree);

/*8.  Maximize the likelihood function, storing parameter values in the matrix paramValues */

Optimize (paramValues, theLnLik);

/*9.  Print the tree with optimal branch lengths to the console. */

fprintf  (stdout, theLnLik);

