% Code by Ella Bingham 2005-2006, partially based on Heikki Mannila's old codes % % Spectral analysis of data, and possible reduction of data due to % (nearly) disconnected components in the data matrix. % % Inputs: % % data - a data matrix with sites as rows and taxa as columns. % % taxoninfo, siteinfo - auxiliary information about those taxa and sites % that are present in the data matrix % % similarity_measure - how the similarity matrix is formed. 'dot' for plain % dot product similarity, 'wdot' for weighted dot product similarity (takes % into account the total number of attribute appearances in the whole data). % % outlier_criterion - a coefficient c such that an observation is discarded % if its value in the 2nd smallest eigenvector (called "spcoeff" here) of the % Laplacian matrix is more than c*std(spcoeff) apart from mean(spcoeff). % % Outputs: % % data, taxoninfo, siteinfo - reduced versions of the corresponding input % arguments, resulting from outlier removal. Not sorted. % % sorted_data, sorted_siteinfo - reduced versions of the corresponding input % arguments, resulting from outlier removal. Rows (sites) of sorted_data and % entries of sorted_siteinfo are sorted according to the spectral % coefficients of the sites. % % spcoeff - vector of spectral coefficients of sites; eigenvector % corresponding to the 2nd smallest eigenvalue of the Laplacian matrix. % function [data,taxoninfo,siteinfo,sorted_data,sorted_siteinfo,spcoeff] = spectral_analysis(data,taxoninfo,siteinfo,similarity_measure,outlier_criterion) % Discard the sites that give rise to disconnected parts in the similarity % matrix. Update siteinfo accordingly. [data,siteinfo] = findconnecteddata(data,siteinfo,similarity_measure); % Iterative process: Remove the nearly disconnected components of the data % by removing outliers, that is, sites whose spectral coefficient differ % more than outlier_criterion*std from the mean of all spectral % coefficients. In case a taxon then becomes empty, it must be removed too. % spectral coefficients: the eigenvector corresponding to the % 2nd smallest eigenvalue of the Laplacian matrix of the data rows [spcoeff,simm,lapm] = laplacian(data,similarity_measure); not_outlier = abs(spcoeff-mean(spcoeff))0)); disp(['dropped ' num2str(size(data,2)-length(n2)) ' columns']) data = data(:,n2); % Update the contents of taxoninfo correspondingly reduced_taxoninfo = taxoninfo; names = fieldnames(taxoninfo); if ~isempty(names) for field=1:max(size(names)) fieldcontents = getfield(taxoninfo,char(names(field))); reduced_taxoninfo=setfield(reduced_taxoninfo,char(names(field)),fieldcontents(n2)); end end taxoninfo = reduced_taxoninfo; [spcoeff,simm,lapm] = laplacian(data,similarity_measure); not_outlier = (abs(spcoeff-mean(spcoeff))