data_analysis [Wiki]

This is an old revision of the document!

1.1. Short description

1.1.1. An approach for “gene set analysis”, i.e. for assessing whether a group of functionally related features (genes, RNAs, proteins or …) are regulated.

1.1.2. The GSRI estimates the fraction of significantly regulated features.

1.1.3. For estimation, the empirical cumulative density function (ecdf) of the p-values is analyzed. An iterative estimation procedure is used to unravel the difference to a uniform distribution of p-values (which corresponds to a diagonal line for the ecdf). It also enables calculation of standard errors for the fraction and significance statements.

1.1.4. In contrast to other similar approaches, no reference gene set which is NOT regulated (e.g. “all genes”) is required.

1.1.5. The most prominent similar approach is GSEA (gene set enrichment analysis)

1.2.1. The approach is applied several times in application project. It works.

1.2.2. Drawback: Collaborators weakly tend to more prominent approaches.

1.3.1. R-package “les” on Bioconductor

1.4.1. 1.5.1. Weighting of the individual p-values leads to LES

2.1. Short description

6.1. Short description

6.1.1. Transcription Start Site Identification (TSSi) based on sequencing reads

6.1.2. The data did not uniquely indicate TSSs.

6.1.3. The approach has been applied for prediction TSS for the physcomitrella patens genome. The results were available in the standard genome browser for this organism.

6.2.1. I guess that similar data is not produced any more. Therefore, the approach might be obsolete.

6.3.1. R-package TSSi

6.4.1.

7.1. Short description

7.1.1. The Mean Optimal Transformation Approach (MOTA) was suggested for investigating non-identifiablities.

7.1.2. Based on alternating conditional expectation (ACE) algorithm

7.1.3. Non-parametric method based on kernel estimation to unravel arbitrary dependencies in data

7.1.4. Works also for relations, e.g. a circle

7.2.1. Since based on kernel estimation restricted to low dimensional problems

7.3.1. R-package MOTA (not maintained any more, see CRAN archive)

7.3.2. ACE is available in as R-package “acepack”

7.3.3. Matlab code for ACE is available internally (ask Clemens)

7.4.1. ACE

7.5.1. Hengl S et al. Data-based identifiability analysis of nonlinear dynamical models (2007)

7.6.1. Breiman & Friedman. Estimating optimal transformations for multiple regression and correlation. (1985)

9.2. Short description

9.2.1. An explicit function which has very similar shape as ODE solutions of signalling pathways

9.2.2. If small amounts of data (observables) are available, the approach might serve as an alternative to traditional ODE modelling.

9.2.3. The approach provides self-explained parameters (amplitudes, response times, time-scales)

9.2.4. It can be directly fit to data in order to have an explicit function describing the time dependency (like a smoothing spline)

9.2.5. It can be fit to ODEs in order to have an approximation of the dynamics as explicit function (e.g. for multiscale models)

9.3.1. D2D is used for fitting

9.3.2. See: D2D Example folder (ToyModels/TransientFunction)

9.4.1. Fitting is very robust

9.4.2. For data, the outcome is great in 90% of cases

9.4.3. For approximating ODEs, the performance depends on the model. The accuracy is better than uncertainties of data.

9.5.1. Submitted

1. GSRI

1.1. Short description

1.1.1. An approach for “gene set analysis”, i.e. for assessing whether a group of functionally related features (genes, RNAs, proteins or …) are regulated.

1.1.2. The GSRI estimates the fraction of significantly regulated features.

1.1.4. In contrast to other similar approaches, no reference gene set which is NOT regulated (e.g. “all genes”) is required.

1.1.5. The most prominent similar approach is GSEA (gene set enrichment analysis)

1.2. Applicability/restrictions/pitfalls

1.2.1. The approach is applied several times in application project. It works.

1.2.2. Drawback: Collaborators weakly tend to more prominent approaches.

1.3. Code availability

1.3.1. R-package “les” on Bioconductor

1.4. Publications from the Timmer group

1.4.1.

1.5. Side remark

1.5.1. Weighting of the individual p-values leads to LES

2. LES

2.1. Short description

2.2. Local estimate of the fraction of significant p-values

2.3. The approach has been developed for tiling arrays.

2.4. The approach can be applied if p-values from statistical tests are available in a spatial order (e.g. along the genome)

2.5. The GSRI estimates the fraction of significantly regulated features.

2.6. A smoothing window is applied in combination (similar to GSRI).

2.7. It enables significance statements whether at a certain position a significant fraction of p-values deviate from the uniform distribution.

2.8. The outcome can be used to rank genomic regions, i.e. for finding regions of interest.

2.9. In contrast to other similar approaches, no reference gene set which is NOT regulated (e.g. “all genes”) is required.

2.10. The most prominent similar approach is GSEA (gene set enrichment analysis)

3. Applicability/restrictions/pitfalls

3.1. The R-package was implemented by Julian Gehring. Is was one of the most experience R programmer in our group and later become group member in Wolfgang Huber’s lab (a major Bioconductor group). He utilized Bioconductor classes.

3.2. There were no other projects with similar data, i.e. where the approach could be applied.

4. Code availability

4.1. R-package “les” on Bioconductor

5. Publications from the Timmer group

5.1. Julian Gehring’s Masters Thesis

6. TSSi

6.1. Short description

6.1.1. Transcription Start Site Identification (TSSi) based on sequencing reads

6.1.2. The data did not uniquely indicate TSSs.

6.1.3. The approach has been applied for prediction TSS for the physcomitrella patens genome. The results were available in the standard genome browser for this organism.

6.2. Applicability/restrictions/pitfalls

6.2.1. I guess that similar data is not produced any more. Therefore, the approach might be obsolete.

6.3. Availability

6.3.1. R-package TSSi

6.4. Publication from the Timmer group

6.4.1.

7. Optimal Transformations

7.1. Short description

7.1.1. The Mean Optimal Transformation Approach (MOTA) was suggested for investigating non-identifiablities.

7.1.2. Based on alternating conditional expectation (ACE) algorithm

7.1.3. Non-parametric method based on kernel estimation to unravel arbitrary dependencies in data

7.1.4. Works also for relations, e.g. a circle

7.2. Applicability/restrictions/pitfalls

7.2.1. Since based on kernel estimation restricted to low dimensional problems

7.3. Code availability

7.3.1. R-package MOTA (not maintained any more, see CRAN archive)

7.3.2. ACE is available in as R-package “acepack”

7.3.3. Matlab code for ACE is available internally (ask Clemens)

7.4. Other related methods

7.4.1. ACE

7.5. Publications from the Timmer group

7.5.1. Hengl S et al. Data-based identifiability analysis of nonlinear dynamical models (2007)

7.6. Publication from external groups

7.6.1. Breiman & Friedman. Estimating optimal transformations for multiple regression and correlation. (1985)

8. Error Models

9. Retarded Transient Function

9.1. Under development

9.2. Short description

9.2.1. An explicit function which has very similar shape as ODE solutions of signalling pathways

9.2.2. If small amounts of data (observables) are available, the approach might serve as an alternative to traditional ODE modelling.

9.2.3. The approach provides self-explained parameters (amplitudes, response times, time-scales)

9.2.4. It can be directly fit to data in order to have an explicit function describing the time dependency (like a smoothing spline)

9.2.5. It can be fit to ODEs in order to have an approximation of the dynamics as explicit function (e.g. for multiscale models)

9.3. Availability

9.3.1. D2D is used for fitting

9.3.2. See: D2D Example folder (ToyModels/TransientFunction)

9.4. Applicability

9.4.1. Fitting is very robust

9.4.2. For data, the outcome is great in 90% of cases

9.4.3. For approximating ODEs, the performance depends on the model. The accuracy is better than uncertainties of data.

9.5. Publication

9.5.1. Submitted