If you use funscoR please cite:
Ochoa et al. The functional landscape of the human phosphoproteome bioRxiv (2019). https://doi.org/10.1101/541656
First, install funscoR from github. It requires to have devtools installed.
devtools::install_github("evocellnet/funscoR")
To get started, fire up the packages and load some sample data.
A reference human phosphoproteome is provided in the phosphoproteome object. The data frame contains a list of phosphorylation residues as described below.
| acc | position | residue |
|---|---|---|
| A0A075B6Q4 | 24 | S |
| A0A075B6Q4 | 35 | S |
| A0A075B6Q4 | 57 | S |
| A0A075B6Q4 | 68 | S |
| A0A075B6Q4 | 71 | S |
| A0A075B6Q4 | 72 | S |
A different reference phosphoproteome can be used as starting point if provided in the same format. Beware certain annotations for the reference set of sites might be required to get equivalent performances.
A gold standard of known regulatory sites is required in order to train the model. Parsed annotations from PhosphositePlus (see license) are provided in this package in the object psp.
| acc | position |
|---|---|
| O95786 | 854 |
| Q8TD46 | 302 |
| P60484 | 380 |
| Q9UPY3 | 1016 |
| P35367 | 142 |
| O43684 | 19 |
A different gold standard can be used in the downstream analysis as long as the provided data frame contains the acc and position columns.
An extensive functional annotation for the human phosphoproteome is provided as part of this package. Next you can find a list of all the annotations available.
data(package = "funscoR")$results %>%
as.data.frame() %>%
filter(str_detect(Item, "feature")) %>%
select(Item) %>%
kable()| Item |
|---|
| feature_disopred |
| feature_disprot |
| feature_domains |
| feature_elm |
| feature_evmut |
| feature_exac |
| feature_foldx |
| feature_hotspots |
| feature_interfaces |
| feature_ms_pride |
| feature_neighPTMs |
| feature_netphorest |
| feature_paxdb |
| feature_proteinlength |
| feature_ptmdb_age |
| feature_ptmdb_coregulation |
| feature_ptmdb_counts |
| feature_ptmdb_regulation |
| feature_pwm_match |
| feature_scratch1Dfeatures |
| feature_sift_scores |
| feature_spectral_counts |
| feature_topology |
| feature_transitpeptides |
Each of the annotation objects can be used indepedently. For example, it’s described next the object that contains the ancestral reconstruction of all available phosphosites. The column w0_mya contains the inferred age of the last common ancestor for the phosphosite. w3_mya contains the equivalent information using a window of +/-3 residues to asses the conservation. More information about the dataset can be found using using ?feature_ptmdb_age.
| acc | residue | position | w0_mya | w0_ancestor_name | w3_mya | w3_ancestor_name |
|---|---|---|---|---|---|---|
| P84085 | T | 4 | 96 | Boreoeutheria | 96 | Boreoeutheria |
| P84085 | S | 6 | 96 | Boreoeutheria | 96 | Boreoeutheria |
| P84085 | S | 10 | 96 | Boreoeutheria | 96 | Boreoeutheria |
| P84085 | S | 103 | 96 | Boreoeutheria | 824 | Bilateria |
| P84085 | S | 137 | 96 | Boreoeutheria | 0 | Homo sapiens |
| P84085 | S | 150 | 96 | Boreoeutheria | 96 | Boreoeutheria |
In order to train a model, you might be interested in annotating the phosphoproteome with all the available features. You can use the annotate_sites function for this.
A preprocessing step must be run to ensure the features are properly provided to the model. The function preprocess_features defaults to a series of methods but additional tunning can be applied using the methods= and features_to_exclude= arguments. Different preprocessing steps are necessary for “ST” and “Y” residues, as some of the features are exclusive to each of the sets.
## preprocess features for training
ST_features <- preprocess_features(annotated_phos, "ST")
Y_features <- preprocess_features(annotated_phos, "Y")Once the features are ready, a model can be trained using a provided gold standard. The default algorithm is a Gradient Boosting Machine with a series of hyperparameters optimized to the default set. Different algorithms and settings can be provided using the parameters= argument. The training process can be parallelized using the doParallel package if the ncores parameter exceeds 1.
Given an annotated phosphoproteome with preprocessed features and a trained model, new functional scores can be predicted for “ST” and “Y” separately.