Citation

If you use funscoR please cite:

Ochoa et al. The functional landscape of the human phosphoproteome bioRxiv (2019). https://doi.org/10.1101/541656

Installation

First, install funscoR from github. It requires to have devtools installed.

devtools::install_github("evocellnet/funscoR")

Getting started

To get started, fire up the packages and load some sample data.

## Load the required packages
library(funscoR)
library(knitr)
library(dplyr)
library(stringr)

Datasets

Phosphoproteome

A reference human phosphoproteome is provided in the phosphoproteome object. The data frame contains a list of phosphorylation residues as described below.

phosphoproteome %>%
  head() %>%
  kable()

acc	position	residue
A0A075B6Q4	24	S
A0A075B6Q4	35	S
A0A075B6Q4	57	S
A0A075B6Q4	68	S
A0A075B6Q4	71	S
A0A075B6Q4	72	S

A different reference phosphoproteome can be used as starting point if provided in the same format. Beware certain annotations for the reference set of sites might be required to get equivalent performances.

Gold standard

A gold standard of known regulatory sites is required in order to train the model. Parsed annotations from PhosphositePlus (see license) are provided in this package in the object psp.

psp %>%
  head() %>%
  kable()

acc	position
O95786	854
Q8TD46	302
P60484	380
Q9UPY3	1016
P35367	142
O43684	19

A different gold standard can be used in the downstream analysis as long as the provided data frame contains the acc and position columns.

Phosphoproteome annotation

An extensive functional annotation for the human phosphoproteome is provided as part of this package. Next you can find a list of all the annotations available.

data(package = "funscoR")$results %>% 
  as.data.frame() %>% 
  filter(str_detect(Item, "feature")) %>% 
  select(Item) %>%
  kable()

Item
feature_disopred
feature_disprot
feature_domains
feature_elm
feature_evmut
feature_exac
feature_foldx
feature_hotspots
feature_interfaces
feature_ms_pride
feature_neighPTMs
feature_netphorest
feature_paxdb
feature_proteinlength
feature_ptmdb_age
feature_ptmdb_coregulation
feature_ptmdb_counts
feature_ptmdb_regulation
feature_pwm_match
feature_scratch1Dfeatures
feature_sift_scores
feature_spectral_counts
feature_topology
feature_transitpeptides

Each of the annotation objects can be used indepedently. For example, it’s described next the object that contains the ancestral reconstruction of all available phosphosites. The column w0_mya contains the inferred age of the last common ancestor for the phosphosite. w3_mya contains the equivalent information using a window of +/-3 residues to asses the conservation. More information about the dataset can be found using using ?feature_ptmdb_age.

feature_ptmdb_age %>%
  head() %>%
  kable()

acc	residue	position	w0_mya	w0_ancestor_name	w3_mya	w3_ancestor_name
P84085	T	4	96	Boreoeutheria	96	Boreoeutheria
P84085	S	6	96	Boreoeutheria	96	Boreoeutheria
P84085	S	10	96	Boreoeutheria	96	Boreoeutheria
P84085	S	103	96	Boreoeutheria	824	Bilateria
P84085	S	137	96	Boreoeutheria	0	Homo sapiens
P84085	S	150	96	Boreoeutheria	96	Boreoeutheria

In order to train a model, you might be interested in annotating the phosphoproteome with all the available features. You can use the annotate_sites function for this.

## annotate phosphoproteome with features
annotated_phos <- annotate_sites(phosphoproteome)

Model training

A preprocessing step must be run to ensure the features are properly provided to the model. The function preprocess_features defaults to a series of methods but additional tunning can be applied using the methods= and features_to_exclude= arguments. Different preprocessing steps are necessary for “ST” and “Y” residues, as some of the features are exclusive to each of the sets.

## preprocess features for training
ST_features <- preprocess_features(annotated_phos, "ST")
Y_features <- preprocess_features(annotated_phos, "Y")

Once the features are ready, a model can be trained using a provided gold standard. The default algorithm is a Gradient Boosting Machine with a series of hyperparameters optimized to the default set. Different algorithms and settings can be provided using the parameters= argument. The training process can be parallelized using the doParallel package if the ncores parameter exceeds 1.

## train new model
ST_model <- train_funscore(ST_features, "ST", psp, ncores = 4)
Y_model <- train_funscore(Y_features, "Y", psp, ncores = 4)

Predicting functional scores

Given an annotated phosphoproteome with preprocessed features and a trained model, new functional scores can be predicted for “ST” and “Y” separately.

## predict funcscoR for all sites
ST_scores <- predict_funscore(ST_features, ST_model, ncores = 4)
Y_scores <- predict_funscore(Y_features, Y_model, ncores = 4)

## gather all predictions
all_scores <- bind_rows(ST_scores, Y_scores) %>%
  mutate(probabilities = log_scaling(probabilities))

funscoR: Functional scoring of human phosphosites