correlation and copulas: part 2

Andrea C.
4 min readApr 4, 2018

We have seen here that there is a problem with the correlation coefficients — they tend to be good only with linear or monotonic dependencies. Since there many situations in which this is not the case, it would be nice to have something more… advanced.

In addition, there is also another point at stake. Imagine your boss gives you two different datasets to analyze. They are both composed of bivariate samples. And, in particular, he wants to know something about they dependence structure: they are different? Similar? And if yes, to which degree? So, we start looking at the scatter plots.

dataset #1
dataset #2

They are of course very different, we can see it very easily, and they have also very different correlation values, 0.7 vs. 0.31. So, apparently, the conclusion would be that we are looking at populations with a very different dependence, but… we must remember the caveats related to using Pearson coefficient.

So, here comes the key idea: if we can somehow “normalize” both samples, by transforming its marginals into a fixed and convenient distribution, all the information about the dependence structure would be included in the corresponding joint distribution.

The convenient distribution is the uniform distribution, and the resulting joint distribution is what statisticians call a copula function.

Therefore, a copula is a multivariate cumulative distribution which has all the marginals uniformly distributed in [0,1].

Let’s now apply this “normalization” to our datasets.

the copula of dataset #1 …
… and the copula of dataset #2

What now? Not only both datasets exhibit the same correlation coefficient, but they indeed appear to be the same data! What have we done?

By transforming the marginals, we have transferred all information about the dependence into the joint distribution, so that it is now decoupled from its marginal distributions.

The transformation of marginals is possible thanks to another key concept to keep in mind: given a variable x, distributed according any (univariate) distribution f(x), the variable y = F(x) results uniformly distributed, being F its cumulative distribution.

This transformation is called probability integral transform:

probability integral transformation of x

The theoretical framework of copulas is given by the Sklar’s theorem. It states that any multivariate cumulative distribution H(x1,x2,…,xn) can be expressed in terms of univariate marginal distribution functions and a copula, which contains the dependence structure between the variables:

Moreover, this representation is unique in the sense that only one copula exists for the given multivariate distribution (at least, in the value ranges of its marginals, but I will not go into these details).

I want also to mention another useful application of copulas: the generation of multivariate distribution for simulation and model fitting purposes. When dealing with multivariate data, only few sampling functions are available (you can check with common statistical packages such as R, Numpy/Scipy, or Matlab). So, while it is easy to handle standard distributions, being the normal multivariate the most famous example, general modeling options are actually quite limited.

Moreover, while usually all the marginals of a multivariate are supposed to belong to the same distribution family, there are many concrete situations in which this appears to be a hypothesis not very realistic. As a matter of fact, modeling the actual dependence we need for our analysis proves to be often a hard task. Just to mention one, we can think about modeling the dependence of a financial portfolio, composed by a large number of assets with different sources of risk embedded.

So, what if we need a multivariate distribution equipped with “custom” marginals? Let say, for instance, that we need to generate samples from a bivariate distribution in which marginals are distributed according to gamma and cauchy respectively? Here the recipe:

  1. Generate data from a well-known distribution, specifying the desired dependence structure. For instance, we could feed the evergreen Normal distribution with a convenient covariance matrix.
  2. Turn the gaussian marginals into uniform ones, by applying the cdf transformation. At this point we obtain the copula of our data.
  3. Apply the inverse cdf transformation to each variable. Here the inverse cdf is the inverse function of the cdf corresponding to our desired marginal. In our case they would be the inverse of gamma and cauchy cdfs.
  4. Et voilà, les jeux son faits!

Btw, this is exactly the process with which the previous datasets were generated.

the process for generation of multivariate data

This is an example of a general method that can easily be applied to a large number of variables, with arbitrary marginals and dependence structures. Nevertheless, it’s worth noting that several families of copulas exist and have been studied for applications in the fields of quantitative finance, civil engineering analysis and climate research.

If you are interested, you will find the code used for the copula datasets on gist.

--

--

Andrea C.

Developer and data scientist for a European central bank. Views my own.