P A D É

Padé
 is about calculating partial derivatives from cloud data. The data is merely a set of learning examples (points) with numerical attributes and numerical class variable.

Padé is a tool for discovering the qualitative relations in numerical data.

EXPERIMENTS


ARTIFICIAL DATA SETS

We sampled some mathematical functions and compared the results of numerical derivation to each of Padé's methods. Numerical derivation needs to know the undelying function while Padé is given only sampled points with no information about the function. Table 1 presents the results... shows how Padé can handle simple smooth functions, discontinuities and multi-valued function.

Table 1: The comparison of Padé's methods on artificial datasets generated by sampling (uniform random) the functions in 1000 points, no noise added. We used Mathematica to calculate true numerical derivatives in these points and compare their signs to Padé's results. The percentages in the cells show on how many Padé got the qualitative derivative right.

f(x,y)First TriangleStar RegressionTube Regression
df/dxdf/dydf/dxdf/dydf/dxdf/dy
x2 - y2
Plot3D[x ^2 - y^2, {x, -10, 10}, {y, -10, 10}]
x2y2.FT.dx
97.8%
x2y2.FT.dy
98.4%
x2y2.SR.dx
94.8%
x2y2.SR.dy
93.7%
x2y2.TR.dx
96.5%
x2y2.TR.dy
97.1%
x y
Plot3D[x y, {x, -10, 10}, {y, -10, 10}]
xy.FT.dx
99.3%
xy.FT.dy
99.6%
xy.SR.dx
92.0%
xy.SR.dy
93.6%
xy.TR.dx
99.8%
xy.TR.dy
99.4%
sin(x) sin(y)
Plot3D[Sin[x] Sin[y], {x, -7, 7}, {y, -7, 7}]

85.9%

88.6%

85.1%

83.4%

50.7%

54.7
im(arcsin((x + i y)4))
Plot3D[Im[ArcSin[(x + I y)^4]], {x, -2, 2}, {y, -2, 2}]

92.8%

93.5%

85.2%

87.3%

94.0%

96.4%
im(arctanh((x + i y)3))
Plot3D[Im[ArcTanh[(x + I y)^3]], {x, -2, 2}, {y, -2, 2}]

92.2%

93.3%

79.0%

83.1%

87.6%

91.9%
(x2 + y2) and ( - x2 - y2)
Plot3D[{x^2 + y^2, -x^2 - y^2}, {x, -2, 2}, {y, -2, 2},

63.8%

62.3%

50.7%

50.4%

97.1%

96.4%

Dimensionality
In the following experiment we show how robust Padé is working in high dimensions (e.g. 100 dimensional space). Again, we take the function x2 - y2, sample it randomly in interval [-10,10] x [-10,10] obtaining 1000 points (= learning examples). We blow the current 2D space to 100D space by adding attributes with random values from [-10,10] to the existing attributes, x and y. We use Padé's Tube regression to calculate the derivatives in this high dimensional space. We find qualitative decision trees to be most suitable to present the results (Table 2).

Table 2: Partial derivatives by Tube regression modeled by qualitative decision trees. We would like to point out two things: 1) the derivatives df/dx and df/dy are mainly correct inspite of high dimensional space in which they were computed (the trees for f = Q(x) and f = Q(y) make sense).
df / dxdf / dy

Note: Among Padé's methods only Tube regression works in high dimensions. First triangle and Star regression are based on Delaunay triangulation for which we use Qhull that supports only low dimensions (up to 8).

Time Complexity (Scalability)
First triangle and Star regression methods start with Delaunay triangulation which in average takes O(n log n). In dimensions greater than 3 time complexity of the qhull algorithm, which we used in Padé, depends on a number of details described in~\cite{qhull}. The current implementation does not work in dimensions higher than 8 which is effectively another constraint.

Having the space triangulated, First triangle then needs to find the triangle lying in the desired direction, which requires computing the determinant of a d-dimensional matrix (where d is the number of attributes). Such a triangle needs to be found for every point in space, for every attribute by which we compute the derivative. The running time strongly depends on the number of triangles that surround each point, which can not be estimated and usually rises exponentially with the number of dimensions. In practice, the method is fast on low dimensional data and gets slower when the number of dimensions increases.

Star regression is, in the first step, also heavily constrained by the triangulation algorithm. In the second step, it computes a star for each point which in the current implementation takes O(n2), where n is the number of points. Univariate linear regression on  the star then takes O(s), where s is the number of points in the star (usually 5 to 10).

Tube regression's time complexity is O(d n2), where d is again the dimension of the attribute space and n is the number of examples. The more complex step is finding appropriate nearest neighbors while regression itself is, like in Star regression, negligable and takes O(s) where s is the number of points in the tube. It is consistently the slowest of all methods.

Table 3 shows running times for Padé's methods on several domains. Please note that this experiment mostly reflects the implementation issues, which at this time were not our main concern. Most of the code is in Python. In Star regression, finding a star for each point is implemented in C++.  The experiments were performed on a 2 GHz laptop with 2GB of RAM.

Table 3: Run times (in seconds) for each of Padé's method calculating all the derivatives of class variable w.r.t. each attribute. The times in brackets denote the additional time needed for Delaunay triangulation where it is needed.
number of
attributes (d)
number of
examples (n)
First triangleStar regressionTube regression
x2 - y2210002.9 (0.016)0.4 (0.016)10.15
x y210003.48 (0.016)0.34 (0.016)10.77
sin(x) sin(y)210002.84 (0.016)0.37 (0.016)10.14
im(arcsin((x + i y)4))210002.89 (0.016)0.33 (0.016)10.51
im(arctanh((x + i y)3))210002.9 (0.016)0.36 (0.016)11.46
(x2 + y2) and ( - x2 - y2)210003.24 (0.016)0.41 (0.016)10.98
x2 - y221000034.15 (0.14)5.96 (0.14)1087.55
x2 - y2101000//54.79
/ qhull program ran out of memory.

Table 4: Running times of all three methods on data sets of different sizes. All times are in seconds.The experiments were performed on a 2 GHz laptop with 2GB of RAM.
Number of
attributes
Number of examples
10002000500010000
First
Triangle
Star
Regression
Tube
Regression
First
Triangle
Star
Regression
Tube
Regression
First
Triangle
Star
Regression
Tube
Regression
First
Triangle
Star
Regression
Tube
Regression
2
2.710.4510.37
5.251.0341.77
14.86.53259.92
42.9828.991159.34
4
242.785.0922.52
449.3519.2986.14
1082.55114.41532.08
2280.12473.262444.58
6
33049.391387.7932.62
/3908.69134.53
//857.95
//3924.64
8
//45.18
//176.94
//1163.02
//5143.41
10
//60.14
//232.04
//1627.23
//6694.18
/ program ran out of memory.


Noise
Regarding noise, Padé is no wizard but it can guess general trends quite well. Tube regression is very robust while First triangle and Star regression are certainly not the methods to be used on noisy data.

The image on the right shows an intersection of the x2 - y2 surface with an x-z plane to show the amount of noise we added to the class variable.

We used qualitative decision trees to model Padé's results. The splits should ideally be on 0 but the error is not so big considering the domain being [-100,100] x [-100,100]. The distributions of the examples in the leaves is given in thebrackets.




Parametric Padé

All the examples above have one thing in common - they are all static domains. In static domains we use spatial information to calculate the derivatives.
In dynamic domains things are a bit different, although they may visualy seem to be the same. Consider a set of learning examples obtained by sampling the parametric curve:

x = t Cos[t], y = t Sin[t], z = 2 t;      t in [-10Pi, 10Pi].

Looking only at the set of points, not knowing about the underlying function, which is usually the case in machine learning, we could as well imagine that a surface, similar to the last example in Table 1 was sampled. The difference between these two cases is that in the first case, each learning example has its own time stamp - we know the time at which it was sampled. Using this information we can see that the definition of the point's neighborhood is different. The points that are close together in time are true neighbors. The curve may actually return to the very same point after a long time but it will likely have different properties (e.g. velocity, acceleration) as before.

Parametric Padé takes time into account and uses the chain rule to calculate the qualitative dependencies between the class variable and the attributes. The following table presents the results of parametric Padé on a data set obtained by sampling the above parametric curve rather than the surface from Table 1.

x = t Cos[t], y = t Sin[t], z = 2 t, t in [-10Pi, 10Pi]df / dxdf / dy