Hide

Problem B
Linear Regression

Implement simple linear regression in d dimensions:

f(x)=w0+w1x1+w2x2++wdxd.

In other words, choose the values w0 through wd that make the best linear fit of a set of (x,y) example pairs. The error of the prediction f(x) for target value y is (f(x)y)2 (i.e. squared error).

\includegraphics[width=0.5\textwidth ]{simple.png}
Figure 1: Plots of training set (top) and testing set (bottom) for the sample data. For this problem, the training set is given as input, and the testing set is listed as output (but not given to your submission).

Input

Input for your submission is a training set of n examples in d dimensions.

The first line contains two integers 1n10000 and 1d10. Following are n lines, each containing d+1 real numbers. For line i (1in), the first d numbers are xi,1 through xi,d, and the last is for the corresponding yi value. Each real value has at most 5 digits past the decimal point and is in the range [2109,2109].

It is guaranteed that no feature is constant and each pair of features is not collinear. Specifically, let μj=1ni=1nxi,j be the mean of feature j. Then it is guaranteed that

0.001i=1n|xi,jμj|

(no feature is constant), and also that for any pair of features 1j<kd:

|i=1n(xi,jμj)(xi,kμk)|i=1n(xi,jμj)2i=1n(xi,kμk)20.99

(no pair of features is collinear).

Output

Implement an algorithm to create a linear model that fits the training data. Recommended approaches are gradient descent, least-squares estimation using the Moore-Penrose pseudoinverse, or ridge regression.

Output d+1 real values corresponding to the coefficients w0 to wd in the definition of f(x) above.

Note that the sample output shown is actually the test data on which a model would be evaluated (see below). It is in the same format as the input. A reasonable output for the given sample input might be:

8.60271 1.19910

which would give a score of about 96.31 on the test data, according to the metric below.

Scoring

For each test file, the output of your submission will be used to construct a linear model to evaluate on another set of data, called the testing set. The test set comes from a distribution similar to the training set, but is not known to your submission.

Assume that the test data has m examples {(x1,y1),,(xm,ym)}. The score your model receives for each test case is based on the following:

100(1i=1m(f(xi)yi)2i=1m(μyyi)2)

where μy=1mi=1myi is the average y value. A value close to 100 is good, as that indicates the linear model f is predicting y well. If you are familiar with the r2 coefficient of determination, this score is equal to 100r2. We clamp the score to be within the range [0,100].

The overall score for your submission is the average score over all hidden test files. The sample data is excluded from this average.

Sample Input 1 Sample Output 1
5 1
1 10
2 11
3 12
4 13
5 15
5 1
0 9.5
2 10
6 16.5
7 17.5
8 18
Hide

Please log in to submit a solution to this problem

Log in