以后写kaggle尽量都用一些实用性的算法,该面向简历编程了,论文阅读笔记之类的以后都尽量用英语写

Background

Optiver Realized Volatility Prediction Competition.
This kaggle project is about trying diff methods to predict the volatility of a trading floor for trading firms,The Accurate Volatility, which is essencial for their investing options.Also is an essencial data standard related to the price of underlying product.
IN short, We have to find the most effective approach to minus RMSPE.

Given Data

1
2
3
4
5
dataset
├── book_test.parquet
├── book_train.parquet
├── trade_test.parquet
└── trade_train.parquet

Each folder contains stock_id=n
trade [‘time_id’, ‘seconds_in_bucket’, ‘price’, ‘size’, ‘order_count’]
book [‘time_id’, ‘seconds_in_bucket’, ‘bid_price1’, ‘ask_price1’, ‘bid_price2’, ‘ask_price2’, ‘bid_size1’, ‘ask_size1’, ‘bid_size2’, ‘ask_size2’],
train [‘stock_id’,’time_id’,’target’]
test [‘stock_id’,’time_id’,’row_id’]

financial concepts

show case:

bid price ask
151 196
150 189
149 148
148 221
251 147
351 146
300 145
20 144
1.Content of an order book
- A list of buy or sell records sorted by price, which lists the number of shares being bid on or offered at each price point.
- in the case of given data,’bid’ means How many shares the Buyer want to buy , ‘ask’ means How many shares Sellers offer.
EACH order book&trade book belongs to 1 kind of stock
2.Trade procedure
- a TRADE HAPPENS when the shares of stock that seller S offers and buyer B bids at the same price.
- B can up his/her intended price and buy the offered by S.
3.Liquidity
there’re some statistics standards for analyser to estimate the liquidity of an order book.
- WAP(weighted avaraged price)takes the price level and size of orders
$$wap = \frac{bidprice1asksize1+askprice1bidsize1}{asksize1+bidsize1}$$
Code for WAP caculation, add one column as ‘wap’
1
2
3
4
book_parquet['wap'] = 
(book_parquet['bid_price1'] * book_parquet['ask_size1'] +
book_parquet['ask_price1'] * book_parquet['bid_size1'])
/(book_parquet['bid_size1']+ book_parquet['ask_size1'])

4.Log returns
another vital standard for comparing the price of a stock in yesterday and today
calling $S_t$ is the price of stock at time $t$ ,the log return is $r_{t1,t2}$,
$$r_{t_1, t_2} = \log{\frac{S_{t_2}}{S_{t_1}}}$$
Noticed The host wants competitors should use WAP to compute log returns, and assuming that log returns have 0 mean
Then the Code for LogReturn is as follows and add it to book table.
Additionally we should expire the NaN row:

1
2
3
4
5
def LogReturn(WAP):
return np.log(WAP)
book_parquet['logreturn'] = LogReturn(book_parquet)
#expire NaN items
book_parquet = book_parquet[~book_example['log_return'].isnull()]

5.Realized Volatility
Volatility is described as ‘the annualized standard deviation of one year’s LogReturn’
$$\sigma = \sqrt{\sum\limits_t{r^2_{t-1,t}}}$$

For each stock data, we find that different stock have different volatility characteristics, So one column should be added as ‘stock_id’, using

1
2
stock_id = i
book_parquet.loc[:,'stock_id'] = stock_id

Evaluation

The evaluation metric is Root Mean Square Percentage Error, as:
$$\text{RMSPE} = \sqrt{\frac{1}{n} \sum_{i=1}^{n} ((y_i - \hat{y}_i)/y_i)^2}$$
The formula above can be implemented as:

1
2
3
4
def RMSPE(yhat, data):
y = data.get_label()
elements = ((y - yhat) / y) ** 2
return float(np.sqrt(np.sum(elements) / len(y)))

Method(s)

I looked through the Discussion board, found most are using XGBoost and LightGBT, I get begin from DataProcessing module and the baseline is implemented with XGBoost, LightGBT will be done later.

data processing

First check how we should process the parquet file.
Parquet is a columnar storage format available to any project in the Hadoop ecosystem, regardless of the choice of data processing framework, data model or programming language.
Hoster provided code for process the columnar file.
and, I’m goin to try to run this method and data on Spark, The code will be release later on github.
Process code:

1
2


APIs

you can use https://lightgbm.readthedocs.io/en/latest/pythonapi/lightgbm.plot_importance.html to see the feature importance of your model.