To better follow the energy consumption, the government wants energy suppliers to install smart meters in every home in England, Wales and Scotland. There are more than 26 million homes for the energy suppliers to get to, with the goal of every home having a smart meter by 2020.
This roll out of meter is lead by the European Union who asked all member governments to look at smart meters as part of measures to upgrade our energy supply and tackle climate change. After an initial study, the British government decided to adopt smart meters as part of their plan to update our ageing energy system.
In this dataset, you will find a refactorised version of the data from the London data store, that contains the energy consumption readings for a sample of 5,567 London Households that took part in the UK Power Networks led Low Carbon London project between November 2011 and February 2014. The data from the smart meters seems associated only to the electrical consumption.
目标: 综合考虑气候、时间、季节、节日以及需求响应等因素,实现负荷预测功能。
数据源: https://www.kaggle.com/jeanmidev/smart-meters-in-london
处理流程:
- 将所有数据进行合并
- 根据每户每天的能源消耗数据,对不一致的住户统计数据进行规范化处理
- 探索天气状况等因素和能源消耗之间的关系
- 将英国假日数据添加到日水平数据中作为指标
- 拟合SARIMAX模型
- 拟合LSTM模型
目录结构:
环境:
Keras\==2.0.2
TensorFlow\==1.15.5
scikit-learn\==0.24.1
数据整合
def first():
for num in range(0,112):
df = pd.read_csv("data/daily_dataset/block_"+str(num)+".csv")
df = df[['day','LCLid','energy_sum']]
df.reset_index()
df.to_csv("data/hc/hc_"+str(num)+".csv")
fout= open("data/energy.csv","a")
for line in open("data/hc/hc_0.csv"):
fout.write(line)
###TODO 能源数据
# 预测未来的能源需求,因此只取能源总量,即给定家庭每天的总能源使用量。
for num in range(0,112):
f = open("data/hc/hc_"+str(num)+".csv")
f.readline() # skip the header
for line in f:
fout.write(line)
f.close()
fout.close()
first()
各个家庭的数据收集是不同的,因此我们将使用“每个家庭的能源”作为预测的目标,而不是仅仅使用能源。
然而有相当多的独特的家庭,所以需要多次出来,我们的最终目标是预测整体消费预测,而不是在家庭水平。
energy = pd.read_csv('data/energy.csv')
housecount = energy.groupby('day')[['LCLid']].nunique()
#print(housecount.head(4))
energy = energy.groupby('day')[['energy_sum']].sum()
energy = energy.merge(housecount, on = ['day'])
energy = energy.reset_index()
energy.count()
energy.day = pd.to_datetime(energy.day,format='%Y-%m-%d').dt.date
energy['avg_energy'] = energy['energy_sum']/energy['LCLid']
#print(energy.describe())
天气信息
weather = pd.read_csv('data/weather_daily_darksky.csv')
print(weather.head(4))
# 每日级别的天气信息是使用数据集中的 darksky api
weather['day']= pd.to_datetime(weather['time']) # day is given as timestamp
weather['day']= pd.to_datetime(weather['day'],format='%Y%m%d').dt.date
# 选择数字变量
weather = weather[['temperatureMax', 'windBearing', 'dewPoint', 'cloudCover', 'windSpeed',
'pressure', 'apparentTemperatureHigh', 'visibility', 'humidity',
'apparentTemperatureLow', 'apparentTemperatureMax', 'uvIndex',
'temperatureLow', 'temperatureMin', 'temperatureHigh',
'apparentTemperatureMin', 'moonPhase','day']]
weather = weather.dropna()
### 天气状况与用电量的关系
weather_energy = energy.merge(weather,on='day')
#print(weather_energy.head(2))
#***1. 温度 ***
# 我们可以看到能量和温度有一个反比关系
# 在低温时,很可能通过加热器等增加能源消耗。
plt.rcParams['font.sans-serif'] = ['SimHei'] # 步骤一(替换sans-serif字体)
plt.rcParams['axes.unicode_minus'] = False # 步骤二(解决坐标轴负数的负号显示问题)
def drow1():
fig, ax1 = plt.subplots(figsize = (20,5))
ax1.plot(weather_energy.day, weather_energy.temperatureMax, color = 'tab:orange')
ax1.plot(weather_energy.day, weather_energy.temperatureMin, color = 'tab:pink')
ax1.set_ylabel('Temperature')
ax1.legend()
ax2 = ax1.twinx()
ax2.plot(weather_energy.day,weather_energy.avg_energy,color = 'tab:blue')
ax2.set_ylabel('Average Energy/Household',color = 'tab:blue')
ax2.legend(bbox_to_anchor=(0.0, 1.02, 1.0, 0.102))
plt.title('能耗和温度')
fig.tight_layout()
plt.show()
#***2. 湿度 ***
# 湿度和平均能耗似乎有相同的趋势。
def drow2():
fig, ax1 = plt.subplots(figsize = (20,5))
ax1.plot(weather_energy.day, weather_energy.humidity, color = 'tab:orange')
ax1.set_ylabel('Humidity',color = 'tab:orange')
ax2 = ax1.twinx()
ax2.plot(weather_energy.day,weather_energy.avg_energy,color = 'tab:blue')
ax2.set_ylabel('Average Energy/Household',color = 'tab:blue')
plt.title('能耗和湿度')
fig.tight_layout()
plt.show()
#***3.云层值 Cloud Cover***
def drow3():
fig, ax1 = plt.subplots(figsize = (20,5))
ax1.plot(weather_energy.day, weather_energy.cloudCover, color = 'tab:orange')
ax1.set_ylabel('Cloud Cover',color = 'tab:orange')
ax2 = ax1.twinx()
ax2.plot(weather_energy.day,weather_energy.avg_energy,color = 'tab:blue')
ax2.set_ylabel('Average Energy/Household',color = 'tab:blue')
plt.title('Energy Consumption and Cloud Cover')
fig.tight_layout()
plt.show()
#***4.能见度 Visibility***
#> 能见度因素似乎不会影响能源消耗,
# 因为能见度很可能是一个户外因素,它的增加或减少不太可能影响家庭内部的能源消耗。
def drow4():
fig, ax1 = plt.subplots(figsize = (20,5))
ax1.plot(weather_energy.day, weather_energy.visibility, color = 'tab:orange')
ax1.set_ylabel('Visibility',color = 'tab:orange')
ax2 = ax1.twinx()
ax2.plot(weather_energy.day,weather_energy.avg_energy,color = 'tab:blue')
ax2.set_ylabel('Average Energy/Household',color = 'tab:blue')
plt.title('Energy Consumption and Visibility')
fig.tight_layout()
plt.show()
#***5. 风速 Wind Speed***
#> 像能见度一样,风速似乎是一个户外因素,并不影响能源消耗。
def drow5():
fig, ax1 = plt.subplots(figsize = (20,5))
ax1.plot(weather_energy.day, weather_energy.windSpeed, color = 'tab:orange')
ax1.set_ylabel('Wind Speed',color = 'tab:orange')
ax2 = ax1.twinx()
ax2.plot(weather_energy.day,weather_energy.avg_energy,color = 'tab:blue')
ax2.set_ylabel('Average Energy/Household',color = 'tab:blue')
plt.title('Energy Consumption and Wind Speed')
fig.tight_layout()
plt.show()
#***6. UV Index 紫外线指数***
#> 紫外线指数与能源消耗成反比关系
def drow6():
fig, ax1 = plt.subplots(figsize=(20, 5))
ax1.plot(weather_energy.day, weather_energy.uvIndex, color='tab:orange')
ax1.set_ylabel('UV Index', color='tab:orange')
ax2 = ax1.twinx()
ax2.plot(weather_energy.day, weather_energy.avg_energy, color='tab:blue')
ax2.set_ylabel('Average Energy/Household', color='tab:blue')
plt.title('Energy Consumption and UV Index')
fig.tight_layout()
plt.show()
#***7. dewPoint 露点***
#> 露点是湿度和温度的函数,因此它有类似的关系。
def drow7():
fig, ax1 = plt.subplots(figsize = (20,5))
ax1.plot(weather_energy.day, weather_energy.dewPoint, color = 'tab:orange')
ax1.set_ylabel('Dew Point',color = 'tab:orange')
ax2 = ax1.twinx()
ax2.plot(weather_energy.day,weather_energy.avg_energy,color = 'tab:blue')
ax2.set_ylabel('Average Energy/Household',color = 'tab:blue')
plt.title('Energy Consumption and Dew Point')
fig.tight_layout()
plt.show()
天气变量和能源消耗之间的相关性:
- 能量与湿度高度正相关,与温度高度负相关。
- 露点、紫外线指数显示与温度多重共线性,故弃用
- 云层和能见度显示与湿度多重共线性,故弃用
- 压力和月相与能量的相关性最小,故弃用
- 风速与能量相关性较低
聚类分析
因为天气信息有很多变量,但不是所有的变量都有用。
通过聚类分析,是否可以基于温度、降水等颗粒天气数据定义一天的天气。
# scaling
scaler = MinMaxScaler()
weather_scaled = scaler.fit_transform(weather_energy[['temperatureMax','humidity','windSpeed']])
# optimum K
Nc = range(1, 20)
kmeans = [KMeans(n_clusters=i) for i in Nc]
score = [kmeans[i].fit(weather_scaled).score(weather_scaled) for i in range(len(kmeans))]
# 肘部曲线
def drow8():
plt.plot(Nc,score)
plt.xlabel('Number of Clusters')
plt.ylabel('Score')
plt.title('Elbow Curve')
plt.show()
kmeans = KMeans(n_clusters=3, max_iter=600, algorithm = 'auto')
kmeans.fit(weather_scaled)
weather_energy['weather_cluster'] = kmeans.labels_
# 天气变量的关系
def drow9():
plt.figure(figsize=(20,5))
plt.subplot(1, 3, 1)
plt.scatter(weather_energy.weather_cluster,weather_energy.temperatureMax)
plt.title('Weather Cluster vs. Temperature')
plt.subplot(1, 3, 2)
plt.scatter(weather_energy.weather_cluster,weather_energy.humidity)
plt.title('Weather Cluster vs. Humidity')
plt.subplot(1, 3, 3)
plt.scatter(weather_energy.weather_cluster,weather_energy.windSpeed)
plt.title('Weather Cluster vs. WindSpeed')
plt.show()
def drow10():
fig, ax1 = plt.subplots(figsize = (10,7))
ax1.scatter(weather_energy.temperatureMax,
weather_energy.humidity,
s = weather_energy.windSpeed*10,
c = weather_energy.weather_cluster)
ax1.set_xlabel('Temperature')
ax1.set_ylabel('Humidity')
plt.show()
ARIMAX
### UK Bank Holidays
holiday = pd.read_csv('data/uk_bank_holidays.csv')
holiday['Bank holidays'] = pd.to_datetime(holiday['Bank holidays'],format='%Y-%m-%d').dt.date
#holiday.head(4)
weather_energy = weather_energy.merge(holiday, left_on = 'day',right_on = 'Bank holidays',how = 'left')
weather_energy['holiday_ind'] = np.where(weather_energy['Bank holidays'].isna(),0,1)
### ARIMAX
weather_energy['Year'] = pd.DatetimeIndex(weather_energy['day']).year
weather_energy['Month'] = pd.DatetimeIndex(weather_energy['day']).month
weather_energy.set_index(['day'],inplace=True)
#** 7-3 train-test split**
model_data = weather_energy[['avg_energy','weather_cluster','holiday_ind']]
train = model_data.iloc[0:(len(model_data) - 30)]
test = model_data.iloc[len(train):(len(model_data) - 1)]
train['avg_energy'].plot(figsize=(25, 4))
test['avg_energy'].plot(figsize=(25, 4))
test.head(1)
# **ACF PACF **
plot_acf(train.avg_energy,lags=100)
plt.show()
plot_pacf(train.avg_energy,lags=50)
plt.show()
t = sm.tsa.adfuller(train.avg_energy, autolag='AIC')
pd.Series(t[0:4], index=['Test Statistic','p-value','#Lags Used','Number of Observations Used'])
# 函数差分
def difference(dataset, interval):
diff = list()
for i in range(interval, len(dataset)):
value = dataset.iloc[i] - dataset.iloc[i - interval]
diff.append(value)
return diff
t = sm.tsa.adfuller(difference(train.avg_energy,1), autolag='AIC')
pd.Series(t[0:4], index=['Test Statistic','p-value','#Lags Used','Number of Observations Used'])
# 季节性成分相当低,但趋势相当强劲,夏季(4月至9月)用电量有明显下降。这可能是由于夏季白天变长。
s = sm.tsa.seasonal_decompose(train.avg_energy,freq=12)
s.seasonal.plot(figsize=(20,5))
s.trend.plot(figsize=(20,5))
s.resid.plot(figsize=(20,5))
endog = train['avg_energy']
exog = sm.add_constant(train[['weather_cluster','holiday_ind']])
#mod = sm.tsa.statespace.SARIMAX(endog=endog, exog=exog, order=(7,1,1),seasonal_order=(1,1, 0, 12),trend='c')
mod = sm.tsa.statespace.SARIMAX(endog=endog, order=(7,1,1),seasonal_order=(1,1, 0, 12),trend='c')
model_fit = mod.fit()
model_fit.summary()
#**Model Fit**
train['avg_energy'].plot(figsize=(25,10))
model_fit.fittedvalues.plot()
plt.show()
#**预测**
predict = model_fit.predict(start = len(train),end = len(train)+len(test)-1)
#predict = model_fit.predict(start = len(train),end = len(train)+len(test)-1,exog = sm.add_constant(test[['weather_cluster','holiday_ind']]))
test['predicted'] = predict.values
#print(test.tail(5))
'''
avg_energy weather_cluster holiday_ind predicted
day
2014-02-23 11.673756 0 0 11.556584
2014-02-24 10.586235 0 0 10.708762
2014-02-25 10.476498 0 0 11.445902
2014-02-26 10.375366 0 0 11.869046
2014-02-27 10.537250 0 0 11.484142
'''
test['residual'] = abs(test['avg_energy']-test['predicted'])
MAE = test['residual'].sum()/len(test)
MAPE = (abs(test['residual'])/test['avg_energy']).sum()*100/len(test)
print("MAE:", MAE) #平均绝对误差: 0.5851807015623742 值越小,说明预测模型拥有更好的精确度。
print("MAPE:", MAPE) #平均绝对百分比误差: 5.237486909890385 预测结果较真实结果平均偏离 5%
model_fit.resid.plot(figsize= (30,5))
model_fit.fittedvalues.plot(figsize = (30,5))
test.predicted.plot()
#print(test['predicted'].tail(5))
'''
2014-02-23 11.556584
2014-02-24 10.708762
2014-02-25 11.445902
2014-02-26 11.869046
2014-02-27 11.484142
'''
LSTM
dataframe = weather_energy.loc[:,'avg_energy']
dataset = dataframe.values
dataset = dataset.astype('float32')
# convert series to supervised learning
def series_to_supervised(data, n_in=1, n_out=1, dropnan=True):
n_vars = 1
df = pd.DataFrame(data)
cols, names = list(), list()
# input sequence (t-n, ... t-1)
for i in range(n_in, 0, -1):
cols.append(df.shift(i))
names += [('var%d(t-%d)' % (j+1, i)) for j in range(n_vars)]
# forecast sequence (t, t+1, ... t+n)
for i in range(0, n_out):
cols.append(df.shift(-i))
if i == 0:
names += [('var%d(t)' % (j+1)) for j in range(n_vars)]
else:
names += [('var%d(t+%d)' % (j+1, i)) for j in range(n_vars)]
# put it all together
agg = pd.concat(cols, axis=1)
agg.columns = names
# drop rows with NaN values
if dropnan:
agg.dropna(inplace=True)
return agg
reframed = series_to_supervised(dataset, 7,1)
print(reframed.head(3))
reframed['weather_cluster'] = weather_energy.weather_cluster.values[7:]
reframed['holiday_ind']= weather_energy.holiday_ind.values[7:]
reframed = reframed.reindex(['weather_cluster', 'holiday_ind','var1(t-7)', 'var1(t-6)', 'var1(t-5)', 'var1(t-4)', 'var1(t-3)','var1(t-2)', 'var1(t-1)', 'var1(t)'], axis=1)
reframed = reframed.values
#Normalization
scaler = MinMaxScaler(feature_range=(0, 1))
reframed = scaler.fit_transform(reframed)
# 分割训练集和测试集
train = reframed[:(len(reframed)-30), :]
test = reframed[(len(reframed)-30):len(reframed), :]
train_X, train_y = train[:, :-1], train[:, -1]
test_X, test_y = test[:, :-1], test[:, -1]
# 将输入重塑为3D[samples, timesteps, features]
train_X = train_X.reshape((train_X.shape[0], 1, train_X.shape[1]))
test_X = test_X.reshape((test_X.shape[0], 1, test_X.shape[1]))
print(train_X.shape, train_y.shape, test_X.shape, test_y.shape)
#**Modelling**
# design network
model = Sequential()
model.add(LSTM(50, input_shape=(train_X.shape[1], train_X.shape[2])))
model.add(Dense(1))
model.compile(loss='mae', optimizer='adam')
# fit network
history = model.fit(train_X, train_y, epochs=50, batch_size=72, verbose=2, shuffle=False)
# plot history
pyplot.plot(history.history['loss'], label='train')
pyplot.legend()
pyplot.show()
# make a prediction
yhat = model.predict(test_X)
test_X = test_X.reshape(test_X.shape[0], test_X.shape[2])
# invert scaling for forecast
inv_yhat = np.concatenate((yhat, test_X), axis=1)
inv_yhat = scaler.inverse_transform(inv_yhat)
# invert scaling for actual
test_y = test_y.reshape((len(test_y), 1))
inv_y = np.concatenate((test_y, test_X), axis=1)
inv_y = scaler.inverse_transform(inv_y)
#Performance
act = [i[9] for i in inv_y] # last element is the predicted average energy
pred = [i[9] for i in inv_yhat] # last element is the actual average energy
# calculate RMSE
import math
rmse = math.sqrt(mean_squared_error(act, pred))
print('Test RMSE: %.3f' % rmse)
predicted_lstm = pd.DataFrame({'predicted':pred,'avg_energy':act})
predicted_lstm['avg_energy'].plot(figsize=(25,10),color = 'red')
predicted_lstm['predicted'].plot(color = 'blue')
plt.show()