In this paper, we introduce an online-learning method to model the property of an office building. Unlike conventional control methods where the building property is modeled via a simulator or through offline learning, our building model is adaptively updated according to the dynamic response of a real environment. Upon the building model for environment prediction, the proposed action agent can control the heating, ventilation, and air conditioning (HVAC) system in a smarter way by scheduling the temperature reference point. To online learn the model and improve the agent, two practical and seldom discussed issues are addressed. The first challenge is data bias where the collected initial training dataset can only partially reveal the statistical mapping between the control input and the environment response. Hence, the trained model may lack generalization. To overcome the data bias issue, a data augmentation method is proposed to embed physical logic in order to train a proper initial model. Next, an online learning process is introduced to update the model generality during the system operation phase. The second practical issue is the constraints on agent exploration for discovering unknown data samples. During the business hours, to comfort employees, a control agent is not allowed to explore the possible controlling space randomly. To balance data collection and control stability, we introduce a hybrid control strategy that considers both the human control rule and the agent action. A confidence score of the agent model is also automatically estimated to determine a suitable control strategy finally. Our experiments have realized in an office building. The results outperform conventional methods and show its superior in terms of control stability.