Object detection is a fundamental approach in creating interactive videos. In this thesis, we propose a new method for object detection, combining object recognition with tracking in a neural network. Specifically, we use GoogLeNet as a feature extractor, and then apply a long short-term memory (LSTM) network to further adjust the feature vectors extracted by GoogLeNet according to the context of the feature vectors of the previous frame. We feed the output of the LSTM to a classifier and regressor as in the Overfeat network, to obtain predicted confidences and predicted bounding boxes. We pre-train the feature extractor on ImageNet datasets, then evaluate our network on OTB100 dataset. We compare our results to results obtained without tracking. Our model shows a better performance at predicting objects in frames where occlusion and background clutter appear, and results in more consistent object bounding boxes across frames.