Statistical language models are essential components of modern text-based and voice-driven systems for human-computer interaction. These models estimate the probability of a word occurring in a given context. The most common language models are combinations of frequency-based maximum likelihood estimates. Often relying on a discrete enumeration of predictive contexts (e.g., n-grams), these models fail to capture or exploit statistical regularities across contexts. In this paper, we show how to learn hierarchical, distributed representations of word contexts that maximize the predictive value of a statistical language model. The representations are initialized by unsupervised algorithms for linear and nonlinear dimensionality reduction, then fed as input to a hierarchical mixture of experts, where each expert is a multinomial distribution over predicted words. While the distributed representations in our model are inspired by the neural probabilistic language model of Bengio et. al, our particular architecture enables us to work with significantly larger vocabularies and training corpora. For example, on a large-scale bigram modeling task involving a sixty thousand word vocabulary and a training corpus of three million sentences, we demonstrate consistent improvement over class-based bigram models. We also discuss extensions of our approach to longer multiword contexts.