Systems and methods are provided for generating an image of a posed human figure or other subject using a neural network that is trained to translate a set of points to realistic images by reconstructing projected surfaces directly in the pixel space or image space. Input to the image generation process may include parameterized control features, such as body shape parameters, pose parameters and/or a virtual camera position. These input parameters may be applied to a three-dimensional model that is used to generate the set of points, such as a sparsely populated image of color and depth information at vertices of the three-dimensional model, before additional image generation occurs directly in the image space. The visual appearance or identity of the synthesized human in successive output images may remain consistent, such that the output is both controllable and predictable.