Cross-Modal biometric matching has been a scarcely explored field but carries several important applications and aims to further secure the currently existing security systems. In this paper, a framework for cross-modal biometric matching is presented, where faces of an individual are generated using his/her voice clips and further the synthesized faces are tested using a face classification network. Generative Adversarial Network (GAN) has become a recent trend in deep learning and has been widely used for image synthesis. We explore the advancements of Convolutional Neural Network (CNN) for feature extraction and generative networks for image synthesis. In the experiment, we compare the performance of Variational Autoencoders(VAE), Conditional Generative Adversarial Networks(C-GAN) and Regularized Conditional Generative Adversarial Networks(RC-GAN) and show that RC-GAN that is C-GAN with a regularization factor added to its loss is able to generate faces corresponding to the true identity of the voice clips with the best accuracy of 84.52% while VAE generates a less noise prone image with the highest PSNR of 28.276 decibels but with an accuracy of 72.61%.