Vectorization of documents

Abstract	Embodiments of the invention include method, systems and computer program products for document vectorization. Aspects include receiving, by a processor, a plurality of documents each having a plurality of word. The processor utilizing a vector embeddings engine generates a vector to represent each of the plurality of words in the plurality of documents. An image representation for each document in the plurality of documents is created and a word probability for each of the plurality of words in the plurality of documents is generated. A position for each word probability is determined in the image based on the vector associated with each word and a compression operation on the images is performed to produce a compact representation for the plurality of documents.
Authors	ShreeRanjani SrirangamSridharan (IBM US) Raghu Ganti (IBM US) Mudhakar Srivatsa (IBM US) Yeon-Sup Lim (IBM US)
Date	Jan-2020
Venue	U.S. Patent Application 16/032,764, filed January 16, 2020