||Embodiments of the invention include method, systems and computer program products for document vectorization. Aspects include receiving, by a processor, a plurality of documents each having a plurality of word. The processor utilizing a vector embeddings engine generates a vector to represent each of the plurality of words in the plurality of documents. An image representation for each document in the plurality of documents is created and a word probability for each of the plurality of words in the plurality of documents is generated. A position for each word probability is determined in the image based on the vector associated with each word and a compression operation on the images is performed to produce a compact representation for the plurality of documents.
- ShreeRanjani SrirangamSridharan (IBM US)
- Raghu Ganti (IBM US)
- Mudhakar Srivatsa (IBM US)
- Yeon-Sup Lim (IBM US)
||U.S. Patent Application 16/032,764, filed January 16, 2020