Vectorization of documents

Abstract Embodiments of the invention include method, systems and computer program products for document vectorization. Aspects include receiving, by a processor, a plurality of documents each having a plurality of word. The processor utilizing a vector embeddings engine generates a vector to represent each of the plurality of words in the plurality of documents. An image representation for each document in the plurality of documents is created and a word probability for each of the plurality of words in the plurality of documents is generated. A position for each word probability is determined in the image based on the vector associated with each word and a compression operation on the images is performed to produce a compact representation for the plurality of documents.
Authors
  • ShreeRanjani SrirangamSridharan (IBM US)
  • Raghu Ganti (IBM US)
  • Mudhakar Srivatsa (IBM US)
  • Yeon-Sup Lim (IBM US)
Date Jan-2020
Venue U.S. Patent Application 16/032,764, filed January 16, 2020