Bold

Embedding

A learned numerical representation of text, image or other media

Encoder

Decoder

"reset"

«перезагрузить»

[0.222233,0.334322,0.2313545,0.222322,2.33224422,8.332352424,7.974....]

Learning: Collocation & Classification

Contrastive Representation Learning

A method to learn embeddings where similar pairs of texts are moved nearer to one another and relatively far from dissimilar (negative) texts

Want the math?

Triplet loss

A number is calculated that rewards the model when the anchor is nearer to the positive and farther from the negative. During training, the model works to reduce this error and to generate embeddings where paired texts are near to one another in vector space.

Training Data for Nomic Embed

query_text: Seeking Energy Independence, Europe Faces Heated Fracking Debate

document_text: While watching the turmoil in Ukraine unfold, you may feel as though it has little to do with the United States, but the conflict is stirring a contentious debate in Europe over a topic familiar to many Americans: fracking...

paired_text: Seeking Energy Independence, Europe Faces Heated Fracking Debate While watching the turmoil in Ukraine unfold, you may feel as though it has little to do with the United States, but the conflict is stirring a contentious debate in Europe over a topic familiar to many Americans: fracking....

∼235M text pairs

task-specific for question/answer, classification and clustering

What do computers really mean by "semantic text similarity?"

Types of Similarity

lexical

semantic

orthographic

stylistic

thematic

words or vocabulary

spelling and writing

themes or concepts

common style or authorship

meaning or semantic content

https://bit.ly/prozhito-map

"Scalable Information Cartography"

Bread Prices

Music

topical

^Пятница\W

lexical

^Понедельник\W

lexical

Person

stylistic

least coherent cluster: "Paris" (mixture of Russian and French)

Nomic topic: medium

5,349 diary entries

average inter-cluster distance: 26.500696

maximum distance: 75.331070

2305 Записей

thematic

Length of Text

orthographic

So, "semantic similarity" is

similar meaning

and

lexical, orthographic, thematic,

and/or stylistic

similarity...

What can we do with that?

Identify patterns of diary writing across authors (clusters of like writing, for example, about weather or prices)
Re-use and citation of language, political phrases, and slogans as they appear in personal diaries – vector search for common Soviet slogans/sayings (самокритика, Работа над собой, ленин жил ленин жив...)
Uniformity/diversity of writing about war, specific experiences of war, and how they compare across multiple diary writers
Changing similarity over time. Identify diaries with high similarity at one time step, compute at other timesteps, and track how their similarity changes over time

lexical

Similarity-Driven Research

— Andrew Janco

〞

Embedding

Nomic Embed

1.

2.

3.

Learning: Collocation & Classification

Contrastive Representation Learning

Training Data for Nomic Embed

Types of Similarity

"Scalable Information Cartography"

Music

Person

Length of Text

〞

Bold

Bold

Andrew Janco PRO

Similarity-Driven Research

— Andrew Janco

Bold

More from Andrew Janco