Armando B. Mendes

WildChat

Posted by Armando Brito Mendes | Filed under data sets, LLMs

https://wildchat.allen.ai/

Um data set com um milhão de perguntas e respostas do chatGPT

The WildChat Dataset is a corpus of 1 million real-world user-ChatGPT interactions, characterized by a wide range of languages and a diversity of user prompts. It was constructed by offering free access to ChatGPT and GPT-4 in exchange for consensual chat history collection. Using this dataset, we finetuned Meta’s Llama-2 and created WildLlama-7b-user-assistant, a chatbot which is able to predict both user prompts and assistant responses.
To learn more: dataset / model / paper

Tags: chatGPT, diálogos, NKP

Read more | Comments off | May 24th, 2024

Look into the machine’s mind

Posted by Armando Brito Mendes | Filed under LLMs, visualização

clique no link para visitar a app

Uma web app capaz de explorar os vários caminhos obtidos da resposta “what is intelligence” do chatGPT

the data
Using the chatgpt api, I ran the same completion prompt “Intelligence is “ hundreds of times (setting the temperature quite high, at 1.6, for more diverse responses). Given a text, a Large Language Model assigns a probability for the word (token) to come, and it just repeats this process until a completion is…well, complete.

semantic space (behind)
Each text (a prompt completion or a sub-sequence) has an embedding: a position in a 1536-dimensions space (I call it semantic space, or s²₁₅₃₆). For each response there’s a trajectory through s²₁₅₃₆ that corresponds to each sub-sequence of words, example: “Intelligence is “ → “Intelligence is the” → “Intelligence is the ability” → “Intelligence is the ability to” → … → full completion.

Because I cannot visualize a 1536-dimensions space (yet), I use a popular technique called Principal Components Analysis that tells me, for the set of points I have, what are the most important (principal) dimensions, and allows me to rotate the highly dimensional space so when I look through it, projected into only 3 dimensions, the points are scattered as much as possible. It’s the best (linear)possible reduction of dimensions. In fewer words: it compresses a highly dimensional space into few dimensions while preserving as much info as it can. More or less the same as when for drawing something you choose a perspective (you rotate the object), so it provides the most relevant information. I call this new space s²₃, and it’s what I visualize.

What you see in the cube is a tree of trajectories that bifurcate. All start with “Intelligence is “ and progress towards longer and less probable sub-sequences of responses. It’s a different representation of the same tree being visualized on the right (both visualizations communicate).

The tree visualization (right)
Visualizes all collected completions. It also represents the calculated probability of a word following a text (because the sample is small, this is only a good approximation for the initial levels of the tree), so “Intelligence is the “ will be followed by “ability” ~75% of the times, at 1.6 temperature. If temperature was lower this probability would rise, until achieving certainity at temperature=0.

By hovering a word, which corresponds to a point in a sub-sequence, you can see in the cube the trajectory from the prompt to all the completions that start with that sub-sequence.

Try other prompts:
· Chatgpt is
· Best thing about AI is
· When
· Santiago Ortiz is (yes, this is a selfai. What I found interesting is that it’s ~50% truth ~50% bs, and it feels like it describes alternative versions of my self in the multiverse)
· My dream
· Tell me a story:
· Intelligence is

references
Simulating my friend Philippe, where I explain embeddings, and how they are used to run semantic search and to find the proper knowledge from a corpus to use it as context for LLMs prompts
A deeper explanation of LLMs, next token prediction, temperature and embeddings, by Stephen Wolfram
English by degrees the original Next Word prediction model by Claude Shannon

moebio for more experiments and data proyects

Tags: bifurcações, chatGPT, word network

Read more | Comments off | March 26th, 2024

About

Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Donec libero. Suspendisse bibendum. Cras id urna. Morbi tincidunt, orci ac convallis aliquam, lectus turpis varius lorem, eu posuere nunc justo tempus leo. Donec mattis, purus nec placerat bibendum, dui pede condimentum odio, ac blandit ante orci ut diam.

Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Donec libero. Suspendisse bibendum. Cras id urna. Learn more...

Armando B. Mendes

WildChat

Look into the machine’s mind

Categorias de Posts

Palavras chave mais usadas

Arquivo

Recent Posts

Recent Comments

About