Skip to content

Commit 05cfefc

Browse files
authored
Multimodal and Multi-Aspect Topic Modeling (#1232)
1 parent 04f4225 commit 05cfefc

46 files changed

Lines changed: 1559 additions & 248 deletions

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

README.md

Lines changed: 46 additions & 19 deletions
Original file line numberDiff line numberDiff line change
@@ -21,8 +21,10 @@ BERTopic supports
2121
[**long-document**](https://maartengr.github.io/BERTopic/getting_started/distribution/distribution.html),
2222
[**hierarchical**](https://maartengr.github.io/BERTopic/getting_started/hierarchicaltopics/hierarchicaltopics.html),
2323
[**class-based**](https://maartengr.github.io/BERTopic/getting_started/topicsperclass/topicsperclass.html),
24-
[**dynamic**](https://maartengr.github.io/BERTopic/getting_started/topicsovertime/topicsovertime.html), and
25-
[**online**](https://maartengr.github.io/BERTopic/getting_started/online/online.html) topic modeling. It even supports visualizations similar to LDAvis!
24+
[**dynamic**](https://maartengr.github.io/BERTopic/getting_started/topicsovertime/topicsovertime.html),
25+
[**online**](https://maartengr.github.io/BERTopic/getting_started/online/online.html),
26+
[**multimodal**](https://maartengr.github.io/BERTopic/getting_started/multimodal/multimodal.html), and
27+
[**multi-aspect**](https://maartengr.github.io/BERTopic/getting_started/multiaspect/multiaspect.html) topic modeling. It even supports visualizations similar to LDAvis!
2628

2729
Corresponding medium posts can be found [here](https://towardsdatascience.com/topic-modeling-with-bert-779f7db187e6?source=friends_link&sk=0b5a470c006d1842ad4c8a3057063a99), [here](https://towardsdatascience.com/interactive-topic-modeling-with-bertopic-1ea55e7d73d8?sk=03c2168e9e74b6bda2a1f3ed953427e4) and [here](https://towardsdatascience.com/using-whisper-and-bertopic-to-model-kurzgesagts-videos-7d8a63139bdf?sk=b1e0fd46f70cb15e8422b4794a81161d). For a more detailed overview, you can read the [paper](https://arxiv.org/abs/2203.05794) or see a [brief overview](https://maartengr.github.io/BERTopic/algorithm/algorithm.html).
2830

@@ -37,10 +39,14 @@ pip install bertopic
3739
If you want to install BERTopic with other embedding models, you can choose one of the following:
3840

3941
```bash
42+
# Embedding models
4043
pip install bertopic[flair]
4144
pip install bertopic[gensim]
4245
pip install bertopic[spacy]
4346
pip install bertopic[use]
47+
48+
# Vision topic modeling
49+
pip install bertopic[vision]
4450
```
4551

4652
## Getting Started
@@ -71,7 +77,7 @@ topic_model = BERTopic()
7177
topics, probs = topic_model.fit_transform(docs)
7278
```
7379

74-
After generating topics and their probabilities, we can access the frequent topics that were generated:
80+
After generating topics and their probabilities, we can access all of the topics together with their topic representations:
7581

7682
```python
7783
>>> topic_model.get_topic_info()
@@ -82,10 +88,11 @@ Topic Count Name
8288
1 466 32_jesus_bible_christian_faith
8389
2 441 2_space_launch_orbit_lunar
8490
3 381 22_key_encryption_keys_encrypted
91+
...
8592
```
8693

87-
The `-1` topic refers to all outlier documents and are typically ignored. Next, let's take a look at the most
88-
frequent topic that was generated:
94+
The `-1` topic refers to all outlier documents and are typically ignored. Each word in a topic describes the underlying theme of that topic and can be used
95+
for interpreting that topic. Next, let's take a look at the most frequent topic that was generated:
8996

9097
```python
9198
>>> topic_model.get_topic(0)
@@ -115,14 +122,28 @@ Think! It's the SCSI card doing... 49 49_windows_drive_dos_file windows - dr
115122
1) I have an old Jasmine drive... 49 49_windows_drive_dos_file windows - drive - docs... 0.038983 ...
116123
```
117124

118-
> **Note**
119-
>
125+
> 🔥 **Tip**
120126
> Use `BERTopic(language="multilingual")` to select a model that supports 50+ languages.
121127

122-
## Visualize Topics
128+
## Fine-tune Topic Representations
129+
130+
In BERTopic, there are a number of different [topic representations](https://maartengr.github.io/BERTopic/getting_started/representation/representation.html) that we can choose from. They are all quite different from one another and give interesting perspectives and variations of topic representations. A great start is `KeyBERTInspired`, which for many users increases the coherence and reduces stopwords from the resulting topic representations:
131+
132+
```python
133+
from bertopic.representation import KeyBERTInspired
134+
135+
# Fine-tune your topic representations
136+
representation_model = KeyBERTInspired()
137+
topic_model = BERTopic(representation_model=representation_model)
138+
```
139+
140+
> 🔥 **Tip**
141+
> Instead of iterating over all of these different topic representations, you can model them simultaneously with [multi-aspect topic representations](https://maartengr.github.io/BERTopic/getting_started/multiaspect/multiaspect.html) in BERTopic.
142+
143+
## Visualizations
123144
After having trained our BERTopic model, we can iteratively go through hundreds of topics to get a good
124-
understanding of the topics that were extracted. However, that takes quite some time and lacks a global representation.
125-
Instead, we can visualize the topics that were generated in a way very similar to
145+
understanding of the topics that were extracted. However, that takes quite some time and lacks a global representation. Instead, we can use one of the [many visualization options](https://maartengr.github.io/BERTopic/getting_started/visualization/visualization.html) in BERTopic.
146+
For example, we can visualize the topics that were generated in a way very similar to
126147
[LDAvis](https://github.com/cpsievert/LDAvis):
127148

128149
```python
@@ -131,16 +152,21 @@ topic_model.visualize_topics()
131152

132153
<img src="images/topic_visualization.gif" width="60%" height="60%" align="center" />
133154

134-
Find all possible visualizations with interactive examples in the documentation
135-
[here](https://maartengr.github.io/BERTopic/getting_started/visualization/visualization.html).
136-
137-
138155
## Modularity
139156
By default, the main steps for topic modeling with BERTopic are sentence-transformers, UMAP, HDBSCAN, and c-TF-IDF run in sequence. However, it assumes some independence between these steps which makes BERTopic quite modular. In other words, BERTopic not only allows you to build your own topic model but to explore several topic modeling techniques on top of your customized topic model:
140157

141158
https://user-images.githubusercontent.com/25746895/218420473-4b2bb539-9dbe-407a-9674-a8317c7fb3bf.mp4
142159

143-
You can swap out any of these models or even remove them entirely. Starting with the embedding step, you can find out how to do this [here](https://maartengr.github.io/BERTopic/getting_started/embeddings/embeddings.html) and more about the underlying algorithm and assumptions [here](https://maartengr.github.io/BERTopic/algorithm/algorithm.html).
160+
You can swap out any of these models or even remove them entirely. The following steps are completely modular:
161+
162+
1. [Embedding](https://maartengr.github.io/BERTopic/getting_started/embeddings/embeddings.html) documents
163+
2. [Reducing dimensionality](https://maartengr.github.io/BERTopic/getting_started/dim_reduction/dim_reduction.html) of embeddings
164+
3. [Clustering](https://maartengr.github.io/BERTopic/getting_started/clustering/clustering.html) reduced embeddings into topics
165+
4. [Tokenization](https://maartengr.github.io/BERTopic/getting_started/vectorizers/vectorizers.html) of topics
166+
5. [Weight](https://maartengr.github.io/BERTopic/getting_started/ctfidf/ctfidf.html) tokens
167+
6. [Represent topics](https://maartengr.github.io/BERTopic/getting_started/representation/representation.html) with one or [multiple](https://maartengr.github.io/BERTopic/getting_started/multiaspect/multiaspect.html) representations
168+
169+
To find more about the underlying algorithm and assumptions [here](https://maartengr.github.io/BERTopic/algorithm/algorithm.html).
144170

145171
## Functionality
146172
BERTopic has many functions that quickly can become overwhelming. To alleviate this issue, you will find an overview
@@ -183,12 +209,13 @@ public attributes that can be used to access model information.
183209
| `.probabilities_` | The probabilities that are generated for each document if HDBSCAN is used. |
184210
| `.topic_sizes_` | The size of each topic |
185211
| `.topic_mapper_` | A class for tracking topics and their mappings anytime they are merged/reduced. |
186-
| `.topic_representations_` | The top *n* terms per topic and their respective c-TF-IDF values. |
212+
| `.topic_representations_` | The top *n* terms per topic and their respective c-TF-IDF values. |
187213
| `.c_tf_idf_` | The topic-term matrix as calculated through c-TF-IDF. |
214+
| `.topic_aspects_` | The different aspects, or representations, of each topic. |
188215
| `.topic_labels_` | The default labels for each topic. |
189-
| `.custom_labels_` | Custom labels for each topic as generated through `.set_topic_labels`. |
190-
| `.topic_embeddings_` | The embeddings for each topic if `embedding_model` was used. |
191-
| `.representative_docs_` | The representative documents for each topic if HDBSCAN is used. |
216+
| `.custom_labels_` | Custom labels for each topic as generated through `.set_topic_labels`. |
217+
| `.topic_embeddings_` | The embeddings for each topic if `embedding_model` was used. |
218+
| `.representative_docs_` | The representative documents for each topic if HDBSCAN is used. |
192219

193220

194221
### Variations

0 commit comments

Comments
 (0)