Home

Awesome

PyPI - Python PyPI - PyPi Downloads Downloads

Please read our Documentation: The Origin of Bunka

Bunkatopics

<img src="docs/images/logo.png" width="35%" height="35%" align="right" />

Bunkatopics is a package designed for Data Cleaning, Topic Modeling Visualization and Frame Analysis. Its primary goal is to assist developers in gaining insights from unstructured data, potentially facilitating data cleaning and optimizing LLMs through fine-tuning processes. Bunkatopics is constructed using well-known libraries like sentence_transformers, langchain and transformers, enabling seamless integration into various environments.

Discover the different Use Case:

Discover different examples using our Google Colab Notebooks

ThemeGoogle Colab Link
Visual Topic Modeling with BunkaOpen In Colab
Cleaning dataset for fine-tuning LLM using BunkaOpen In Colab
Understanding a dataset using Frame Analysis with BunkaOpen In Colab
Full Introduction to Topic Modeling, Data Cleaning and Frame Analysis with Bunka.Open In Colab

Installation via Pip

pip install bunkatopics

Installation via Git Clone

git clone https://github.com/charlesdedampierre/BunkaTopics.git
cd BunkaTopics
pip install -e .

Quick Start

Uploading Sample Data

To get started, let's upload a sample of Medium Articles into Bunkatopics:

from datasets import load_dataset
docs = load_dataset("bunkalab/medium-sample-technology")["train"]["title"] # 'docs' is a list of text [text1, text2, ..., textN]

Choose Your Embedding Model

Bunkatopics offers seamless integration with Huggingface's extensive collection of embedding models. You can select from a wide range of models, but be mindful of their size.

# Load Embedding model
from sentence_transformers import SentenceTransformer
embedding_model = SentenceTransformer(model_name_or_path="all-MiniLM-L6-v2")

# Load Projection Model
import umap
projection_model = umap.UMAP(
                n_components=2,
                random_state=42)

from bunkatopics import Bunka

bunka = Bunka(embedding_model=embedding_model, 
            projection_model=projection_model)  # the language is automatically detected, make sure the embedding model is adapted

# Fit Bunka to your text data
 bunka.fit(docs)
from sklearn.cluster import KMeans
clustering_model = KMeans(n_clusters=15)
>>> bunka.get_topics(name_length=5, custom_clustering_model=clustering_model)# Specify the number of terms to describe each topic

Topics are described by the most specific terms belonging to the cluster.

topic_idtopic_namesizepercent
bt-12technology - Tech - Children - student - days32210.73
bt-11blockchain - Cryptocurrency - sense - Cryptocurrencies - Impact2839.43
bt-7gadgets - phone - Device - specifications - screen2588.6
bt-8software - Kubernetes - ETL - REST - Salesforce2588.6
bt-1hackathon - review - Recap - Predictions - Lessons2578.57
bt-4Reality - world - cities - future - Lot2468.2
bt-14Product - Sales - day - dream - routine2418.03
bt-0Words - Robots - discount - NordVPN - humans2086.93
bt-2Internet - Overview - security - Work - Development2026.73
bt-13Course - Difference - Step - science - Point1926.4
bt-6quantum - Cars - Way - Game - quest1625.4
bt-3Objects - Strings - app - Programming - Functions1193.97
bt-5supply - chain - revolution - Risk - community1193.97
bt-9COVID - printing - Car - work - app892.97
bt-10Episode - HD - Secrets - TV441.47

Visualize Your Topics

Finally, let's visualize the topics that Bunka has computed for your text data:

>>> bunka.visualize_topics(width=800, height=800, colorscale='delta')
<img src="docs/images/topic_modeling_raw_YlGnBu.png" width="70%" height="70%" align="center" />

Topic Modeling with GenAI Summarization of Topics

Explore the power of Generative AI for summarizing topics!

from langchain.llms import OpenAI

llm = OpenAI(openai_api_key = 'OPEN_AI_KEY')

Note: It is recommended to use an Instruct model ie a model that has been fine-tuned on a discussion task. If not, the results might be meaningless.

# Obtain clean topic names using Generative Model
bunka.get_clean_topic_name(llm=llm)

Check the top documents for every topic!

>>> bunka.df_top_docs_per_topic_

Finally, let's visualize again the topics. We can chose from different colorscales.

>>> bunka.visualize_topics(width=800, height=800)
YlGnBuPortland
Image 1Image 2
deltaBlues
Image 3Image 4

We can now access the newly made topics

>>> bunka.df_topics_
topic_idtopic_namesizepercent
bt-1Cryptocurrency Impact34512.32
bt-3Data Management Technologies2438.68
bt-14Everyday Life2308.21
bt-0Digital Learning Campaign2258.04
bt-12Business Development2237.96
bt-2Technology Devices2127.57
bt-10Market Predictions Recap2017.18
bt-4Comprehensive Learning Journey1876.68
bt-6Future of Work1856.61
bt-11Internet Discounts1756.25
bt-5Technological Urban Water Management1726.14
bt-9Electric Vehicle Technology1455.18
bt-8Programming Concepts1164.14
bt-13Quantum Technology Industries1053.75
bt-7High Definition Television (HDTV)361.29

Visualise Dimensions on topics

dataset = load_dataset("bunkalab/medium-sample-technology-tags")['train']
docs = list(dataset['title'])
ids = list(dataset['doc_id'])
tags = list(dataset['tags'])

metadata = {'tags':tags}

from bunkatopics import Bunka

bunka = Bunka()

# Fit Bunka to your text data
bunka.fit(docs=docs, ids=ids, metadata=metadata)
bunka.get_topics(n_clusters=10)
bunka.visualize_topics(color='tags', width=800, height=800) # Adjust the color
<img src="docs/images/bunka_color.png" width="70%" height="70%" align="center" />

Manually Cleaning the topics

If you are not happy with the resulting topics, you can change them manually. Click on Apply changes when you are done. In the example, we changed the topic Cryptocurrency Impact to Cryptocurrency and Internet Discounts to Advertising.

>>> bunka.manually_clean_topics()
<img src="docs/images/manually_change_topics.png" width="40%" height="20%" align="center" />

Removing Data based on topics for fine-tuning purposes

You have the flexibility to construct a customized dataset by excluding topics that do not align with your interests. For instance, in the provided example, we omitted topics associated with Advertising and High-Definition television, as these clusters primarily contain promotional content that we prefer not to include in our model's training data.

>>> bunka.clean_data_by_topics()
<img src="docs/images/fine_tuning_dataset.png" width="40%" height="20%" align="center" />
>>> bunka.df_cleaned_
doc_idcontenttopic_idtopic_name
873ba315Invisibilize Data With JavaScriptbt-8Programming Concepts
1243d58fWhy End-to-End Testing is Important for Your Teambt-3Data Management Technologies
45fb8166This Tiny Wearable Device Uses Your Body Heat...bt-2Technology Devices
a122d1d2Digital Policy Salon: The Next Frontierbt-0Digital Learning Campaign
1bbcfc1cPreparing Hardware for Outdoor Creative Technology Installationsbt-5Technological Urban Water Management
79580c34Angular Or React ?bt-8Programming Concepts
af0b08a2Ed-Tech Startups Are Cashing in on Parents’ Insecuritiesbt-0Digital Learning Campaign
2255c350Former Google CEO Wants to Create a Government-Funded University to Train A.I. Codersbt-6Future of Work
d2bc4b33Applying Action & The Importance of Ideasbt-12Business Development
5219675eWhy You Should (not?) Use Signalbt-2Technology Devices
............

Bourdieu Map

The Bourdieu map provides a 2-Dimensional unsupervised scale to visualize various texts. Each region on the map represents a distinct topic, characterized by its most specific terms. Clusters are formed, and their names are succinctly summarized using Generative AI.

The significance of this visualization lies in its ability to define axes, thereby creating continuums that reveal data distribution patterns. This concept draws inspiration from the work of the renowned French sociologist Bourdieu, who employed 2-Dimensional maps to project items and gain insights.

from langchain.llms import HuggingFaceHub

# Define the HuggingFaceHub instance with the repository ID and API token
llm = HuggingFaceHub(
    repo_id='mistralai/Mistral-7B-v0.1',
    huggingfacehub_api_token="HF_TOKEN"
)

## Bourdieu Fig
bourdieu_fig = bunka.visualize_bourdieu(
        llm=llm,
        x_left_words=["This is about business"],
        x_right_words=["This is about politics"],
        y_top_words=["this is about startups"],
        y_bottom_words=["This is about governments"],
        height=800,
        width=800,
        clustering=True,
        topic_n_clusters=10,
        density=False,
        convex_hull=True,
        radius_size=0.2,
        min_docs_per_cluster = 5, 
        label_size_ratio_clusters=80)
>>> bourdieu_fig.show()
positive/negative vs humans/machinespolitics/business vs humans/machines
Image 1Image 2
politics/business vs positive/negativepolitics/business vs startups/governments
Image 3Image 4

Saving and loading Bunka

bunka.save_bunka("bunka_dump")
...

from bunkatopics import Bunka
bunka = Bunka().load_bunka("bunka_dump")
>>> bunka.get_topics(n_clusters = 15)

Loading customed embeddings (Beta)

'''
ids = ['doc_1', 'doc_2'...., 'doc_n']
embeddings = [[0.05121125280857086,
  -0.03985324501991272,
  -0.05017390474677086,
  -0.03173152357339859,
  -0.07367539405822754,
  0.0331297293305397,
  -0.00685789855197072...]]

'''

pre_computed_embeddings = [{'doc_id': doc_id, 'embedding': embedding} for doc_id, embedding in zip(ids, embeddings)]
...

from bunkatopics import Bunka
bunka = Bunka()
bunka.fit(docs=docs, ids = ids, pre_computed_embeddings = pre_computed_embeddings)


from sklearn.cluster import KMeans
clustering_model = KMeans(n_clusters=15)
>>> bunka.get_topics(name_length=5, 
                    custom_clustering_model=clustering_model)# Specify the number of terms to describe each topic

Front-end (Beta)

This is a beta feature. First, git clone the repository

git clone https://github.com/charlesdedampierre/BunkaTopics.git
cd BunkaTopics
pip install -e .

cd web # got the web directory
npm install # install the needed React packages
from bunkatopics import Bunka

from sentence_transformers import SentenceTransformer
embedding_model = SentenceTransformer(model_name_or_path="all-MiniLM-L6-v2")

# Initialize Bunka with your chosen model
bunka = Bunka(embedding_model=embedding_model) 

# Fit Bunka to your text data
bunka.fit(docs)
bunka.get_topics(n_clusters=15, name_length=3) # Specify the number of terms to describe each topic
>>> bunka.start_server() # A serveur will open on your computer at http://localhost:3000/ 
<img src="docs/images/bunka_server.png" width="100%" height="100%" align="center" />