A Korean Version of Semantic Text Segmentation with Embedding
Given a text document d and a desired number of segments k, this repo shows how to segment the document into k semantically homoginuous segments.
The general approach is as follows:
- Convert the words in d into embedding vectors using the GloVe model.
- For all word sequences s in d: the average meaning (centroid) of s is represented by taking the average embeddings of all words in s.
- The error in the centroid calculation for a sequence s is calculated as the average cosine distance between the centroid and all words in s.
- The segmentation is done using the greedy heuristic by iteratively choosing the best split point p.
The class text_segmentation_class in text_segmetnation.py contains funtions to convert the document words in GloVe embeddings and choose the splitting points. The notebook semantc_text_segmentation_example.ipynb demonstrates how to use the class.
- https://github.com/ratsgo/embedding/releases ์์ ์ ๊ณตํ๋ word-embeddings.zip ์ค glove ์ฌ์ฉ
- https://github.com/jroakes/glove-to-word2vec/blob/master/convert.py ์ด์ฉํด glove => word2vec
- all_doc_tokens: ํํ์ ๋จ์๋ก ํ ํฌ๋์ด์ง(Okt.morphs() ์ฌ์ฉ)
- token_index: all_doc_tokens์ ์ธ๋ฑ์ค
- doc_tokens: all_doc_tokens ์ค ๋ช ์ฌ๋ค๋ง