Skip to content

Commit 920b46e

Browse files
add optional steps to load data and setup drive/colab (LSA typically starts on day 2)
1 parent 9d3a39b commit 920b46e

1 file changed

Lines changed: 22 additions & 0 deletions

File tree

episodes/06-lsa.md

Lines changed: 22 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -67,10 +67,32 @@ LSA requires two steps- first we must create a TF-IDF matrix, which we have alre
6767
Next, we will perform dimensional reduction using a technique called SVD.
6868

6969
### Worked Example: LSA
70+
In case you are starting from a fresh notebook, you will need to (1), mount Google drive (2) add the helper code to your path, and (3) load the data.csv file.
71+
```
72+
# Run this cell to mount your Google Drive.
73+
from google.colab import drive
74+
drive.mount('/content/drive')
75+
76+
# Show existing colab notebooks and helpers.py file
77+
from os import listdir
78+
wksp_dir = '/content/drive/My Drive/Colab Notebooks/text-analysis/code'
79+
listdir(wksp_dir)
80+
81+
# Add folder to colab's path so we can import the helper functions
82+
import sys
83+
sys.path.insert(0, wksp_dir)
84+
85+
# Read the data back in.
86+
from pandas import read_csv
87+
data = read_csv("/content/drive/My Drive/Colab Notebooks/text-analysis/data/data.csv")
88+
89+
```
7090

7191
Mathematically, these "latent semantic" dimensions are derived from our TF-IDF matrix, so let's begin there. From the previous lesson:
7292

7393
```python
94+
from sklearn.feature_extraction.text import TfidfVectorizer
95+
vectorizer = TfidfVectorizer(input='filename', max_df=.6, min_df=.1) # Here, max_df=.6 removes terms that appear in more than 60% of our documents (overly common words like the, a, an) and min_df=.1 removes terms that appear in less than 10% of our documents (overly rare words like specific character names, typos, or punctuation the tokenizer doesn’t understand)
7496
tfidf = vectorizer.fit_transform(list(data["Lemma_File"]))
7597
print(tfidf.shape)
7698
```

0 commit comments

Comments
 (0)