Qui murmure à l'oreille des député·es

Scripts developped for the paper "Qui murmure à l'oreille des député·es" (currently under review). Most scripts are designed to be run on a GPU graphical card with 20G of RAM. You may not be able to reproduce some of the steps below (in particular the installation of cuML) on a CPU.

Many functions and variables used in several parts of the code are defined in utils.py, for example the text preprocessing is described here.

Installation

Clone this repository
Install dependencies

Recommended Python version: 3.12.4

cd reproduction_wlwf
pip install -r requirements.txt

Install cuML with the correct specification depending on your CUDA version (see below for CUDA 12):

pip install --extra-index-url=https://pypi.nvidia.com "cudf-cu12==26.4.*" "dask-cudf-cu12==26.4.*" "cuml-cu12==26.4.*" "cugraph-cu12==26.4.*" "nx-cugraph-cu12==26.4.*" "cuxfilter-cu12==26.4.*" "cucim-cu12==26.4.*" "pylibraft-cu12==26.4.*" "raft-dask-cu12==26.4.*" "cuvs-cu12==26.4.*" "nx-cugraph-cu12==26.4.*"

Format your data in the following tree

data_source
├── congress
    ├── lr
    │   ├── 2022-06-20.csv
    │   ├── 2022-06-21.csv
    │   ├── 2022-06-22.csv
        ...
    ├── majority
    │   ├── 2022-06-20.csv
    │   ├── 2022-06-21.csv
    │   ├── 2022-06-22.csv
        ...
    ├── nupes
    │   ├── 2022-06-20.csv
    │   ├── 2022-06-21.csv
    │   ├── 2022-06-22.csv
        ...
    └── rn
    │   ├── 2022-06-20.csv
    │   ├── 2022-06-21.csv
    │   ├── 2022-06-22.csv
        ...
├── media
    ├── 2022-06-20.csv
    ├── 2022-06-21.csv
    ├── 2022-06-22.csv
        ...
├── supporter
    ├── lr
    │   ├── 2022-06-20.csv
    │   ├── 2022-06-21.csv
    │   ├── 2022-06-22.csv
        ...
    ...

The csv files should have the following columns: id, text, user_id, retweeted_id, to_userid (if the tweet is a reply, id of the author of the tweet replied to), to_tweetid (if the tweet is a reply, id of the tweet replied to).

text                                         id                  retweeted_id        user_id    to_userid  to_tweetid
RT @UEFrance: 🆕 Estimation du taux d'infla… 1587218214638985216 1587030788331159553 347374931
RT @UEFrance: 🆕 Estimation du taux d'infla… 1587218515227992065 1587030788331159553 2250213234
@bertrand149 @alma_dufour Exact. 7 Mds ça f… 1587218672078159873                     328510700  1430392861 1587207053017288705

Encoding with Sentence-BERT

Example to encode congress data

python 01_encode_with_sbert.py congress

You can choose a group among the following categories : congress, attentive, media, supporter.

Example to encode congress data from another location where you have stored the data_source folder

python 01_encode_with_sbert.py congress --origin_path /distant_store/reproduction_wlwf

--origin_path is by default your current repository, but you can also select another origin to your file tree. Be careful to respect the structure of files and folders within this repository.

NB : If you are using Windows, use "" instead of "/" in your paths.

Compute dimensionality reduction using cuML

Example to run the script from another location where you have stored the data_source folder

python 02_run_umap.py --origin_path /distant_store/reproduction_wlwf/

Run BERTopic model

Example to run the script for congress and media:

python 03_run_bertopic.py --origin_path /distant_store/reproduction_wlwf/ --public congress,media

--origin_path has the same function as in 01_encode_with_sbert.py script. Be careful to keep the same origin-path between the two scripts. --public allows choosing the group(s) you want to use to run the model (by default, all groups are included). You can choose one of the following publics : congress, attentive, media, supporter. You can write several groups separated by a comma.

NB : If you are using Windows, use "" instead of "/" in your paths

This script produces 3 types of outputs:

time series (one file per topic), located in data_prod/dashboard/bertopic/data/
keywords associated to each topic (one file per topic), located in data_prod/dashboard/bertopic/img/
representative tweets, (one file per public) located in data_prod/dashboard/bertopic/representative_docs...,

Run VAR model

Firstly, run this script to create a file that has the correct format for the VAR model script.

python 04_structure_data_for_VAR.py

This generates a file located in the following path: data_prod/dashboard/general_TS.csv.

To run the VAR model, you can use a script with some options:

tests: this option will generate a file located in data_prod/var/issue-level/infos_topics.csv. For each topic, this file tells the stationarity type (stationary for each group, stationary in difference for each group, or different type of stationarity between groups), the AIC/HQ/FPE/SC selection criteria, and the minimum number of lags that are necessary to avoid autocorrelation of residuals. This file must exist to use the estimate option.
estimate: this option will estimate the VAR model for each topic and calculate GIRF. For each topic, it will generate a file for the VAR model parameter located in data_prod/var/issue-level/var_model_{topic_number}.Rdata and a GIRF file data_prod/var/issue-level/var_girf_topic_{topic_number}.Rdata. VAR model files must exist to use the tests_post option.
tests_post: This option will generate a file located data_prod/var/issue-level/post_checks.csv. For each topic, it will indicate if the maximum of absolute values of the roots, if the absence of autocorrelation for residuals is assessed (Portmanteau test), and if the normality of residuals is accepted (multivariate Jarque-Bera test).
number_irf: the number of days for cumulated GIRF chosen for some outputs (default: 40)

If files linked to estimation exist, this script generates the following file: data_prod/var/irf_data.csv which is a file with cumulated GIRF data for the number of days chosen in the number_irf option.

Here is an example of the script to generate cumulated IRF for 30 days with all options.

Rscript 05-var-models.r --tests --estimate --tests_post --number_irf 30

Produce the dashboard

python 06_dashboard.py

This command will create one html page per topic, and a general index page in the docs folder. Once the website is created, you can serve it using the following command:

python -m http.server -d docs

The website will then be visible in your browser on http://127.0.0.1:8000/

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Qui murmure à l'oreille des député·es

Installation

Format your data in the following tree

Encoding with Sentence-BERT

Compute dimensionality reduction using cuML

Run BERTopic model

Run VAR model

Produce the dashboard

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 395 Commits
data_prod		data_prod
data_source		data_source
docs		docs
scripts		scripts
.gitignore		.gitignore
01_encode_with_sbert.py		01_encode_with_sbert.py
02_run_umap.py		02_run_umap.py
03_run_bertopic.py		03_run_bertopic.py
04_structure_data_for_VAR.py		04_structure_data_for_VAR.py
05-var-models.r		05-var-models.r
06_dashboard.py		06_dashboard.py
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt
utils.py		utils.py
utils_R.r		utils_R.r

Folders and files

Latest commit

History

Repository files navigation

Qui murmure à l'oreille des député·es

Installation

Format your data in the following tree

Encoding with Sentence-BERT

Compute dimensionality reduction using cuML

Run BERTopic model

Run VAR model

Produce the dashboard

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages