Building Your First Dataset
With a fully-built
QuantGov corpus and an estimator, it’s now possible
to build a full set of analyses into a dataset. Generally that will mean
running a number of natural language analyses as well as a number of
Running NLP Analyses
As discussed earlier,
Let’s start by running two built-in QuantGov analyses: the standard word count, and the conditional count, which gives one measure of the complexity of the document. Run these two commands:
quantgov nlp count_words corpus-fr-2016 -o wordcount.csv quantgov nlp count_conditionals corpus-fr-2016 -o conditionals.csv
Running Machine Learning Analyses
Now we are ready to add in the machine learning estimates from the
estimator we trained earlier. Start by copying the
is_world_classifier.qge file from the estimator’s
into the directory where you created the corpora and estimators.
Now run the following command:
quantgov ml estimate is_world_classifier.qge corpus-fr-2016 -o is_world.csv
Open the resulting
is_world.csv in a spreadsheet editor or statistical
package, and you will see the familiar QuantGov analysis format: one
column for each index level, and another for the analysis value itself.
In this case, the results are True and False values because we trained a
binary classification model.
In this particular case, we can also see how well our classifier did because the true value is right there in the first level of the index. In my results, the classifier performed decently well out-of-the-box on standard metrics: 93% accuracy, with a precision of .84, a recall of .7, and an F1 score of .77; in practice, however, it would generally be appropriate to further customize the estimator before training to improve those scores even more.
QuantGov can also produce probability estimates instead of
classifications, by adding the
--probability flag. Run the following
quantgov ml estimate is_world_classifier.qge corpus-fr-2016 --probability -o is_world_prob.csv
Open the resulting
is_world_prob.csv and you will see that instead of
True and False values, the estimates are probabilities of belonging to
the “world” section of the Federal Register. Sorting these results by
the probability shows that the documents with a very high probability
tend to actually be “world” section documents, while those with a very
low probability generally are not—exactly what we would want.
Combining into a single dataset
We now have four pieces of data: word count, conditional count, is-world classification, and is-world probability. They could certainly be circulated separately, but QuantGov analyses are also designed to be easily merged together using the index.
For example, create a file called
combine_datasets.py in the folder
with your analysis results with these contents:
import argparse import pandas as pd from pathlib import Path def parse_args(): parser = argparse.ArgumentParser() parser.add_argument('dataset', nargs='+', type=pd.read_csv) parser.add_argument('-o', '--outfile', required=True) return parser.parse_args() def main(): args = parse_args() results = args.dataset for df in args.dataset[1:]: results = results.merge(df) results.to_csv(args.outfile, index=False) if __name__ == "__main__": main()
Now in that folder, run the command:
python combine_datasets.py -o fr2016_isworld.csv wordcount.csv conditionals.csv is_world.csv is_world_prob.csv
fr2016_isworld.csv file will have all the results
generated, ready for further analysis and distribution. Similar scripts
or boilerplate are simple to produce for most statistical analysis