Natural Language Analysis
The QuantGov library includes a number of built-in utilities to perform
some natural language analysis on a corpus. These can be run using the
quantgov nlp set of commands. All nlp analyses are run at the document
level, and output a csv. By default, results are printed to standard
output, but an output file may be specified with the
For example, the command
quantgov nlp count_words corpora/mycorpus -o wordcount.csv
would produce a word count for the corpus contained in the folder
corpora/mycorpus and save it in the file
wordcount.csv. The output
will include one column per index label, with the index label name as a
header, and then the results of the analyses. The results of
analyses may therefore be merged with any data or statistical analysis
software to create a combined dataset.
Note that not all analyses may be aggregated above the document level; users should familiarize themselves with the metrics produced and use them appropriately.
- count_words: Count the number of words in a document. The
regular expression defining a word is
\b\w+\b'by default; this may be overridden with the
- count_occurrences: Count the non-overlapping occurrences of a
list of words, phrases, or regular expressions. The longest items in
the list take precedence. A total column may be specified with the
- shannon_entropy: Calculate Shannon entropy for the document.
- conditional_counter: Count occurrences of the words and phrases “if”, “but”, “except”, “provided”, “when”, “where”, “whenever”, “unless”, “notwithstanding”, “in the event”, and “in no event”.
- sentence_length: Calculate average sentence length. Requires
- sentiment_analysis: Produce a quantification of polarity and
subjectivity for a document. Requires