The QuantGov Corpus

Within the QuantGov framework, a corpus refers to a set of documents and a driver that is able to serve those documents, along with a unique index value.

Basic Structure

A corpus represents a set of documents, and implements the corpus driver interface. The root directory for a corpus should contain a Python module named driver.py that defines the QuantGov corpus driver for that corpus. Additionally, a corpus can contain a file called metadata.csv that contains any additional information about individual documents that may be relevant. For example, the metadata.csv generated by the CFR corpus includes which agency and department authored each individual CFR part.

Starting a New Corpus

The fastest way to start a new corpus is to use the command quantgov start corpus NAME where NAME is the name for the corpus. This command will copy the skeleton corpus from https://github.com/quantgov/corpus. The skeleton corpus implements a RecursiveDirectoryCorpusDriver (see below), and expects a set of text files in a directory data/clean relative to the corpus base directory.

The Corpus Driver Interface

Each corpus should contain a python module named driver.py. This driver serves two important functions. First it specifies how the corpus should be indexed. An index is one or more values that, taken together, uniquely identify each document in the corpus. An index can be as simple as an id number, or it can be more descriptive. For the CFR corpus, each document, representing a single subdivision called a part, is represented by three pieces of metadata: the year of the CFR edition in which the part was printed, the to which the part belongs, and the part number.

The driver.py file should define a variable called driver which holds the driver object. The class of the driver object should be a subclass of the quantgov.corpus.CorpusDriver class, which implements a standard interface for identifying document labels and for streaming those documents for analysis. The quantgov.corpus module has several built-in driver classes which will cover many cases:

RecursiveDirectoryCorpusDriver

The RecursiveDirectoryCorpusDriver serves files stored in a directory (also called a folder) on the local computer. The driver is called “recursive” because it will enter into subdirectories recursively to serve all files found. The index labels, which are specified in the object constructor, should match how many levels of directories deep the corpus files are stored.

For example, as discussed above, the CFR corpus has three index levels: year, title, and part. To use the RecursiveDirectoryCorpusDriver, we would organize our files so that there was one directory for each year. Within each year directory, we would have one folder for each title that appears in each year. Then, within the title directories, we would have a text file holding each part. This means that for the 1997 edition of Title 26 Part 1, we would have a file in data/clean/1997/26/1.txt. The corpus driver for this circumstance would be as simple as:

import quantgov as qg

from pathlib import Path

BASE_DIR = Path(__file__).resolve().parent

driver = qg.corpus.RecursiveDirectoryCorpusDriver(
    directory=BASE_DIR.joinpath('data', 'clean'),
    index_labels=('year', 'title', 'part')
)

NamePatternCorpusDriver

The NamePatternCorpusDriver uses a regular expressions to specify the name pattern of files in a given folder. All files are expected to be in the specified folder, and the file names are expected to contain all the elements of the index.

Again using the CFR as an example, in this case we might have the 1997 edition of Title 26 Part 1 in the file data/clean/1997-26-1.txt. We would then specify each part of the index in named pattern groups. The resulting driver would look like this:

import quantgov as qg

from pathlib import Path

BASE_DIR = Path(__file__).resolve().parent

driver = qg.corpus.RecursiveDirectoryCorpusDriver(
    directory=BASE_DIR.joinpath('data', 'clean'), pattern),
    pattrn=r'(?P<year>\d+)-(?P<title>\d+)-(?P<part>\d+)'
)

IndexDriver

The IndexDriver simply reads a csv file where each row represents a document. The last column is assumed to be the path to the file containing that document. Any preceding columns are assumed to be the index. The first row is assumed to be a header, and the column headers for the index columns are used as the index level names.

For the CFR example, the index csv file might look like this:

year,title,part,path
1997,26,1,data/clean/97-26-01.txt
1997,26,2,data/clean/97-26-02.txt
1997,26,3,data/other/Y1997T26P02.txt

In a spreadsheet editor, the csv would look more like this:

year title part path
1997 26 1 data/clean/97-26-01.txt
1997 26 2 data/clean/97-26-02.txt
1997 26 3 data/other/Y1997T26P02.txt

Note that no particular pattern is needed for the file path; they may be as esoteric as your data. Note also that the use of absolute paths in an index csv will likely hamper portability.

The assuming the file above is saved in data/index.csv, the driver.py file for this corpus would be as follows:

import quantgov as qg

from pathlib import Path

BASE_DIR = Path(__file__).resolve().parent

driver = qg.corpus.IndexDriver(index=BASE_DIR.joinpath('data', 'index.csv'))

Custom Drivers

Users may also define their own driver classes. Custom drivers should subclass quantgov.corpus.structures.CorpusDriver and override the stream method. The stream method should generate quantgov.corpus.structures.Document objects, which hold both the index and the text of the document.

Using the Corpus to Stream Documents

To load a corpus, users should use the quantgov.load_driver function, which takes the path to a corpus as the argument and returns the driver defined in that corpus. The driver has two methods of importance to users:

  • CorpusDriver.stream generates Document objects representing each document in the corpus.
  • CorpusDriver.get_streamer returns a CorpusStreamer object wrapping the CorpusDriver.stream generator.

A Document object is a named tuple with two attributes: index, which contains the index, and text, which contains the text of the document

A CorpusStreamer object is an iterator which can be used to iterate over the Document objects in the corpus, and which keeps track of how many Documents has been served, and the indices of all documents seen. The served indices can be accessed in the index attribute. The count of documents already streamed can be accessed in the documents_streamed attribute. Finally, the finished attribute indicates whether the corpus has finished streaming.