The QuantGov Corpus
Within the QuantGov framework, a corpus refers to a set of documents and a driver that is able to serve those documents, along with a unique index value.
A corpus represents a set of documents, and implements the corpus driver
interface. The root directory for a corpus should contain a Python
driver.py that defines the QuantGov corpus driver for
that corpus. Additionally, a corpus can contain a file called
metadata.csv that contains any additional information about individual
documents that may be relevant. For example, the
generated by the CFR corpus includes which agency and department
authored each individual CFR part.
Starting a New Corpus
The fastest way to start a new corpus is to use the command
quantgov start corpus NAME where NAME is the name for the corpus. This
command will copy the skeleton corpus from
https://github.com/quantgov/corpus. The skeleton corpus implements a
RecursiveDirectoryCorpusDriver (see below), and expects a set of text
files in a directory
data/clean relative to the corpus base directory.
The Corpus Driver Interface
Each corpus should contain a python module named
driver serves two important functions. First it specifies how the corpus
should be indexed. An index is one or more values that, taken together,
uniquely identify each document in the corpus. An index can be as simple
as an id number, or it can be more descriptive. For the CFR corpus, each
document, representing a single subdivision called a part, is
represented by three pieces of metadata: the year of the CFR edition in
which the part was printed, the to which the part belongs, and the part
driver.py file should define a variable called
holds the driver object. The class of the driver object should be a
subclass of the
quantgov.corpus.CorpusDriver class, which implements a
standard interface for identifying document labels and for streaming
those documents for analysis. The
quantgov.corpus module has several
built-in driver classes which will cover many cases:
RecursiveDirectoryCorpusDriver serves files stored in a directory
(also called a folder) on the local computer. The driver is called
“recursive” because it will enter into subdirectories recursively to
serve all files found. The index labels, which are specified in the
object constructor, should match how many levels of directories deep the
corpus files are stored.
For example, as discussed above, the CFR corpus has three index levels:
year, title, and part. To use the
would organize our files so that there was one directory for each year.
Within each year directory, we would have one folder for each title that
appears in each year. Then, within the title directories, we would have
a text file holding each part. This means that for the 1997 edition of
Title 26 Part 1, we would have a file in
corpus driver for this circumstance would be as simple as:
import quantgov as qg from pathlib import Path BASE_DIR = Path(__file__).resolve().parent driver = qg.corpus.RecursiveDirectoryCorpusDriver( directory=BASE_DIR.joinpath('data', 'clean'), index_labels=('year', 'title', 'part') )
NamePatternCorpusDriver uses a regular expressions to specify the
name pattern of files in a given folder. All files are expected to be in
the specified folder, and the file names are expected to contain all the
elements of the index.
Again using the CFR as an example, in this case we might have the 1997
edition of Title 26 Part 1 in the file
would then specify each part of the index in named pattern
The resulting driver would look like this:
import quantgov as qg from pathlib import Path BASE_DIR = Path(__file__).resolve().parent driver = qg.corpus.RecursiveDirectoryCorpusDriver( directory=BASE_DIR.joinpath('data', 'clean'), pattern), pattrn=r'(?P<year>\d+)-(?P<title>\d+)-(?P<part>\d+)' )
IndexDriver simply reads a csv file where each row represents a
document. The last column is assumed to be the path to the file
containing that document. Any preceding columns are assumed to be the
index. The first row is assumed to be a header, and the column headers
for the index columns are used as the index level names.
For the CFR example, the index csv file might look like this:
year,title,part,path 1997,26,1,data/clean/97-26-01.txt 1997,26,2,data/clean/97-26-02.txt 1997,26,3,data/other/Y1997T26P02.txt
In a spreadsheet editor, the csv would look more like this:
Note that no particular pattern is needed for the file path; they may be as esoteric as your data. Note also that the use of absolute paths in an index csv will likely hamper portability.
The assuming the file above is saved in
driver.py file for this corpus would be as follows:
import quantgov as qg from pathlib import Path BASE_DIR = Path(__file__).resolve().parent driver = qg.corpus.IndexDriver(index=BASE_DIR.joinpath('data', 'index.csv'))
Users may also define their own driver classes. Custom drivers should
quantgov.corpus.structures.CorpusDriver and override the
stream method. The
stream method should generate
quantgov.corpus.structures.Document objects, which hold both the index
and the text of the document.
Using the Corpus to Stream Documents
To load a corpus, users should use the
which takes the path to a corpus as the argument and returns the driver
defined in that corpus. The driver has two methods of importance to
- CorpusDriver.stream generates
Documentobjects representing each document in the corpus.
- CorpusDriver.get_streamer returns a
CorpusStreamerobject wrapping the
Document object is a named tuple with two attributes:
contains the index, and
text, which contains the text of the document
CorpusStreamer object is an iterator which can be used to iterate
Document objects in the corpus, and which keeps track of how
Documents has been served, and the indices of all documents seen.
The served indices can be accessed in the
index attribute. The count
of documents already streamed can be accessed in the
documents_streamed attribute. Finally, the
indicates whether the corpus has finished streaming.