Excel

I used Microsoft Excel for the bulk of my data joining and filtering between the three worksheets (across two datasets) that I had. For readers who wish to replicate this process, please note that Excel is oftentimes not the best choice for historical data manipulation. The tool, however, is sufficient if the user is particularly proficient in Excel functions and the dataset is labelled with unique identifiers.

Linking Rig and Custom House Information

The 'rig' and 'custom' house fields exist in a separate worksheet related to the main crew list entries via the 'crew list unique ID'. Joining the first two worksheets was as simple as using the following matching function:

=VLOOKUP(crewID,'Crew lists'!$A$2:$O$1000000,4)

The above function, used to join the 'rig' field to my main worksheet, has three inputs. The first input is the 'crew list unique ID. The second is the entire table that the function will be searching for, which is the second worksheet that contains our 'rig' and 'custom house' information. The last specifies the column number of the desired information to be linked; column 4 for 'rig' and 11 for 'custom house'. 

Since each cell now links to another cell, that is the value isn't hardcoded but only pointing to another value, the excel file slowed down quite significantly with the over 300,000 links made. After completing this process, I saved my file as CSV using Excel's 'Save As' tool to encode the links as hardcoded text and speed up the file for future processing. 

Filtering for New Bedford Voyages Only

Since all the 'custom house' fields were correctly and consistently populated, I easily used the Excel filter function to limit my dataset to voyages who recorded lay at the New Bedford customs house. Recall that this limitation of scope was done intentionally as the original authors of the data have noted that records are most accurate and complete for for 19th century voyages registered in New Bedford.

Expanding Acronyms for Ship Rig

To complete the data manipulation of the first file, I also needed to expand the acronyms for the ship's rig, or ship's configuration, accoridng to some legend. I found the legend for acronyms to full names here and was easily able to copy it into excel as the HTML code was formatted to display the data tabularly. Linking the acronym code to its full text value was as easy as running a VLOOKUP on the data as before:

=VLOOKUP(Rig,Legend!A$1:B$16,2)

Once more, I saved the file as a CSV to encode the links as text. 

Joining Datasets

Using the aforementioned VLOOKUP logic, using 'voyage ID' as the linking variable, I was able to easily link all the relevant voyage and vessel information (tonnage, build date, departure year, arrival year, bone/oil/sperm, and master) to my main dataset. Once more, saving the file as a CSV encoded my text leaving me with one CSV file to manipulate, clean, and later on, analyze. 

Extrapolate Voyage Length

I wanted to add my own datapoint of 'voyage length' into the dataset as I found that 'departure year' and 'arrival year' fields were largely complete for the data, and so, an additional datapoint could be helpful later on when visualizing. In order to do so, I used excel logic to subtract arrival year from departure year, using the placeholder '<1' if those two years were the same:

=IFERROR(YearIn-YearOut,"<1")