The 17 volumes of Massachusetts Soldiers and Sailors of the Revolutionary War, a Compilation from the Archives is a bit of a beast to search through.

Follow this post to receive updates, please comment with any issues. Also, see the list of updates at the end of this post.

I did a little work this weekend to bypass all of that. I call this a cleaned up and combined text of all 17 Volumes. The raw text was taken from:
https://archive.org/details/massachusettssol00mass/page/134/mode/2up (this URL is for volume 1).

The text view was copied into files for each volume. I processed those files in Python to remove arbitrary end of lines, thus concatening the text for each soldier. My goal was to be able to do wildcard searches per soldier such that I could find matches between company commanders and regiment commanders. For instance searching for “ilder.*Doo” would find men who were in Capt. Abel Wilder’s company., Col. Ephraim Doolittle’s regt. That syntax is for Notepad ++. I got tired of doing that manually in each of the 17 volumes. This also addresses not having to search across volumes for sound-alike names such as Ayres and Eyres.

Better basis to determine composition of companies or regiments at various times; inventory which towns the soldiers were from; traits of soldiers from rolls that described the men (hair, eyes, height, ethnicity, occupation, etc.; and which men were in certain battles. This in the form of positive claim such as containing the words: killed, wounded, captured, or lost items or a tacit claim for being in the unit at the time of the battle.

This is a large file – STRONGLY RECOMMEND using in Notepad++ or do a bit of programming in Python (DO NOT ATTEMPT IN: Notepad, WordPad, or even MS Word). Both Notepad++ and Python support wildcard and extended character searching, e.g., “\n\r”. Notepad++ is limited to one wildcard, so for complex sequences, Python is the only option (using the regex library and conditional logic to mimic an SQL query with multiple WHERE clauses).

The process of removing end of line characters described above generally worked, but issues with the OCR (Optical Character Recognition) process resulted in many false interpretations of characters and unintended insertions of new lines. So, I also started replacing some of the safer sequences:

Daxiei to Daniel x4
Daxiel to Daniel x68
Aakon to Aaron x3
Reubex to Reuben x 35
AxLEX to Allen x7
Johx to John x452
JoHx to John x63
Jamks to James x7
AVigglesworth to Wigglesworth x36
Woodbridjje to Woodbridge
Bexoxi to Benoni x5
Bexjamix to Benjamin x98
Williaji to William x70
Joxathax to Jonathan x51
Lejiuel to Lemuel x5

If you are tempted to do similar, please be aware that many seemingly safe find/replace pairs will produce unintended changes Use Find first to locate a large number of prospective changes to be certain of avoiding changing valid information.

Sometimes the OCR process in combination with a poorly printed page created a mess. For instance:

An^TT.w SPr-eant Capt. Joseph Uichards’s co.; enlisted Aug. 11, 1779; ser'”‘^”^^ZZ 3 a”;n^.’detach,nent under Capt. Samuel Fisher at Rhode Island.

Should be:

Allen, Abijah, Sergeant, Capt. Joseph Richards’s co.; enlisted Aug. 11, 1779; service 1 mo. 3 days with detachment under Capt. Samuel Fisher at Rhode Island.

Thus, if you’re looking for specific person, the scanned version on archive.org is your best option.

-CEF

Updates

April 11, 2021: removed more arbitrary end of line characters and fixed more OCR spelling issues, particularly surnames starting with A and B.

March 18, 2021: removed a thousand more arbitrary end of line characters and fixed more OCR spelling issues.

March 14, 2021: removed several thousand more arbitrary end of line characters and fixed more OCR spelling issues.

March 6, 2021: removed about 10,000 additional arbitrary end of line characters and fixed many more OCR spelling issues.

One thought on “Improved text for searching MA Soldiers and Sailors

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s