The 17 volumes of Massachusetts Soldiers and Sailors of the Revolutionary War, a Compilation from the Archives is a bit of a beast to search through.

Follow this post to receive updates, please comment with any issues. Also, see the list of updates at the end of this post.

I did a little work this weekend to bypass all of that. I call this a cleaned-up and combined text of all 17 Volumes. The raw text was taken from:
https://archive.org/details/massachusettssol00mass/page/134/mode/2up (this URL is for volume 1).

The text view was copied into files for each volume. I processed those files in Python to remove the arbitrary end of lines, thus concatenating the text for each soldier. My goal was to be able to do wildcard searches per soldier such that I could find matches between company commanders and regiment commanders. For instance, searching for “ilder.*Doo” would find men who were in Capt. Abel Wilder’s company., Col. Ephraim Doolittle’s regt. That syntax is for Notepad ++. I got tired of doing that manually in each of the 17 volumes. This also addresses not having to search across volumes for sound-alike names such as Ayres and Eyres.

Better basis to determine the composition of companies or regiments at various times; inventory which towns the soldiers were from; traits of soldiers from rolls that described the men (hair, eyes, height, ethnicity, occupation, etc.; and which men were in certain battles. This in the form of a positive claim such as containing the words: killed, wounded, captured, or lost items or a tacit claim for being in the unit at the time of the battle.

This is a large file – STRONGLY RECOMMEND using in Notepad++ or do a bit of programming in Python (DO NOT ATTEMPT IN: Notepad, WordPad, or even MS Word). Both Notepad++ and Python support wildcard and extended character searching, e.g., “\n\r”. Notepad++ is limited to one wildcard, so for complex sequences, Python is the only option (using the regex library and conditional logic to mimic an SQL query with multiple WHERE clauses).

The process of removing end-of-line characters described above generally worked, but issues with the OCR (Optical Character Recognition) process resulted in many false interpretations of characters and unintended insertions of new lines. So, I also started replacing some of the safer sequences:

Daxiei to Daniel x4
Daxiel to Daniel x68
Aakon to Aaron x3
Reubex to Reuben x 35
AxLEX to Allen x7
Johx to John x452
JoHx to John x63
Jamks to James x7
AVigglesworth to Wigglesworth x36
Woodbridjje to Woodbridge
Bexoxi to Benoni x5
Bexjamix to Benjamin x98
Williaji to William x70
Joxathax to Jonathan x51
Lejiuel to Lemuel x5

If you are tempted to do similar, please be aware that many seemingly safe find/replace pairs will produce unintended changes Use Find first to locate a large number of prospective changes to be certain of avoiding changing valid information.

Sometimes the OCR process in combination with a poorly printed page created a mess. For instance:

An^TT.w SPr-eant Capt. Joseph Uichards’s co.; enlisted Aug. 11, 1779; ser'”‘^”^^ZZ 3 a”;n^.’detach,nent under Capt. Samuel Fisher at Rhode Island.

Should be:

Allen, Abijah, Sergeant, Capt. Joseph Richards’s co.; enlisted Aug. 11, 1779; service 1 mo. 3 days with detachment under Capt. Samuel Fisher at Rhode Island.

Thus, if you’re looking for specific person, the scanned version on archive.org is your best option.

-CEF

Updates

April 11, 2021: removed more arbitrary end-of-line characters and fixed more OCR spelling issues, particularly surnames starting with A and B.

March 18, 2021: removed a thousand more arbitrary end-of-line characters and fixed more OCR spelling issues.

March 14, 2021: removed several thousand more arbitrary end-of-line characters and fixed more OCR spelling issues.

March 6, 2021: removed about 10,000 additional arbitrary end-of-line characters and fixed many more OCR spelling issues.

Feb 3, 2022: manually traversed surnames beginning with letters A, B, and C to correct spelling and letter case issues to improve better searching. This dealt with common OCR transcription errors such as G instead of C, for names such as Cross, which appeared as Gross, and could not be handled en-masse.

Feb 20, 2022: corrected surnames beginning with letter D per Feb 3 description.

April 21, 2022: corrected surnames beginning with letter E and F through Farrington per Feb 3 description.

May 28, 2022: corrected remaining surnames beginning with the letter F.

Sep 8, 2022: corrected surnames from starting with G through Halloran. Now includes all of vol. VI and start of vol. VII

One thought on “Improved text for searching MA Soldiers and Sailors

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s