CS6120 Natural Language Processing Assignment 0
Scenario
A regular expression (RE) is a sequence of characters that forms a search pattern. RE can be used for string searching and manipulation tasks, such as finding, replacing, or validating text. Regular expressions are powerful tool in many languages for handling text data. They are useful in data cleaning, parsing, and text preprocessing.
Task
This assignment has two parts to it:
-
Part A): You are given a small csv file with five short stories listed in rows. The file also contains empty columns with header labels. Use RE to extract information for the empty columns.
-
Part B) Download all 5 volumes of "A system of practical medicine" form Gutenberg Library. Then apply RE search to look for the number of times most common modern health conditions are mentioned in each text. Your objective is to create a df with five rows in it, one for each volume. The df should contain columns for various health conditions and their frequency within each volume. Here are the most frequent health conditions:
Heart disease
Cancer
Stroke
Respiratory diseases
Alzheimer's disease
Diabetes
Influenza and Pneumonia
Kidney diseases
Septicemia
Liver disease
Hypertension
Parkinson's disease
Chronic lower respiratory disease
Accidents/injuries
Osteoporosis
Asthma
Depression
Oral health issues
HIV/AIDS
Tuberculosis
Malaria
Dengue fever
Hepatitis
Epilepsy
Multiple sclerosis
Expected Output
Please submit a fully executed Jupyter notebook clearly identifying question number and steps. Make sure to add proper commentary to your solution.