Over the past week, my mentors and I have finalized a more thorough and complete game plan for my project. I now have a much better understanding of what I need to do here. Thus far, I have been looking at the radiologist reports and classifying them into data sets with and without endorectal coils. Since many radiologist reports neglect to mention an endorectal coil, even if one was used, I spent a lot of time going through the MR images to find the endorectal coils, most easily viewed through the sagittal and axial oblique T2 image series. Once I have finalized and completed the dataset in this way to achieve a high level of confidence, the next step is to create a database.
I will set up a MySQL database to store all the data, including both text and images. Although all the required hardware has arrived, we still have to install various Python packages, frameworks, libraries, etc., in addition to Matlab and MySQL. Text mining with python will be used for clustering and frequency analysis on the radiologist reports. This will show us which terms and phrases in each radiologist report are most commonly referenced and grouped together for each parameter. A list of scan parameters and patient demographics and clinical outcomes for each case will also be compiled. After setting up this database, we can perform some statistical analysis with SciPy to find out how to optimize the acquisition protocol. Depending on my progress, this whole process should take a month or more.
After the text mining step, we will move onto the more interesting and complex image analysis task. Specifically, we want to use Matlab/Python for texture analysis, feature extraction, and segmentation on the MR images. At this point, we will have to expand the MySQL database to incorporate the vast amount of images (which will number in the tens of thousands). Unlike the text mining task, the main goal of the image analysis task is to optimize the acquisition parameters.
With a complete database, will be used to develop a training database, and from there, we can perform more advanced machine learning with recurrent neural networks (RNN) for the textual information and convolutional neural networks (CNN) for the images. Here is what Dr. Panda drew to explain everything to me:
All in all, this project should take several months to complete, but I'm aiming to get it done as fast as possible. Although I'll probably make good progress on the text mining task by the end of April, I plan on staying over the summer, probably until August or September, to complete the image analysis task. I honestly think that I need more machine learning knowledge and experience in order to do a good job, so I will spend a lot of time on learning and researching machine learning with my mentors and on Stack Overflow.
I expect to install all the software and iron out some more details next week. I also met with Dr. Kawashima in the reading room today, and I learned more about what the radiologists are looking for in a prostate MR scan. Areas of interest in and around the prostate include the peripheral and transitional zones, the invasion of nodal vascular bundles, effacement of fat, seminal vesicles, and lymph nodes. Basically, there are bunch of biomarkers to look for, and I need to remember to keep an eye out for them. Suspicious markers can indicate malignancy.