March - May 2020
TEXT ANALYTICS ON PEDIATRICIANS
Technology: Text mining, Machine learning , Statistical analysis | Python, R, Excel
- The goal of this project was to analyze and identify how the pediatricians around Illinois are reviewed online on various medical aspects.
- Successfully incorporated Predictive analytics and text mining to examine pediatricians are reviewed online based on selected key aspects.
- Performed sentiment analysis on scraped review text followed by a Semi-supervised topic modeling to calculate aspect-based scores for each pediatrician.
- Built a regression model with the review characteristics against the overall star rating to evaluate the key factors affecting a Pediatrician’s rating score.
January - April 2020
DATA ANALYTICS ON LIVE MUSIC EVENTS IN THE UK
Technology: ETL processing , Data analytics and Visualization | Python, Spark, Postgresql, Tableau
- The goal of this project was to provide live event recommendations based on previous data on artists, their performance and popularity, and event details.
- Extracted data on live music performances taken over a year from Spotify and MusicBrainz.
- Performed data cleaning on the extracted raw data and persisted into a Postgresql database for further analytical processing.
- Visually analyzed the loaded final data using Tableau to derive inferences like artists popularity in UK, popular venue to host, etc.
September - November 2019
UNDERSTANDING CUSTOMER BEHAVIOUR AT VMWARE
Technology: Predictive analytics and visualization| Python, R, Microsoft Excel
- The goal of this Project was to incorporate Data analytics techniques to analyze and understand the Online Customer activity at VMware solutions.
- Performed comprehensive data analysis on the customer data to identify patterns that will help to improve personalized marketing for the company.
- Developed a series of predictive models on the historical data to infer insights about the customer behaviour.
June - August 2020
Network Traffic Intrusion Detection
Big Data Analytics, Machine learning in Spark, Pipelines | Databricks, Python- Spark ML, Pyspark, mlflow, boto3
- My goal was to develop machine learning model from the TCP dump data of U.S Air force LAN to predict whether the connection is normal or DOS attack.
- The data which has been stored in Amazon S3 has around 5 million records. I created Spark cluster in Databricks environment to do Predictive analytics.
- Successfully created Spark ML pipelines for the Random forest classification model with the help of Apache spark’s MLlib package and achieved F1 score of 98%.
- Serialized the best working model into a bundle with the help of mlflow mleap format and exported the model for predicting other application.
September - November 2019
Predicting Net Promoter Score to improve Patient experience at Manipal Hospitals
Technology: Predictive analytics and visualization | R, Microsoft Excel
- The goal of this project was to use Machine learning and analytics techniques to improve the overall Patient experience and enhance Patient satisfaction at Manipal hospitals.
- The data which comprises of Questionnare and the ordered rating from the patients consists of lot of features which are fine-grained by using Feature selection techniques such as Step-wise logistic regression.
- Machine learning techniques like RandomForests and Adaboost techniques are used for prediction with various performance measures being employed.
October - December 2019
Customer Churn Analysis for cell2cell
Technology: Supervised and Un-supervised Machine learning, Statistical analysis | Python, Microsoft Excel
- The goal of this project was to inherit machine learning techniques to identify potential customers from a telecom company who will be churning out of the service.
- Implemented various classification techniques to predict whether a customer would churn out of the company inorder to focus on Customer Retention management.
- Parallely employed Regression models to predict the Monthly Revenue loss that a company would incur in losing a Customer to derive better insights.
- Performed Customer segmentation using Clustering techniques in order to segment customers by profitability to help target Customers based on those profitable segments.
October - December 2019
Online News Popularity Analysis for Mashable dataset
Technology: Machine learning, Statistical analysis and visualization | Python, Microsoft Excel
- The goal of this project was to incorporate Data analytics tool to analyze the popularity of new article on various categories.
- Devised a predictive model to estimate the number of shares an article in the Mashable media website can get.
- Performed analysis based on the different categories of articles to find out the popularity of each category of the article.
November - December 2019
Visualization of the performance of Alaska airlines
Technology: Data visualization | Tableau, Microsoft Excel
- The goal of this project was to analyze and report about the various factors that are affecting the flight delay in the Alaska airlines.
- Visually analyzed and compared the performance of the Alaska airlines based on data from the BTS (Bureau of Transportation Statistics) for the year 2018.
- Identified the competitor airline and performed comparisons through visualizations based on flight delay metrics.
January - March 2020
Prediction of severity of Real-time Car accidents in United States
Technology: Predictive Analytics and Visualization | AWS- S3, Sagemaker, Docker, Python- sklearn, pandas, plotly
- The goal of this project was to predict the severity of real-time car accidents and derive information about accident hotspot locations and identify the key factors influencing it.
- Performed wide range of exploratory analysis and geo-spatial analysis to visually interpret hotspot zones and driving factors.
- Developed machine learning techniques like Support vector machines and Logistic regression to predict the severity of accidents given various factors like weather, road and temporal data.
- Deployed the best performing model Random forests into Amazon Sagemaker with the help of mlflow library and Docker container installed in the amazon EC2 instance.
March - April 2020
Recommendation on Amazon Fine foods
Technology: Machine learning, Big data analytics | Python- pyspark, scikit learn
- The goal of this project was to develop series of Recommendation systems to recommend items to users based on various factors.
- Implemented two collaborative filtering approaches one is based on user similarity using cosine similarity metric and other being the user-item similarity using Matrix factorization technique.
- Developed an other recommendation algorithm that recommends items to users which has more popularity.
January - March 2020
Health Information Trends Survey Analytics
Technology: Machine learning, Statistical analysis and visualization | R language
- The goal of this project was to analyze how often a person access his/her own online medical record based on their survey results.
- The data is collected from results of survey conducted by the National institute of Health U.S on the online medical record access.
- Implemented machine learning techniques to predict the number of times a person would have accessed his/her EMR in a span of an year.
- With the help of the results identified the factors that influenced a person’s EMR activity like demographics, medical conditions, etc.
January - March 2020
Screening for Chronic Disease
Technology: Machine learning, Statistical analysis and visualization | R language
- The goal of this project was to identify patient characteristics that are indicators of likelyhood of developing chronic kidney disease.
- Performed exploratory analysis on 8,000 patient records containing information such as patient demographics and other clinical parameters.
- Developed logistic regression algorithm to identify significant factors that contribute to development of chronic kidney disease
- Addressed the class-imbalance problem by identifying suitable threshold for classifying a patient as 'likely to develop chronic kidney disease' using Receiver Operating Charateristics (ROC) curve and also considering precision/recall over accuracy to evaluate the model
- This model can be implemented in hospitals as a screening tool to avoid prescribing unwanted tests and hence reducing healthcare costs.