Homework Projects
Master thesis
For my Master thesis I worked in Prof. Dr. Julia Vogt’s Medical Data Science Group on administrative claims patient data. The goal of the project was to find better representation for diseases annotated by the ICD standard. We used co-occurrence information with the help of Word2Vec and hierarchical information with the help of HyperE. Our model then employed euclidean, hyperbolic machine learning using PyTorch and graph neural networks using PyTorch geometric.
Homework projects
During the Masters degree there are many projects that had to be done for the different courses, here is a selection of those:
Advanced machine learning
A part of the course advanced machine learning are the four tasks that have to be completed:
- Task1 which tackles outlier detection, feature selection and data cleaning.
- Task2 where again feature selection and class imbalance are the problem.
- Task3 with time series data of different lengths.
Advanced systems lab project
The objective of the advanced systems lab is to design, write and evaluate fast C code. As a test of our abilities we have to write a project. Our group decided on the Baum-Welch algorithm, which is a special case of the expectation-maximization algorithm of hidden Markov models. As part of the project, our group reordered the steps of the EM-algorithm, unrolled loops, inserted SIMD instructions and checked the performance in valgrind.
Computational intelligence lab project
One main focus of the computational intelligence lab is how to model data (images, text, etc.). Part of the course is a project, where we chose the task of sentiment analysis of tweets. We tried out BERT, ALBERT of different sizes and also lexical normalization with MoNoise.
Machine learning for healthcare
The course machine learning for healthcare reviews most relevant methods and applications of machine learning in biomedicine. To get hands-on experience we did some projects.
- Task1 does a heartbeat classification with keras neural nets, that also uses transfer learning.
- Task2 is split in two parts:
- The first part is about classifying clinical notes to the sickness that they describe.
- The second part was not finished, but tried to use “pre-seeded” LDA to search desired articles.
- Task3 uses U-Net to segment CT scans.
Partisan responses
For the course Sequencing Legal DNA we had to do a course project. Our group decided on the big project of generating partisan responses based on U.S. congressional speeches.
The subtask for this project were:
- Filtering and pre-process the dataset.
- Search related speeches in the dataset based on a query
- Extract information from the resulting dataset using AllenNLPs module
- Use that information to build knowledge graphs
- Generate responses based on those knowledge graphs using GraphWriter
Since that pipeline did not work out as hoped and to baseline, we also tested the text generation capabilities of GPT-2.
Notes
In my second semester it dawned on me, that I could keep the notes of the mandatory lecture on computer instead of my bad handwriting.
Big data
There are many mandatory readings for the course big data.
Here is a selection of readings:
- Dynamo: Amazon’s Highly Available Key-value Store
- Windows Azure Storage: A Highly Available Cloud Storage Service with Strong Consistency
- Dremel: Interactive Analysis of Web-Scale Datasets
- JSON ECMA standard
- Understanding JSON Schema
- XML in a Nutshell
- HBase: The Definitive Guide
- Bigtable: A Distributed Storage System for Structured Data
- Hadoop: The Definitive Guide
- The Hadoop Distributed File System
Learning group
The natural language lab of ETH has a weekly learning group that discusses interesting papers.
Here are some that I have read:
Disease representations
For the Master thesis I needed to read some papers for research. Some of those papers have notes in this folder.
Some of the papers I have studied:
- KAME: Knowledge-based Attention Model for Diagnosis Prediction in Healthcare
- BEHRT: Transformer for Electronic Health Records
- HyperCore: Hyperbolic and Co-graph Representation for Automatic ICD Coding
- Exploiting hierarchy in medical concept embedding
- Learning Electronic Health Records through Hyperbolic Embedding of Medical Ontologies