⭕ I just learned set theory and I can’t contain myself
Late to the Party 🎉 is about insights into real-world AI without the hype.
Hello internet,
I was a bit under the weather this week, but next week we’re finally releasing my MOOC segment on deep learning! Let’s dive into some other machine learning first!
The Latest Fashion
- This paper predicts images we are thinking of from brain scans! Fascinating but terrifying.
- Hackathon winner creates a dark ship detector with stable diffusion and GANs to detect illegal over-fishing
- These animations for neural network education are amazing!
***Thanks to David for sending in the brain scan paper!***
Got this from a friend? Subscribe here!
My Current Obsession
I have spent so much energy on the ECMWF MOOC for ML in weather and climate prediction. So on Monday, you can finally see my segment on deep learning! Very excited!
And this one feels huge: I was featured in Interesting Engineering! They wrote a very extensive glowing review of my Skillshare course on AI art!
Thing I Like
There are two kind of people in the world. Those that have a favourite spoon and those that just realized there are people that have favourite spoons. I just bought some “long small spoon”, and I am so happy. Completely forgot they exist until I saw a Tiktok.
Hot off the Press
I have been doing a lot of behind-the-scenes work.
I added Calls for Proposals on PythonDeadlin.es for:
- EuroPython
- DjangoCon US
- Kiwi PyCon
I guess, owing to the fact that I have been working so much on the ECMWF MOOC, I felt like adding a “teaching” section to my website. This section showcases all the different ways I taught different topics.
In Case You Missed It
The VS Code Twitter account made a top 10 list of extensions. They were more web design focused, so I shared my VS Code Extension Top 10, which is more data science focused.
Machine Learning Insights
Last week I asked, How can you select the most important features in a data set?, and here’s the gist of it:
Selecting the most important features in a dataset is a critical step in building machine learning models. It helps reduce the data's dimensionality and improve model performance by focusing on the most relevant information.
Let's first look at some potential steps and then analyze how we select features on a volcano dataset.
Here are some ideas:
- Correlation Matrix: We can calculate the correlation between each feature in the dataset and the target variable. Features with a high correlation to the target variable are likely to be direct predictors of the target variable and should be retained. Additionally, we can calculate the cross-correlation between features to eliminate redundant information.
- Recursive Feature Elimination: This technique involves iteratively removing the least important features from the dataset until the desired number of features is reached. The model's performance after removing each feature determines its importance.
- Leave-One-Feature-Out: LOFO is similar to RFE in that we build multiple models and evaluate how the metrics change by leaving one feature out. However, it does not perform iterative elimination to avoid falsely getting rid of features that could be useful when others were "masking" its usefulness.
- Feature Importance from Tree-Based Models: Models like decision trees, random forests, and Xgboost can provide feature importance scores. These scores can be used to rank the features and select the most important ones. They are, however, prone to over-valuing correlation.
- Permutation Importance: This evaluates each feature on a trained model by scrambling each feature and assessing the decline in performance based on each feature.
- Principal Component Analysis (PCA): This dimensionality reduction technique transforms the original features into a smaller set of uncorrelated variables called principal components. The principal components can then be ranked based on their contribution, but they only consider linear contributions, which isn't always optimal for feature selection.
- L1 Regularization: In linear models, adding an L1 regularization term to the objective function forces some model coefficients to become zero, effectively performing feature selection. The non-zero coefficients correspond to the most important features. This is the sparsity in L1 regularisation textbooks talk about.
In a volcano data set, some potential features might include the volcano's height, the frequency of earthquakes in the area, the temperature of nearby hot springs, and the composition of volcanic gases. Using the ideas outlined above, we would start with a nice cross-correlation matrix to eliminate correlated features.
We might find that the frequency of earthquakes and the composition of volcanic gases are correlated time series, which is good to remember for further analysis. We can eliminate correlated features by hand or use Leave-One-Feature-Out with a cheaper model, like Xgboost, to rank the importance of features to weigh the correlated features against each other. Then we'll use these features to build a predictive model or further investigate the relationship between height, earthquakes, spring temperature, and volcanic activity, eliminating the gas composition. This also has the added benefit of eliminating a possibly expensive and complicated measurement from our data pipeline!
Data Stories
The space race was big, but how many space launches were there actually?
This Youtube video is a great visualization of the different launches, their name, and the country that was responsible for the rocket taking off.
[Source: Youtube]
Question of the Week
- What is one-hot encoding and what are its pros and cons?
Post them on Mastodon and Tag me. I'd love to see what you come up with. Then I can include them in the next issue!
Tidbits from the Web
- I love FPV drone shots, and this summitting of the Matterhorn is stunning
- Unjaded Jade is a lovely reminder to be kind to yourself
- Linkin Park, but make it ✨antique✨
Jesper Dramsch is the creator of PythonDeadlin.es, ML.recipes, data-science-gui.de and the Latent Space Community.
I laid out my ethics including my stance on sponsorships, in case you're interested!