🎃 Carved pumpkins often catch me off gourd

Got this from a friend? Subscribe here!

                October 21, 2023

            🎃 Carved pumpkins often catch me off gourd

Late to the Party 🎉 is about insights into real-world AI without the hype.

Hello internet,

            ah, it’s still early! Finally, I’m somewhat back on track. So, let’s enjoy some high-quality machine learning today! I’m quite proud of today’s ML insight!
The Latest Fashion

This little online game has you crack LLMs with different levels of security! I had so much fun!
The next iteration of the Scikit-learn online MOOC starts in 10 days!
Opening up ChatGPT, a full overview of transparency, accountability, and openness of different LLMs and APIs.

Got this from a friend? Subscribe here!
My Current Obsession
I had a lovely chat with some fellow creators on Friday, where we talked about YouTube stats and general creativity. It’s so lovely having these communities that you can engage in.
Some folks have started utilising the accountability space I created in the Latent Space, and that makes me very happy as well! (Bonus: someone told me I was the only teacher they ever had who told them about pandas profiler, which was both surprising and making me happy about my decisions of what I teach!)
People have been very receptive to our team making the AIFS data-driven weather forecasting system at ECMWF public. This was lovely, and I’m glad we’re making such a positive impact already!
I’m also really enjoying the Halloween decorations I’ve been putting up. It’s so lovely decorating your space.
Thing I Like
I was feeling a bit self-conscious because my beard was kinda doing what it wanted despite me really trying to make it look nice. Turns out the right tool really helps. My old grooming kit was slowly having technical difficulties. First time I tried out the Philips multi-groomer, it looked SO nice! All the different tools it comes with are very useful, too. Anyways, enough gushing over some blades!
Hot off the Press
I wrote a nice post about my learning of working in weather forecasting for 2 years.
In Case You Missed It
I worked a bit on ML-recipes to make it more SEO-friendly and more valuable to everyone.
On Socials
My summary post about working in weather for 2 years has been very well-received.
I also re-shared my PhD thesis, and both Mastodon and Linkedin enjoyed it quite a bit!
7 months ago, you heard about some strategies for structuring Python packages, and Linkedin quite enjoyed that this week!
Python Deadlines
I found FlaskCon 2023 for later this year!
Machine Learning Insights
Last week I asked, How do you treat missing values in your data?, and here’s the gist of it:
Handling missing values in your data is a critical step in preparing it for machine learning. These missing values can wreak havoc on the accuracy of your models, potentially leading to incorrect predictions or even breaking the model. Here are some strategies for addressing missing data with an example from meteorological data, such as sea surface temperature, which is missing data over land.:

Identify Missing Values:
Problem: To start, you need to pinpoint where the missing values are in your dataset. These are often represented as “NaN” or “null” values.
Example: In a weather dataset, you might encounter missing values in the “SST” column. These gaps imply that there was no recorded precipitation data over land, as a simple example.

Using Tree-based Models:
Solution: The easiest and most reliable way to handle missing values is using models like decision trees or XGBoost. These models can automatically deal with NaNs in the data.
Example: When missing data in our sea surface temperature, the information from the “missingness” can actually inform the model implicitly about land-sea masks.

Remove Rows with Missing Values:
Solution: A straightforward approach is to eliminate rows containing missing values. This is suitable when the number of missing values is relatively small and won’t significantly impact the overall dataset.
Example: If your daily weather data has a few days with missing values in the “temperature” column, you can opt to remove those specific rows from your dataset. This can work when working with observations that aren’t systematically missing.

Fill in Missing Values:
Solution: When removing data isn’t ideal, you can fill in missing values with appropriate replacements. The replacement can be a constant value (e.g., 0) or a measure of central tendency (like the mean or median). Most research shows this is preferable over more sophisticated methods that solely work on toy datasets.
Example: Suppose you have missing temperature data for some days; you can replace these gaps with the average temperature for the entire dataset or the climatology, which is a time-dependent average.

Interpolate Data:
Solution: Interpolation entails estimating missing values based on the values of neighbouring data points. This is valuable when you want to maintain the overall trend in the data, but it can lead to data leakage, which has to be avoided at all costs.
Example: If you have hourly temperature readings with some missing, interpolation can help estimate the missing values based on the temperatures recorded just before and after the gaps.

Predict Missing Values:
Solution: Machine learning techniques, such as regression, can predict missing values based on other features in your dataset. This is particularly useful when the missing values are related to other available data. This is even more prone to leakage and cascading failures of models. So we’re in dodgy territory here.
Example: If you lack humidity values, you can train a machine learning model to predict humidity based on temperature, pressure, and other variables in your weather dataset.

Keep Missing Values as a Separate Category:
Solution: Oftentimes, it’s valuable to retain missing values as a distinct boolean category or to use them to convey specific information. The choice depends on your data and the problem at hand. I call this “missingness as a feature”.
Example: In a dataset classifying weather conditions as “sunny,” “rainy,” or “missing,” missing values could indicate that weather data wasn’t available for those instances. In an SST, you get a sort of land-sea mask for free.

Collect More Data:
Solution: If missing values persist and are critical for your analysis, consider collecting more data. It’s often possible to obtain the missing data from other sources or by extending your data collection efforts. This is especially true if your data is missing systematically, reducing predictability in specific conditions or areas.
Example: If historical temperature data for a specific region is missing, collaboration with a local meteorological agency might help acquire the missing records.

Handling missing values in your data is a practical and vital aspect of data preprocessing. The choice of how to handle them depends on the nature of your data and the specific problem you’re addressing. It’s crucial to select the approach that best suits your dataset and ensures the reliability of your machine learning models. Additionally, we must be careful about data leakage and resist the urge to be overly “smart” about the interpolation. Most people I know who actually work with data will just replace values with the mean value and add a feature of “missingness”.
Sam Harrison had a great point about the reason the data is missing. In statistics, we would often look at “missing at random” and the opposite, a variable that is missing systematically. Those values often have a very different amount of information in their missingness. Sam then goes on to mention the “limit of detection”, something I haven’t thought about but makes a lot of sense! What if our value is smaller than our sensors account for? Then we shouldn’t add values with the mean, but rather with a value “as small as possible”! Loved that addition!
Data Stories
2023 has seen some of the most devastating wildfires in Canada.
Peter Atwood took the time to visualise both the fires and the resulting smoke over the year.
The bright spots show a heat source detected by infrared satellites.
The smoke is derived from aerosol data from the GEOS model by NASA.
Fascinating visualisation!

Watch the full visualisation in 4K
Question of the Week

What is the most over-hyped method in machine learning, in your opinion?

Post them on Mastodon and Tag me. I’d love to see what you come up with. Then, I can include them in the next issue!
Tidbits from the Web

I just learned about this massive open-access library of books
The Internet Archive now has a scholarly research section for all you nerds out there!
Have a great weekend with this super cute seal being extra adorable

Jesper Dramsch is the creator of PythonDeadlin.es, ML.recipes, data-science-gui.de and the Latent Space Community.
I laid out my ethics including my stance on sponsorships, in case you're interested!

Don't miss what's next. Subscribe to Late To The Party 🎉:

Start the conversation: