š Carved pumpkins often catch me off gourd
Late to the Party š is about insights into real-world AI without the hype.
Hello internet,
ah, itās still early! Finally, Iām somewhat back on track. So, letās enjoy some high-quality machine learning today! Iām quite proud of todayās ML insight!
The Latest Fashion
- This little online game has you crack LLMs with different levels of security! I had so much fun!
- The next iteration of the Scikit-learn online MOOC starts in 10 days!
- Opening up ChatGPT, a full overview of transparency, accountability, and openness of different LLMs and APIs.
Got this from a friend? Subscribe here!
My Current Obsession
I had a lovely chat with some fellow creators on Friday, where we talked about YouTube stats and general creativity. Itās so lovely having these communities that you can engage in.
Some folks have started utilising the accountability space I created in the Latent Space, and that makes me very happy as well! (Bonus: someone told me I was the only teacher they ever had who told them about pandas profiler, which was both surprising and making me happy about my decisions of what I teach!)
People have been very receptive to our team making the AIFS data-driven weather forecasting system at ECMWF public. This was lovely, and Iām glad weāre making such a positive impact already!
Iām also really enjoying the Halloween decorations Iāve been putting up. Itās so lovely decorating your space.
Thing I Like
I was feeling a bit self-conscious because my beard was kinda doing what it wanted despite me really trying to make it look nice. Turns out the right tool really helps. My old grooming kit was slowly having technical difficulties. First time I tried out the Philips multi-groomer, it looked SO nice! All the different tools it comes with are very useful, too. Anyways, enough gushing over some blades!
Hot off the Press
I wrote a nice post about my learning of working in weather forecasting for 2 years.
In Case You Missed It
I worked a bit on ML-recipes to make it more SEO-friendly and more valuable to everyone.
On Socials
My summary post about working in weather for 2 years has been very well-received.
I also re-shared my PhD thesis, and both Mastodon and Linkedin enjoyed it quite a bit!
7 months ago, you heard about some strategies for structuring Python packages, and Linkedin quite enjoyed that this week!
Python Deadlines
I found FlaskCon 2023 for later this year!
Machine Learning Insights
Last week I asked, How do you treat missing values in your data?, and hereās the gist of it:
Handling missing values in your data is a critical step in preparing it for machine learning. These missing values can wreak havoc on the accuracy of your models, potentially leading to incorrect predictions or even breaking the model. Here are some strategies for addressing missing data with an example from meteorological data, such as sea surface temperature, which is missing data over land.:
-
Identify Missing Values:
Problem: To start, you need to pinpoint where the missing values are in your dataset. These are often represented as āNaNā or ānullā values.
Example: In a weather dataset, you might encounter missing values in the āSSTā column. These gaps imply that there was no recorded precipitation data over land, as a simple example.
-
Using Tree-based Models:
Solution: The easiest and most reliable way to handle missing values is using models like decision trees or XGBoost. These models can automatically deal with NaNs in the data.
Example: When missing data in our sea surface temperature, the information from the āmissingnessā can actually inform the model implicitly about land-sea masks.
-
Remove Rows with Missing Values:
Solution: A straightforward approach is to eliminate rows containing missing values. This is suitable when the number of missing values is relatively small and wonāt significantly impact the overall dataset.
Example: If your daily weather data has a few days with missing values in the ātemperatureā column, you can opt to remove those specific rows from your dataset. This can work when working with observations that arenāt systematically missing.
-
Fill in Missing Values:
Solution: When removing data isnāt ideal, you can fill in missing values with appropriate replacements. The replacement can be a constant value (e.g., 0) or a measure of central tendency (like the mean or median). Most research shows this is preferable over more sophisticated methods that solely work on toy datasets.
Example: Suppose you have missing temperature data for some days; you can replace these gaps with the average temperature for the entire dataset or the climatology, which is a time-dependent average.
-
Interpolate Data:
Solution: Interpolation entails estimating missing values based on the values of neighbouring data points. This is valuable when you want to maintain the overall trend in the data, but it can lead to data leakage, which has to be avoided at all costs.
Example: If you have hourly temperature readings with some missing, interpolation can help estimate the missing values based on the temperatures recorded just before and after the gaps.
-
Predict Missing Values:
Solution: Machine learning techniques, such as regression, can predict missing values based on other features in your dataset. This is particularly useful when the missing values are related to other available data. This is even more prone to leakage and cascading failures of models. So weāre in dodgy territory here.
Example: If you lack humidity values, you can train a machine learning model to predict humidity based on temperature, pressure, and other variables in your weather dataset.
-
Keep Missing Values as a Separate Category:
Solution: Oftentimes, itās valuable to retain missing values as a distinct boolean category or to use them to convey specific information. The choice depends on your data and the problem at hand. I call this āmissingness as a featureā.
Example: In a dataset classifying weather conditions as āsunny,ā ārainy,ā or āmissing,ā missing values could indicate that weather data wasnāt available for those instances. In an SST, you get a sort of land-sea mask for free.
-
Collect More Data:
Solution: If missing values persist and are critical for your analysis, consider collecting more data. Itās often possible to obtain the missing data from other sources or by extending your data collection efforts. This is especially true if your data is missing systematically, reducing predictability in specific conditions or areas.
Example: If historical temperature data for a specific region is missing, collaboration with a local meteorological agency might help acquire the missing records.
Handling missing values in your data is a practical and vital aspect of data preprocessing. The choice of how to handle them depends on the nature of your data and the specific problem youāre addressing. Itās crucial to select the approach that best suits your dataset and ensures the reliability of your machine learning models. Additionally, we must be careful about data leakage and resist the urge to be overly āsmartā about the interpolation. Most people I know who actually work with data will just replace values with the mean value and add a feature of āmissingnessā.
Sam Harrison had a great point about the reason the data is missing. In statistics, we would often look at āmissing at randomā and the opposite, a variable that is missing systematically. Those values often have a very different amount of information in their missingness. Sam then goes on to mention the ālimit of detectionā, something I havenāt thought about but makes a lot of sense! What if our value is smaller than our sensors account for? Then we shouldnāt add values with the mean, but rather with a value āas small as possibleā! Loved that addition!
Data Stories
2023 has seen some of the most devastating wildfires in Canada.
Peter Atwood took the time to visualise both the fires and the resulting smoke over the year.
The bright spots show a heat source detected by infrared satellites.
The smoke is derived from aerosol data from the GEOS model by NASA.
Fascinating visualisation!
Watch the full visualisation in 4K
Question of the Week
- What is the most over-hyped method in machine learning, in your opinion?
Post them on Mastodon and Tag me. Iād love to see what you come up with. Then, I can include them in the next issue!
Tidbits from the Web
- I just learned about this massive open-access library of books
- The Internet Archive now has a scholarly research section for all you nerds out there!
- Have a great weekend with this super cute seal being extra adorable
Jesper Dramsch is the creator of PythonDeadlin.es, ML.recipes, data-science-gui.de and the Latent Space Community.
I laid out my ethics including my stance on sponsorships, in case you're interested!