🌌 May the Fourth be with you!

Worried these links might be sponsored? Fret no more. They’re all organic, as per my ethics.

                May 4, 2024

            🌌 May the Fourth be with you!

                In this edition, we cover configurable models, efficient transformers, career roadmaps, feature drift in ML, and I finally have a date for my move!

Late to the Party 🎉 is about insights into real-world AI without the hype.

Hello internet,

            I have just hit merge on a multi-thousand-line PR that makes our model at work completely configurable. Work we have been doing as a team for almost 5 months now. Such a relief!
In this issue, I share efficient transformers, tech trees for your career, the removal of LAION-5B, and we answer the question of feature drift in machine learning models! In addition to my coding work, I did some open source work, had a song stuck in my head and I finally have a moving date! Also a re-design on PythonDeadlin.es!
Dive right in, let us! ~ Yoda probably.
The Latest Fashion

It’s a year old, so “ancient” in ML terms, but “A Survey on Efficient Transformer Training” has some good insights.
Developer Roadmaps is a Tech Tree for your career, and I think the idea is amazing!
Who would’ve guessed?! Indiscriminate image scraping is a bad idea. LAION-5B was removed due to CSAM.

Worried these links might be sponsored? Fret no more. They’re all organic, as per my ethics.

My Current Obsession
I mentioned that I’m really curious about pixi so I have started trying it on my personal projects. I have it running with my personal website and on pythondeadlin.es now. But I was struggling to understand some of the functionality. After going on Discord, where Ruben was super helpful. I decided to give back and update the Python tutorial a bit. I made a bit of a fool of myself, though, when Grammarly corrected conda to condom... So, if you do your writing with Grammarly and you write about Python, make sure to double-check, especially on public PRs.
I finally have a moving date for my new flat. I already had a company lined up, but they quoted me an extra 800€ after the fact when I reminded them I would be moving into a pedestrian zone. After a day of being very upset, I called another company. When they called me back, they unfortunately informed me that they would also have to increase the price—by 29.50€. This is a good reminder to shop around. I am so excited to move into the new place!
I worked on a slight re-design of PythonDeadlin.es, and this may be a coincidence, but the organisers of PyCon Japan updated my entry to add the deadline this week! I have also fixed a huge usability issue that few probably even noticed. Technically, the website remembers your selection of conferences you might be interested in. So when you don’t use PythonDeadlin.es daily, you may come back and think you see all conference deadlines, but actually, some are hidden. So that now has a lifetime of a few days now, to avoid any possible sadness that might ensue if someone misses a conference because they changed a setting a month ago.
I have a few talks coming up as well. An exciting one is representing the work at ECMWF at the “Science Night” in Bonn mid May.
Thing I Like
I haven’t gotten this one yet, but I saw a lot of autistic creators recommend these neck coolers. If you have any experience with them, let me know! I’d be super interested!
Hot off the Press
Python Deadlines
Unfortunately, I missed the CfP for PyCon Uganda, but they’re now on their second iteration!
But I did find the CfPs for Xtreme Python, PyCon APAC, PyCon France, PyCon Israel, and I got a PR for PyCon Japan!
PyCon Korea is without CfP for now, but we have conference dates!
Within the next two weeks the CfPs for PyData Eindhoven, PyCon Portugal, and PyCon Russia are closing.
Machine Learning Insights
Last week I asked, Can you explain the concept of feature drift and how it affects machine learning models?, and here’s the gist of it:
Feature drift is a concept in machine learning that refers to the change in the statistical properties of the model input features over time. (You've seen it in the book I sent you when you signed up for this newsletter.)
This shift can lead to a significant decline in the model's performance, as the predictions are based on patterns learned from historical data, which may no longer be relevant if the input data has changed. Depending on where you deploy your models, this can get you in regulatory hot water or endanger lives!
Understanding Feature Drift
It's helpful to think about the data that feeds into machine learning models to grasp feature drift. These models make predictions based on features—quantitative or categorical data points used as inputs. Imagine the underlying distribution or behaviour of these features changes after the model has finished training. In that case, the model will start making wrong predictions simply because it has made connections between certain patterns that do not apply anymore.
Causes of Feature Drift
Feature drift can occur for various reasons:

Changes in consumer behaviour: For example, changes in shopping habits before and during major holidays, during a pandemic or during economic shifts, like the current cost-of-living crisis.
Technological advances: New technologies can change how people interact with systems, altering the generated data and disrupting existing economies, such as Uber and Amazon. Phones basically destroying the market for consumer cameras and Walkmans. Or possibly a suspicious moon showing up and altering the gravitational field of your home planet.
Seasonal effects: Different times of the year can bring different behaviours or conditions, affecting the data. Depending on the training data, we have to be careful to represent the full spectrum of our real-world data.
External changes: These could be regulatory changes, like the EU increasing the world's interest in USB-C chargers. But these could also be climate change impacting extreme weather events.

Detecting and Mitigating Feature Drift
Detecting feature drift involves monitoring the statistical properties of the input features regularly and comparing them to the properties of the data used to train the model. Techniques such as statistical tests (e.g., Kolmogorov-Smirnov test and Chi-squared test), visualization (scatter plots, histograms), or more complex methods like change point detection for time series can be employed.
To mitigate feature drift, you can:

Re-train models: Regularly update the model with new data. (I'm still proud of the figure in the book.)
Use adaptive models: Implement models that automatically adjust to changes in data over time. But that is often easier said than done.
Feature engineering: Modify or select features in ways that are less susceptible to drift. In geophysics, we often work with relative instead of absolute measurements for this!

Feature Drift in Meteorology
In meteorology, feature drift can be particularly relevant due to changes in climate patterns. For instance, if a model is used to predict weather events based on historical weather data, shifts in climate could alter those underlying data distributions. Over time, the model's predictions become less accurate as the actual weather patterns deviate from past trends and possible increases in extreme weather events. Regular updates to the model with recent data reflecting the new weather patterns might be crucial to maintain its accuracy. But honestly, we haven't done studies on the deterioration of models yet, so that's just a "this is what I'd anticipate".
Understanding and managing feature drift is essential for maintaining the robustness and reliability of machine learning models, especially in dynamic environments where data can evolve significantly over time.
Got this from a friend? Subscribe here!

Question of the Week

How can machine learning models be designed to be more adaptable to changing environmental conditions?

Post them on Mastodon and Tag me. I'd love to see what you come up with. Then I can include them in the next issue!
Tidbits from the Web

I am “in this video”, and I don’t like it…
There’s a Tiktoker who summarises “niche tea” and it’s my favourite series right now.
I’ve had this song stuck in my head for 3 days now…

Jesper Dramsch is the creator of PythonDeadlin.es, ML.recipes, data-science-gui.de and the Latent Space Community.
I laid out my ethics including my stance on sponsorships, in case you're interested!

Don't miss what's next. Subscribe to Late To The Party 🎉:

Start the conversation: