-
19 May 2022Downtime Friday 20th: Spring cleaning time for the database
The website will be unavailable tomorrow (Friday May 20th, 2022) for up to an hour. The downtime will be sometime between 3-6 pm CEST.
This is an annual "spring cleaning time" where I will run some manual queries on the database to trim stale data from very old, inactive accounts. The goal is to prevent the database from slowing down due to large indexes, as well as keep the full backups to a manageable size.
As I have done in the past I will use the following criteria :
- Delete flashcards from accounts inactive for 2 YEARS.
- Delete private stories where the user's account was inactive for 5 YEARS.
An inactive account is an account for which the user did not sign in within the period. For example if you sign in just once within the last 5 years, the account data is persisted. It is not necessary to do any activity - just sign in. In other words what I'm looking at is "the last time the user logged in".
For flashcards : I have found in the past from emails asking "how do I reset my account?" that users ; when they come back after a long break ; usually want to delete all their cards and start over. It makes sense for how the SRS works after all. So the 2 YEAR interval seems sensible to me. Keep in mind on Koohii a "flashcard" refers to the SRS status for each kanji - not stories.
For stories : we're looking at deleting private stories only, from old inactive accounts. This ensures the shared stories area is unaffected and as you'll see below, the shared stories only account for 7% of 7 million+ stories! Still, it's not unusual for users to come back after several years break, which is why I used a more sensible 5 year interval.
How to Backup Your Data
If you think you may take a long break, or have finished with Kanji Koohii for the foreseeable future - keep in mind your data is not "locked in". You can export all your stories from the My Stories page. This will download a CSV format file which you can load up in a typical spreadsheet app like Google Docs or Libre Office.
The stories export could be useful also for eg. printing and browsing offline.
You can also export your kanji flashcards from the Manage Flashcards section of the site. This is a CSV export and how you can use that data elsewhere is up to you. To clarify, this is the SRS status data.
On Koohii, stories are independent of kanji cards. Many users are only interested in the stories section of the site, other users are only interested in the SRS. Some use both. Just remember : stories are stored independently of flashcards. You can remove a card and add it back later and the corresponding story for that kanji will still be there.
More database nitty gritty and thoughts
I have found over the years that as long as the full gzipped database backup is under 1 GB, it is manageable for me. It doesn't take too long to download, and if I need to debug something, it doesn't take forever either, to "copy" that database in my development environment.
By running some statistics I have found in the past that the stories table in the database is, unsurprisingly, the most space consuming.
As of a recent May backup, there are over 7,294,500 stories! (yes, that is seven millions). Of those, "only" 513,800 are shared - that is just 7% !
So what this means, is when users go through RTK and input or copy shared stories for 2000 kanji, and then leave and never come back,... this just adds up to a lot of data over the years.
Thankfully due to the nature of this service, I can afford to trim some of that excess data: it turns out that those users who do in fact return (after several years break) often want to "reset" their account and start over. I know that because I'd get occasional emails like "hey, I haven't used koohii in a while, how can I reset my account ?".
More recently, the crude admin panel that I have lets me see when users delete accounts and create accounts - useful to pinpoint issues with registration, or abuse by spam bots. And I could see a user initiated an account deletion, and minutes later created a new account under the same name. I had a hunch that not all accounts deleted are necessarily users leaving and this confirmed it - many users want to "reset" their account after a long break.
While we're here let's have some numbers - as of a recent May backup:
Table Rows data_length index_length ---------------------------------------------------- reviews 8,513,399 194 MB 273 MB stories 7,294,513 1103 MB 191 MB custkeywords 1,834,099 58 MB 23 MB
The reviews (ie. flashcards) table may have an issue with how I designed it. I'm not sure. It's pretty silly you can see that the index itself is larger than the data... although I think it is not unusual when the average row length is small (24 bytes).
Based on some test runs tomorrow's "spring cleaning" will shave off 2,057,521 private stories (from accounts not used in 5+ years), of a total of 7,294,513. That is 28% of the stories table, and not negligible 300 MB trim. It will also shave 40% of the flashcards table (flashcards from accounts not used in 2+ years), another significant ~180 MB trim.
Hope you enjoyed this little foray into the database!
I will post an update when the site is back up tomorrow after the downtime.
By Month
- Oct 2024 (1)
- Sep 2024 (1)
- Jun 2024 (2)
- May 2024 (4)
- Apr 2024 (3)
- Mar 2024 (1)
- Feb 2024 (1)
- Dec 2023 (1)
- Nov 2023 (2)
- Oct 2023 (2)
- Apr 2023 (2)
- Mar 2023 (2)
- Feb 2023 (1)
- Jan 2023 (2)
- Dec 2022 (1)
- Nov 2022 (2)
- Oct 2022 (3)
- Sep 2022 (1)
- May 2022 (4)
- Apr 2022 (1)
- Feb 2022 (2)
- Jan 2022 (2)
- Dec 2021 (4)
- Nov 2021 (2)
- Oct 2021 (2)
- Sep 2021 (2)
- Aug 2021 (1)
- Apr 2021 (2)
- Feb 2021 (3)
- Jan 2021 (3)
- Dec 2020 (1)
- Nov 2020 (1)
- May 2020 (1)
- Apr 2020 (1)
- Jan 2020 (1)
- Oct 2019 (1)
- Sep 2019 (1)
- Aug 2019 (4)
- Jul 2019 (3)
- Jun 2019 (1)
- May 2019 (1)
- Mar 2019 (2)
- Jan 2019 (1)
- Nov 2018 (3)
- Oct 2018 (8)
- Sep 2018 (4)
- Aug 2018 (3)
- Jul 2018 (1)
- Jun 2018 (4)
- May 2018 (1)
- Apr 2018 (1)
- Mar 2018 (1)
- Jan 2018 (1)
- Dec 2017 (6)
- Nov 2017 (4)
- Oct 2017 (4)
- Sep 2017 (5)
- Aug 2017 (5)
- Jun 2017 (3)
- May 2017 (2)
- Apr 2017 (3)
- Mar 2017 (7)
- Feb 2017 (10)
- Jan 2017 (11)
- Dec 2016 (6)
- Nov 2016 (5)
- Oct 2016 (6)
- Sep 2016 (7)
- Aug 2016 (3)
- May 2016 (1)
- Mar 2016 (2)
- Jan 2016 (1)
- Dec 2015 (3)
- Nov 2015 (1)
- Oct 2015 (1)
- Sep 2015 (7)
- Jul 2015 (2)
- Jun 2015 (1)
- May 2015 (5)
- Apr 2015 (4)
- Mar 2015 (5)
- Feb 2015 (4)
- Jan 2015 (5)
- Dec 2014 (4)
- Nov 2014 (3)
- Oct 2014 (2)
- Jun 2014 (1)
- Apr 2014 (2)
- Mar 2014 (4)
- Feb 2014 (3)
- Jan 2014 (4)
- Dec 2013 (2)
- Oct 2013 (1)
- Sep 2013 (1)
- Jun 2013 (4)
- May 2013 (1)
- Mar 2013 (1)
- Jan 2013 (2)
- Oct 2012 (2)
- Aug 2012 (1)
- Jul 2012 (2)
- Jun 2012 (2)
- May 2012 (1)
- Mar 2012 (2)
- May 2011 (1)
- Apr 2011 (4)
- Mar 2011 (3)
- Feb 2011 (2)
- Jan 2011 (2)
- Dec 2010 (8)
- Nov 2010 (8)
- Oct 2010 (3)
- Sep 2010 (3)
- Aug 2010 (1)
- Jul 2010 (2)
- Jun 2010 (5)
- May 2010 (1)
- Apr 2010 (3)
- Mar 2010 (4)
- Feb 2010 (2)
- Jan 2010 (1)
- Dec 2009 (5)
- Nov 2009 (5)
- Oct 2009 (1)
- Aug 2009 (1)
- May 2009 (5)
- Apr 2009 (2)
- Mar 2009 (1)
- Feb 2009 (2)
- Jan 2009 (2)
- Nov 2008 (1)
- Oct 2008 (1)
- Sep 2008 (1)
- May 2008 (2)
- Apr 2008 (1)
- Feb 2008 (6)
- Jan 2008 (5)
- Dec 2007 (6)
- Oct 2007 (1)
- Sep 2007 (2)
- Aug 2007 (3)
- Jun 2007 (1)
- May 2007 (5)
- Apr 2007 (1)
- Mar 2007 (2)
- Feb 2007 (1)
- Jan 2007 (4)
- Dec 2006 (3)
- Aug 2006 (1)
- Jun 2006 (3)
- Apr 2006 (6)
- Mar 2006 (8)
- Feb 2006 (1)
- Jan 2006 (4)
- Nov 2005 (1)
- Oct 2005 (4)
- Sep 2005 (1)
- Aug 2005 (11)