Uptime September 2018

Uptime report for the past month:

api.rehtumb.com
http://www.uptimedoctor.com/publicreport/vi083t3k/79788/1/345532

dashboard.rethumb.com
http://www.uptimedoctor.com/publicreport/vi083t3k/79788/1/346471

rethumb.com
http://www.uptimedoctor.com/publicreport/vi083t3k/79788/1/346472


Main Events

Some details on the main events of the past month.

🧯🔥 dashboard.rethumb.com down for 4h 55m:
  • As reported in Uptime August 2018 we had some issues with the machine running dashboard.rethumb.com. These issues are now solved.

🧮🧪 API Performance

  • The following charts show the maximum execution time for a new image request (without using the CDN and not hitting any local cache).

August 2018

September 2018

  • These values should be as low as possible but we can see a significant improvement from August to September as the values have less variation.

  • For now our primary goal is to avoid spikes as these occur when the system is under high load and our system is scaling up.

Uptime August 2018

Uptime report for the past month:

api.rehtumb.com
http://www.uptimedoctor.com/publicreport/vi083t3k/79788/1/345532

dashboard.rethumb.com
http://www.uptimedoctor.com/publicreport/vi083t3k/79788/1/346471

rethumb.com
http://www.uptimedoctor.com/publicreport/vi083t3k/79788/1/346472


Main Events

Some details on the main events of the past month.

🧯🔥 dashboard.rethumb.com down for 45min:
  • The machine running dashboard.rethumb.com had a drastic performance degradation and we had to move the application to other host.

  • This issue started in August but it was solved only in September.

  • This issue was similar to the one reported in Uptime February 2018.


Notes

Some changes in our infrastructure during the past month.

🖥️ ✈️ Servers migration:

  • We moved our main entry point api.rethumb.com to servers in Europe.

  • The impact should be mininal and transparent as the entry point is still behind CloudFlare cache.

  • This also impacts our clients using the cname feature.

Uptime May 2018

Uptime report for the past month:

api.rehtumb.com
http://www.uptimedoctor.com/publicreport/vi083t3k/79788/1/345532

dashboard.rethumb.com
http://www.uptimedoctor.com/publicreport/vi083t3k/79788/1/346471

rethumb.com
http://www.uptimedoctor.com/publicreport/vi083t3k/79788/1/346472


Main Events

May 15, 2018 - Object Store Incidents

We had anoter issue with our main object store provider. As usual the system uses the backup service but now we can do a fast switch and start using a new primary service without downtime.

Incident:

https://status.digitalocean.com/incidents/ql22cg0mzsj4

Our Twitter message:


Notes

  • We migrate all our queues from Beanstalk to RabbitMQ. In the future we would like to write a blog post with our experience and why the made the change.

Uptime April 2018

Uptime report for the past month:

api.rehtumb.com
http://www.uptimedoctor.com/publicreport/vi083t3k/79788/1/345532

dashboard.rethumb.com
http://www.uptimedoctor.com/publicreport/vi083t3k/79788/1/346471

rethumb.com
http://www.uptimedoctor.com/publicreport/vi083t3k/79788/1/346472


Main Events

Apr 24, 2018 - Object Store Incidents

During one hour we had a issue with our primary Object Store provider. The system continued to work using the backup service. Although we would prefer a fast switch between providers instead of a fall-back when the first one fails – this is being addressed in the upcoming release: v67.

Incident:

https://status.digitalocean.com/incidents/fhspl6w8yp1b

Our Twitter message:


Notes

  • On-going effort: we are in the process of migrating to a 100% container-based infrastructure in order to have more flexibility and improve our scalability response - more on this soon.

Uptime March 2018

Uptime report for the past month:

api.rehtumb.com
http://www.uptimedoctor.com/publicreport/vi083t3k/79788/1/345532

dashboard.rethumb.com
http://www.uptimedoctor.com/publicreport/vi083t3k/79788/1/346471

rethumb.com
http://www.uptimedoctor.com/publicreport/vi083t3k/79788/1/346472

Main Events

No main events.

Our API is now stable without any event in the past weeks.

Notes

  • We are in the process of migrating to a 100% container-based infrastructure in order to have more flexibility and improve our scalability response - more on this soon.

Uptime February 2018

Uptime report for the past month:

api.rehtumb.com
http://www.uptimedoctor.com/publicreport/vi083t3k/79788/1/345532

dashboard.rethumb.com
http://www.uptimedoctor.com/publicreport/vi083t3k/79788/1/346471

rethumb.com
http://www.uptimedoctor.com/publicreport/vi083t3k/79788/1/346472

Main Events

Some details on the main events of the past month.

api.rehtumb.com performance issues
  • Our main API servers received a Spectre and Meltdown update but after the reboot the performance was 10x worse. This cause notorious issues with the handling of the requests and we had to move the servers to new hardware. We changed some of the servers to a new provider in order to avoid having all the servers on the same location.

  • Note: the slow performance is not directly related with the patch. Other machines also received the update and didn’t suffer any impact.

dashboard.rethumb.com down for 03hr 15min
  • Our dashboard suffered a major impact during the migration issues that we had in the past month. During the migration described above we also moved our dashboard to new hardware and a new location. The 3h downtime was mainly due to the poor CPU performance after the patch.

Final Notes

  • Some of our machines are now running in Europe instead of NY. This won’t have any major impact on our final users.

  • Our monitor on the api.rethumb.com is hitting the CloudFlare cache. We will add new monitors with statistics from our own servers without the cache in front.

Outage 14/Feb/2018

At the moment rethumb is having some performance issues, this is being addressed with our VPN provider.

The root cause is related with software upgrades to mitigate the Meltdown and Spectre issues.

Updates

17/Feb/2018

We have decided to start migrating our infrastructure to new providers.

18/Feb/2018

First phase of the infrastructure migration in now done. We will continue to work to migrate the remaining machines.

We expect to have the migration done later today. Until then it is expected to have some slow down when processing new images.

19/Feb/2018

Our infrastructure migration is now complete and the system is stable.

After this episode we will take some measures to prevent these issues in the future.

We will also take some additional measures such as:

  • Create a public dashboard with current service status.
  • Use our Twitter account to publish details about outages.
  • Use our blog to report outages and on-going efforts to mitigate them.

Release v66

Starting today we will have a new post on every new release of a new API version. These posts aim to share internal changes, bugfixes and new features with each new release of rethumb.

Relase: v66
Date: 06/Feb/2018


#1 Bugfix

Fixed the fallback to original image when the system can’t process the user request and had to send back the original image instead of a processed one.

To configure the timeout behabiour users can access the “Source section > Timeout Action” in the Dashboard.