What are Time-of-Fault (TOF) bugs?
- Three compositions:
- Trigger: a special timing of fault 1.
- Error: unexpected state left due to the untimely fault.
- Failure: system cannot properly handle such unexpected state.
- A real example: MR-3858
- Trigger: bug happens only when a task attempt crashes in the middle of committing.
- Error: the crashing attempt’s ID was recorded in a heap object as the committing attempt.
- Failure: all relaunched task attempts cannot commit and the job fails (hang).
What is FCatch?
FCatch is a research project aiming at fighting TOF bugs in cloud systems. Currently, it contains:
- A TOF bug model and a TOF bug benchmark suite.
- A tool to automatically predict TOF bugs from correct runs with low false positive rate.
We are working on providing better solutions to make software systems more reliable, especially for distributed and cloud systems.
Publications
- FCatch: Automatically detecting time-of-fault bugs in cloud systems. Haopeng Liu, Xu Wang, Guangpu Li, Shan Lu, Feng Ye and Chen Tian, ASPLOS’18
@inproceedings{liu2018fcatch, title={FCatch: Automatically Detecting Time-of-fault Bugs in Cloud Systems}, author={Liu, Haopeng and Wang, Xu and Li, Guangpu and Lu, Shan and Ye, Feng and Tian, Chen}, booktitle={Proceedings of the 23rd International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS'18)}, }
-
message loss or node crashes (How common are they?). ↩