When software problems are not software problems

Sorry if this has already been reviewed, or you think it really belongs to the wiki.

I am a software developer at a company that produces microthermal presses for the bioscience industry. I mainly deal with various bits of hardware (pneumatics, hydraulics, stepper motors, sensors, etc.) By developing a graphical interface in C ++ for aspiration and printing samples on slides with microchips.

Upon joining the company, I noticed that whenever a hardware-related problem occurs, it would cause the entire installation to freeze, as no one was wiser about a particular problem - hardware / software / misuse, etc. Since then, I have improved the situation a bit by introducing software timeouts and exception handling to better identify and solve any hardware-related problems that arise, for example, PLC commands that did not complete successfully, incorrect FPGA response commands, and various others conditions such as deadlocks, etc. In addition, the software will now record a summary of a specific problem, inform the user, and gracefully exit the stream. This software is not built-in, it just interacts with serial ports.

Despite what has been achieved, the non-software guys still do not understand that in these cases the “software” problem that they tell me is not really a software problem, rather the software reports the problem, but does not cause his. Don’t get me wrong, I don’t like anything more than going out of software bugs, such as a ton of bricks, and looking at ways to increase stability. I know the system well enough, now I have almost a sixth sense for these things.

No matter how many times I try to explain this, nothing really penetrates. They still report hardware problems (which eventually get fixed) as software.

I would like to hear from others who have experienced similar finger impressions and what methods they used to deal with them.

UPDATE Some great answers here that pretty much sing from the same hymn sheet: be more descriptive. I think team identification and bombing is pure when hardware malfunctions were in the first stage, but still not enough. The next step will be to compare the fact that for the layman there are rather meaningless PLC commands, for something more suggestive. “PLC command M71 timeout” becomes “Syringe system initialization error. Check for sufficient vacuum”, etc.

+7
error-reporting
source share
8 answers

Perhaps when you report a problem as a message to the user or entries in the log file, you need to explicitly indicate that this is hardware that does not work:

"The stepper motor is not responding."

Unfortunately, since this is the software that people see and interact with it, it is assumed that the software is all there is.

+2
source share

You can try flagging error messages as “EQUIPMENT PROBLEM”. You can get your point of view.

+2
source share

There is no such problem as a non-software problem in the system. The software is the boss, and the boss cannot blame the failure for tools.

If the basic equipment does not work correctly, it should tell the user what exactly is wrong with which component. If this is not the case, this is a software problem.

For example, disabling TCP means that it must reconnect. If this is an FPGA response, it should indicate exactly what the user inputs and outputs were and who was to blame. If not, this is a software problem.

+1
source share

I agree with the other posters, but I wanted to add another perspective: it could be worse. They can try to solve hardware problems within a few days or weeks, and then find out later when everyone is under the gun and losing their minds that they are not correcting, that they are addressing the wrong problem, and that was actually a problem software. So count on your blessings. If they always classify it as a software problem, at least you know about it. Only then can you troubleshoot, perhaps add additional code to solve problems or problems, and make the system a little better.

Plus, it's almost the same as every software developer everywhere has ever come across. Except usually this software is compared to the user, not software and hardware. And in this case, there seems to be no known solution. There are many ways to solve the problem, but not fix it. Thus, an ever-growing list of abbreviations describing how to blame the user without being rude: error ID-ten-T, PICNIC, PEBKAC, etc.

+1
source share

"If what you do is not working, stop doing it and try something else"

As pointed out in other comments, this is a transmission and, to a lesser extent, a perception problem. People will blame what they no longer understand than FAR to make themselves feel like a victim. A motor can spark, throw fire and explode from someone heavily overloading the feeder (with EVERY warning not to plaster it all) - but if this software stops responding, guess what caused the problem?

Since providing each of your users with EE and CS or 10 classes is completely out of the question, return to a good ole connection. The basis of which is 4 things (mostly my opinion) in a certain order - what you observe, feel, think and what needs to be done. Thus, with this idea, I put this answer into practice.

It looks like your users like blaming software when some of the core hardware is a key issue (watch). An attempt to explain this with the help of users is inappropriate and a waste of time, that it is not their work and most of them will not care (feel). What you might want to try is talking to the engineering team about the parts that they use, and exploring things that work best with the software as a whole. Maybe there are some restrictions that have never been addressed? (I think) Changing the hardware or just a better understanding of this may be the real answer, as well as more targeted errors and feedback from these users (done).

+1
source share

Who is it, who reports problems?

If these are end users, I think this is not a problem. They just know that what they are trying to do does not work. It is not the user's responsibility to diagnose the problem. All they know is: "I tried to do X, I had to do it, but Z happened instead." All you need is your problem.

If hardware users insist that the problem is software, and software people insist that the problem is hardware, you need to improve the software to more accurately diagnose errors, as ChrisF and others noted.

If the upstream companies blame the software group for the problems that are responsible for the hardware group, and you are tired of taking the blame for other people's mistakes, well, I understand that. Again, as a software developer, you have the ability to create more accurate error messages. If you can directly say that “the stepper motor is not responding” or something else, then you have “moral authority” who insists that someone run diagnostics on the stepper motor. Just saying, “I'm pretty sure the hardware problem” won't win the argument.

+1
source share

Testing development (does not necessarily mean "test-driven") is what you need to provide resources for.

In principle, each subsystem should have a sufficiently thorough set of unit tests to identify problems before integration. Every time a problem arises, test the hardware so you can know for sure (or almost sure) that it is a hardware problem. This means that the equipment must be designed so that it can be thoroughly tested.

I was the integration manager for my college robot team, and this tactic helps a lot.

Hope this helps.

0
source share

First, make sure your users are more likely to read and understand your error messages. The display "FPGA GS_WIDGIT_FROB command returned an invalid response 0xFF45001C. Turning off the id 576D controller. (Error 1Xf)" may be great for you. But the user is likely to hit "Ok" without reading it. Even if they read it, he does not give them any useful information. In any case, you get a phone call. The Widgit Frobber display requires maintenance, but still record all the heavy details somewhere, and you'll probably get fewer calls.

Secondly, you know that this is a hardware problem, so do something with it ! Do you have email software support or something else to fix the problem. If the user is forced to decide what actions to take to correct him, you can bet that they will get it wrong, at least for a while. If the user sees that “Widgit Frobber requires service,” a notice of equipment was notified (ticket number 234) “they know that they do not need to do anything.

0
source share

All Articles