Bugs: UFF transfer fail and too many open files.

Espen Wang espen at nepse.net
Wed Apr 6 15:28:27 CST 2005


Hi EggDev.

I'd be happy to dig up any information you may need to investigate
further.
Don't hesitate to contact me.

=====================================================================
1) INFORMATION ABOUT YOUR EGGDROP

1.1) Eggdrop version:
     1.6.17

1.2) Make type:
     ( ) dynamic
     (X) static
     ( ) debug
     ( ) sdebug

1.3) List of any options passed to ./configure:

None.

1.4) List of patches and/or modules you use:

All these are standard bundled modules:
ctcp
irc
server
encryption
transfer
share
compress
notes
console
channels
dns

=====================================================================
2) INFORMATION ABOUT TCL

2.1) Tcl library version:
     ( ) 7.0
     ( ) 7.1
     ( ) 7.2
     ( ) 7.3
     ( ) 7.4
     ( ) 7.5
     ( ) 7.6
     ( ) 8.0
     ( ) 8.1
     ( ) 8.2
     ( ) 8.3
     (X) 8.4
     ( ) 8.5
     ( ) Other - Which? ____

2.2) Tcl library patchlevel: 7
  eg; p1, p2, etc for Tcl versions up to 8.0p2
      or the 3rd part of the version number for 8.0.3 and newer

2.3) Tcl scripts used:
     [ ] alltools
     [ ] sentinel
     [ ] getops
     [X] others - Please mention all others:

SpeedNet 2.44 (http://nepse.net/files/speednet2.44.tgz)

=====================================================================
3) INFORMATION ABOUT THE OS

3.1) OS type:
     ( ) BeOS
     ( ) BSD/OS
     ( ) Cygwin
     ( ) Darwin/Mac OS X
     ( ) Dell SVR4
     (X) FreeBSD
     ( ) HP-UX
     ( ) IRIX
     ( ) Linux
     ( ) Lynx
     ( ) NetBSD
     ( ) NeXT
     ( ) OpenBSD
     ( ) OSF/Tru64
     ( ) QNX
     ( ) SINIX
     ( ) Solaris/SunOS
     ( ) Ultrix
     ( ) Other - Which? _____________

3.2) OS Version/Release: 5.3-STABLE

=====================================================================
4) BUG DETAILS

4.1) The logged last context (example: Last context: userent.c/973 []):

Still online.

4.2) If the bot wrote to the file DEBUG, copy the text -contents- of
     that file here (NOTE: It should be about 20 lines of info, but it
     could be a few lines more):

None.

4.3) Your comments and a description of the bug:

I discovered the the hub was unresponsive. From 'ps', I learned that the
eggdrop process was running at 99% cpu usage.
With no result of investigation, I killed it, and restarted it.
Now I discovered that a leaf was having trouble accepting the userfile
transfer on link.
The filesystem on the leaf shell was mounted read-only. Hurray!
Since it was nothing I could do but alerting the admin (still obviously
sleeping), I left bots alone, playing for themselves.
After a long time of "link - fail userfiletransfer - unlink" cycle, the
error messages suddenly changed to this:

sidhatt = hub, myser = leaf
[22:53:00] sidhatt [21:53] ERROR writing user file to transfer.
[22:53:00] sidhatt [21:53] Failed to compress file
`.share.myser.1112820780': not a file.
[22:53:00] sidhatt [21:53] uff parsing failed
[22:53:00] sidhatt [21:53] Sending user file send request to myser

[22:52:59] myser [21:52] Downloading user file from sidhatt
[22:53:00] myser [21:52] Ending sharing with sidhatt (uff parsing
failed).
[22:53:00] myser [21:52] (Userlist download aborted.)

After some investigation, I stumbled over this:

[23:00:45] dizz .nettcl sidhatt exec ls
[23:00:46] sidhatt [22:00] #dizz# nettcl
[23:00:46] sidhatt [22:00] (net/log) [cmd] sidhatt: dizz requested
sidhatt tcl "exec ls"
[23:00:46] sidhatt sidhatt Tcl: couldn't create output pipe for command:
too many open files

The hub had too many open files, and thus couldn't create or open new
ones.

4.4) Can you cause the bug condition to repeat? If so, please outline
     step by step what causes the error:

It happened twice. I think they are related, but I could be wrong. The
first incident I have little information about. The second is easier, as
I still had control over the bot. I think it is reproducable.

1. Standard hub->leaf userfileshare setup.
2. Enable compression on userfile transfers.
3. Deny write access for userfile/temp path? on leaf. (In my case the
filesystem was mounted read-only.)
4. Watch as the leaf links, fails to accept the transfer and get
unlinked.
5. Repeat "4" a long time. (Go the some runts :p)
6. The hub should now have too many files descriptors open, and fail to
create temp file to transfer.

4.5) Do you have ideas on what is wrong that causes this error?
     Please list them:
     
Hub fails to close the temp file descriptor when a leaf aborts file
transfer.

4.6) Do you have ideas on how to correct it?  Please list them:

Make sure the file descriptor gets closed, even on transfer abortions.

4.7) Other comments?

The bot had a uptime of about 2.5 months before I killed it. Experience
tells me memory leaks and other stuff can happen in the long run.

4.8) If the bot dumped a 'core' file when it crashed, it would be *very*
     useful if you could paste gdb's output during the following steps:
     First call gdb
         $ gdb eggdrop -c core
     and then enter 'bt' on gdb's command line:
         (gdb) bt
     Keep your core file for at least one week, so that the dev team
     can ask for further information if needed. However, don't send
     us the core file unless we ask for it.

     NOTE: If this is a bug you can reproduce, please compile with
           make debug and follow the above step. It can greatly help
           find and fix the bug.

None dumped.

=====================================================================


-- 
Best regards,
Espen Wang
[ contact(at)nepse(dot)net | http://nepse.net/ ]




More information about the Bugs mailing list